When my Apple TimeCapsule device could no longer cope with the backup requirements it became necessary to upgrade. Of course simply upgrading to a newer version would have sufficed but the previous version was always very hot and although I never obtained quantitative metrics it often seemed that both backup and network access slowed significantly for the duration of a backup. After a little researched I decided to try the Western Digital MyBook Live. This network-attached storage device claimed to support Apple’s TimeMachine backup process as well as offering the potential of being a central repository for media files. However, upon attempting my first backup, it was excruciatingly slow, but eventually did complete and seemed to be ok from then on.
That was, until last week when a polite dialog boxed informed me that my previous backups couldn’t be trusted and a fresh complete backup would be necessary. Ok, not to worry, better to find out when backing up than when trying to restore. The issue was the backup process was so excruciatingly slow that now it could not complete. Fortunately, I had another portable drive that I could try backing up to … no problem.
Following a weekend of trawling through blog and forum discussions it became clear that many others are experiencing similar issues, although interestingly it isn’t a consistent phenomena. Fortunately someone did post how to enable SSH access to the device. Fantastic, now I could look at what the device was doing…
I logged in and what was immediately clear was that the media servicing programs were using a lot of memory and CPU time. After stopping these the device soon started to breathe a little easier and backups began running a little more reliably. But then after a few hours they appeared to have stalled again. Upon investigating further I noticed two commands were being called almost continuously ls -s1NRA –block-size=1 /shares and du -m –max-depth=1. For those curious, these are called by /usr/local/sbin/monitorio.sh and /usr/local/sbin/createBackupTally.sh respectively.
The issue became clear, there was an IO bottleneck, and this is why the problem was worse with the media serving services running, they forced the device to use significant amounts of swap space (hard disk as memory) which would accentuate the problem. Attempts have already been made to ensure those scripts run unobtrusively through the use of ‘nice’ process prioritisation. Unfortunately, after a backup has stalled for a period it appears to enter a state of limbo where it neither continues nor terminates, even if the IO bottleneck has cleared and free CPU cycles return.
The solution I have found to work reliably, so far, it to use the ‘ionice’ program to increase the IO priority of the ‘afpd’ and ‘cnid_metad’ services. In order for this priority to be applied automatically ionice -c 1 needs to be prepended to the start commands in /etc/init.d/netatalk as follows :
Line 73: ionice -c 1 /usr/sbin/cnid_metad $CNID_CONFIG
Line 80: ionice -c 1 nice -n $AFPD_NICE /usr/sbin/afpd $AFPD_UAMLIST -g $AFPD_GUEST -c $AFPD_MAX_CLIENTS -n “$ATALK_NAME$ATALK_ZONE”
Unfortunately, this change will probably have to be reapplied after each firmware upgrade until Western Digital includes it, something similar or actually addresses the cause. The best solution would be for them to find a more efficient means of achieving the objectives of their ‘monitorio.sh’ and ‘createBackupTally.sh’ scripts. Given the heavy use of swap when the media services are running this device should probably have 512MB of RAM, instead of the current 256MB.
At this stage, if given the opportunity again, I’d probably just buy another Apple TimeCapsule because my experience with Apple products, excluding OSX Server, is that they just work.