RAIDZ resilience put to the test on my new NAS!

In a previous post, I discussed building a NAS device for my home network using an HP Microserver kitted out with 4 x Samsung F4 2TB drives and 8GB of RAM, and running the FreeBSD-based FreeNAS OS. One key objective was to build in a high degree of resilience and so I chose a ZFS RAIDZ2 disk configuration which would tolerate up to 2 concurrent drive failures while still maintaining integrity of the data. At the time, I didn’t fully understand what a real life drive failure would look and feel like… but that was soon to change!

Houston, we have a problem

The NAS had been up and running for a couple of days with no problems at all. I hadn’t started copying real data across to the device but had played with using the web GUI, setting up some ZFS datasets and associated CIFS and AFP shares, some test transfers from different computers and accessing the device using SSH. All seemed very well.

It was while logged onto the box with SSH that I first noticed strange problems. For some reason, the connection would be terminated and at the same time the web GUI would stop responding. Then, a few seconds later I would be able to connect again and the whole cycle would repeat. I also noticed that if I put my ear to the box (it is pretty quiet under normal operation) I could hear what sounded like a drive repeatedly spinning up and down. That noise fills me with dread when it’s not something you specifically want to happen.

I then decided to restart the box and watch the boot sequence from the console. On reboot the first thing that was apparent was that it was taking a long time, and the cause seemed to be detecting the drives. Although the first 3 of the 4 drives were detected OK, albeit slowly, it refused to recognise the 4th drive. This seemed consistent with the unexpected drive noise and suggested the 4th drive was having problems.

RAIDZ to the rescue

At this point I decided to shut the box down again, pop out the 4th drive and reboot to see what happened. On restart the box booted quickly without any problems and seemed stable once up. The web GUI alert indicator stated that the volume was in a degraded state and viewing the disks making up the RAIDZ2 volume confirmed that the 4th drive was missing. However, throughout this all the data was accessible even though the volume was in a degraded state. So, RAIDZ was doing its job!

Testing the suspected failed drive

I didn’t want to believe that a brand new drive could be suffering a failure so I decided to shut the box down again, put the 4th drive back in and then reboot to see the result. At the same time I removed and replaced each drive in its caddy making sure that the connection was solid, and also double checked the connections of all cables onto the motherboard, particularly the single heavy braided cable cluster for the hard drives. On restart, it booted up just fine – and this was with the suspected failed drive back in!

Once the box was back up and stable I decided to run a sequence of S.M.A.R.T. self-tests on the 4th drive. The first short test only took a couple of minutes and came back all OK. The second test was a long one and took several hours to complete. When I checked the results the following morning, everything was reporting OK!

Panic over?

So, could it be that the drive is actually perfectly OK and the problems were down to a different cause? Possibly a dodgy drive caddy connection, or a loose cable connection?

I will definitely keep a very close watch on the system over the next few days.

But one thing positive has come out of this – RAIDZ is doing its job and gives me a fair bit of confidence that my data is safe in the event of a drive failure 🙂