How fast should my FreeNAS based HP Microserver NAS be?…

It’s been a couple of weeks since I built a home NAS using a HP Microserver N36L with 8GB RAM, FreeNAS 8.0.2-RELEASE and 4 x 2TB Samsung F4 hard drives configured as a RAIDZ2. Apart from a scary incident which resulted in an unexpected real world test of RAIDZ2 resilience, the NAS has been pretty stable although I’ve not been blown away by read/write performance over the network. I didn’t really want to get into fine tuning ZFS this early as I was hoping the out-of-the-box performance would be good enough, but it looks like I’m going to have to do a bit of investigation to understand why performance is not as good as I had hoped.

Network dropouts

It’s worth mentioning that I was also experiencing regular incidents of the NAS dropping off the network and reappearing several seconds later. This was particularly noticeable when SSHing onto the box using Putty, only to have the shell stop responding and the connection terminated a few seconds later. At the same time the web GUI would also stop responding and any remote file shares would also disappear.

Checking the FreeNAS logs didn’t show anything scary such as disk problems, so I Googled a bit and found many reports of problems with the on-board Broadcom based NC107i embedded network controller on the HP Microserver N36L. Users report regular network disconnection and reconnection problems and many have resorted to installing a separate quality NIC (such as an Intel PRO/1000 server or desktop card) in one of the PCIe slots. This sounded promising and I was all set to order a NIC when it dawned on me that I had been playing about with configuring my various network devices for jumbo frames support and when I couldn’t get it to work reliably had forgotten to revert my Win XP PCs NIC settings back to a default MTU of 1500! As soon as I did this the NAS network connection was steady again so I’ve delayed the purchase of a separate NIC… for now at least!

Testing network speed with iperf / jperf

Given the numerous reports of problems with the on-board NIC in the N36L, the first test I wanted to perform was a low level network test using iperf and its GUI front-end jperf. Luckily iperf is bundled with FreeNAS so it was simply a case of starting it in server mode using the command:

iperf -s

Then I fired up jperf on my iMac and ran a few basic tests…

The results were very positive! After several runs the average TCP transfer rate was around 910 Mb/s (or around 113 MB/s) which must be near the theoretical maximum throughput for a Gigabit network. Now these were not exhaustive tests for any long period or under sustained load, but the on-board network controller appears to be doing its job at least some of the time so I don’t think that’s the main cause of poor performance.

So next I think I need to start drilling down into testing the raw hard drive IO performance and then maybe onto a bit of ZFS tuning. But that will have to wait until another post 🙂

RAIDZ resilience put to the test on my new NAS!

In a previous post, I discussed building a NAS device for my home network using an HP Microserver kitted out with 4 x Samsung F4 2TB drives and 8GB of RAM, and running the FreeBSD-based FreeNAS OS. One key objective was to build in a high degree of resilience and so I chose a ZFS RAIDZ2 disk configuration which would tolerate up to 2 concurrent drive failures while still maintaining integrity of the data. At the time, I didn’t fully understand what a real life drive failure would look and feel like… but that was soon to change!

Houston, we have a problem

The NAS had been up and running for a couple of days with no problems at all. I hadn’t started copying real data across to the device but had played with using the web GUI, setting up some ZFS datasets and associated CIFS and AFP shares, some test transfers from different computers and accessing the device using SSH. All seemed very well.

It was while logged onto the box with SSH that I first noticed strange problems. For some reason, the connection would be terminated and at the same time the web GUI would stop responding. Then, a few seconds later I would be able to connect again and the whole cycle would repeat. I also noticed that if I put my ear to the box (it is pretty quiet under normal operation) I could hear what sounded like a drive repeatedly spinning up and down. That noise fills me with dread when it’s not something you specifically want to happen.

I then decided to restart the box and watch the boot sequence from the console. On reboot the first thing that was apparent was that it was taking a long time, and the cause seemed to be detecting the drives. Although the first 3 of the 4 drives were detected OK, albeit slowly, it refused to recognise the 4th drive. This seemed consistent with the unexpected drive noise and suggested the 4th drive was having problems.

RAIDZ to the rescue

At this point I decided to shut the box down again, pop out the 4th drive and reboot to see what happened. On restart the box booted quickly without any problems and seemed stable once up. The web GUI alert indicator stated that the volume was in a degraded state and viewing the disks making up the RAIDZ2 volume confirmed that the 4th drive was missing. However, throughout this all the data was accessible even though the volume was in a degraded state. So, RAIDZ was doing its job!

Testing the suspected failed drive

I didn’t want to believe that a brand new drive could be suffering a failure so I decided to shut the box down again, put the 4th drive back in and then reboot to see the result. At the same time I removed and replaced each drive in its caddy making sure that the connection was solid, and also double checked the connections of all cables onto the motherboard, particularly the single heavy braided cable cluster for the hard drives. On restart, it booted up just fine – and this was with the suspected failed drive back in!

Once the box was back up and stable I decided to run a sequence of S.M.A.R.T. self-tests on the 4th drive. The first short test only took a couple of minutes and came back all OK. The second test was a long one and took several hours to complete. When I checked the results the following morning, everything was reporting OK!

Panic over?

So, could it be that the drive is actually perfectly OK and the problems were down to a different cause? Possibly a dodgy drive caddy connection, or a loose cable connection?

I will definitely keep a very close watch on the system over the next few days.

But one thing positive has come out of this – RAIDZ is doing its job and gives me a fair bit of confidence that my data is safe in the event of a drive failure 🙂