SATA-RAID and ZFS: Infortrend/ADVUNI OXYGENRAID: top or flop?
SATA RAID devices have become quite popular as they have a low pricetag for the storage size offered. SATA disks are said to have much lower input/output operations per second capability compared to SCSI/SAS or FC devices. Public opinion is also that SATA devices are failing much earlier than their SCSI/FC counterparts.
And then there is Sun's new filesystem and volume manager system: ZFS. ZFS also is capable to create RAID groups of its own - how does it perform with SATA devices?
I had the opportunity to have an Infortrend aka Advanced Unibyte OXYGENRAID device with 16 spindles à 750 GB each for three days to test in ZFS environments. Not all tests I wanted to made could be made in that short time. This isn't meant as a scientific work because I did not have the time take three or four results per configuration to guess the error range of my results. So take the following numbers as a first impression.So let's begin.
- Sun X4200 M2 dual opteron server with 20 GB RAM, 10 GB used for ARC cache
- 2 Sun (Q-Logic) FC cards installed, 4 Gbit/sec capable, connected through a SAN switch fabric to the raid device
- SAN switch fabric without other traffic (isolated)
- Solaris 10 x86 with Kernel 127112-11 installled (all relevant ZFS patches as of 2008/05/12).
- Advanced Unibyte (OEM Infortrend) OXYGENRAID, Firmware 4F614, 16 x 750 GB SATA Seagate ST3750640AS Barracuda 7200.10
This was a real world test, so caches on the system (ARC) and on the RAID device are ON. I did explicitly not want a lab benchmark with all caches turned off.
Test configurationsThe following configurations were tested:
- 1x1: single disk, configured as JBOD/non-RAID, for reference
- 1x2r1: RAID1 (mirror) with 2 disks
- 1x2m: ZFS mirror with 2 disks as JBOD/NRAID
- 1x3r5: RAID5 with 3 disks
- 4x3r5: 4 RAID5 sets with 3 disks each, striped via ZFS
- 1x6r5: RAID5 with 6 disks
- 1x6z1: ZFS raidz1 with 6 disks (JBOD/NRAID)
- 2x6r5: 2 RAID5 sets with 6 disks each, striped via ZFS
- 1x12r5: RAID5 with 12 disks
- 1x12z1: ZFS raidz1 with 12 disks (JBOD/NRAID)
All devices have been accessed via Sun's scsi_vhci (MPxIO) driver, logical blocksize 20 (1 MB). For the 12-disk-raidz1-Test, 6 disks were configured per Infortrend controller, because one Infortrend controller can only handle up to 8 LUNs.
A side note for raidz1/JBOD configurations: I wanted to know whether raidz1 is usable with this RAID device by configuring the disks just as JBOD. I did not want to test the usability of ZFS' raid implementation in general. Infortrend's NRAID/JBOD implementation is 'in' these numbers, too.
Sun's filebench tool was used to generate the results.
The following filebench personalities and parameters were used:
All tests were run for 300 seconds ("run 300").
ZFS was used as filesystem in all scenarios.
The result is not very impressive, more spindles mean more i/o bandwidth. More interesting is the fact that ZFS' mirror mechanism performs better with sequential reads than the RAID1 mirror implemented in the raid device. A 12 disk RAID5 configuration performs a little bit better than a 4-way-ZFS-stripe of 3-disk-RAID5-devices. The differences are not substantial. It seems that 85 MB/s are somewhat of a bottleneck for this raid device. Note that ZFS' raid configurations have a slightly lower bandwith.
The next graph compares used cpu time:
Again - nothing impressive. As ZFS' raid implementation uses the system's CPUs it will use more cpu time compared to the access to the RAID devices handled by the RAID system.
Write numbers are more interesting - it seems that our RAID device does do writes very well - better than reads. Note that even with a RAID1 (mirror) configuration, write bandwith is higher than in the single disk configuration. ZFS' mirror does not enhance the write bandwidth in any way. Infortrend's write back cache seems to do a very good job.
CPU time numbers are somewhat similar than the read numbers, again the cpu overhead for zfs' raid implentations is visible.
The varmail scenario is a heavy random i/o scenario with many files in one directory with concurrent access of many threads. If you don't want to use a RAID system for single user video streaming, read on.
The graphic above shows the number of total operations per second measured. If you like to have real read/write operations, this is the graph for you:
- ZFS' mirror does not enhance performance.
- More spindles mean more operations per time unit.
- ZFS concatenation/stripe algorith introduces a performance decrease.
- ZFS' RAID implementation in conjunction with Infortrend's JBOD setting is not recommendable for this scenario.
More spindles effectively reduce latency. ZFS' raid latency is bigger than Infortrend's. 19ms is the lower limit - 5 times the value stated by Seagate for one of these disks (Barracuda 7200.10).
The amount of CPU time used by ZFS' raid implementation is not a big hit (times in microseconds):
Now, a more complex scenario. "oltp" simulates a transactional database (like Oracle, Postgres, ...) with (very) small database updates, a common shared memory mapped region and a transaction log file. 230 threads are running in parallel. First result: Do not use this RAID for that kind of workload - and use many spindles in case you have to do so - and bury immediately any idea to use ZFS' raid implementation raidz1 with this Infortrend device.
- raidz1 (ZFS raid) does not scale at all with the number of spindles.
- ZFS' concatenation/stripe algorithm performs very well with this kind of workload.
- These Seagate SATA disks seem to be able to handle 100 oltp ops per second. FC and SAS disks should handle more than 100.
The last graph shows cpu usage:
CPU time overhead of ZFS raidz1 is considerably high for this kind of workload.
In conjunction with this Infortrend device, ZFS raidz1 is performing very badly besides sequential access patterns. As many sources on the internet are stating thet ZFS raidz IS in fact very fast, it must be a rather dull JBOD implementation of this Infortrend device - as it is sold as a hardware RAID device, nobody wants to use them as JBOD cases - it's not the job of that box.
For comparison, I would like to make this test with the same hostsystem with other devices. Main problem: I don't have access to other FC storage devices at the moment. I'll have an old LSI Fiberchannel disk system with real FC disks next week, so I will be able to make filebench tests.
The tests were done in the datacenter of University of Konstanz.
Listed below are links to blogs that reference this entry: SATA-RAID and ZFS: Infortrend/ADVUNI OXYGENRAID: top or flop?.
TrackBack URL for this entry: