I did some tests with that array, as I did with an Infortrend OXYGENRAID device and an old LSI FC array.

Recently in Solaris Category
With patch 137137-09 (SPARC) and 137138-09 (x86), Solaris 10 can handle zfs pools with version 10. You will have to upgrade your zpools; "zpool status" will result in this warning:
To upgrade, use the command
The GZIP compression known by OpenSolaris and Solaris Express is now in the Solaris 10 code base starting from (kernel) patch 137137-09/137138-09.
You may set:
and - to set a gzip compression level:
status: The pool is formatted using an older on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on older software versions.To upgrade, use the command
# zpool upgrade yourpoolnameThe GZIP compression known by OpenSolaris and Solaris Express is now in the Solaris 10 code base starting from (kernel) patch 137137-09/137138-09.
You may set:
# set compression=gzip mypool/testand - to set a gzip compression level:
# set compression=gzip-9 mypool/test
Two mirrored volumes concatenated to give a 8TB storage pool. I logged 18 hours of block read/write activity and made a MPEG4-Movie out of it.
Green pixels summarize sector reads, red pixels sector writes. Each pixel represent the same number of consecutive storage blocks. The pixels are doubled in height because of the 4:2:0 color coding of MPEG4. Otherwise, a red pixel adjacent to a green one could not have been possible, as two lines are coded together with regard to color. It is a TV format, after all..
At half past eight, a backup job started, scanning many files. After 4 o'clock an expire run is started. And at 5 o'clock it is really "night" with user activity rebeginning at 6:30.
I used 640x480 pixel frames with 30 frames per second. Should be possible to watch on a NTSC TV :-) Enjoy!
Green pixels summarize sector reads, red pixels sector writes. Each pixel represent the same number of consecutive storage blocks. The pixels are doubled in height because of the 4:2:0 color coding of MPEG4. Otherwise, a red pixel adjacent to a green one could not have been possible, as two lines are coded together with regard to color. It is a TV format, after all..
At half past eight, a backup job started, scanning many files. After 4 o'clock an expire run is started. And at 5 o'clock it is really "night" with user activity rebeginning at 6:30. I used 640x480 pixel frames with 30 frames per second. Should be possible to watch on a NTSC TV :-) Enjoy!
As a followup to this article, I used a zfs volume and an ufs volume with an application doing the same work on both partitions (replicated sql engines).
Click on the still frame to download the video (4 MB, ISO MPEG 4 format):

Click on the still frame to download the video (4 MB, ISO MPEG 4 format):

Continue reading UFS and ZFS write patterns [Update].
Important security update, I checked the wrong error condition in nvpair_lookup_*, so using -z on UFS volumes did not work (program segfaulted).
Example:
Binaries for SPARC Solaris 10 and x86 Solaris 10 are included.
You may download vhci_stat-1.3.1.tar.gz here.
MD5: 603e3a620085dc87197345687e4f740f vhci_stat-1.3.1.tar.gz
More info is available on the vhci_stat-Page.
Example:
# vhci_stat -z
vhci_stat
1.3.1; Pascal Gienger <pascal@southbrain.com>; http://southbrain.com/south/
Host: seattle
PATH WWPN,LUN STATE
------------------------------------------------- ------------------- -----
/scsi_vhci/ssd@g600d0230006b66680c50ab4f92f61000
ZFS: pool name = atl1, ZFS Version: 4
ZFS: size = 698.37G, member type = mirror
/pci@1f,0/pci@1/fibre-channel@5/fp@0,0 210000d0231c50ab,1 ONLINE
[...]Binaries for SPARC Solaris 10 and x86 Solaris 10 are included.
You may download vhci_stat-1.3.1.tar.gz here.
MD5: 603e3a620085dc87197345687e4f740f vhci_stat-1.3.1.tar.gz
More info is available on the vhci_stat-Page.
Obsolete - please use Version 1.3.1!
I just did a little upgrade to my vhci_stat-program, now showing also ZFS header data. The pool does not need to be imported for this task.
Example:
Binaries for SPARC Solaris 10 and x86 Solaris 10 are included.
[Download removed]
More info is available on the vhci_stat-Page.
I just did a little upgrade to my vhci_stat-program, now showing also ZFS header data. The pool does not need to be imported for this task.
Example:
# vhci_stat -z
vhci_stat
1.3; Pascal Gienger <pascal@southbrain.com>; http://southbrain.com/south/
Host: seattle
PATH WWPN,LUN STATE
------------------------------------------------- ------------------- -----
/scsi_vhci/ssd@g600d0230006b66680c50ab4f92f61000
ZFS: pool name = atl1, ZFS Version: 4
ZFS: size = 698.37G, member type = mirror
/pci@1f,0/pci@1/fibre-channel@5/fp@0,0 210000d0231c50ab,1 ONLINE
[...]Binaries for SPARC Solaris 10 and x86 Solaris 10 are included.
[Download removed]
More info is available on the vhci_stat-Page.
[This is obsolete, these binaries are included in the 1.3.1 vhci_stat-distribution: vhci_stat-1.3.1.tar.gz]
Due to popular demand, I made precompiled binaries for Solaris 10 on SPARC and x86 platforms, without any shared GCC libraries:
Due to popular demand, I made precompiled binaries for Solaris 10 on SPARC and x86 platforms, without any shared GCC libraries:
- For SPARC: vhci_stat-1.2.sparc10.bin.tar.gz
md5: 3394723d4f12ca95f15848fc6fac22f4 - For x86: vhci_stat-1.2.x86_10.bin.tar.gz
md5: 1c7fe153cb33f57a71783f39d0d3951a
First:
On our running Postfix-Graylisting Setup, the two MySQL nodes are quite loaded:
And this is the result of implementing graylisting:

But let's start from the beginning.
The scenario was like that:
Two mail servers are acting like an MX for an educational institution. 1.2 million spam mails came in per day. The two Ironport appliances behind did their best to sort them out, but in that institution spam cannot be deleted, there are laws against that. So the central cyrus mail server sorts them out via sieve scripts and puts them in the Spam boxes of every user, wasting disk space and - more important - disk i/o.
So I had the idea to implement a graylisting setup.
Each of the two Postfix nodes has its own MySQL database server running. But how to replicate data between them?
On our running Postfix-Graylisting Setup, the two MySQL nodes are quite loaded:
Uptime: 1 day 8 hours 54 min 29 sec
Threads: 320 Questions: 20519246 Slow queries: 15205 Opens: 522 Flush tables: 1 Open tables: 516 Queries per second avg: 173.204And this is the result of implementing graylisting:

But let's start from the beginning.
The scenario was like that:
Two mail servers are acting like an MX for an educational institution. 1.2 million spam mails came in per day. The two Ironport appliances behind did their best to sort them out, but in that institution spam cannot be deleted, there are laws against that. So the central cyrus mail server sorts them out via sieve scripts and puts them in the Spam boxes of every user, wasting disk space and - more important - disk i/o.
So I had the idea to implement a graylisting setup.
Each of the two Postfix nodes has its own MySQL database server running. But how to replicate data between them?
Continue reading Two-node Graylisting (or Greylisting) setup with Postfix, MySQL and Perl.
I wanted a tool to show up network i/o statistics - just like iostat does for disk access. I used Sun::Solaris::Kstat in Perl, available in Solaris 10 package SUNWperl584core.
The only paramter my script accepts for now is the time delay between two measurements.
Example:
You may download the tiny perl script here.
Update July 12th, 2008: Version 1.1 corrects the problem that on some SPARC integrated network cards there is no kstat class "mac", so this update derives the nic instances by checking "obytes64".
The only paramter my script accepts for now is the time delay between two measurements.
Example:
-bash-3.00$ knetstat 2
network i/o statistics
r/s t/s kr/s kt/s interface
network i/o statistics
r/s t/s kr/s kt/s interface
1086 1916 91.75 2400.10 e1000g0
45.5 48.5 6.78 8.51 e1000g1
network i/o statistics
r/s t/s kr/s kt/s interface
1129.5 1921 99.30 2396.48 e1000g0
17.5 18 2.46 2.59 e1000g1
network i/o statistics
r/s t/s kr/s kt/s interface
706.5 1063 79.56 1199.68 e1000g0
16 18 2.43 2.63 e1000g1You may download the tiny perl script here.
Update July 12th, 2008: Version 1.1 corrects the problem that on some SPARC integrated network cards there is no kstat class "mac", so this update derives the nic instances by checking "obytes64".
Some people are saying I am having too much spare time - at least I had enough time to pull an old LSI RAID out of trash to make some performance measurements with it - the same test suite as used on the Infortrend SATA RAID device testes some days earlier.
The results are quite impressive - a 6 year old device can compete with a new one - fiberchannel disks of 2002 were as fast as SATA disks nowadays are.
I will get a combined SATA and SAS device the next days, so I will do some tests to compare NEW SAS-disk packs compared with new SATA disk packs.
Results:
The results are quite impressive - a 6 year old device can compete with a new one - fiberchannel disks of 2002 were as fast as SATA disks nowadays are.
I will get a combined SATA and SAS device the next days, so I will do some tests to compare NEW SAS-disk packs compared with new SATA disk packs.
Results:
In case you have SAN RAID storage, and you are running a "zpool create" to create a new zfs volume and you encounter these messages:
This does not always mean that your SAN disk has a problem, no it can be simply the fact that the disks you use already have a zfs header stating the length of the "disk". On some RAID devices, when you reconfigure system disks/disk sets, disks won't be initialized again and - if the new disk group is smaller than it was before - Solaris will try to get the last block of that "disk" - as written in the "wrong" zfs header. This will fail.
Correct it by writing zeroes at the header:
dd if=/dev/zero count=10 of=/dev/dsk/...diskname....
and reboot.
May 26 08:40:22 atlanta scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g0080e52126cdf002 (sd69):
May 26 08:40:22 atlanta SCSI transport failed: reason 'tran_err': giving up
May 26 08:40:23 atlanta scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g0080e52126cdf002 (sd69):
May 26 08:40:23 atlanta SCSI transport failed: reason 'tran_err': giving up
May 26 08:40:24 atlanta scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g0080e52126cdf002 (sd69):
May 26 08:40:24 atlanta SCSI transport failed: reason 'tran_err': giving up
May 26 08:40:24 atlanta scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/disk@g0080e52126cdf002 (sd69):
May 26 08:40:24 atlanta SCSI transport failed: reason 'tran_err': giving upThis does not always mean that your SAN disk has a problem, no it can be simply the fact that the disks you use already have a zfs header stating the length of the "disk". On some RAID devices, when you reconfigure system disks/disk sets, disks won't be initialized again and - if the new disk group is smaller than it was before - Solaris will try to get the last block of that "disk" - as written in the "wrong" zfs header. This will fail.
Correct it by writing zeroes at the header:
dd if=/dev/zero count=10 of=/dev/dsk/...diskname....
and reboot.
Normal fiberchannel devices are named
cXtWWPNdLUN
Example:
/dev/dsk/c5t21000080E511F169d0
WWPN in this case is 21:00:00:80:E5:11:F1:69. It appears as "target" in the standard Solaris SCSI notation.
The problem: If you got multiple paths (to multiple WWPNs) to your devices, your volumes will appears as many identical volumes - one for each path:
How to correct this?
Tell scsi_vhci to accept this "LSI ProFibre"-Device as multipath device:
Set a suitable region size to use logical-block round-robin:
Result:
Wonderful.
cXtWWPNdLUN
Example:
/dev/dsk/c5t21000080E511F169d0
WWPN in this case is 21:00:00:80:E5:11:F1:69. It appears as "target" in the standard Solaris SCSI notation.
The problem: If you got multiple paths (to multiple WWPNs) to your devices, your volumes will appears as many identical volumes - one for each path:
1. c5t21000080E511F169d2 <LSI-ProFibre 4000R-5902-262.26GB>
/pci@7b,0/pci10de,5d@e/pci1077,142@0/fp@0,0/disk@w21000080e511f169,2
2. c5t21000080E511F169d3 <LSI-ProFibre 4000R-5902-262.26GB>
/pci@7b,0/pci10de,5d@e/pci1077,142@0/fp@0,0/disk@w21000080e511f169,3
3. c5t22000080E511F169d2 <LSI-ProFibre 4000R-5902-262.26GB>
/pci@7b,0/pci10de,5d@e/pci1077,142@0/fp@0,0/disk@w22000080e511f169,2
4. c5t22000080E511F169d3 <LSI-ProFibre 4000R-5902-262.26GB>
/pci@7b,0/pci10de,5d@e/pci1077,142@0/fp@0,0/disk@w22000080e511f169,3
5. c5t23000080E511F169d2 <LSI-ProFibre 4000R-5902-262.26GB>
/pci@7b,0/pci10de,5d@e/pci1077,142@0/fp@0,0/disk@w23000080e511f169,2
6. c5t23000080E511F169d3 <LSI-ProFibre 4000R-5902-262.26GB>
/pci@7b,0/pci10de,5d@e/pci1077,142@0/fp@0,0/disk@w23000080e511f169,3
7. c5t24000080E511F169d2 <LSI-ProFibre 4000R-5902-262.26GB>
/pci@7b,0/pci10de,5d@e/pci1077,142@0/fp@0,0/disk@w24000080e511f169,2
8. c5t24000080E511F169d3 <LSI-ProFibre 4000R-5902-262.26GB>
/pci@7b,0/pci10de,5d@e/pci1077,142@0/fp@0,0/disk@w24000080e511f169,3How to correct this?
Tell scsi_vhci to accept this "LSI ProFibre"-Device as multipath device:
device-type-scsi-options-list =
"LSI ProFibre 4000R", "symmetric-option";
symmetric-option = 0x1000000;Set a suitable region size to use logical-block round-robin:
device-type-mpxio-options-list =
"device-type=LSI ProFibre 4000R", "load-balance-options=logical-block-options";
logical-block-options="load-balance=logical-block", "region-size=20";Result:
1. c6t0080E52126CDF002d0 <LSI-ProFibre 4000R-5902-262.26GB>
/scsi_vhci/disk@g0080e52126cdf002
2. c6t0080E52126CDF003d0 <LSI-ProFibre 4000R-5902-262.26GB>
/scsi_vhci/disk@g0080e52126cdf003Wonderful.
If you have devices of multiple vendors to be accessed by the Solaris MPxIO architecture, you must use this syntax in scsi_vhci.conf:
To set up logical block round-robin for both devices, use this (region-size=20 means 2^20 bytes: 1 MB):
device-type-scsi-options-list =
"ADVUNI OXYGENRAID", "symmetric-option",
"LSI ProFibre 4000R", "symmetric-option";
symmetric-option = 0x1000000;To set up logical block round-robin for both devices, use this (region-size=20 means 2^20 bytes: 1 MB):
device-type-mpxio-options-list =
"device-type=ADVUNI OXYGENRAID", "load-balance-options=logical-block-options",
"device-type=LSI ProFibre 4000R", "load-balance-options=logical-block-options";
logical-block-options="load-balance=logical-block", "region-size=20";
Note that the device lines are comma-separated.
I used the past weekend to compare multiple configurations of an Infortrend Oxygenraid system. It is also available as Advanced Unibyte Oxygenraid. Other OEM brand names may exist.
The hardware configuration tested includes 16 SATA disks (Seagate Barracuda 7200.10, 750 GB per disk). 12 were used for this tests. Large RAID reconfiguration delays slowed down the whole story.
Four filebench tests were run against each configuration.
Read it in the articles section...
Comments appreciated!
Why?
Because it will just start the resilvering process from the beginning... Just in case you don't want to wait for some other hours for the mirror to complete...
Because it will just start the resilvering process from the beginning... Just in case you don't want to wait for some other hours for the mirror to complete...
Installing patches on solaris machines can be funny:
just found after ps -edaf :)
(yes i KNOW what this sed command does, but it is funny to see).
root 9635 9628 0 09:15:04 pts/3 0:00 sed s,'"'"',ApOsTrOpHe,gjust found after ps -edaf :)
(yes i KNOW what this sed command does, but it is funny to see).
To whom it may concern: There are two tutorial pages now available how to set up zpools and how to manipulate zfs filesystems. Basic commands are shown, I'll write a chapter on snapshots, export/imports and promoting soon
- How to create zpools
- How to create zfs filesystems
- How to create zpools
- How to create zfs filesystems
[Update: We're at version 1.3 now, just look at the vhci_stat-Page]
When you have some Solaris machines with MPxIO (scsi_vhci) enabled you may begin to cry why there isn't a simple tool to show all paths to a certain volume AND ITS STATUS.
You may use stmsboot to accomplish the further, but it will only show paths which have been found before you configured scsi_vhci. Why? Because devices found afterwards which have been taken immediately by scsi_vhci do not have a seperate devfs entry - neither in /devices nor in /dev.
So stmsboot does not show them. You may also use luxadm of StorEdge Traffic Manager.
I preferred a simplier solution: I wrote a very tiny little test program using scsi_vhci ioctls to get the required info and to print it - it also sets a return code equally to the number of failed links. If all links are not in FAILED state, it will return 0 (zero).
Example:
You may even reprobe devices using "-p" as an option. Option "-i" sends an SCSI inquiry to each device to look up its vendor and product id.
When I have some time I will add Controller names instead of their PCI path. In my example, you see two different FibreChannel controller ports (each residing on a separate controller):
-
You may find this utility on my vhci_stat-Page.
I would be glad to get feedback from you!
When you have some Solaris machines with MPxIO (scsi_vhci) enabled you may begin to cry why there isn't a simple tool to show all paths to a certain volume AND ITS STATUS.
You may use stmsboot to accomplish the further, but it will only show paths which have been found before you configured scsi_vhci. Why? Because devices found afterwards which have been taken immediately by scsi_vhci do not have a seperate devfs entry - neither in /devices nor in /dev.
So stmsboot does not show them. You may also use luxadm of StorEdge Traffic Manager.
I preferred a simplier solution: I wrote a very tiny little test program using scsi_vhci ioctls to get the required info and to print it - it also sets a return code equally to the number of failed links. If all links are not in FAILED state, it will return 0 (zero).
Example:
# vhci_stat -i
vhci_stat
1.1; Pascal Gienger <pascal@southbrain.com>; http://southbrain.com/south/
Host: atlanta
PATH WWPN,LUN STATE
------------------------------------------------- ------------------- -----
/scsi_vhci/disk@g600d0230006c1c4c0c50be27386c4900
(Vendor = ADVUNI Model = OXYGENRAID 416F4 Rev. = 347B VolID = 27386C49)
/pci@7b,0/pci10de,5d@e/pci1077,142@0/fp@0,0 220000d0232c50be,2 ONLINE
/pci@7b,0/pci10de,5d@d/pci1077,142@0/fp@0,0 210000d0231c50be,2 ONLINE
/scsi_vhci/disk@g600d0230006c1c4c0c50be6456ac0d00
(Vendor = ADVUNI Model = OXYGENRAID 416F4 Rev. = 347B VolID = 6456AC0D)
/pci@7b,0/pci10de,5d@e/pci1077,142@0/fp@0,0 220000d0232c50be,1 ONLINE
/pci@7b,0/pci10de,5d@d/pci1077,142@0/fp@0,0 210000d0231c50be,1 ONLINE
[... rest omitted ...]
You may even reprobe devices using "-p" as an option. Option "-i" sends an SCSI inquiry to each device to look up its vendor and product id.
When I have some time I will add Controller names instead of their PCI path. In my example, you see two different FibreChannel controller ports (each residing on a separate controller):
-
/pci@7b,0/pci10de,5d@d/pci1077,142@0/fp@0,0
- /pci@7b,0/pci10de,5d@e/pci1077,142@0/fp@0,0You may find this utility on my vhci_stat-Page.
I would be glad to get feedback from you!
Just a side note:
We are using ZFS across two SAN mirror pairs for our production mailserver. It is running Cyrus IMAP with 14000 users and 50000 mailboxes. The zpool has 8 Terabytes and the system experienced some latencies when accessing data on the zpools.
The first problem was easy to resolve but still we had IMAP SELECT times greater 0,5 seconds on some accesses.
The machine was doing approx 2000 random read accesses per second (we have approx 1000 concurrent IMAP users including our webmail application which is heavily using IMAP commands).
The first step was to put more RAM for the ARC cache in the machine, which resulted in 100-200 random read accesses per second. Cyrus metadata files of logged in users are now mainly in RAM.
The second step was to disable the prefetching algorithm. I put that in
And - hooray - SELECT times are under 0,1 seconds. The prefetch method seems not to be optimized for a highly random i/o environment.
We are using ZFS across two SAN mirror pairs for our production mailserver. It is running Cyrus IMAP with 14000 users and 50000 mailboxes. The zpool has 8 Terabytes and the system experienced some latencies when accessing data on the zpools.
The first problem was easy to resolve but still we had IMAP SELECT times greater 0,5 seconds on some accesses.
The machine was doing approx 2000 random read accesses per second (we have approx 1000 concurrent IMAP users including our webmail application which is heavily using IMAP commands).
The first step was to put more RAM for the ARC cache in the machine, which resulted in 100-200 random read accesses per second. Cyrus metadata files of logged in users are now mainly in RAM.
The second step was to disable the prefetching algorithm. I put that in
/etc/system:set zfs:zfs_prefetch_disable=1And - hooray - SELECT times are under 0,1 seconds. The prefetch method seems not to be optimized for a highly random i/o environment.
