April 2008 Archives

Solaris 10 & up: Working with zpools and zfs

| | TrackBacks (0)
To whom it may concern: There are two tutorial pages now available how to set up zpools and how to manipulate zfs filesystems. Basic commands are shown, I'll write a chapter on snapshots, export/imports and promoting soon

- How to create zpools
- How to create zfs filesystems

[Update: We're at version 1.3 now, just look at the vhci_stat-Page]

When you have some Solaris machines with MPxIO (scsi_vhci) enabled you may begin to cry why there isn't a simple tool to show all paths to a certain volume AND ITS STATUS.

You may use stmsboot to accomplish the further, but it will only show paths which have been found before you configured scsi_vhci. Why? Because devices found afterwards which have been taken immediately by scsi_vhci do not have a seperate devfs entry - neither in /devices nor in /dev.
So stmsboot does not show them. You may also use luxadm of StorEdge Traffic Manager.

I preferred a simplier solution: I wrote a very tiny little test program using scsi_vhci ioctls to get the required info and to print it - it also sets a return code equally to the number of failed links. If all links are not in FAILED state, it will return 0 (zero).


# vhci_stat -i
1.1; Pascal Gienger <pascal@southbrain.com>; http://southbrain.com/south/
Host: atlanta

PATH                                              WWPN,LUN            STATE
------------------------------------------------- ------------------- -----
  (Vendor = ADVUNI   Model = OXYGENRAID 416F4 Rev. = 347B VolID = 27386C49)
    /pci@7b,0/pci10de,5d@e/pci1077,142@0/fp@0,0   220000d0232c50be,2  ONLINE
    /pci@7b,0/pci10de,5d@d/pci1077,142@0/fp@0,0   210000d0231c50be,2  ONLINE
  (Vendor = ADVUNI   Model = OXYGENRAID 416F4 Rev. = 347B VolID = 6456AC0D)
    /pci@7b,0/pci10de,5d@e/pci1077,142@0/fp@0,0   220000d0232c50be,1  ONLINE
    /pci@7b,0/pci10de,5d@d/pci1077,142@0/fp@0,0   210000d0231c50be,1  ONLINE

[... rest omitted ...]

You may even reprobe devices using "-p" as an option. Option "-i" sends an SCSI inquiry to each device to look up its vendor and product id.

When I have some time I will add Controller names instead of their PCI path. In my example, you see two different FibreChannel controller ports (each residing on a separate controller):

- /pci@7b,0/pci10de,5d@d/pci1077,142@0/fp@0,0
- /pci@7b,0/pci10de,5d@e/pci1077,142@0/fp@0,0

You may find this utility on my vhci_stat-Page.
I would be glad to get feedback from you!

Ludwigshafen (Rh) by night - Excelsior hotel Sky bar!

| | TrackBacks (0)
Whenever you come to Ludwigshafen (adjacent to Mannheim, Germany) for a visit to one of the biggest chemical industries in Germany (BASF), you should take some time for a visit in a rather dull place - the main station area, a 70ies urban experiment which is nowadays a disaster - missing shops, broken windows, a subway/tram station in a devastous state, and a main railway station which was opened as one of the most modern stations of europe...

skybar.jpgIn the middle of this area there is the Eurohotel Excelsior which has a wonderful bar in the 17th floor. It is opened to the public and it is absoluely 70ies style - since the opening of the hotel in the middle of the 70ies the design of the bar never changed. Cocktail prices are low, and when you look through the glass walls, you will see the bridge above. I took the photograph with a Canon EOS 30D and a tripod.

Click on the pictures to get a bigger one.
Just a side note:

We are using ZFS across two SAN mirror pairs for our production mailserver. It is running Cyrus IMAP with 14000 users and 50000 mailboxes. The zpool has 8 Terabytes and the system experienced some latencies when accessing data on the zpools.

The first problem was easy to resolve
but still we had IMAP SELECT times greater 0,5 seconds on some accesses.

The machine was doing approx 2000 random read accesses per second (we have approx 1000 concurrent IMAP users including our webmail application which is heavily using IMAP commands).

The first step was to put more RAM for the ARC cache in the machine, which resulted in 100-200 random read accesses per second. Cyrus metadata files of logged in users are now mainly in RAM.

The second step was to disable the prefetching algorithm. I put that in /etc/system:

set zfs:zfs_prefetch_disable=1

And - hooray - SELECT times are under 0,1 seconds. The prefetch method seems not to be optimized for a highly random i/o environment.

On the other day, I just wanted to know how many memory my system is using really for its purposes. The modular debugger "mdb" has a nifty macro for it: it is called memstat - really straightforward.

This is the result:

# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace cpu.AuthenticAMD.15 uppc pcplusmp ufs mpt ip hook neti sctp arp usba fcp fctl qlc lofs fcip cpc random crypto zfs logindmux ptm nfs ]
> ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                    4111973             16062   78%
Anon                       251805               983    5%
Exec and libs                6346                24    0%
Page cache                  37719               147    1%
Free (cachelist)           285302              1114    5%
Free (freelist)            547571              2138   10%

Total                     5240716             20471
Physical                  5121357             20005

Weird - a 16 GB Kernel? Yes, that is normal as we are using ZFS as a filesystem. And its cache is stored in kernel memory. You may use the famous arcstats.pl perl program from Neelakanth Nadgir to get detailed ARC statistics but to understand a little bit you also may start with the Sun::Solaris Perl-Modules shipped with Solaris 10.

For our ARC statistics we have to use Sun::Solaris::Kstat:


use Sun::Solaris::Kstat;

my $k=new Sun::Solaris::Kstat || die "No Kernel statistics module available.";

while (1)
  my $kstats = $k -> {zfs}{0}{arcstats};
  my %khash = %$kstats;

  foreach my $key (keys %khash)
    printf "%-25s = %-20s\n",$key,$khash{$key};
  print "----------\n";
  sleep 5;

This example will print out something like this every 5 seconds:

mru_ghost_hits            = 31005
crtime                    = 134.940576581
demand_metadata_hits      = 7307803
c_min                     = 670811648
mru_hits                  = 4479479
demand_data_misses        = 1616108
hash_elements_max         = 1059239
c_max                     = 10737418240
size                      = 10737420288
prefetch_metadata_misses  = 0
hits                      = 14405090
hash_elements             = 940483
mfu_hits                  = 9925611
prefetch_data_hits        = 0
prefetch_metadata_hits    = 0
hash_collisions           = 2486320
demand_data_hits          = 7097287
hash_chains               = 280319
deleted                   = 1301979
misses                    = 2263351
demand_metadata_misses    = 647243
evict_skip                = 47
p                         = 10211474432
c                         = 10737418240
prefetch_data_misses      = 0
recycle_miss              = 595519
hash_chain_max            = 11
class                     = misc
snaptime                  = 12168.15032689
mutex_miss                = 13682
mfu_ghost_hits            = 139332

You can see fields for the size ("size"), the maximum size ("c_max") - this is set in my case to 10GB via

set zfs:zfs_arc_max=0x280000000

in /etc/system.

You see the counters for hits, misses, metadata misses, and so on. To get values per time unit, just take differences between to measures and format them - or just just use arcstats.pl, which will yield in an output like this:

# /var/home/pascal/bin/arcstat.pl
    Time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
10:45:21   16M    2M     13    2M   13     0    0  653K    8    10G   10G
10:45:22   512   196     38   196   38     0    0    96   57    10G   10G
10:45:23   736   219     29   219   29     0    0    76   27    10G   10G
10:45:24   647   210     32   210   32     0    0    74   39    10G   10G

So - at the end - we know that 10 GB of the 16 GB kernel memory is used for the ZFS cache.

Editor note, July 16th 2008: Yes, yes, yes, you are right, there is "kstat" available as a command (/usr/bin/kstat) and you can write:

kstat zfs:0:arcstats:size

to get the actual ARC cache size.
Just take a look at the kstat  program, it is written in perl and using... Sun::Solaris::Kstat to retrieve the values... :)
To see which devices are accessible by your fibrechannel controller cards and the fabric, just use fcinfo.

First, find out which fibrechannel host adapters you have on your machine:

# fcinfo hba-port
HBA Port WWN: 2100001b321fe69b
        OS Device Name: /dev/cfg/c5
        Manufacturer: QLogic Corp.
        Model: 375-3355-01
        Firmware Version: 4.2.2
        FCode/BIOS Version:  BIOS: 1.24; fcode: 1.24; EFI: 1.8;
        Type: N-port
        State: online
        Supported Speeds: 1Gb 2Gb 4Gb
        Current Speed: 4Gb
        Node WWN: 2000001b321fe69b
HBA Port WWN: 2100001b321fbd9c
        OS Device Name: /dev/cfg/c4
        Manufacturer: QLogic Corp.
        Model: 375-3355-01
        Firmware Version: 4.2.2
        FCode/BIOS Version:  BIOS: 1.24; fcode: 1.24; EFI: 1.8;
        Type: N-port
        State: online
        Supported Speeds: 1Gb 2Gb 4Gb
        Current Speed: 4Gb
        Node WWN: 2000001b321fbd9c

Here you see the firmware levels and types of your controller cards. You will also see the WWPN (port WWN) and the WWNN (Node WWN) of your controller. As each of the cards do only have one port, you will see only one WWPN per WWN.

To find out which devices can be reached by these controllers, use the remote-port option of fcinfo. You will have to give the port to be controlled as an argument to  "-p":


# fcinfo remote-port -p 2100001b321fe69b
Remote Port WWN: 220000d0232c50ab
        Active FC4 Types: SCSI
        SCSI Target: yes
        Node WWN: 200000d0232c50ab
Remote Port WWN: 2100001b321f9eb3
        Active FC4 Types:
        SCSI Target: no
        Node WWN: 2000001b321f9eb3
Remote Port WWN: 210000d0231c50be
        Active FC4 Types: SCSI
        SCSI Target: yes
        Node WWN: 200000d0231c50be

You see through the port with the WWPN 2100001b321fe69b (which is on c5 which can be seen on the first fcinfo command output) three devices: two scsi volumes and one other target: This is the fiberchannel controller on another machine.
The fcinfo command helped me very much to find out where I have a configuration error on my fibrechannel fabric.

Just as a side note: the last scsi device in the example above has the WWNN 200000d0231c50be. This is one of the paths which are constructing the virtual scsi_vhci device c6t600D0230006C1C4C0C50BE5BC9D49100d0 mentioned in my entry about scsi_vhci (MPxIO) names.

Solaris MPxIO scsi_vhci device names

| | TrackBacks (0)
In case you wondered about the lengthy scsi_vhci device names and you're stuck finding your "right" device: I had the same problem, so I tried to match known characteristics of my setup to that names. Some parts are easy to understand.

In scsi_vhci-names all components identify a VOLUME (fibrechannel disk device) and not a connection nor a switch nor the path - as it is a common name for all paths leading to that volume.

Let's look at a typical MPxIO name:


"c6" is easy: It is the virtual scsi_vhci controller. In my case, it subsumes the two real controllers c4 and c5 (both fibrechannel 4gbit/sec Q-Logic cards). My real devices are:

c4                             fc-fabric    connected    configured   unknown
c5                             fc-fabric    connected    configured   unknown

Don't be excited about the fact that both controllers won't be accessible any more nor you will find any volumes there (you will see "unusable"). They're taken by scsi_vhci creating "controller 6".

"d0" at the end means LUN 0, scsi_vhci always uses LUN0, even if your multipathed volumes are on different LUNs.

The interesting part is the target specification (you have remarked it: even these device names follow the Solaris cXtYdZ pattern). In this case, the target is NOT a number but a long string:

t6 00D023 000 6C1C4C0C50BE 5BC9D491 00
   wwwwww     cccccccccccc vvvvvvvv

w means the common bytes 2-4 of the fibrechannel wwn (world wide number) of the multiple end ports of that device/volume.
c is the common controller ID of your device. It is unique and on many fibrechannel devices it can be set manually.
v is the fibrechannel volume ID. It is unique for your device. Using this ID scsi_vhci recognizes which paths do access the same device/volume. So never use identical fibrechannel ids for different devices. This is important as there are some RAID devices which do set the volume id to 00000000.

You can put the slice obviously at the end of the specification, so this slice definition is perfectly correct:


(in case you want to access Slice 7).

If you intend to use ZFS/zpool omit the slice and use the entire device.

December 2015

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    


This blog is owned by:

Pascal Gienger
J├Ągerstrasse 77
8406 Winterthur

Google+: Profile
YouTube Channel: pascalgienger