Creating and manipulating zpools (zfs)


Zpools are the underlying device layers for zfs filesystems. Mirrors, RAIDs and Concatenated Storage are defined here.

For pooling devices, zpools can be:

- a mirror
- a RAIDz with single or double parity
- a concatenated/striped storage

This work sheet has been done with Solaris 10 running on a virtual Parallels machine. The disks are not real, they are virtualized by Parallels, giving 8 GB to each disk. Not much, but enough to play with.


First we will try to look up the disks accessible by our system:


# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0d0 <DEFAULT cyl 2085 alt 2 hd 255 sec 63>
          /pci@0,0/pci-ide@1f,1/ide@0/cmdk@0,0
       1. c1d0 <DEFAULT cyl 1042 alt 2 hd 255 sec 63>
          /pci@0,0/pci-ide@1f,1/ide@1/cmdk@0,0
Specify disk (enter its number): ^C



Type CTRL-C to quit "format".

If your disks do not show up, use devfsadm:


# devfsadm
# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0d0 <DEFAULT cyl 2085 alt 2 hd 255 sec 63>
          /pci@0,0/pci-ide@1f,1/ide@0/cmdk@0,0
       1. c0d1 <DEFAULT cyl 1042 alt 2 hd 255 sec 63>
          /pci@0,0/pci-ide@1f,1/ide@0/cmdk@1,0
       2. c1d0 <DEFAULT cyl 1042 alt 2 hd 255 sec 63>
          /pci@0,0/pci-ide@1f,1/ide@1/cmdk@0,0
       3. c1d1 <DEFAULT cyl 1042 alt 2 hd 255 sec 63>
          /pci@0,0/pci-ide@1f,1/ide@1/cmdk@1,0
Specify disk (enter its number): ^C



You'll notice that the virtual disks are mapped as IDE/ATA drives, so the disk device names don't have a target specification "t".



Let's create our first pool by simply putting together all three disks (c0d0 is our root partition and boot disk which is not usable for our example):

# zpool create zfstest c0d1 c1d0 c1d1

That's it. You have just created a zpool named "zfstest" containing all three disks. Your available space will be just the sum of all three disks:

# zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
zfstest                23.8G     91K   23.8G     0%  ONLINE     -

Use "zpool status" to get detailed status information of the components of your zpool:

# zpool status
  pool: zfstest
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zfstest     ONLINE       0     0     0
          c0d1      ONLINE       0     0     0
          c1d0      ONLINE       0     0     0
          c1d1      ONLINE       0     0     0

errors: No known data errors

To destroy a pool, use "zpool destroy":

# zpool destroy zfstest

and your pool is gone.

Let's try a mirror now:

# zpool create mirror c1d0 c1d1

You just created a mirror between disk c1d0 and disk c1d1. Available storage is the same as if you used only one of these disks. If disk sizes differ, the smaller size will be your storage size. Data is replicated between these disks.

"zpool status" now reads:

# zpool status
  pool: zfstest
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zfstest     ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1d0    ONLINE       0     0     0
            c1d1    ONLINE       0     0     0

errors: No known data errors

So now we have a simple mirror. But how to put data on it?

If you create a zpool there is automatically a zfs filesystem created in it. The mountpoint defaults to the poolname. So your pool "zfstest" is mounted as a zfs filesystem at /zfstest:

# df -k
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/dsk/c0d0s0      14951508 5725085 9076908    39%    /
/devices                   0       0       0     0%    /devices
ctfs                       0       0       0     0%    /system/contract
proc                       0       0       0     0%    /proc
mnttab                     0       0       0     0%    /etc/mnttab
swap                 2104456     836 2103620     1%    /etc/svc/volatile
objfs                      0       0       0     0%    /system/object
/usr/lib/libc/libc_hwcap1.so.1
                     14951508 5725085 9076908    39%    /lib/libc.so.1
fd                         0       0       0     0%    /dev/fd
swap                 2103624       4 2103620     1%    /tmp
swap                 2103644      24 2103620     1%    /var/run
zfstest              8193024      24 8192938     1%    /zfstest

We will create a big file on it:

# dd if=/dev/zero bs=128k count=40000 of=/zfstest/bigfile
40000+0 records in
40000+0 records out

It is really there now:

# ls -la /zfstest
total 10241344
drwxr-xr-x   2 root     sys            3 Apr 21 11:15 .
drwxr-xr-x  39 root     root        1024 Apr 21 11:13 ..
-rw-r--r--   1 root     root     5242880000 Apr 21 11:30 bigfile

Now the differences to classical volume managers do begin. The underlying zpool "zfstest" KNOWS actually that approx. 5 Gigabytes are taken by zfs:

# zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
zfstest                7.94G   4.88G   3.05G    61%  ONLINE     -

This has enormous advantages: When replacing a mirrored disk, zfs will only copy allocated blocks to the new disk and not all blocks of the pool. The same is true with RAID devices, only allocated data blocks are reconstructed.


Now let's just stop the mirror. You do that just by detaching one drive from the mirror:


# zpool detach zfstest c1d0

This command pulled away disk c1d0 from the pool. Your mirror is not a mirror any more:

# zpool status
  pool: zfstest
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zfstest     ONLINE       0     0     0
          c1d1      ONLINE       0     0     0

errors: No known data errors

However you did not lose any bit of data! Your zpool ist just available as it was before (as long disk c1d1 does not fail).

You may attach another disk to your pool to create a new mirror:

# zpool attach zfstest c1d1 c0d1

Now your mirror consists of disks c1d1 and c0d1. Solaris will immediately begin to duplicate any block that's used by zfs from drive c1d1 to drive c0d1:

# zpool status
  pool: zfstest
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress, 55.53% done, 0h7m to go
config:

        NAME        STATE     READ WRITE CKSUM
        zfstest     ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1d1    ONLINE       0     0     0
            c0d1    ONLINE       0     0     0

errors: No known data errors

The process of replicating data on new or outdated disks is named "resilvering".

A mirror is not limited to two disks. If you have big concerns that your valuable data is prone to losses, just attach another disk to your mirror:

# zpool attach zfstest c0d1 c1d0

Your mirror now has three elements. Note, that your storage size does not grow by attaching new mirror components. But now two drives may fail completely and you still have all of your data:

# zpool status
  pool: zfstest
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Apr 21 13:56:16 2008
config:

        NAME        STATE     READ WRITE CKSUM
        zfstest     ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1d1    ONLINE       0     0     0
            c0d1    ONLINE       0     0     0
            c1d0    ONLINE       0     0     0

errors: No known data errors

Let's detach two disks now:

# zpool detach zfstest c1d1
# zpool detach zfstest c1d0

Your mirror has gone once again. To set up a concatenated or striped storage (write operations on zfs occur on ALL pool members, so it's more like a striped disk set), you may add these disks to your pool (never mistake  "add" for "attach" - the former ADDs storage, the latter attaches disks to mirrors):

# zpool add zfstest c1d0 c1d1

Your pool has become the same as the one we created at the beginning of our exercise:

# zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
zfstest                23.8G   4.88G   18.9G    20%  ONLINE     -
# zpool status
  pool: zfstest
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Apr 21 13:56:16 2008
config:

        NAME        STATE     READ WRITE CKSUM
        zfstest     ONLINE       0     0     0
          c0d1      ONLINE       0     0     0
          c1d0      ONLINE       0     0     0
          c1d1      ONLINE       0     0     0

errors: No known data errors

The only difference is our file "bigfile", which is still available as we did not destroy the pool. You see it from the output of "zpool list" above: 4,8G are still used.

Now we are stuck. It is not possible to remove disks added to our zpool. As writes occur on all members, newly written data is on all disks. No chance to throw away a disk. Mirror component disks can be detached at any time - as long it is not the last disk of a mirror.

Let's destroy the pool and set up a RAID storage. ZFS offers two RAID types: raidz1 and raidz2. raidz1 means single parity, raidz2 double parity.

# zpool destroy zfstest
# zpool create zfstest raidz1 c0d1 c1d0 c1d1

We have now created a raid group:

# zpool status
  pool: zfstest
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zfstest     ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c0d1    ONLINE       0     0     0
            c1d0    ONLINE       0     0     0
            c1d1    ONLINE       0     0     0

errors: No known data errors

Be aware that "zpool list" is showing the global capacity of your raid set and not the usable capacity:

# zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
zfstest                23.9G    157K   23.9G     0%  ONLINE     -

To see how many space we are able to allocate, use a zfs command (zfs commands are explained in the zfs tutorial text):

# zfs list
NAME      USED  AVAIL  REFER  MOUNTPOINT
zfstest   101K  15.7G  32.6K  /zfstest

One disk may fail in this scenario.

To put up the same pool with double parity:

# zpool destroy zfstest
# zpool create zfstest raidz2 c0d1 c1d0 c1d1
# zfs list
NAME      USED  AVAIL  REFER  MOUNTPOINT
zfstest  86.7K  7.80G  24.4K  /zfstest

Only 7.8 GB left - compared to 15.7 GB with a single parity RAID device. Two drives may fail now.

We have achieved the same data security as with a three way mirror - hence the same usable storage.

As a practical example, here is the output of "zpool status" and "zpool list" of a mailserver. The zpool "mail" consists of two mirror pairs added to a pool.

The creation command has been:

# zpool create mail \
 mirror c6t600D0230006C1C4C0C50BE5BC9D49100d0 c6t600D0230006B66680C50AB7821F0E900d0 \
 mirror c6t600D0230006B66680C50AB0187D75000d0 c6t600D0230006C1C4C0C50BE27386C4900d0

As you see, it is perfectly legal and possible to add the storage of two mirrors in one pool.

# zpool status
  pool: mail
 state: ONLINE
 scrub: none requested
config:

        NAME                                       STATE     READ WRITE CKSUM
        mail                                       ONLINE       0     0     0
          mirror                                   ONLINE       0     0     0
            c6t600D0230006C1C4C0C50BE5BC9D49100d0  ONLINE       0     0     0
            c6t600D0230006B66680C50AB7821F0E900d0  ONLINE       0     0     0
          mirror                                   ONLINE       0     0     0
            c6t600D0230006B66680C50AB0187D75000d0  ONLINE       0     0     0
            c6t600D0230006C1C4C0C50BE27386C4900d0  ONLINE       0     0     0


# zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
mail                   6.81T    3.08T  3.73T    45%  ONLINE     -


As you see, you can also use Sun MPxIO devices - they have LONG device names. You may also use FDISK partitions of x86 computers (...p0,...p1,...) and Solaris slices (...s0, ....s1, ...) to set up zpools. Both are however not recommended but fine to play with zpool commands.

The MPxIO names are usable because they show up just like normal block disk devices in /dev/dsk:

# ls -la /dev/dsk/c6t600D0230006C1C4C0C50BE5BC9D49100d0
lrwxrwxrwx   1 root     root          65 Dec 11 06:22 /dev/dsk/c6t600D0230006C1C4C0C50BE5BC9D49100d0 -> ../../devices/scsi_vhci/disk@g600d0230006c1c4c0c50be5bc9d49100:wd


0 TrackBacks

Listed below are links to blogs that reference this entry: Creating and manipulating zpools (zfs).

TrackBack URL for this entry: http://southbrain.com/mt/mt-tb.cgi/114

5 Comments

nice article.. keep up the good work.
can you please provide me with a script which will show me the memory usage on solaris 10 with + ZFS. Please do not refer to the arcstat or arcsummary perl scripts. I want a normal script whihc will tell me how much is used by ZFS and how much memory is free.

thanks for the reply...:)

Awesome walk through. This was very helpful! I have an interview in two days and they said "brush up on zpool," this was perfect. Also just an FYI, I'm using Solaris 11 express and it allows you to have raidz3.

Let's try a mirror now:

# zpool create mirror c1d0 c1d1

Does that not require a zpool name somewhere in the command?

Thanks,
Warwick.

Leave a comment

August 2012

Sun Mon Tue Wed Thu Fri Sat
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  

About

This blog is owned by:

Pascal Gienger
Jägerstrasse 77
8406 Winterthur
Switzerland


Google+: Profile
YouTube Channel: pascalgienger