ZFS: Practicing failures on virtual hardware

I always used to sweat, and sweat bullets, when it came time to replace a failed disk in ZFS. It happened infrequently enough that I never did remember the syntax quite right in between issues, and the last thing you want to do with production hardware and data is fumble around in a cold sweat – you want to know what the correct syntax for everything is ahead of time, and be practiced and confident.

Wonderfully, there’s no reason for that anymore – particularly on Linux, which boasts a really excellent set of tools for simulating storage hardware quickly and easily. So today, we’re going to walk through setting up a pool, destroying one of the disks in it, and recovering from the failure. The important part here isn’t really the syntax for the replacement, though… it’s learning how to set up the simulation in the first place, which allows you to test lots of things properly when your butt isn’t on the line!

Prerequisites

Obviously, you need to have ZFS available. If you’re on Ubuntu Xenial, you can get it with apt update ; apt install zfs-linux. If you’re on an earlier version of Ubuntu, you’ll need to add the zfs-native PPA from the ZFS on Linux project, and install from there: apt-add-repository ppa:zfs-native/stable ; apt update ; apt install ubuntu-zfs.

You’ll also need the QEMU tools, to create and manage .qcow2 storage files, and qemu-nbd loopback type devices that access them like real hardware. Again on Ubuntu, that’s apt update ; apt install qemu-utils.

If you aren’t running Ubuntu, the tools are definitely still available, but you’ll need to consult your own distro’s documentation / forums / etc to figure out how to install ’em.

Creating the virtual disk back-end

First up, we create the back end files for the storage. In today’s example, that’ll be a pair of 1GB disks.

root@banshee:~# qemu-img create -f qcow2 /tmp/0.qcow2 1G ; qemu-img create -f qcow2 /tmp/1.qcow2 1G
Formatting '/tmp/0.qcow2', fmt=qcow2 size=1073741824 encryption=off cluster_size=65536 lazy_refcounts=off 
Formatting '/tmp/1.qcow2', fmt=qcow2 size=1073741824 encryption=off cluster_size=65536 lazy_refcounts=off

That does exactly what it looks like: creates a pair of 1GB virtual disks, /tmp/0.qcow2 and /tmp/1.qcow2. Qcow2 files are sparsely allocated, so they don’t actually take up any room at all yet – they’ll expand as and if you put data in them. (If you omit the -f qcow2 argument, qemu-img will create fully allocated RAW files instead, which will take longer.)

Creating the virtual disk devices

By themselves, our .qcow2 files don’t help us much. In order to use them as disks with ZFS, what we really need are block devices, in the /dev filesystem, which reference our qcow2 files. That’s where qemu-nbd comes in.

root@banshee:~# modprobe nbd max_part=63

You might or might not actually need that bit – but it never hurts. This makes sure the nbd kernel module is loaded, and, for safety’s sake, that it won’t try to load more than 63 partitions on a single virtual device. This might keep your system from crashing if you try to access a stupendously corrupt or maliciously crafted qcow2 file – I’m not sure what happens if a system thinks it has devices sda1 through sda65537, and I’d really rather not find out!

root@banshee:~# qemu-nbd -c /dev/nbd0 /tmp/0.qcow2 ; qemu-nbd -c /dev/nbd1 /tmp/1.qcow2

Easy peasy – we now have virtual disks /dev/nbd0 and /dev/nbd1, which can be accessed by ZFS (or any other filesystem or linux system utility) just like any other disk would be.

Setting up a zpool

This looks just like setting up any other pool. We’ll use the ashift argument to make the pool use 4K blocks, since that matches my underlying block device (which might or might not really matter, but it’s a good habit to get into anyway, and this is all about building good habits and muscle memory, right?)

root@banshee:~# zpool create -oashift=12 test mirror /dev/nbd0 /dev/nbd1
root@banshee:~# zpool status test
  pool: test
 state: ONLINE
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    nbd0    ONLINE       0     0     0
	    nbd1    ONLINE       0     0     0

errors: No known data errors

Easy peasy! We now have a pool with a simple two-disk mirror vdev.

Playing with topology: add more devices

What if we wanted to make it a pool of mirrors, with a second mirror vdev?

root@banshee:~# qemu-img create -f qcow2 /tmp/2.qcow2 1G ; qemu-img create -f qcow2 /tmp/3.qcow2 1G
Formatting '/tmp/2.qcow2', fmt=qcow2 size=1073741824 encryption=off cluster_size=65536 lazy_refcounts=off 
Formatting '/tmp/3.qcow2', fmt=qcow2 size=1073741824 encryption=off cluster_size=65536 lazy_refcounts=off 
root@banshee:~# qemu-nbd -c /dev/nbd2 /tmp/2.qcow2 ; qemu-nbd -c /dev/nbd3 /tmp/3.qcow2
root@banshee:~# zpool add -oashift=12 test mirror /dev/nbd2 /dev/nbd3
root@banshee:~# zpool status test
  pool: test
 state: ONLINE
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    nbd0    ONLINE       0     0     0
	    nbd1    ONLINE       0     0     0
	  mirror-1  ONLINE       0     0     0
	    nbd2    ONLINE       0     0     0
	    nbd3    ONLINE       0     0     0

errors: No known data errors

Couldn’t be easier.

Save some data on your new virtual hardware

Before we do anything else, let’s store some data on the new pool, so that we have something to detect corruption on later. (ZFS doesn’t checksum or scrub empty blocks, so won’t find corruption in them.)

root@banshee:~# pv < /dev/urandom > /test/urandom.bin
 408MB 0:00:37 [  15MB/s] [                                          <=>                    ]
^C

I used the pipe viewer utility here, pv, which is available on Ubuntu with apt update ; apt install pv.

The ^C is because I hit control-C to interrupt it once I felt that I’d written enough data. In this case, a bit over 400MB of random data, saved on our new pool in the file /test/urandom.bin.

root@banshee:~# zpool list test
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
test  1.97G   414M  1.56G         -    26%    20%  1.00x  ONLINE  -

root@banshee:~# zfs list test
NAME   USED  AVAIL  REFER  MOUNTPOINT
test   413M  1.50G   413M  /test

root@banshee:~# ls -lh /test
total 413M
-rw-r--r-- 1 root root 413M May 16 12:12 urandom.bin

Gravy.

I just want to kill something beautiful

What happens if a drive or a controller port goes berserk, and overwrites large swathes of a disk with zeroes? No time like the present to find out!

We don’t really want to write directly over a .qcow2 file, since our goal today is to test ZFS, not to test the QEMU infrastructure itself. We want to write to the actual device we created, and let the device put the data in the .qcow2 file. Let’s pick on /dev/nbd0, backed by /tmp/0.qcow2 – it looks uppity and in need of some comeuppance.

root@banshee:~# pv < /dev/zero > /dev/nbd0
 777MB 0:00:06 [48.9MB/s] [       <=>                                                     ]
^C

Boom. Errors galore. Does ZFS know about them yet?

root@banshee:~# zpool status test
  pool: test
 state: ONLINE
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    nbd0    ONLINE       0     0     0
	    nbd1    ONLINE       0     0     0
	  mirror-1  ONLINE       0     0     0
	    nbd2    ONLINE       0     0     0
	    nbd3    ONLINE       0     0     0

errors: No known data errors

Nope! We really, really did simulate on-disk corruption, which ZFS has no way of knowing about until it tries to actually access the data.

Easter egging for errors

So, let’s actually access the data by reading in the entire /test/urandom.bin file, and then check our status:

root@banshee:~# pv < /test/urandom.bin > /dev/null
 412MB 0:00:00 [6.99GB/s] [=============================================>] 100%            
root@banshee:~# zpool status test
  pool: test
 state: ONLINE
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    nbd0    ONLINE       0     0     0
	    nbd1    ONLINE       0     0     0
	  mirror-1  ONLINE       0     0     0
	    nbd2    ONLINE       0     0     0
	    nbd3    ONLINE       0     0     0

errors: No known data errors

Hey – still nothing! Why not? Because /test/urandom.qcow2 is still in the ARC, that’s why! ZFS didn’t actually need to hit the storage to hand us the file, so it didn’t – and we still haven’t detected the massive amount of corruption we injected into /dev/nbd0.

We can get ZFS to actually look for the problem in a couple of ways. The cruder way is to export the pool and reimport it, which will have the side effect of dumping the ARC as well. The more professional way is to scrub the pool, which is a technique with the explicit design of reading and verifying every block. But first, let’s see what happens just reading from the storage normally, after the ARC isn’t holding the data anymore:

root@banshee:~# zpool export test
root@banshee:~# zpool import test
root@banshee:~# zpool status test
  pool: test
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    nbd0    ONLINE       0     0     4
	    nbd1    ONLINE       0     0     0
	  mirror-1  ONLINE       0     0     0
	    nbd2    ONLINE       0     0     0
	    nbd3    ONLINE       0     0     0

errors: No known data errors

Ooh, we’ve already found a few errors – we injected so much corruption that we nailed a few of ZFS’ metadata blocks on /dev/nbd0, which were found immediately on importing the pool again. There were redundant copies of the metadata blocks available, though, so ZFS repaired the errors it found already with those.

Now that we’ve exported and imported the pool, which also dumped the ARC, let’s re-read the file in its entirety again, then see what our status looks like:

root@banshee:~# pv < /test/urandom.bin > /dev/null
 412MB 0:00:00 [ 563MB/s] [================================================>] 100%            
root@banshee:~# zpool status test
  pool: test
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    nbd0    ONLINE       0     0     9
	    nbd1    ONLINE       0     0     0
	  mirror-1  ONLINE       0     0     0
	    nbd2    ONLINE       0     0     0
	    nbd3    ONLINE       0     0     0

errors: No known data errors

Five more blocks detected corrupt. Doesn’t seem like a lot, does it? Keep in mind, ZFS is only finding blocks that we specifically attempt to read from right now – so a lot of what would be corrupt blocks on /dev/nbd0, we actually read from /dev/nbd1 instead. And only half-ish of the data was saved to mirror-0 in the first place – the rest went to mirror-1. So, ZFS is finding and repairing corrupt blocks… but only a few of them so far. This was worth playing with to see what happens with undetected errors in normal use, but now that we’ve done this, let’s look for errors the right way.

It isn’t really clean until it’s been scrubbed

When we scrub the pool, we explicitly read every single block that’s been written and is actively in use on every single device, all at once.

So far, we’ve found nine corrupt blocks just poking around the system like we normally would in normal use. Will a scrub find more?

root@banshee:~# zpool scrub test
root@banshee:~# zpool status test
  pool: test
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 170M in 0h0m with 0 errors on Mon May 16 12:32:00 2016
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    nbd0    ONLINE       0     0 1.40K
	    nbd1    ONLINE       0     0     0
	  mirror-1  ONLINE       0     0     0
	    nbd2    ONLINE       0     0     0
	    nbd3    ONLINE       0     0     0

errors: No known data errors

There we go! We just went from 9 checksum errors, to 1.4 thousand checksum errors detected.

And that, folks, is why you want to regularly scrub your pool.

Can we do worse? What happens if we blow through every block on /dev/nbd0, instead of “just” 3/4 or so of them?

root@banshee:~# pv < /dev/zero > /dev/nbd0
pv: write failed: No space left on device<=>                                                 ]
root@banshee:~# zpool scrub test
root@banshee:~# zpool status test
  pool: test
 state: ONLINE
status: One or more devices could not be used because the label is missing or
	invalid.  Sufficient replicas exist for the pool to continue
	functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0h0m with 0 errors on Mon May 16 12:36:38 2016
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    nbd0    UNAVAIL      0     0 1.40K  corrupted data
	    nbd1    ONLINE       0     0     0
	  mirror-1  ONLINE       0     0     0
	    nbd2    ONLINE       0     0     0
	    nbd3    ONLINE       0     0     0

errors: No known data errors

The CKSUM column hasn’t changed – but /dev/nbd0 now shows as UNAVAIL, with the corrupted data flag, because we blew through all the metadata including all copies of the disk label itself. We still have no data errors on the pool itself, but mirror-0, and the pool itself, are operating degraded now. Which we’ll want to fix.

Interestingly, we also just found a bug in ZFS – the pool itself as well as the mirror-0 vdev, should be showing DEGRADED in the STATE column, not ONLINE! I’m gonna go report that, be back in a minute…

Replacing a failed disk (Rise chicken. Chicken, rise.)

OK, now that we’ve thoroughly failed a disk, we can replace it. We happen to know – because we’re evil bastards who did it ourselves – that the actual disk itself is fine on the hardware level, we just basically degaussed it. So we could just replace it in the array in situ. That’s not usually going to be best practice with real hardware, though, so let’s more thoroughly simulate actually removing and replacing the disk with a new one.

First, the removal:

root@banshee:~# qemu-nbd -d /dev/nbd0
/dev/nbd0 disconnected

Simple enough. ZFS doesn’t really know the disk is gone yet, but we can force it to figure out by scrubbing again before checking the status:

root@banshee:~# zpool scrub test
root@banshee:~# zpool status test
  pool: test
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
	invalid.  Sufficient replicas exist for the pool to continue
	functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0h0m with 0 errors on Mon May 16 12:42:35 2016
config:

	NAME        STATE     READ WRITE CKSUM
	test        DEGRADED     0     0     0
	  mirror-0  DEGRADED     0     0     0
	    nbd0    UNAVAIL      0     0 1.40K  corrupted data
	    nbd1    ONLINE       0     0     0
	  mirror-1  ONLINE       0     0     0
	    nbd2    ONLINE       0     0     0
	    nbd3    ONLINE       0     0     0

errors: No known data errors

Interestingly, now that we’ve actually removed /dev/nbd0 in a hardware sense, our pool and vdev show – correctly – as DEGRADED, in the actual STATUS column as well as in the status message.

Regardless, now that we’ve “pulled” the disk, let’s “replace” it.

root@banshee:~# rm /tmp/0.qcow2
root@banshee:~# qemu-img create -f qcow2 /tmp/0.qcow2 1G
Formatting '/tmp/0.qcow2', fmt=qcow2 size=1073741824 encryption=off cluster_size=65536 lazy_refcounts=off 
root@banshee:~# qemu-nbd -c /dev/nbd0 /tmp/0.qcow2

Easy enough – now, we’re in exactly the same boat we’d be in if this was a real machine and we’d physically removed and replaced the offending drive, but done nothing else. ZFS, of course, is still going to show degraded status – we need to give the new drive to the pool, as well as physically plugging it in. So let’s do that:

root@banshee:~# zpool replace test /dev/nbd0

Is it really that easy? Let’s check the status and find out:

root@banshee:~# zpool status test
  pool: test
 state: ONLINE
  scan: resilvered 206M in 0h0m with 0 errors on Mon May 16 12:47:20 2016
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    nbd0    ONLINE       0     0     0
	    nbd1    ONLINE       0     0     0
	  mirror-1  ONLINE       0     0     0
	    nbd2    ONLINE       0     0     0
	    nbd3    ONLINE       0     0     0

errors: No known data errors

Yep, it really is that easy.

Party’s over: clean up before you go home

Now that you’ve successfully tested what you set out to test, the last step is to clean up your virtual mess. In order, you need to destroy the test pool, disconnect the virtual devices, and delete the back-end storage files they referenced.

root@banshee:~# zpool destroy test
root@banshee:~# qemu-nbd -d /dev/nbd0 ; qemu-nbd -d /dev/nbd1 ; qemu-nbd -d /dev/nbd2 ; qemu-nbd -d /dev/nbd3
/dev/nbd0 disconnected
/dev/nbd1 disconnected
/dev/nbd2 disconnected
/dev/nbd3 disconnected
root@banshee:~# rm /tmp/0.qcow2 ; rm /tmp/1.qcow2 ; rm /tmp/2.qcow2 ; rm /tmp/3.qcow2

All done – clean as a whistle, and ready to set up the next simulation!

If you learned how to fail out a disk, awesome, but…

If you ended up here trying to figure out how to deal with and replace a failed disk, cool, and I hope you got what you were looking for. But please, remember the testing steps we did for the environment – they’re what this post is actually about. Learning how to set up your own virtual test environment will make you a much, much better admin down the line – and make you as cool as an action hero that one fateful day when it’s a real disk that’s faulted out of your real pool, and your data’s on the line. Nothing gives you confidence and takes away the stress like plenty of experience having done the exact same thing before.

Testing copies=n resiliency

I decided to see how well ZFS copies=n would stand up to on-disk corruption today. Spoiler alert: not great.

First step, I created a 1GB virtual disk, made a zpool out of it with 8K blocks, and set copies=2.

me@locutus:~$ sudo qemu-img create -f qcow2 /data/test/copies/0.qcow2 1G
me@locutus:~$ sudo qemu-nbd -c /dev/nbd0 /data/test/copies/0.qcow2 1G
me@locutus:~$ sudo zpool create -oashift=12 test /data/test/copies/0.qcow2
me@locutus:~$ sudo zfs set copies=2 test

Now, I wrote 400 1MB files to it – just enough to make the pool roughly 85% full, including the overhead due to copies=2.

me@locutus:~$ cat /tmp/makefiles.pl
#!/usr/bin/perl

for ($x=0; $x<400 ; $x++) {
	print "dd if=/dev/zero bs=1M count=1 of=$x\n";
	print `dd if=/dev/zero bs=1M count=1 of=$x`;
}

With the files written, I backed up my virtual filesystem, fully populated, so I can repeat the experiment later.

me@locutus:~$ sudo zpool export test
me@locutus:~$ sudo cp /data/test/copies/0.qcow2 /data/test/copies/0.qcow2.bak
me@locutus:~$ sudo zpool import test

Now, I write corrupt blocks to 10% of the filesystem. (Roughly: it's possible that the same block was overwritten with a garbage block more than once.) Note that I used a specific seed, so that I can recreate the scenario exactly, for more runs later.

me@locutus:~$ cat /tmp/corruptor.pl
#!/usr/bin/perl

# total number of blocks in the test filesystem
$numblocks=131072;

# desired percentage corrupt blocks
$percentcorrupt=.1;

# so we write this many corrupt blocks
$corruptloop=$numblocks*$percentcorrupt;

# consistent results for testing
srand 32767;

# generate 8K of garbage data
for ($x=0; $x<8*1024; $x++) {
	$garbage .= chr(int(rand(256)));
}

open FH, "> /dev/nbd0";

for ($x=0; $x<$corruptloop; $x++) {
	$blocknum = int(rand($numblocks-100));
	print "Writing garbage data to block $blocknum\n";
	seek FH, ($blocknum*8*1024), 0;
	print FH $garbage;
}

close FH;

Okay. When I scrub the filesystem I just wrote those 10,000 or so corrupt blocks to, what happens?

me@locutus:~$ sudo zpool scrub test ; sudo zpool status test
  pool: test
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 133M in 0h1m with 1989 errors on Mon May  9 15:56:11 2016
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0 1.94K
	  nbd0      ONLINE       0     0 8.94K

errors: 1989 data errors, use '-v' for a list
me@locutus:~$ sudo zpool status -v test | grep /test/ | wc -l
385

OUCH. 385 of my 400 total files were still corrupt after the scrub! Copies=2 didn't do a great job for me here. 🙁

What if I try it again, this time just writing garbage to 1% of the blocks on disk, not 10%? First, let's restore that backup I cleverly made:

root@locutus:/data/test/copies# zpool export test
root@locutus:/data/test/copies# qemu-nbd -d /dev/nbd0
/dev/nbd0 disconnected
root@locutus:/data/test/copies# pv < 0.qcow2.bak > 0.qcow2
 999MB 0:00:00 [1.63GB/s] [==================================>] 100%            
root@locutus:/data/test/copies# qemu-nbd -c /dev/nbd0 /data/test/copies/0.qcow2
root@locutus:/data/test/copies# zpool import test
root@locutus:/data/test/copies# zpool status test | tail -n 5
	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  nbd0      ONLINE       0     0     0

errors: No known data errors

Alright, now let's change $percentcorrupt from 0.1 to 0.01, and try again. How'd we do after only corrupting 1% of the blocks on disk?

root@locutus:/data/test/copies# zpool status test
  pool: test
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 101M in 0h0m with 72 errors on Mon May  9 16:13:49 2016
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0    72
	  nbd0      ONLINE       0     0 1.08K

errors: 64 data errors, use '-v' for a list
root@locutus:/data/test/copies# zpool status test -v | grep /test/ | wc -l
64

Still not great. We lost 64 out of our 400 files. Tenth of a percent?

root@locutus:/data/test/copies# zpool status -v test
  pool: test
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 12.1M in 0h0m with 2 errors on Mon May  9 16:22:30 2016
config:

	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     2
	  nbd0      ONLINE       0     0   105

errors: Permanent errors have been detected in the following files:

        /test/300
        /test/371

Damn. We still lost two files, even with only writing 130 or so corrupt blocks. (The missing 26 corrupt blocks weren't picked up by the scrub because they happened in the 15% or so of unused space on the pool, presumably: a scrub won't check unused blocks.) OK, what if we try a control - how about we corrupt the same tenth of a percent of the filesystem (105 blocks or so), this time without copies=2 set? To make it fair, I wrote 800 1MB files to the same filesystem this time using the default copies=1 - this is more files, but it's the same percentage of the filesystem full. (Interestingly, this went a LOT faster. Perceptibly, more than twice as fast, I think, although I didn't actually time it.)

Now with our still 84% full /test zpool, but this time with copies=1, I corrupted the same 0.1% of the total block count.

root@locutus:/data/test/copies# zpool status test
  pool: test
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 8K in 0h0m with 98 errors on Mon May  9 16:28:26 2016
config:

	NAME                         STATE     READ WRITE CKSUM
	test                         ONLINE       0     0    98
	  /data/test/copies/0.qcow2  ONLINE       0     0   198

errors: 93 data errors, use '-v' for a list

Without copies=2 set, we lost 93 files instead of 2. So copies=n was definitely better than nothing for our test of writing 0.1% of the filesystem as bad blocks... but it wasn't fabulous, either, and it fell flat on its face with 1% or 10% of the filesystem corrupted. By comparison, a truly redundant pool - one with two disks in it, in a mirror vdev - would have survived the same test (corrupting ANY number of blocks on a single disk) with flying colors.

The TL;DR here is that copies=n is better than nothing... but not by a long shot, and you do give up a lot of performance for it. Conclusion: play with it if you like, but limit it only to extremely important data, and don't make the mistake of thinking it's any substitute for device redundancy, much less backups.

zfs: copies=n is not a substitute for device redundancy!

I’ve been seeing a lot of misinformation flying around the web lately about the zfs dataset-level feature copies=n. To be clear, dangerous misinformation. So dangerous, I’m going to go ahead and give you the punchline in the title of this post and in its first paragraph: copies=n does not give you device fault tolerance!

Why does copies=n actually exist then? Well, it’s a sort of (extremely) poor cousin that helps give you a better chance of surviving data corruption. Let’s say you have a laptop, you’ve set copies=3 on some extremely critical work-related datasets, and the drive goes absolutely bonkers and starts throwing tens of thousands of checksum errors. Since there’s only one disk in the laptop, ZFS can’t correct the checksum errors, only detect them… except on that critical dataset, maybe and hopefully, because each block has multiple copies. So if a given block has been written three times and any single copy of that block reads so as to match its validation hash, that block will get served up to you intact.

So far, so good. The problem is that I am seeing people advocating scenarios like “oh, I’ll just add five disks as single-disk vdevs to a pool, then make sure to set copies=2, and that way even if I lose a disk I still have all the data.” No, no, and no. But don’t take my word for it: let’s demonstrate.

First, let’s set up a test pool using virtual disks.

root@banshee:/tmp# qemu-img create -f qcow2 0.qcow2 10G ; qemu-img create -f qcow2 1.qcow2 10G
Formatting '0.qcow2', fmt=qcow2 size=10737418240 encryption=off cluster_size=65536 lazy_refcounts=off 
Formatting '1.qcow2', fmt=qcow2 size=10737418240 encryption=off cluster_size=65536 lazy_refcounts=off 
root@banshee:/tmp# qemu-nbd -c /dev/nbd0 /tmp/0.qcow2 ; qemu-nbd -c /dev/nbd1 /tmp/1.qcow2
root@banshee:/tmp# zpool create test /dev/nbd0 /dev/nbd1
root@banshee:/tmp# zpool status test
  pool: test
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    test        ONLINE       0     0     0
      nbd0      ONLINE       0     0     0
      nbd1      ONLINE       0     0     0

errors: No known data errors

Now let’s set copies=2, and then create a couple of files in our pool.

root@banshee:/tmp# zfs set copies=2 test
root@banshee:/tmp# dd if=/dev/urandom bs=4M count=1 of=/test/test1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 0.310805 s, 13.5 MB/s
root@banshee:/tmp# dd if=/dev/urandom bs=4M count=1 of=/test/test2
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 0.285544 s, 14.7 MB/s

Let’s confirm that copies=2 is working.

We should see about 8MB of data on each of our virtual disks – one for each copy of each of our 4MB test files.

root@banshee:/tmp# ls -lh *.qcow2
-rw-r--r-- 1 root root 9.7M May  2 13:56 0.qcow2
-rw-r--r-- 1 root root 9.8M May  2 13:56 1.qcow2

Yep, we’re good – we’ve written a copy of each of our two 4MB files to each virtual disk.

Now fail out a disk:

root@banshee:/tmp# zpool export test
root@banshee:/tmp# qemu-nbd -d /dev/nbd1
/dev/nbd1 disconnected

Will a pool with copies=2 and one missing disk import?

root@banshee:/tmp# zpool import
   pool: test
     id: 15144803977964153230
  state: UNAVAIL
 status: One or more devices are missing from the system.
 action: The pool cannot be imported. Attach the missing
    devices and try again.
   see: http://zfsonlinux.org/msg/ZFS-8000-6X
 config:

    test         UNAVAIL  missing device
      nbd0       ONLINE

    Additional devices are known to be part of this pool, though their
    exact configuration cannot be determined.

That’s a resounding “no”.

Can we force an import?

root@banshee:/tmp# zpool import -f test
cannot import 'test': one or more devices is currently unavailable

No – your data is gone.

Please let this be a lesson: no, copies=n is not a substitute for redundancy or parity, and yes, losing any vdev does lose the pool.

Reshuffling pool storage on the fly

If you’re new here:

Sanoid is an open-source storage management project, built on top of the OpenZFS filesystem and Linux KVM hypervisor, with the aim of providing affordable, open source, enterprise-class hyperconverged infrastructure. Most of what we’re talking about today boils down to “managing ZFS storage” – although Sanoid’s replication management tool Syncoid does make the operation a lot less complicated.

Recently, I deployed two Sanoid appliances to a new customer in Raleigh, NC.

When the customer specced out their appliances, their plan was to deploy one production server and one offsite DR server – and they wanted to save a little money, so the servers were built out differently. Production had two SSDs and six conventional disks, but offsite DR just had eight conventional disks – not like DR needs a lot of IOPS performance, right?

Well, not so right. When I got onsite, I discovered that the “disaster recovery” site was actually a working space, with a mission critical server in it, backed up only by a USB external disk. So we changed the plan: instead of a production server and an offsite DR server, we now had two production servers, each of which replicated to the other for its offsite DR. This was a big perk for the customer, because the lower-specced “DR” appliance still handily outperformed their original server, as well as providing ZFS and Sanoid’s benefits of rolling snapshots, offsite replication, high data integrity, and so forth.

But it still bothered me that we didn’t have solid state in the second suite.

The main suite had two pools – one solid state, for boot disks and database instances, and one rust, for bulk storage (now including backups of this suite). Yes, our second suite was performing better now than it had been on their original, non-Sanoid server… but they had a MySQL instance that tended to be noticeably slow on inserts, and the desire to put that MySQL instance on solid state was just making me itch. Problem is, the client was 250 miles away, and their Sanoid Standard appliance was full – eight hot-swap bays, each of which already had a disk in it. No more room at the inn!

We needed minimal downtime, and we also needed minimal on-site time for me.

You can’t remove a vdev from an existing pool, so we couldn’t just drop the existing four-mirror pool to a three-mirror pool. So what do you do? We could have stuffed the new pair of SSDs somewhere inside the case, but I really didn’t want to give up the convenience of externally accessible hot swap bays.

So what do you do?

In this case, what you do – after discussing all the pros and cons with the client decision makers, of course – is you break some vdevs. Our existing pool had four mirrors, like this:

	NAME                              STATE     READ WRITE CKSUM
	data                              ONLINE       0     0     0
	  mirror-0                        ONLINE       0     0     0
	    wwn-0x50014ee20b8b7ba0-part3  ONLINE       0     0     0
	    wwn-0x50014ee20be7deb4-part3  ONLINE       0     0     0
	  mirror-1                        ONLINE       0     0     0
	    wwn-0x50014ee261102579-part3  ONLINE       0     0     0
	    wwn-0x50014ee2613cc470-part3  ONLINE       0     0     0
	  mirror-2                        ONLINE       0     0     0
	    wwn-0x50014ee2613cfdf8-part3  ONLINE       0     0     0
	    wwn-0x50014ee2b66693b9-part3  ONLINE       0     0     0
          mirror-3                        ONLINE       0     0     0
            wwn-0x50014ee20b9b4e0d-part3  ONLINE       0     0     0
            wwn-0x50014ee2610ffa17-part3  ONLINE       0     0     0

Each of those mirrors can be broken, freeing up one disk – at the expense of removing redundancy on that mirror, of course. At first, I thought I’d break all the mirrors, create a two-mirror pool, migrate the data, then destroy the old pool and add one more mirror to the new pool. And that would have worked – but it would have left the data unbalanced, so that the majority of reads would only hit two of my three mirrors. I decided to go for the cleanest result possible – a three mirror pool with all of its data distributed equally across all three mirrors – and that meant I’d need to do my migration in two stages, with two periods of user downtime.

First, I broke mirror-0 and mirror-1.

I detached a single disk from each of my first two mirrors, then cleared its ZFS label afterward.

    root@client-prod1:/# zpool detach wwn-0x50014ee20be7deb4-part3 ; zpool labelclear wwn-0x50014ee20be7deb4-part3
    root@client-prod1:/# zpool detach wwn-0x50014ee2613cc470-part3 ; zpool labelclear wwn-0x50014ee2613cc470-part3

Now mirror-0 and mirror-1 are in DEGRADED condition, as is the pool – but it’s still up and running, and the users (who are busily working on storage and MySQL virtual machines hosted on the Sanoid Standard appliance we’re shelled into) are none the wiser.

Now we can create a temporary pool with the two freed disks.

We’ll also be sure to set compression on by default for all datasets created on or replicated onto our new pool – something I truly wish was the default setting for OpenZFS, since for almost all possible cases, LZ4 compression is a big win.

    root@client-prod1:/# zpool create -o ashift=12 tmppool mirror wwn-0x50014ee20be7deb4-part3 wwn-0x50014ee2613cc470-part3
    root@client-prod1:/# zfs set compression=lz4 tmppool

We haven’t really done much yet, but it felt like a milestone – we can actually start moving data now!

Next, we use Syncoid to replicate our VMs onto the new pool.

At this point, these are still running VMs – so our users won’t see any downtime yet. After doing an initial replication with them up and running, we’ll shut them down and do a “touch-up” – but this way, we get the bulk of the work done with all systems up and running, keeping our users happy.

    root@client-prod1:/# syncoid -r data/images tmppool/images ; syncoid -r data/backup tmppool/backup

This took a while, but I was very happy with the performance – never dipped below 140MB/sec for the entire replication run. Which also strongly implies that my users weren’t seeing a noticeable amount of slowdown! This initial replication completed in a bit over an hour.

Now, I was ready for my first little “blip” of actual downtime.

First, I shut down all the VMs running on the machine:

    root@client-prod1:/# virsh shutdown suite100 ; virsh shutdown suite100-mysql ; virsh shutdown suite100-openvpn
    root@client-prod1:/# watch -n 1 virsh list

As soon as virsh list showed me that the last of my three VMs were down, I ctrl-C’ed out of my watch command and replicated again, to make absolutely certain that no user data would be lost.

    root@client-prod1:/# syncoid -r data/images tmppool/images ; syncoid -r data/backup tmppool/backup

This time, my replication was done in less than ten seconds.

Doing replication in two steps like this is a huge win for uptime, and a huge win for the users – while our initial replication needed a little more than an hour, the “touch-up” only had to copy as much data as the users could store in a few moments, so it was done in a flash.

Next, it’s time to rename the pools.

Our system expects to find the storage for its VMs in /data/images/VMname, so for minimum downtime and reconfiguration, we’ll just export and re-import our pools so that it finds what it’s looking for.

    root@client-prod1:/# zpool export data ; zpool import data olddata 
    root@client-prod1:/# zfs set mountpoint=/olddata/images/qemu olddata/images/qemu ; zpool export olddata

Wait, what was that extra step with the mountpoint?

Sanoid keeps the virtual machines’ hardware definitions on the zpool rather than on the root filesystem – so we want to make sure our old pool’s ‘qemu’ dataset doesn’t try to automount itself back to its original mountpoint, /etc/libvirt/qemu.

    root@client-prod1:/# zpool export tmppool ; zpool import tmppool data
    root@client-prod1:/# zfs set mountpoint=/etc/libvirt/qemu data/images/qemu

OK, at this point our original, degraded zpool still exists, intact, as an exported pool named olddata; and our temporary two disk pool exists as an active pool named data, ready to go.

After less than one minute of downtime, it’s time to fire up the VMs again.

    root@client-prod1:/# virsh start suite100 ; virsh start suite100-mysql ; virsh start suite100-openvpn

If anybody took a potty break or got up for a fresh cup of coffee, they probably missed our first downtime window entirely. Not bad!

Time to destroy the old pool, and re-use its remaining disks.

After a couple of checks to make absolutely sure everything was working – not that it shouldn’t have been, but I’m definitely of the “measure twice, cut once” school, especially when the equipment is a few hundred miles away – we’re ready for the first completely irreversible step in our eight-disk fandango: destroying our original pool, so that we can create our final one.

    root@client-prod1:/# zpool destroy olddata
    root@client-prod1:/# zpool create -o ashift=12 newdata mirror wwn-0x50014ee20b8b7ba0-part3 wwn-0x50014ee261102579-part3
    root@client-prod1:/# zpool add -o ashift=12 newdata mirror wwn-0x50014ee2613cfdf8-part3 wwn-0x50014ee2b66693b9-part3
    root@client-prod1:/# zpool add -o ashift=12 newdata mirror wwn-0x50014ee20b9b4e0d-part3 wwn-0x50014ee2610ffa17-part3
    root@client-prod1:/# zfs set compression=lz4 newdata

Perfect! Our new, final pool with three mirrors is up, LZ4 compression is enabled, and it’s ready to go.

Now we do an initial Syncoid replication to the final, six-disk pool:

    root@client-prod1:/# syncoid -r data/images newdata/images ; syncoid -r data/backup newdata/backup

About an hour later, it’s time to shut the VMs down for Brief Downtime Window #2.

    root@client-prod1:/# virsh shutdown suite100 ; virsh shutdown suite100-mysql ; virsh shutdown suite100-openvpn
    root@client-prod1:/# watch -n 1 virsh list

Once our three VMs are down, we ctrl-C out of ‘watch’ again, and…

Time for our final “touch-up” re-replication:

    root@client-prod1:/# syncoid -r data/images newdata/images ; syncoid -r data/backup newdata/backup

At this point, all the actual data is where it should be, in the right datasets, on the right pool.

We fix our mountpoints, shuffle the pool names, and fire up our VMs again:

    root@client-prod1:/# zpool export data ; zpool import data tmppool 
    root@client-prod1:/# zfs set mountpoint=/tmppool/images/qemu olddata/images/qemu ; zpool export tmppool
    root@client-prod1:/# zpool export newdata ; zpool import newdata data
    root@client-prod1:/# zfs set mountpoint=/etc/libvirt/qemu data/images/qemu
    root@client-prod1:/# virsh start suite100 ; virsh start suite100-mysql ; virsh start suite100-openvpn

Boom! Another downtime window over with in less than a minute.

Our TOTAL elapsed downtime was less than two minutes.

At this point, our users are up and running on the final three-mirror pool, and we won’t be inconveniencing them again today. Again we do some testing to make absolutely certain everything’s fine, and of course it is.

The very last step: destroying tmppool.

    root@client-prod1:/# zpool destroy tmppool

That’s it; we’re done for the day.

We’re now up and running on only six total disks, not eight, which gives us the room we need to physically remove two disks. With those two disks gone, we’ve got room to slap in a pair of SSDs for a second pool with a solid-state mirror vdev when we’re (well, I’m) there in person, in a week or so. That will also take a minute or less of actual downtime – and in that case, the preliminary replication will go ridiculously fast too, since we’ll only be moving the MySQL VM (less than 20G of data), and we’ll be writing at solid state device speeds (upwards of 400MB/sec, for the Samsung Pro 850 series I’ll be using).

None of this was exactly rocket science. So why am I sharing it?

Well, it’s pretty scary going in to deliberately degrade a production system, so I wanted to lay out a roadmap for anybody else considering it. And I definitely wanted to share the actual time taken for the various steps – I knew my downtime windows would be very short, but honestly I’d been a little unsure how the initial replication would go, given that I was deliberately breaking mirrors and degrading arrays. But it went great! 140MB/sec sustained throughput makes even pretty substantial tasks go by pretty quickly – and aside from the two intervals with a combined downtime of less than two minutes, my users never even noticed anything happening.

Closing with a plug: yes, you can afford it.

If this kind of converged infrastructure (storage and virtualization) management sounds great to you – high performance, rapid onsite and offsite replication, nearly zero user downtime, and a whole lot more – let me add another bullet point: low cost. Getting started isn’t prohibitively expensive.

Sanoid appliances like the ones we’re describing here – including all the operating systems, hardware, and software needed to run your VMs and manage their storage and automatically replicate them both on and offsite – start at less than $5,000. For more information, send us an email, or call us at (803) 250-1577.

libguestfs0 and ZFS on Linux in Ubuntu

Trying to get Kimchi installed this morning, I ran into a roadblock almost immediately: libguestfs-tools depends on libguestfs0, which, on Ubuntu at least, stupidly has a hard dependency on zfs-fuse. Which is a dead project, and which conflicts with zfsutils.

In the real world, you might want libguestfs-tools without ever wanting the first thing to do with ANY form of zfs, so this dependency is a really bad idea. Even if you DO want to use libguestfs-tools WITH zfs, it’s an incredibly bad idea because zfsutils – part of ZFS on Linux – provides all the functionality needed already. Unfortunately, the package maintainers don’t seem to quite understand the issues here – I’m guessing none of them are ZFS people – so that leaves you with the need to edit the dependencies yourself.

Luckily, that’s not too hard. First, you’ll need a script, which we’ll call debedit:

#!/bin/bash

EDITOR=nano

if [[ -z "$1" ]]; then
  echo "Syntax: $0 debfile"
  exit 1
fi

DEBFILE="$1"
TMPDIR=`mktemp -d /tmp/deb.XXXXXXXXXX` || exit 1
OUTPUT=`basename "$DEBFILE" .deb`.modified.deb

if [[ -e "$OUTPUT" ]]; then
  echo "$OUTPUT exists."
  rm -r "$TMPDIR"
  exit 1
fi

dpkg-deb -x "$DEBFILE" "$TMPDIR"
dpkg-deb --control "$DEBFILE" "$TMPDIR"/DEBIAN

if [[ ! -e "$TMPDIR"/DEBIAN/control ]]; then
  echo DEBIAN/control not found.

  rm -r "$TMPDIR"
  exit 1
fi

CONTROL="$TMPDIR"/DEBIAN/control

MOD=`stat -c "%y" "$CONTROL"`
$EDITOR "$CONTROL"

if [[ "$MOD" == `stat -c "%y" "$CONTROL"` ]]; then
  echo Not modfied.
else
  echo Building new deb...
  dpkg -b "$TMPDIR" "$OUTPUT"
fi

rm -r "$TMPDIR"

Save that, name it debedit, and chmod 755 it.

Now, you’ll need to download libguestfs0, which is the package that has the bad dependencies, which you’ll edit:

you@box:~$ apt-get download libguestfs0
you@box:~$ ./debedit libguest*deb

Remove the zfs-fuse dependency from the Depends: line in the deb file, and exit nano. Finally, install your modified libguestfs0 package:

you@box:~$ sudo dpkg -i *modified.deb ; sudo apt-get -f install

All done! At least, until and unless the next update to libguestfs0 downloads and attempts to install a new .deb that wants to put that dependency right back again, in which case you’ll need to lather-rinse-repeat.

I me-too’ed an existing bug at https://bugs.launchpad.net/ubuntu/+source/libguestfs/+bug/1053911 ; if you’re affected, you probably should too.

ZFS compression: yes, you want this

So ZFS dedup is a complete lose. What about compression?

Compression is a hands-down win. LZ4 compression should be on by default for nearly anything you ever set up under ZFS. I typically have LZ4 on even for datasets that will house database binaries… yes, really. Let’s look at two quick test runs, on a Xeon E3 server with 32GB ECC RAM and a pair of Samsung 850 EVO 1TB disks set up as a mirror vdev.

This is an inline compression torture test: we’re reading pseudorandom data (completely incompressible) and writing it to an LZ4 compressed dataset.

root@lab:/data# pv < in.rnd > incompressible/out.rnd
7.81GB 0:00:22 [ 359MB/s] [==================================>] 100%

root@lab:/data# zfs get compressratio data/incompressible
NAME                 PROPERTY       VALUE  SOURCE
data/incompressible  compressratio  1.00x  -

359MB/sec write… yyyyyeah, I’d say LZ4 isn’t hurting us too terribly here – and this is a worst case scenario. What about something a little more realistic? Let’s try again, this time with a raw binary of my Windows Server 2012 R2 “gold” image (the OS is installed and Windows Updates are applied, but nothing else is done to it):

root@lab:/data/test# pv < win2012r2-gold.raw > realworld/win2012r2-gold.out
8.87GB 0:00:17 [ 515MB/s] [==================================>] 100%

Oh yeah – 515MB/sec this time. Definitely not hurting from using our LZ4 compression. What’d we score for a compression ratio?

root@lab:/data# zfs get compressratio data/realworld
NAME            PROPERTY       VALUE  SOURCE
data/realworld  compressratio  1.48x  -

1.48x sounds pretty good! Can we see some real numbers on that?

root@lab:/data# ls -lh /data/realworld/win2012r2-gold.raw
-rw-rw-r-- 1 root root 8.9G Feb 24 18:01 win2012r2-gold.raw
root@lab:/data# du -hs /data/realworld
6.2G	/data/realworld

8.9G of data in 6.2G of space… with sustained writes of 515MB/sec.

What if we took our original 8G of incompressible data, and wrote it to an uncompressed dataset?

root@lab:/data#  zfs create data/uncompressed
root@lab:/data# zfs set compression=off data/uncompressed
root@lab:/data# cat 8G.in > /dev/null ; # this is to make sure our source data is preloaded in the ARC
root@lab:/data# pv < 8G.in > uncompressed/8G.out
7.81GB 0:00:21 [ 378MB/s] [==================================>] 100%

So, our worst case scenario – completely incompressible data – means a 5% performance hit, and a more real-world-ish scenario – copying a Windows Server installation – means a 27% performance increase. That’s on fast solid state, of course; the performance numbers will look even better on slower storage (read: spinning rust), where even worst-case writes are unlikely to slow down at all.

Yep, that’s a win.

ZFS dedup: tested, found wanting

Even if you have the RAM for it (and we’re talking a good 6GB or so per TB of storage), ZFS deduplication is, unfortunately, almost certainly a lose.

I don’t usually have that much RAM to spare, but one server has 192GB of RAM and only a few terabytes of storage – and it stores a lot of VM images, with obvious serious block-level duplication between images. Dedup shows at 1.35+ on all the datasets, and would be higher if one VM didn’t have a couple of terabytes of almost dup-free data on it.

That server’s been running for a few years now, and nobody using it has complained. But I was doing some maintenance on it today, splitting up VMs into their own datasets, and saw some truly abysmal performance.

root@virt0:/data/images# pv < jabberserver.qcow2 > jabber/jabberserver.qcow2 206MB 0:00:31 [7.14MB/s] [>                  ]  1% ETA 0:48:41

7MB/sec? UGH! And that’s not even a sustained average; that’s just where it happened to be when I killed the process. This server should be able to sustain MUCH better performance than that, even though it’s reading and writing from the same pool. So I checked, and saw that dedup was on:

root@virt0:~# zpool list
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
data  7.06T  2.52T  4.55T    35%  1.35x  ONLINE  -

In theory, you’d think that dedup would help tremendously with exactly this operation: copying a quiesced VM from one dataset to another on the same pool. There’s no need for a single block of data to be rewritten, just more pointers added to the metadata for the existing blocks. However, dedup looked like the obvious culprit for my performance woes here, so I disabled it and tried again:

root@virt0:/data/images# pv < jabberserver.qcow2 > jabber/jabberserver.qcow219.2GB 0:04:58 [65.7MB/s] [============>] 100%

Yep, that’s more like it.

TL;DR: ZFS dedup sounds like a great idea, but in the real world, it sucks. Even on a machine built to handle it. Even on exactly the kind of storage (a bunch of VMs with similar or identical operating systems) that seems tailor-made for it. I do not recommend its use for pretty much any conceivable workload.

(On the other hand, LZ4 compression is an unqualified win.)

ZFS: You should use mirror vdevs, not RAIDZ.

Continuing this week’s “making an article so I don’t have to keep typing it” ZFS series… here’s why you should stop using RAIDZ, and start using mirror vdevs instead.

The basics of pool topology

A pool is a collection of vdevs. Vdevs can be any of the following (and more, but we’re keeping this relatively simple):

single disks (think RAID0)
redundant vdevs (aka mirrors – think RAID1)
parity vdevs (aka stripes – think RAID5/RAID6/RAID7, aka single, dual, and triple parity stripes)

The pool itself will distribute writes among the vdevs inside it on a relatively even basis. However, this is not a “stripe” like you see in RAID10 – it’s just distribution of writes. If you make a RAID10 out of 2 2TB drives and 2 1TB drives, the second TB on the bigger drives is wasted, and your after-redundancy storage is still only 2 TB. If you put the same drives in a zpool as two mirror vdevs, they will be a 2x2TB mirror and a 2x1TB mirror, and your after-redundancy storage will be 3TB. If you keep writing to the pool until you fill it, you may completely fill the two 1TB disks long before the two 2TB disks are full. Exactly how the writes are distributed isn’t guaranteed by the specification, only that they will be distributed.

What if you have twelve disks, and you configure them as two RAIDZ2 (dual parity stripe) vdevs of six disks each? Well, your pool will consist of two RAIDZ2 arrays, and it will distribute writes across them just like it did with the pool of mirrors. What if you made a ten disk RAIDZ2, and a two disk mirror? Again, they go in the pool, the pool distributes writes across them. In general, you should probably expect a pool’s performance to exhibit the worst characteristics of each vdev inside it. In practice, there’s no guarantee where reads will come from inside the pool – they’ll come from “whatever vdev they were written to”, and the pool gets to write to whichever vdevs it wants to for any given block(s).

Storage Efficiency

If it isn’t clear from the name, storage efficiency is the ratio of usable storage capacity (after redundancy or parity) to raw storage capacity (before redundancy or parity).

This is where a lot of people get themselves into trouble. “Well obviously I want the most usable TB possible out of the disks I have, right?” Probably not. That big number might look sexy, but it’s liable to get you into a lot of trouble later. We’ll cover that further in the next section; for now, let’s just look at the storage efficiency of each vdev type.

single disk vdev(s) – 100% storage efficiency. Until you lose any single disk, and it becomes 0% storage efficency…
eight single-disk vdevs
RAIDZ1 vdev(s) – (n-1)/n, where n is the number of disks in each vdev. For example, a RAIDZ1 of eight disks has an SE of 7/8 = 87.5%.
an eight disk raidz1 vdev
RAIDZ2 vdev(s) – (n-2)/n. For example, a RAIDZ2 of eight disks has an SE of 6/8 = 75%.
an eight disk raidz2 vdev
RAIDZ3 vdev(s) – (n-3)/n. For example, a RAIDZ3 of eight disks has an SE of 5/8 = 62.5%.
an eight disk raidz3 vdev
mirror vdev(s) – 1/n, where n is the number of disks in each vdev. Eight disks set up as 4 2-disk mirror vdevs have an SE of 1/2 = 50%.
a pool of four 2-disk mirror vdevs

One final note: striped (RAIDZ) vdevs aren’t supposed to be “as big as you can possibly make them.” Experts are cagey about actually giving concrete recommendations about stripe width (the number of devices in a striped vdev), but they invariably recommend making them “not too wide.” If you consider yourself an expert, make your own expert decision about this. If you don’t consider yourself an expert, and you want more concrete general rule-of-thumb advice: no more than eight disks per vdev.

Fault tolerance / degraded performance

Be careful here. Keep in mind that if any single vdev fails, the entire pool fails with it. There is no fault tolerance at the pool level, only at the individual vdev level! So if you create a pool with single disk vdevs, any failure will bring the whole pool down.

It may be tempting to go for that big storage number and use RAIDZ1… but it’s just not enough. If a disk fails, the performance of your pool will be drastically degraded while you’re replacing it. And you have no fault tolerance at all until the disk has been replaced and completely resilvered… which could take days or even weeks, depending on the performance of your disks, the load your actual use places on the disks, etc. And if one of your disks failed, and age was a factor… you’re going to be sweating bullets wondering if another will fail before your resilver completes. And then you’ll have to go through the whole thing again every time you replace a disk. This sucks. Don’t do it. Conventional RAID5 is strongly deprecated for exactly the same reasons. According to Dell, “Raid 5 for all business critical data on any drive type [is] no longer best practice.”

RAIDZ2 and RAIDZ3 try to address this nightmare scenario by expanding to dual and triple parity, respectively. This means that a RAIDZ2 vdev can survive two drive failures, and a RAIDZ3 vdev can survive three. Problem solved, right? Well, problem mitigated – but the degraded performance and resilver time is even worse than a RAIDZ1, because the parity calculations are considerably gnarlier. And it gets worse the wider your stripe (number of disks in the vdev).

Saving the best for last: mirror vdevs. When a disk fails in a mirror vdev, your pool is minimally impacted – nothing needs to be rebuilt from parity, you just have one less device to distribute reads from. When you replace and resilver a disk in a mirror vdev, your pool is again minimally impacted – you’re doing simple reads from the remaining member of the vdev, and simple writes to the new member of the vdev. In no case are you re-writing entire stripes, all other vdevs in the pool are completely unaffected, etc. Mirror vdev resilvering goes really quickly, with very little impact on the performance of the pool. Resilience to multiple failure is very strong, though requires some calculation – your chance of surviving a disk failure is 1-(f/(n-f)), where f is the number of disks already failed, and n is the number of disks in the full pool. In an eight disk pool, this means 100% survival of the first disk failure, 85.7% survival of a second disk failure, 66.7% survival of a third disk failure. This assumes two disk vdevs, of course – three disk mirrors are even more resilient.

But wait, why would I want to trade guaranteed two disk failure in RAIDZ2 with only 85.7% survival of two disk failure in a pool of mirrors? Because of the drastically shorter time to resilver, and drastically lower load placed on the pool while doing so. The only disk more heavily loaded than usual during a mirror vdev resilvering is the other disk in the vdev – which might sound bad, but remember that it’s no more heavily loaded than it would’ve been as a RAIDZ member. Each block resilvered on a RAIDZ vdev requires a block to be read from each surviving RAIDZ member; each block written to a resilvering mirror only requires one block to be read from a surviving vdev member. For a six-disk RAIDZ1 vs a six disk pool of mirrors, that’s five times the extra I/O demands required of the surviving disks.

Resilvering a mirror is much less stressful than resilvering a RAIDZ.

One last note on fault tolerance

No matter what your ZFS pool topology looks like, you still need regular backup.

Say it again with me: I must back up my pool!

ZFS is awesome. Combining checksumming and parity/redundancy is awesome. But there are still lots of potential ways for your data to die, and you still need to back up your pool. Period. PERIOD!

Normal performance

It’s easy to think that a gigantic RAIDZ vdev would outperform a pool of mirror vdevs, for the same reason it’s got a greater storage efficiency. “Well when I read or write the data, it comes off of / goes onto more drives at once, so it’s got to be faster!” Sorry, doesn’t work that way. You might see results that look kinda like that if you’re doing a single read or write of a lot of data at once while absolutely no other activity is going on, if the RAIDZ is completely unfragmented… but the moment you start throwing in other simultaneous reads or writes, fragmentation on the vdev, etc then you start looking for random access IOPS. But don’t listen to me, listen to one of the core ZFS developers, Matthew Ahrens: “For best performance on random IOPS, use a small number of disks in each RAID-Z group. E.g, 3-wide RAIDZ1, 6-wide RAIDZ2, or 9-wide RAIDZ3 (all of which use ⅓ of total storage for parity, in the ideal case of using large blocks). This is because RAID-Z spreads each logical block across all the devices (similar to RAID-3, in contrast with RAID-4/5/6). For even better performance, consider using mirroring.“

Please read that last bit extra hard: For even better performance, consider using mirroring. He’s not kidding. Just like RAID10 has long been acknowledged the best performing conventional RAID topology, a pool of mirror vdevs is by far the best performing ZFS topology.

Future expansion

This is one that should strike near and dear to your heart if you’re a SOHO admin or a hobbyist. One of the things about ZFS that everybody knows to complain about is that you can’t expand RAIDZ. Once you create it, it’s created, and you’re stuck with it.

Well, sorta.

Let’s say you had a server with 12 slots to put drives in, and you put six drives in it as a RAIDZ2. When you bought it, 1TB drives were a great bang for the buck, so that’s what you used. You’ve got 6TB raw / 4TB usable. Two years later, 2TB drives are cheap, and you’re feeling cramped. So you fill the rest of the six available bays in your server, and now you’ve added an 12TB raw / 8TB usable vdev, for a total pool size of 18TB/12TB. Two years after that, 4TB drives are out, and you’re feeling cramped again… but you’ve got no place left to put drives. Now what?

Well, you actually can upgrade that original RAIDZ2 of 1TB drives – what you have to do is fail one disk out of the vdev and remove it, then replace it with one of your 4TB drives. Wait for the resilvering to complete, then fail a second one, and replace it. Lather, rinse, repeat until you’ve replaced all six drives, and resilvered the vdev six separate times – and after the sixth and last resilvering finishes, you have a 24TB raw / 16TB usable vdev in place of the original 6TB/4TB one. Question is, how long did it take to do all that resilvering? Well, if that 6TB raw vdev was nearly full, it’s not unreasonable to expect each resilvering to take twelve to sixteen hours… even if you’re doing absolutely nothing else with the system. The more you’re trying to actually do in the meantime, the slower the resilvering goes. You might manage to get six resilvers done in six full days, replacing one disk per day. But it might take twice that long or worse, depending on how willing to hover over the system you are, and how heavily loaded it is in the meantime.

What if you’d used mirror vdevs? Well, to start with, your original six drives would have given you 6TB raw / 3TB usable. So you did give up a terabyte there. But maybe you didn’t do such a big upgrade the first time you expanded. Maybe since you only needed to put in two more disks to get more storage, you only bought two 2TB drives, and by the time you were feeling cramped again the 4TB disks were available – and you still had four bays free. Eventually, though, you crammed the box full, and now you’re in that same position of wanting to upgrade those old tiny 1TB disks. You do it the same way – you replace, resilver, replace, resilver – but this time, you see the new space after only two resilvers. And each resilvering happens tremendously faster – it’s not unreasonable to expect nearly-full 1TB mirror vdevs to resilver in three or four hours. So you can probably upgrade an entire vdev in a single day, even without having to hover over the machine too crazily. The performance on the machine is hardly impacted during the resilver. And you see the new capacity after every two disks replaced, not every six.

TL;DR

Too many words, mister sysadmin. What’s all this boil down to?

don’t be greedy. 50% storage efficiency is plenty.
for a given number of disks, a pool of mirrors will significantly outperform a RAIDZ stripe.
a degraded pool of mirrors will severely outperform a degraded RAIDZ stripe.
a degraded pool of mirrors will rebuild tremendously faster than a degraded RAIDZ stripe.
a pool of mirrors is easier to manage, maintain, live with, and upgrade than a RAIDZ stripe.
BACK. UP. YOUR POOL. REGULARLY. TAKE THIS SERIOUSLY.

TL;DR to the TL;DR – unless you are really freaking sure you know what you’re doing… use mirrors. (And if you are really, really sure what you’re doing, you’ll probably change your mind after a few years and wish you’d done it this way to begin with.)

Will ZFS and non-ECC RAM kill your data?

This comes up far too often, so rather than continuing to explain it over and over again, I’m going to try to do a really good job of it once and link to it here.

What’s ECC RAM? Is it a good idea?

The ECC stands for Error Correcting Checksum. In a nutshell, ECC RAM is a special kind of server-grade memory that can detect and repair some of the most common kinds of in-memory corruption. For more detail on how ECC RAM does this, and which types of errors it can and cannot correct, the rabbit hole’s over here.

Now that we know what ECC RAM is, is it a good idea? Absolutely. In-memory errors, whether due to faults in the hardware or to the impact of cosmic radiation (yes, really) are a thing. They do happen. And if it happens in a particularly strategic place, you will lose data to it. Period. There’s no arguing this.

What’s ZFS? Is it a good idea?

ZFS is, among other things, a checksumming filesystem. This means that for every block committed to storage, a strong hash (somewhat misleadingly AKA checksum) for the contents of that block is also written. (The validation hash is written in the pointer to the block itself, which is also checksummed in the pointer leading to itself, and so on and so forth. It’s turtles all the way down. Rabbit hole begins over here for this one.)

Is this a good idea? Absolutely. Combine ZFS checksumming with redundancy or parity, and now you have a self-healing array. If a block is corrupt on disk, the next time it’s read, ZFS will see that it doesn’t match its checksum and will load a redundant copy (in the case of mirror vdevs or multiple copy storage) or rebuild a parity copy (in the case of RAIDZ vdevs), and assuming that copy of the block matches its checksum, will silently feed you the correct copy instead, and log a checksum error against the first block that didn’t pass.

ZFS also supports scrubs, which will become important in the next section. When you tell ZFS to scrub storage, it reads every block that it knows about – including redundant copies – and checks them versus their checksums. Any failing blocks are automatically overwritten with good blocks, assuming that a good (passing) copy exists, either redundant or as reconstructed from parity. Regular scrubs are a significant part of maintaining a ZFS storage pool against long term corruption.

Is ZFS and non-ECC worse than not-ZFS and non-ECC? What about the Scrub of Death?

OK, it’s pretty easy to demonstrate that a flipped bit in RAM means data corruption: if you write that flipped bit back out to disk, congrats, you just wrote bad data. There’s no arguing that. The real issue here isn’t whether ECC is good to have, it’s whether non-ECC is particularly problematic with ZFS. The scenario usually thrown out is the the much-dreaded Scrub Of Death.

TL;DR version of the scenario: ZFS is on a system with non-ECC RAM that has a stuck bit, its user initiates a scrub, and as a result of in-memory corruption good blocks fail checksum tests and are overwritten with corrupt data, thus instantly murdering an entire pool. As far as I can tell, this idea originates with a very prolific user on the FreeNAS forums named Cyberjock, and he lays it out in this thread here. It’s a scary idea – what if the very thing that’s supposed to keep your system safe kills it? A scrub gone mad! Nooooooo!

The problem is, the scenario as written doesn’t actually make sense. For one thing, even if you have a particular address in RAM with a stuck bit, you aren’t going to have your entire filesystem run through that address. That’s not how memory management works, and if it were how memory management works, you wouldn’t even have managed to boot the system: it would have crashed and burned horribly when it failed to load the operating system in the first place. So no, you might corrupt a block here and there, but you’re not going to wring the entire filesystem through a shredder block by precious block.

But we’re being cheap here. Say you only corrupt one block in 5,000 this way. That would still be hellacious. So let’s examine the more reasonable idea of corrupting some data due to bad RAM during a scrub. And let’s assume that we have RAM that not only isn’t working 100% properly, but is actively goddamn evil and trying its naive but enthusiastic best to specifically kill your data during a scrub:

First, you read a block. This block is good. It is perfectly good data written to a perfectly good disk with a perfectly matching checksum. But that block is read into evil RAM, and the evil RAM flips some bits. Perhaps those bits are in the data itself, or perhaps those bits are in the checksum. Either way, your perfectly good block now does not appear to match its checksum, and since we’re scrubbing, ZFS will attempt to actually repair the “bad” block on disk. Uh-oh! What now?

Next, you read a copy of the same block – this copy might be a redundant copy, or it might be reconstructed from parity, depending on your topology. The redundant copy is easy to visualize – you literally stored another copy of the block on another disk. Now, if your evil RAM leaves this block alone, ZFS will see that the second copy matches its checksum, and so it will overwrite the first block with the same data it had originally – no data was lost here, just a few wasted disk cycles. OK. But what if your evil RAM flips a bit in the second copy? Since it doesn’t match the checksum either, ZFS doesn’t overwrite anything. It logs an unrecoverable data error for that block, and leaves both copies untouched on disk. No data has been corrupted. A later scrub will attempt to read all copies of that block and validate them just as though the error had never happened, and if this time either copy passes, the error will be cleared and the block will be marked valid again (with any copies that don’t pass validation being overwritten from the one that did).

Well, huh. That doesn’t sound so bad. So what does your evil RAM need to do in order to actually overwrite your good data with corrupt data during a scrub? Well, first it needs to flip some bits during the initial read of every block that it wants to corrupt. Then, on the second read of a copy of the block from parity or redundancy, it needs to not only flip bits, it needs to flip them in such a way that you get a hash collision. In other words, random bit-flipping won’t do – you need some bit flipping in the data (with or without some more bit-flipping in the checksum) that adds up to the corrupt data correctly hashing to the value in the checksum. By default, ZFS uses 256-bit SHA validation hashes, which means that a single bit-flip has a 1 in 2^256 chance of giving you a corrupt block which now matches its checksum. To be fair, we’re using evil RAM here, so it’s probably going to do lots of experimenting, and it will try flipping bits in both the data and the checksum itself, and it will do so multiple times for any single block. However, that’s multiple 1 in 2^256 (aka roughly 1 in 10^77) chances, which still makes it vanishingly unlikely to actually happen… and if your RAM is that damn evil, it’s going to kill your data whether you’re using ZFS or not.

But what if I’m not scrubbing?

Well, if you aren’t scrubbing, then your evil RAM will have to wait for you to actually write to the blocks in question before it can corrupt them. Fortunately for it, though, you write to storage pretty much all day long… including to the metadata that organizes the whole kit and kaboodle. First time you update the directory that your files are contained in, BAM! It’s gotcha! If you stop and think about it, in this evil RAM scenario ZFS is incredibly helpful, because your RAM now needs to not only be evil but be bright enough to consistently pull off collision attacks. So if you’re running non-ECC RAM that turns out to be appallingly, Lovecraftianishly evil, ZFS will mitigate the damage, not amplify it.

If you are using ZFS and you aren’t scrubbing, by the way, you’re setting yourself up for long term failure. If you have on-disk corruption, a scrub can fix it only as long as you really do have a redundant or parity copy of the corrupted block which is good. Once you corrupt all copies of a given block, it’s too late to repair it – it’s gone. Don’t be afraid of scrubbing. (Well, maybe be a little wary of the performance impact of scrubbing during high demand times. But don’t be worried about scrubbing killing your data.)

I’ve constructed a doomsday scenario featuring RAM evil enough to kill my data after all! Mwahahaha!

OK. But would using any other filesystem that isn’t ZFS have protected that data? ‘Cause remember, nobody’s arguing that you can lose data to evil RAM – the argument is about whether evil RAM is more dangerous with ZFS than it would be without it.

I really, really want to use the Scrub Of Death in a movie or TV show. How can I make it happen?

What you need here isn’t evil RAM, but an evil disk controller. Have it flip one bit per block read or written from disk B, but leave the data from disk A alone. Now scrub – every block on disk B will be overwritten with a copy from disk A, but the evil controller will flip bits on write, so now, all of disk B is written with garbage blocks. Now start flipping bits on write to disk A, and it will be an unrecoverable wreck pretty quickly, since there’s no parity or redundancy left for any block. Your choice here is whether to ignore the metadata for as long as possible, giving you the chance to overwrite as many actual data blocks as you can before the jig is up as they are written to by the system, or whether to pounce straight on the metadata and render the entire vdev unusable in seconds – but leave the actual data blocks intact for possible forensic recovery.

Alternately you could just skip straight to step B and start flipping bits as data is written on any or all individual devices, and you’ll produce real data loss quickly enough. But you specifically wanted a scrub of death, not just bad hardware, right?

I don’t care about your logic! I wish to appeal to authority!

OK. “Authority” in this case doesn’t get much better than Matthew Ahrens, one of the cofounders of ZFS at Sun Microsystems and current ZFS developer at Delphix. In the comments to one of my filesystem articles on Ars Technica, Matthew said “There’s nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem.”

Hope that helps. =)

Benchmarking Windows Guests on KVM:I/O performance

I’ve been using KVM in production to host Windows Server guests for close to 4 years now. I’ve always been thoroughly impressed at what a great job KVM did with accelerating disk I/O for the guests – making Windows guests perform markedly faster virtualized than they used to on the bare metal. So when I got really, REALLY bad performance recently on a few Windows Server Standard 2012 guests – bad enough to make the entire guest seem “locked up tight” for minutes at a time – I did some delving to figure out what was going on.

Linux and KVM offer a wealth of options for handling caching and underlying subsystems of host storage… an almost embarassing wealth, which nobody seems to have really benchmarked. I have seen quite a few people tossing out offhanded comments about this cache mode or that cache mode being “safer”or “faster”or “better”, but no hard numbers. So to both fix my own immediate problem and do some much-needed documentation, I spent more hours this week than I really want to think about doing some real, no-kidding, here-are-the-numbers benchmarking.

Methodology

Test system: AMD FX-8320 8-core CPU, 32GB DDR3 SDRAM, 1x WD 2TB Black (WD2002FAEX) drive, 1x Samsung 840 PRO Series 512GB SSD, Ubuntu 12.04.2-LTS fully updated, Windows Server 2008 R2 guest OS, HD Tune Pro 5.50 Windows disk benchmark suite.

The host and guest OS are both installed on the WD 2TB Black conventional disk; the Samsung 840 PRO Series SSD is attached to the guest in various configurations for benchmarking. The guest OS is given approximately 30 seconds to “settle” after each boot and login before running any benchmarks. No other operations are occurring on either guest or host while benchmarks are run.

Exploratory Testing

Before diving straight into “which combination works the fastest”, I really wanted to explore the individual characteristics of all the various overlapping options available.

The first thing I wanted to find out: how much of a penalty, if any, do you pay for operating a raw disk virtualized under KVM, as opposed to under Windows on the bare metal? And how much of a boost do the VirtIO guest drivers offer over basic IDE drivers?

As you can see, we do pay a penalty – particularly without the VirtIO drivers, which offer a substantial increase in performance over the default IDE, even without caching. We can also see that LVM logical volumes perform effectively identically to actual raw disks. Nice to know!

Now that we know that “raw is raw”, and “VirtIO is significantly better than IDE”, the next burning question is: how much of a performance hit do we take if we use .qcow2 files on an actual filesystem, instead of feeding KVM a raw block device? Actually, let’s pause that question – before that, why would we want to use a .qcow2 file instead of a raw disk or LV? Two big answers: rsync, and state saves. Unless you compile rsync from source with an experimental patch, you can’t use it to synchronize copies of a guest that are stored on a block device – whereas you can, with a qcow2 or raw file on a filesystem. And you can’t save state (basically, like hibernation – only much faster, and handled by the host instead of the guest) with raw storage either – you need qcow2 for that. As a really, really nice aside, if you’re using qcow2 and your host runs out of space… your guest pauses instead of crashing, and as soon as you’ve made more space available on your host, you can resume the guest as though nothing ever happened. Which is nice.

So, if we can afford to, we really would like to have qcow2. Can we afford to?

Yes… yes we can. There’s nothing too exciting to see here – basically, the takeaway is “there is little to no performance penalty for using qcow2 files on a filesystem instead of raw disks.” So, performance is determined by cache settings and by the presence of VirtIO drivers in our guest… not so much by whether we’re using raw disks, or LV, or ZVOL, or qcow2 files.

One caveat: I tested using fully-allocated qcow2 files. I did a little bit of casual testing with sparsely allocated (aka “thin provisioned”) qcow2 files, and basically, they follow the same performance curves, but with roughly half the performance on some writes. This isn’t that big a deal, in my opinion – you only have to do a “slow” write to any given block once. After that, it’s a rewrite, not a new write, and you’re back to the same performance level you’d have had if you’d fully allocated your qcow2 file to start with. So, basically, it’s a self-correcting problem, with a tolerable temporary performance penalty. I’m more than willing to deal with that in return for not having to potentially synchronize gigabytes of slack space when I do backups and migrations.

So… since performance is determined largely by cache settings, let’s take a look at how those play out in the guest:

In plain English, “writethrough” – the default – is read caching with no write cache, and “writeback” is both read and write caching. “None” is a little deceptive – it’s not just “no caching”, it actually requires Direct I/O access to the storage medium, which not all filesystems provide.

The first thing we see here is that the default cache method, writethrough, is very very fast at reading, but painfully slow on writes – cripplingly so. On very small writes, writethrough is capable of less than 0.2 MB/sec in some cases! This is on a brand-new 840 Pro Series SSD... and it’s going to get even worse than this later, when we look at qcow2 storage. Avoid, avoid, avoid.

KVM caching really is pretty phenomenal when it hits, though. Take a look at the writeback cache method – it jumps well above bare metal performance for large reads and writes… and it’s not a small jump, either; 1MB random reads of well over 1GB / sec are completely normal. It’s potentially a little risky, though – you could potentially lose guest data if you have a power failure or host system crash during a write. This shouldn’t be an issue on a stable host with a UPS and apcupsd.

Finally, there’s cache=none. It works. It doesn’t impress. It isn’t risky in terms of data safety. And it generally performs somewhat better with extremely, extremely small random I/O… but without getting the truly mind-boggling wins that caching can offer. In my personal opinion, cache=none is mostly useful when you’re limited to IDE drivers in your guest. Also worth noting: “cache=none” isn’t available on ZFS or FUSE filesystems.

Moving on, we get to the stuff I really care about when I started this project – ZFS! Storing guests on ZFS is really exciting, because it offers you the ability to take block-level host-managed snapshots of your guests; set and modify quotas; set and configure compression; do asynchronous replication; do block-level deduplication – the list goes on and on and on. This is a really big deal. But… how’s the performance?

The performance is very, very solid… as long as you don’t use writethrough. If you use writethrough cache and ZFS, you’re going to have a bad time. Also worth noting: Direct I/O is not available on the ZFS filesystem – although it is available with ZFS zvols! – so there are no results here for “cache=none” and ZFS qcow2.

The big, big, big thing you need to take away from this is that abysmal write performance line for ZFS/qcow2/writethrough – well under 2MB/sec for any and all writes. If you set a server up this way, it will look blazing quick and you’ll love it… until the first time you or a user tries to write a few hundred MB of data to it across the network, at which point the whole thing will lock up tighter than a drum for half an hour plus. Which will make you, and your users, very unhappy.

What else can we learn here? Well, although we’ve got the option of using a zvol – which is basically ZFS’s answer to an LVM LV – we really would like to avoid it, for the same reasons we talked about when we compared qcow to raw. So, let’s look at the performance of that raw zvol – is it worth the hassle? In the end, no.

But here’s the big surprise – if we set up a ZVOL, then format it with ext4 and put a .qcow2 on top of that… it performs as well, and in some cases better than, the raw zvol itself did! As odd as it sounds, this leaves qcow2-on-ext4-on-zvol as one of our best performing overall storage methods, with the most convenient options for management. It sounds like it’d be a horrible Rube Goldberg, but it performs like best-in-breed. Who’d’a thunk it?

There’s one more scenario worth exploring – so far, since discovering how much faster it was, we’ve almost exclusively looked at VirtIO performance. But you can’t always get VirtIO – for example, I have a couple of particularly crotchety old P2V’ed Small Business Server images that absolutely refuse to boot under VirtIO without blue-screening. It happens. What are your best options if you’re stuck with IDE?

Without VirtIO drivers, caching does very, very little for you at best, and kills your performance at worst. (Hello again, horrible writethrough write performance.) So you really want to use “cache=none” if you’re stuck on IDE. If you can’t do that for some reason (like using ZFS as a filesystem, not a zvol), writeback will perform quite acceptably… but it will also expose you to whatever added data integrity risk that the write caching presents, without giving you any performance benefits in return. Caveat emptor.

Final Tests / Top Performers

At this point, we’ve pretty thoroughly explored how individual options affect performance, and the general ways in which they interact. So it’s time to cut to the chase: what are our top performers?

First, let’s look at our top read performers. My method for determining “best” read performance was to take the 4KB random read and the sequential read, then multiply the 4KB random by a factor which, when applied to the bare metal Windows performance, would leave a roughly identical value to the sequential read. Once you’ve done this, taking the average of the two gives you a mean weighted value that makes 4KB read performance roughly as “important” as sequential read performance. Sorting the data by these values gives us…

Woah, hey, what’s that joker in the deck? RAIDZ1…?

My primary workstation is also an FX-8320 with 32GB of DDR3, but instead of an SSD, it has a 4 drive RAIDZ1 array of Western Digital 1TB Black (WD1001FAEX) drives. I thought it would be interesting to see how the RAIDZ1 on spinning rust compared to the 840 Pro SSD… and was pretty surprised to see it completely stomping it, across the board. At least, that’s how it looks in these benchmarks. The benchmarks don’t tell the whole story, though, which we’ll cover in more detail later. For now, we just want to notice that yes, a relatively small and inexpensive RAIDZ1 array does surprisingly well compared to a top-of-the-line SSD – and that makes for some very interesting and affordable options, if you need to combine large amounts of data with high performance access.

Joker aside, the winner here is pretty obvious – qcow2 on xfs, writeback. Wait, xfs? Yep, xfs. I didn’t benchmark xfs as thoroughly as I did ext4 – never tried it layered on top of a zvol, in particular – but I did do an otherwise full set of xfs benchmarks. In general, xfs slightly outperforms ext4 across an identically shaped curve – enough of a difference notice on a graph, but not enough to write home about. The difference is just enough to punt ext4/writeback out of the top 5 – and even though we aren’t actually testing write performance here, it’s worth noting how much better xfs/writeback writes than the two bottom-of-the-barrel “top performers” do.

I keep harping on this, but seriously, look close at those two writethrough entries – it’s not as bad as writethrough-on-ZFS-qcow2, but it’s still way worse than any of the other contenders, with 4KB writes under 2MB/sec. That’s single-raw-spinning-disk territory, at best. Writethrough does give you great read performance, and it’s “safe” as in data integrity, but it’s dangerous as hell in terms of suddenly underperforming under really badly under heavy load. That’s not just “I see a valley on the graph” levels of bad, it’s very potentially “hey IT guy, why did the server lock up?” levels of bad. I do not recommend writethrough caching, “default” option or not.

How ’bout write performance? I calculated “weighted write” just like “weighted read” – divide the bare metal sequential write speed by the bare metal 4KB write speed, then apply the resulting factor to all the 4KB random writes, and average them with the sequential writes. Here are the top 5 weighted write performers:

The first thing to notice here is that while the top 5 slots have changed, the peak read numbers really haven’t. All we’ve really done here is kick the writethrough entries to the curb – we haven’t paid any significant penalty in read performance to do so. Realizing that, let’s not waste too much time talking about this one… instead, let’s cut straight to the “money graph” – our top performers in average mean weighted read and write performance. The following are, plain and simple, the best performers for any general purpose (and most fairly specialized) use cases:

Interestingly, our “jokers” – zvol/ext4/qcow2/writeback and zfs/qcow2/writeback on my workstation’s relatively humble 4-drive RAIDZ1 – are still dominating the pack, at #1 and #2 respectively. This is because they read as well as any of the heavy lifters do, and are showing significantly better write performance – with caveats, which we’ll cover in the conclusions.

Jokers aside, xfs/qcow2/writeback is next, followed by zvol/ext4/qcow2/writeback. You aren’t seeing any cache=none at all here – the gains some cache=none contenders make in very tiny writes just don’t offset the penalties paid in reads and in larger writes from foregoing the cache. If you have a very heavily teeny-tiny-write-loaded workload – like a super-heavy-traffic database – cache=none may still perform better… but you probably don’t want virtualization for a really heavy database workload in the first place, KVM or otherwise. If you hammer your disks, rust or solid state, to within an inch of their lives… you’re really going to feel that raw performance penalty.

Conclusions

In the end, the recommendation is pretty clear – a ZFS zvol with ext4, qcow2 files, and writeback caching offers you the absolute best performance. (Using xfs on a zvol would almost certainly perform as well, or even better, but I didn’t test that exact combination here.) Best read performance, best write performance, qcow2 management features, ZFS snapshots / replication / data integrity / optional compression / optional deduplication / etc – there really aren’t any drawbacks here… other than the complexity of the setup itself. In the real world, I use simple .qcow2 on ZFS, no zvols required. The difference in performance between the two was measurable on graphs, but it’s not significant enough to make me want to actually maintain the added complexity.

If you can’t or won’t use ZFS for whatever reason (like licensing concerns), xfs is probably your next best bet – but if that scares you, just use ext4 – the difference won’t be enough to matter much in the long run.

There’s really no reason to mess around with raw disks or raw files – they add significant extra hassle, remove significant features, and don’t offer tangible performance benefits.

If you’re going to use writeback caching, you should be extra certain of power integrity – UPS on the server with apcupsd. If you’re at all uncertain of your power integrity… take the read performance hit and go with nocache. (On the other hand, if you’re using ZFS and taking rolling hourly snapshots… maybe it’s worth taking more risks for the extra performance. Ultimately, you’re the admin, you get to call the shots.)

Caveats

It’s very important to note that these benchmarks do not tell the whole story about disk performance under KVM. In particular, both the random and sequential reads used here bypass the cache considerably more heavily than most general purpose workloads would… minimizing the impact that the cache has. And yes, it is significant. See those > 1GB/sec peaks in the random read performance? That kind of thing happens a lot more in a normal workload than it does in a random walk – particularly with ZFS storage, which uses the ARC (Adaptive Replacement Cache) rather than the simple FIFO cache used by other systems. The ARC makes decisions about cache eviction based not only on the time since an object was last seen, but the frequency at which it’s seen, among other things – with the result that it kicks serious butt, especially after a good amount of time to warm up and learn the behavior patterns of the system.

So, please take these numbers with a grain of salt. They’re quite accurate for what they are, but they’re synthetic benchmarks, not a real-world experience. Windows on KVM is actually a much (much) better experience than the raw numbers here would have you believe – the importance of better-managed cache, and persistent cache, really can’t be over-emphasized.

The actual guest OS installation for these tests was on a single spinning disk, but it used ZFS/qcow2/writeback for the underlying storage. I needed to reboot the guest after every single row of data – and in many cases, several times more, because I screwed something or another up. In the end, I rebooted that Windows Server 2008 R2 guest upwards of 50 times. And on a single spinning disk, shutdowns took about 4 seconds and boot times (from BIOS to desktop) took about 3 seconds. You don’t get that kind of performance out of the bare metal, and you can’t see it in these graphs, either. For most general-purpose workloads, these graphs are closer to being a “worst-place scenario” than are a direct model.

That sword cuts both ways, though – the “jokers in the deck”, my RAIDZ1 of spinning rust, isn’t really quite as impressive as it looks above. I’m spitballing here, but I think a lot of the difference is that ZFS is willing to more aggressively cache writes with a RAIDZ array than it is with a single member, probably because it expects that more spindles == faster writes, and it only wants to keep so many writes in cache before it flushes them. That’s a guess, and only a guess, but the reality is that after those jaw-dropping 1.9GB/sec 1MB random write runs, I could hear the spindles chattering for a significant chunk of time, getting all those writes committed to the rust. That also means that if I’d had the patience for bigger write runs than 5GB, I’d have seen performance dropping significantly. So you really shouldn’t think “hey, forget SSDs completely, spinning rust is fine for everybody!” It’s not. It is, however, a surprisingly good competitor on the cheap if you buy enough of it, and if your write runs come in bursts – even “bursts” ranging in the gigabytes.