ZFS tuning cheat sheet

Quick and dirty cheat sheet for anyone getting ready to set up a new ZFS pool. Here are all the settings you’ll want to think about, and the values I think you’ll probably want to use.

I am not generally a fan of tuning things unless you need to, but unfortunately a lot of the ZFS defaults aren’t optimal for most workloads.

SLOG and L2ARC are special devices, not parameters… but I included them anyway. Lean into it.

parameter best* value why / what does it do?
ashift 12 Ashift tells ZFS what the underlying physical block size your disks use is. It’s in bits, so ashift=9 means 512B sectors (used by all ancient drives), ashift=12 means 4K sectors (used by most modern hard drives), and ashift=13 means 8K sectors (used by some modern SSDs).

If you get this wrong, you want to get it wrong high. Too low an ashift value will cripple your performance. Too high an ashift value won’t have much impact on almost any normal workload.

Ashift is per vdev, and immutable once set. This means you should manually set it at pool creation, and any time you add a vdev to an existing pool, and should never get it wrong because if you do, it will screw up your entire pool and cannot be fixed.

 xattr  sa  Sets Linux eXtended ATTRibutes directly in the inodes, rather than as tiny little files in special hidden folders.

This can have a significant performance impact on datasets with lots of files in them, particularly if SELinux is in play. Unlikely to make any difference on datasets with very few, extremely large files (eg VM images).

 compression  lz4  Compression defaults to off, and that’s a losing default value. Even if your data is incompressible, your slack space is (highly) compressible.

LZ4 compression is faster than storage. Yes, really. Even if you have a $50 tinkertoy CPU and a blazing-fast SSD. Yes, really. I’ve tested it. It’s a win.

You might consider gzip compression for datasets with highly compressible files. It will have better compression rate but likely lower throughput. YMMV, caveat imperator.

 atime  off  If atime is on – which it is by default – your system has to update the “Accessed” attribute of every file every time you look at it. This can easily double the IOPS load on a system all by itself.

Do you care when the last time somebody opened a given file was, or the last time they ls’d a directory? Probably not. Turn this off.

 recordsize  16K  If you have files that will be read from or written to in random batches regularly, you want to match the recordsize to the size of the reads or writes you’re going to be digging out of / cramming into those large files.

For most database binaries or VM images, 16K is going to be either an exact match or at least a much better one than the default recordsize, 128K.

This can improve the IOPS capability of an array used for db binaries or VM images fourfold or more.

 recordsize  1M  Wait, didn’t we just do recordsize…? Well, yes, but different workloads call for different settings if you’re tuning.

If you’re only reading and writing in fairly large chunks – for example, a collection of 5-8MB JPEG images from a camera, or 100GB movie files, either of which will not be read or written random access – you’ll want to set recordsize=1M, to reduce the IOPS load on the system by requiring fewer individual records for the same amount of data. This can also increase compression ratio, for compressible data, since each record uses its own individual compression dictionary.

If you’re using bittorrent, recordsize=16K results in higher possible bittorrent write performance… but recordsize=1M results in lower overall fragmentation, and much better performance when reading the files you’ve acquired by torrent later.

 SLOG  maybe  SLOG isn’t a setting, it’s a special vdev type that acts as a write aggregation layer for the entire pool. It only affects synchronous writes – asynchronous writes are already aggregated in the ZIL in RAM.

SLOG doesn’t need to be a large device; it only has to accumulate a few seconds’ worth of writes. Having one means that synchronous writes perform like asynchronous writes; it doesn’t really act like a “write cache” in the way new ZFS users tend to hope it will.

Great for databases, NFS exports, or anything else that calls sync() a lot. Not too useful for more casual workloads.

 L2ARC  nope!  L2ARC is a layer of ARC that resides on fast storage rather than in RAM. It sounds amazing – super huge super fast read cache!

Yeah, it’s not really like that. For one thing, L2ARC is ephemeral – data in L2ARC  doesn’t survive reboots. For another thing, it costs a significant amount of RAM to index the L2ARC, which means now you have a smaller ARC due to the need for indexing your L2ARC.

Even the very fastest SSD is a couple orders of magnitude slower than RAM. When you have to go to L2ARC to fetch data that would have fit in the ARC if it hadn’t been for needing to index the L2ARC, it’s a massive lose.

Most people won’t see any real difference at all after adding L2ARC. A significant number of people will see performance decrease after adding L2ARC. There is such a thing as a workload that benefits from L2ARC… but you don’t have it. (Think hundreds of users, each with extremely large, extremely hot datasets.)

* “best” is always debatable. Read reasoning before applying. No warranties offered, explicit or implied.

ZFS does NOT favor lower latency devices. Don’t mix rust disks and SSDs!

In an earlier post, I addressed the never-ending urban legend that ZFS writes data to the lowest-latency vdev. Now the urban legend that never dies has reared its head again; this time with someone claiming that ZFS will issue read operations to the lowest-latency disk in a given mirror vdev.

TL;DR – this, too, is a myth. If you need or want an empirical demonstration, read on.

I’ve got an Ubuntu Bionic machine handy with both rust and SSD available; /tmp is an ext4 filesystem on an mdraid1 SSD mirror and /rust is an ext4 filesystem on a single WD 4TB black disk. Let’s play.

root@box:~# truncate -s 4G /tmp/ssd.bin
root@box:~# truncate -s 4G /rust/rust.bin
root@box:~# mkdir /tmp/disks
root@box:~# ln -s /tmp/ssd.bin /tmp/disks/ssd.bin ; ln -s /rust/rust.bin /tmp/disks/rust.bin
root@box:~# zpool create -oashift=12 test /tmp/disks/rust.bin
root@box:~# zfs set compression=off test

Now we’ve got a pool that is rust only… but we’ve got an ssd vdev off to the side, ready to attach. Let’s run an fio test on our rust-only pool first. Note: since this is read testing, we’re going to throw away our first result set; they’ll largely be served from ARC and that’s not what we’re trying to do here.

root@box:~# cd /test
root@box:/test# fio --name=read --ioengine=sync  --rw=randread --bs=16K --size=1G --numjobs=1 --end_fsync=1

OK, cool. Now that fio has generated its dataset, we’ll clear all caches by exporting the pool, then clearing the kernel page cache, then importing the pool again.

root@box:/test# cd ~
root@box:~# zpool export test
root@box:~# echo 3 > /proc/sys/vm/drop_caches
root@box:~# zpool import -d /tmp/disks test
root@box:~# cd /test

Now we can get our first real, uncached read from our rust-only pool. It’s not terribly pretty; this is going to take 5 minutes or so.

root@box:/test# fio --name=read --ioengine=sync  --rw=randread --bs=16K --size=1G --numjobs=1 --end_fsync=1
[ ... ]
Run status group 0 (all jobs):
  READ: bw=17.6MiB/s (18.5MB/s), 17.6MiB/s-17.6MiB/s (18.5MB/s-18.5MB/s), io=1024MiB (1074MB), run=58029-58029msec

Alright. Now let’s attach our ssd and make this a mirror vdev, with one rust and one SSD disk.

root@box:/test# zpool attach test /tmp/disks/rust.bin /tmp/disks/ssd.bin
root@box:/test# zpool status test
  pool: test
 state: ONLINE
  scan: resilvered 1.00G in 0h0m with 0 errors on Sat Jul 14 14:34:07 2018
config:

    NAME                     STATE     READ WRITE CKSUM
    test                     ONLINE       0     0     0
      mirror-0               ONLINE       0     0     0
        /tmp/disks/rust.bin  ONLINE       0     0     0
        /tmp/disks/ssd.bin   ONLINE       0     0     0

errors: No known data errors

Cool. Now that we have one rust and one SSD device in a mirror vdev, let’s export the pool, drop all the kernel page cache, and reimport the pool again.

root@box:/test# cd ~
root@box:~# zpool export test
root@box:~# echo 3 > /proc/sys/vm/drop_caches
root@box:~# zpool import -d /tmp/disks test
root@box:~# cd /test

Gravy. Now, do we see massively improved throughput when we run the same fio test? If ZFS favors the SSD, we should see enormously improved results. If ZFS does not favor the SSD, we’ll not-quite-doubled results.

root@box:/test# fio --name=read --ioengine=sync  --rw=randread --bs=16K --size=1G --numjobs=1 --end_fsync=1
[...]
Run status group 0 (all jobs):
   READ: bw=31.1MiB/s (32.6MB/s), 31.1MiB/s-31.1MiB/s (32.6MB/s-32.6MB/s), io=1024MiB (1074MB), run=32977-32977msec

Welp. There you have it. Not-quite-doubled throughput, matching half – but only half – of the read ops coming from the SSD. To confirm, we’ll do this one more time; but this time we’ll detach the rust disk and run fio with nothing in the pool but the SSD.

root@box:/test# cd ~
root@box:~# zpool detach test /tmp/disks/rust.bin
root@box:~# zpool export test
root@box:~# zpool import -d /tmp/disks test
root@box:~# cd /test

Moment of truth… this time, fio runs on pure solid state:

root@box:/test# fio --name=read --ioengine=sync  --rw=randread --bs=16K --size=1G --numjobs=1 --end_fsync=1
[...]
Run status group 0 (all jobs):
  READ: bw=153MiB/s (160MB/s), 153MiB/s-153MiB/s (160MB/s-160MB/s), io=1024MiB (1074MB), run=6710-6710msec

Welp, there you have it.

Rust only: reads 18.5 MB/sec
SSD only: reads 160 MB/sec
Rust + SSD: reads 32.6 MB/sec

No, ZFS does not read from the lowest-latency disk in a mirror vdev.

Please don’t perpetuate the myth that ZFS favors lower latency devices.

Primer: How data is stored on-disk with ZFS

As with a lot of things at this blog, I’m largely writing this to confirm and solidify my own knowledge. I tend to be pretty firm on how disks relate to vdevs, and vdevs relate to pools… but once you veer down deeper into the direct on-disk storage, I get a little hazier. So here’s an attempt to remedy that, with citations, for my benefit (and yours!) down the line.

Top level: the zpool

The zpool is the topmost unit of storage under ZFS. A zpool is a single, overarching storage system consisting of one or more vdevs. Writes are distributed among the vdevs according to how much FREE space each vdev has available – you may hear urban myths about ZFS distributing them according to the performance level of the disk, such that “faster disks end up with more writes”, but they’re just that – urban myths. (At least, they’re only myths as of this writing – 2018 April, and ZFS through 7.5.)

A zpool may be created with one or more vdevs, and may have any number of additional vdevs zpool added to it later – but, for the most part, you may not ever remove a vdev from a zpool. There is working code in development to make this possible, but it’s more of a “desperate save” than something you should use lightly – it involves building a permanent lookup table to redirect requests for records stored on the removed vdevs to their new locations on remaining vdevs; sort of a CNAME for storage blocks.

If you create a zpool with vdevs of different sizes, or you add vdevs later when the pool already has a substantial amount of data in it, you’ll end up with an imbalanced distribution of data that causes more writes to land on some vdevs than others, which will limit the performance profile of your pool.

A pool’s performance scales with the number of vdevs within the pool: in a pool of n vdevs, expect the pool to perform roughly equivalently to the slowest of those vdevs, multiplied by n. This is an important distinction – if you create a pool with three solid state disks and a single rust disk, the pool will trend towards the IOPS performance of four rust disks.

Also note that the pool’s performance scales with the number of vdevs, not the number of disks within the vdevs. If you have a single 12 disk wide RAIDZ2 vdev in your pool, expect to see roughly the IOPS profile of a single disk, not of ten!

There is absolutely no parity or redundancy at the pool level. If you lose any vdev, you’ve lost the entire pool, plain and simple. Even if you “didn’t write to anything on that vdev yet” – the pool has altered and distributed its metadata accordingly once the vdev was added; if you lose that vdev “with nothing on it” you’ve still lost the pool.

It’s important to realize that the zpool is not a RAID0; in conventional terms, it’s a JBOD – and a fairly unusual one, at that.

Second level: the vdev

A vdev consists of one or more disks. Standard vdev types are single-disk, mirror, and raidz. A raidz vdev can be raidz1, raidz2, or raidz3. There are also special vdev types – log and l2arc – which extend the ZIL and the ARC, respectively, onto those vdev types. (They aren’t really “write cache” and “read cache” in the traditional sense, which trips a lot of people up. More about that in another post, maybe.)

A single vdev, of any type, will generally have write IOPS characteristics similar to those of a single disk. Specifically, the write IOPS characteristics of its slowest member disk – which may not even be the same disk on every write.

All parity and/or redundancy in ZFS occurs within the vdev level.

Single-disk vdevs

This is as simple as it gets: a vdev that consists of a single disk, no more, no less.

The performance profile of a single-disk vdev is that of, you guessed it, that single disk.

Single-disk vdevs may be expanded in size by replacing that disk with a larger disk: if you zpool attach a 4T disk to a 2T disk, it will resilver into a 2T mirror vdev. When you then zpool detach the 2T disk, the vdev becomes a 4T vdev, expanding your total pool size.

Single-disk vdevs may also be upgraded permanently to mirror vdevs; just zpool attach one or more disks of the same or larger size.

Single-disk vdevs can detect, but not repair, corrupted data records. This makes operating with single-disk vdevs quite dangerous, by ZFS standards – the equivalent, danger-wise, of a conventional RAID0 array.

However, a pool of single-disk vdevs is not actually a RAID0, and really shouldn’t be referred to as one. For one thing, a RAID0 won’t distribute twice as many writes to a 2T disk as to a 1T disk. For another thing, you can’t start out with a three disk RAID0 array, then add a single two-disk RAID1 array (or three five-disk RAID5 arrays!) to your original array, and still call it “a RAID0”.

It may be tempting to use old terminology for conventional RAID, but doing so just makes it that much more difficult to get accustomed to thinking in terms of ZFS’ real topology, hindering both understanding and communication.

Mirror vdevs

Mirror vdevs work basically like traditional RAID1 arrays – each record destined for a mirror vdev is written redundantly to all disks within the vdev. A mirror vdev can have any number of constituent disks; common sizes are 2-disk and 3-disk, but there’s nothing stopping you from creating a 16-disk mirror vdev if that’s what floats your boat.

A mirror vdev offers usable storage capacity equivalent to that of its smallest member disk; and can survive intact as long as any single member disk survives. As long as the vdev has at least two surviving members, it can automatically repair corrupt records detected during normal use or during scrubbing – but once it’s down to the last disk, it can only detect corruption, not repair it. (If you don’t scrub regularly, this means you may already be screwed when you’re down to a single disk in the vdev – any blocks that were already corrupt are no longer repairable, as well as any blocks that become corrupt before you replace the failed disk(s).

You can expand a single disk to a mirror vdev at any time using the zpool attach command; you can also add new disks to an existing mirror in the same way. Disks may also be detached and/or replaced from mirror vdevs arbitrarily. You may also expand the size of an individual mirror vdev by replacing its disks one by one with larger disks; eg start with a mirror of 2T disks, then replace one disk with a 4T disk, wait for it to resilver, then replace the second 2T disk with another 4T disk. Once there are no disks smaller than 4T in the vdev, and it finishes resilvering, the vdev will expand to the new 4T size.

Mirror vdevs are extremely performant: like all vdevs, their write IOPS are roughly those of a single disk, but their read IOPS are roughly those of n disks, where n is the number of disks in the mirror – a mirror vdev n disks wide can read blocks from all n members in parallel.

A pool made of mirror vdevs closely resembles a conventional RAID10 array; each has write IOPS similar to n/2 disks and read IOPS similar to disks, where n is the total number of disks. As with single-disk vdevs, though, I’d advise you not to think and talk sloppily and call it “ZFS RAID10” – it really isn’t, and referring to it that way blurs the boundaries between pool and vdev, hindering both understanding and accurate communication.

RAIDZ vdevs

RAIDZ vdevs are striped parity arrays, similar to RAID5 or RAID6. RAIDZ1 has one parity block per stripe, RAIDZ2 has two parity blocks per stripe, and RAIDZ3 has three parity blocks per stripe. This means that RAIDZ1vdevs can survive loss of a single disk, RAIDZ2 can survive the loss of two disks, and RAIDZ3 vdevs can survive the loss of as many as three disks.

Note, however, that – just like mirror vdevs – once you’ve stripped away all the parity, you’re vulnerable to corruption that can’t be repaired. RAIDZ vdevs take typically take significantly longer to resilver than mirror vdevs do, as well – so you really don’t want to end up completely “uncovered” (surviving, but with no remaining parity blocks) with a RAIDZ array.

Each raidz vdev offers n-(parity*n) storage capacity, where n is the storage capacity of a single disk, and parity is the number of parity blocks per stripe. So a six-disk RAIDZ1 vdev offers the storage capacity of five disks, an eight-disk RAIDZ2 vdev offers the storage capacity of six disks, and so forth.

You may create RAIDZ vdevs using mismatched disk sizes, but the vdev’s capacity will be based around the smallest member disk. You can expand the size of an existing RAIDZ vdev by replacing all of its members individually with larger disks than were originally used, but you cannot expand a RAIDZ vdev by adding new disks to it and making it wider – a 5-disk RAIDZ1 vdev cannot be converted into a 6-disk RAIDZ1 vdev later; neither can a 6-disk RAIDZ2 be converted into a 6-disk RAIDZ1.

It’s a common misconception to think that RAIDZ vdev performance scales linearly with the number of disks used. Although throughput under ideal conditions can scale towards n-parity disks, throughput under moderate to serious load will rapidly degrade toward the profile of a single disk – or even slightly worse, since it scales down toward the profile of the slowest disk for any given operation. This is the difference between IOPS and bandwidth (and it works the same way for conventional RAID!)

RAIDZ vdev IOPS performance is generally more robust than that of a conventional RAID5 or RAID6 array of the same size, because RAIDZ offers variable stripe write sizes – if you routinely write data in records only one record wide, a RAIDZ1 vdev will write to only two of its disks (one for data, and one for parity); a RAIDZ2 vdev will write to only three of its disks (one for data, and two for parity) and so on. This can mitigate some of the otherwise-crushing IOPS penalty associated with wide striped arrays; a three-record variable stripe write to a six-disk RAIDZ vdev only lights up half the disks both when written, and later, when read – which can make the performance profile of that six-disk RAIDZ resemble that of two three-disk RAIDZ1 vdevs rather than that of a single vdev.

The performance improvement described above assumes that multiple reads and writes of the three-record stripes are being requested concurrently; otherwise the entire vdev still binds while waiting for a full-stripe read or write.

Remember that you can – and with larger servers, should – have multiple RAIDZ vdevs per pool, not just one. A pool of three eight-disk RAIDZ2 vdevs will significantly outperform a pool with a single 24-disk RAIDZ2 or RAIDZ3 vdev – and it will resilver much faster when replacing failed disks.

Third level: the metaslab

Each vdev is organized into metaslabs – typically, 200 metaslabs per vdev (although this number can change, if vdevs are expanded and/or as the ZFS codebase itself becomes further optimized over time).

When you issue writes to the pool, those writes are coalesced into a txg (transaction group), which is then distributed among individual vdevs, and finally allocated to specific metaslabs on each vdev. There’s a fairly hefty logic chain which determines exactly what metaslab a record is written to; it was explained to me (with no warranty offered) by a friend who worked with Oracle as follows:

• Is this metaslab “full”? (zfs_mg_noalloc_threshold)
• Is this metaslab excessively fragmented? (zfs_metaslab_fragmentation_threshold)
• Is this metaslab group excessively fragmented? (zfs_mg_fragmentation_threshold)
• Have we exceeded minimum free space thresholds? (metaslab_df_alloc_threshold) This one is weird; it changes the whole storage pool allocation strategy for ZFS if you cross it.
• Should we prefer lower-numbered metaslabs over higher ones? (metaslab_lba_weighting_enabled) This is totally irrelevant to all-SSD pools, and should be disabled there, because it’s pretty stupid without rust disks underneath.
• Should we prefer lower-numbered metaslab groups over higher ones? (metaslab_bias_enabled) Same as above.

You can dive into the hairy details of your pool’s metaslabs using the zdb command – this is a level which I have thankfully not personally needed so far, and I devoutly hope I will continue not to need it in the future.

Fourth level: the record

Each ZFS write is broken into records, the size of which is determined by the zfs set recordsize=command. The default recordsize is currently 128K; it may range from 512B to 1M.

Recordsize is a property which can be tuned individually per dataset, and for higher performance applications, should be tuned per dataset. If you expect to largely be moving large chunks of contiguous data – for example, reading and writing 5MB JPEG files – you’ll benefit from a larger recordsize than default. Setting recordsize=1M here will allow your writes to be less fragmented, resulting in higher performance both when making the writes, and later when reading them.

Conversely, if you expect a lot of small-block random I/O – like reading and writing database binaries, or VM (virtual machine) images – you should set recordsize smaller than the default 128K. MySQL, as an example, typically works with data in 16K chunks; if you set recordsize=16K you will tremendously improve IOPS when working with that data.

ZFS CSUMs – cryptographic hashes which verify its data’s integrity – are written on a per-record basis; data written with recordsize=1M will have a single CSUM per 1MB; data written with recordsize=8K will have 128 times as many CSUMs for the same 1MB of data.

Setting recordsize to a value smaller than your hardware’s individual sector size is a tremendously bad idea, and will lead to massive read/write amplification penalties.

Fifth (and final) level: ashift

Ashift is the property which tells ZFS what the underlying hardware’s actual sector size is. The individual blocksize within each record will be determined by ashift; unlike recordsize, however, ashift is set as a number of bits rather than an actual number.  For example, ashift=13 specifies 8K sectors, ashift=12 specifies 4K sectors, and ashift=9 specifies 512B sectors.

Ashift is per vdev, not per pool – and it’s immutable once set, so be careful not to screw it up!  In theory, ZFS will automatically set ashift to the proper value for your hardware; in practice, storage manufacturers very, very frequently lie about the underlying hardware sector size in order to keep older operating systems from getting confused, so you should do your homework and set it manually. Remember, once you add a vdev to your pool, you can’t get rid of it; so if you accidentally add a vdev with improper ashift value to your pool, you’ve permanently screwed up the entire pool!

Setting ashift too high is, for the most part, harmless – you’ll increase the amount of slack space on your storage, but unless you have a very specialized workload this is unlikely to have any significant impact. Setting ashift too low, on the other hand, is a horrorshow. If you end up with an ashift=9 vdev on a device with 8K sectors (thus, properly ashift=13), you’ll suffer from massive write amplification penalties as ZFS needs to write, read, rewrite again over and over on the same actual hardware sector. I have personally seen improperly set ashift cause a pool of Samsung 840 Pro SSDs perform slower than a pool of WD Black rust disks!

Even if you’ve done your homework and are absolutely certain that your disks use 512B hardware sectors, I strongly advise considering setting ashift=12 or even ashift=13 – because, remember, it’s immutable per vdev, and vdevs cannot be removed from pools. If you ever need to replace a 512B sector disk in a vdev with a 4K or 8K sector disk, you’ll be screwed if that vdev is ashift=9.

How data gets imbalanced on ZFS

In an earlier post, I demonstrated that ZFS distributes writes evenly across vdevs according to FREE space per vdev (not based on latency or anything else: just FREE).

There are three ways I know of that you can end up with an imbalanced distribution of data across your vdevs. The first two are dead obvious; the third took a little head-scratching and empirical testing before I was certain of it.

Different-sized vdevs

If you used vdevs of different sizes in the first place, you end up with more data on the larger vdevs than the smaller vdevs.

This one’s a no-brainer: we know that ZFS will distribute writes according to the amount of FREE on each vdev, so if you create a pool with one 1T vdev and one 2T vdev, twice as many writes will go on the 2T vdev as the 1T vdev; natch.

Vdevs ADDed after data was already written to the pool

If you zpool add one or more vdevs to an existing pool that already has data on it, ZFS isn’t going to redistribute the writes you already made to the older vdevs.

For example, let’s say you create a pool with a single 2T vdev, write 1T of data to it, then add another 2T vdev. You’ve got 1T FREE on one vdev and 2T FREE on the other vdev; ZFS will now write two records to the new vdev for every one record it writes to the old one; this means that while your writes will remain imbalanced for the rest of the pool’s life, each vdev will become full at about the same time.

You might ask, why not bias writes to the new vdevs even more heavily, so that they achieve balance before the pool’s full? The answer is consistency. If you distribute two writes to a 2T FREE vdev for every one write to a 1T FREE vdev, you have a consistent write performance profile for the remainder of the life of the pool, rather than a really bad performance profile either now (if you bias all the writes to the vdev with more FREE) or at the end of the pool’s life (if you deliver writes evenly until one vdev is entirely full, then have no choice but to send all writes to the one vdev that still has FREEspace remaining).

Balanced writes, imbalanced deletes

OK, this is the fun one. Let’s say you create a pool with two equally-sized vdevs, and a year later you look at it and you’ve got imbalanced writes. What gives?

Well, this is going to be more likely the larger your recordsize is, since as far as I can tell each record is written to a single vdev (not split across the pool as a whole in ashift-sized blocks). Basically, although ZFS wrote your data balanced across your equally-sized vdevs, you deleted more records from one vdev than another.

To demonstrate this effect (and give myself a sanity check!), I created a pool with two equally-sized 500GB vdevs, set recordsize=1M, and wrote a ton of 900K files to the pool.

root@banshee:~# zpool create -oashift=13 alloctest /ssd/alloctest/disk1.raw /rust/alloctest/disk2.raw
root@banshee:~# zfs set recordsize=1M alloctest

root@banshee:~# for i in {1..3636} do ; cp /tmp/900K.bin /alloctest/$i.bin ; done

root@banshee:~# zpool iostat -v alloctest
                               capacity   operations  bandwidth
pool                          alloc free  read  write read write
----------------------------- ----- ----- ----- ----- ----- -----
alloctest                     3.14G 989G  0     45    4.07K 14.2M
 /rust/alloctest/disk1.raw    1.57G 494G  0     22    2.04K 7.10M
 /ssd/alloctest/disk2.raw     1.57G 494G  0     22    2.04K 7.09M
----------------------------- ----- ----- ----- ----- ----- -----

As expected, these files are balanced equally across each vdev in the pool… even though one of the vdevs is much, much faster than the other, since they had the same FREE space available.

Now, we write a tiny bit of Perl to delete only the even-numbered files from alloctest

#!/usr/bin/perl

opendir (my $dh, "/alloctest") || die "Can't open directory: $!";

while (readdir $dh) { 
    my $file = $_; 
    $file =~ s/\.bin$// ; 
    if ($file/2 == int($file/2)) { 
        # this is an even-numbered file - delete it
        unlink "/alloctest/$file.bin"; 
    }
}

closedir $dh;

Now we run our little bit of Perl, delete the even-numbered files only, and see if we’re left with imbalanced data:

root@banshee:~# perl ~/deleteevens.pl

root@banshee:~# zpool iostat -v alloctest
                               capacity   operations  bandwidth
pool                          alloc free  read  write read write
----------------------------- ----- ----- ----- ----- ----- -----
alloctest                     1.57G 990G  0     24    2.13K 7.44M
 /rust/alloctest/disk1.raw    12.3M 496G  0     12    1.07K 3.72M
 /ssd/alloctest/disk2.raw     1.56G 494G  0     12    1.07K 3.72M
----------------------------- ----- ----- ----- ----- ----- -----

Bingo! 12.3M ALLOCed on disk1, and 1.56G ALLOCed on disk2 – it took some careful planning, but we now have imbalanced data on a pool with equally-sized vdevs that have been present since the pool’s creation.

However, it’s not imbalanced because ZFS wrote it that way, it’s imbalanced because we deleted it that way.  By deleting all the even-numbered files, we got rid of the files on /ssd/alloctest/disk1.raw while leaving all the files (actually, all the records) on /ssd/alloctest/disk2.rawintact. And since ZFS allocates writes according to FREE per vdev, we know that our data will slowly creep back into balance, as ZFS favors the vdev with a higher FREE count on new writes.

In practice, most people shouldn’t see a really large imbalance like this in normal usage, even with a large recordsize. I had to pretty specifically gimmick this scenario up to save files right at the desired recordsize and then delete them very specifically in a pattern which would produce the results I was looking for; organic deletions should be very unlikely to create a large imbalance.

ZFS allocates writes according to free space per vdev, not latency per vdev

I frequently see the mistaken idea popping up that ZFS allocates writes to the quickest vdev to respond. This isn’t the case: ZFS allocates pool writes in proportion to the amount of free space available on each vdev, so that the vdevs will become full at roughly the same time regardless of how small or large each was to begin with.

Testing: one large slow vdev, one small fast vdev

We can demonstrate this quickly and easily. Below, I use the truncate command to create raw storage files on two pools: rust and ssd.  By creating a 10G storage file on rust and a 2G storage file on ssd, we will see quickly whether ZFS prefers to allocate data according to free space or to latency: the ssd storage is tremendously lower latency, but the size of the device on the rust is larger.

root@banshee:~# zfs create ssd/alloctest
root@banshee:~# zfs create rust/alloctest
root@banshee:~# zfs set compression=off ssd/alloctest
root@banshee:~# zfs set compression=off rust/alloctest
root@banshee:~# truncate -s 10G /rust/alloctest/10Grust.raw
root@banshee:~# truncate -s 2G /ssd/alloctest/2Gssd.raw
root@banshee:~# zpool create -oashift=13 alloctest /rust/alloctest/10Grust.raw /ssd/alloctest/2Gssd.raw
root@banshee:~# zfs set compression=off alloctest

root@banshee:~# zpool list -v alloctest
NAME                          SIZE  ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
alloctest                     11.9G 672K  11.9G -       0%   0% 1.00x ONLINE -
 /rust/alloctest/10Grust.raw  9.94G 416K  9.94G -       0%   0%
 /ssd/alloctest/2Gssd.raw     1.98G 256K  1.98G -       0%   0%

OK, now we’ve got our lopsided pool “alloctest”, which has one very fast 2G vdev and one much slower 10G vdev. Let’s see what happens when we dump 2GB of data into it:

root@banshee:~# dd if=/dev/zero bs=256M count=8 of=/alloctest/2G.bin
8+0 records in
8+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 16.6184 s, 129 MB/s

root@banshee:~# zpool list -v alloctest
NAME                          SIZE  ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
alloctest                     11.9G 2.00G 9.92G -       9%   16% 1.00x ONLINE -
 /rust/alloctest/10Grust.raw  9.94G 1.56G 8.37G -       9%   15%
 /ssd/alloctest/2Gssd.raw     1.98G 451M  1.54G -       13%  22%

We’ve ALLOC’d 451M to the smaller vdev, and 1.56G to the larger vdev – a ratio of 3.54:1, quite close to the 5:1 ratio of the storage sizes themselves.

What if we dump more data in?

root@banshee:~# dd if=/dev/zero bs=256M count=12 of=/alloctest/3G.bin
12+0 records in
12+0 records out
3221225472 bytes (3.2 GB, 3.0 GiB) copied, 29.0672 s, 111 MB/s

root@banshee:~# zpool list -v alloctest
NAME                          SIZE  ALLOC FREE  EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
alloctest                     11.9G 5.01G 6.91G -        24%  42% 1.00x ONLINE -
 /rust/alloctest/10Grust.raw  9.94G 3.92G 6.02G -        23%  39%
 /ssd/alloctest/2Gssd.raw     1.98G 1.09G 916M  -        34%  54%

3.92G to 1.09G – 3.59 to 1, or no real change. Let’s fill the pool literally to bursting:

root@banshee:~# dd if=/dev/zero bs=256M count=48 of=/alloctest/12G.bin
dd: error writing '/alloctest/12G.bin': No space left on device
27+0 records in
26+0 records out
7014973440 bytes (7.0 GB, 6.5 GiB) copied, 99.4393 s, 70.5 MB/s

root@banshee:~# zpool list -v alloctest
NAME                          SIZE  ALLOC FREE  EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
alloctest                     11.9G 11.5G 381M  -        58%  96% 1.00x ONLINE -
 /rust/alloctest/10Grust.raw  9.94G 9.61G 330M  -        58%  96%
 /ssd/alloctest/2Gssd.raw     1.98G 1.93G 50.8M -        61%  97%

With the pool entirely full, we have a ratio of 4.98:1 – still not quite the exact 5:1 ratio of our vdevs’ sizes, but pretty damn close.

Testing: one large fast vdev, one small slow vdev

OK… now what if we repeat the same experiment, but this time we put the big vdev on ssd and the little one on rust?

root@banshee:~# truncate -s 10G /ssd/alloctest/10Gssd.raw
root@banshee:~# truncate -s 2G /rust/alloctest/2Grust.raw
root@banshee:~# zpool create -oashift=13 alloctest /ssd/alloctest/10Gssd.raw /rust/alloctest/2Grust.raw
root@banshee:~# zfs set compression=off alloctest

root@banshee:~# zpool list -v alloctest
NAME                        SIZE  ALLOC FREE  EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
alloctest                   11.9G 552K  11.9G -        0%   0% 1.00x ONLINE -
 /ssd/alloctest/10Gssd.raw  9.94G 336K  9.94G -        0%   0%
 /rust/alloctest/2Grust.raw 1.98G 216K  1.98G -        0%   0%

OK, the tables have turned. Now we’ve got a 12G pool with 10G of the storage on fast SSD, and 2G of the storage on slow rust. Let’s dump data in it:

root@banshee:~# dd if=/dev/zero bs=256M count=8 of=/alloctest/2G.bin
8+0 records in
8+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 13.5287 s, 159 MB/s

root@banshee:~# zpool list -v alloctest
NAME                        SIZE  ALLOC FREE  EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
alloctest                   11.9G 1.98G 9.95G -        9%   16% 1.00x ONLINE -
 /ssd/alloctest/10Gssd.raw  9.94G 1.55G 8.39G -        9%   15%
 /rust/alloctest/2Grust.raw 1.98G 440M  1.56G -        13%  21%

1.55G to 440M – 3.6:1. That’s a pretty familiar ratio, isn’t it? Let’s dump another 3G of data in, just like we did earlier, when the big vdev was rust:

root@banshee:~# dd if=/dev/zero bs=256M count=12 of=/alloctest/3G.bin
12+0 records in
12+0 records out
3221225472 bytes (3.2 GB, 3.0 GiB) copied, 23.5282 s, 137 MB/s

root@banshee:~# zpool list -v alloctest
NAME                        SIZE  ALLOC FREE  EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
alloctest                   11.9G 5.01G 6.91G -        25%  42% 1.00x ONLINE -
 /ssd/alloctest/10Gssd.raw  9.94G 3.92G 6.02G -        24%  39%
 /rust/alloctest/2Grust.raw 1.98G 1.09G 916M  -        34%  54%

1.09G to 3.92G ALLOCated… simplified, that’s 3.6:1 again. Just like it was when the big vdev was rust and the small vdev was ssd.

What about high-IOPS, small random writes?

For this one, I set up equally-sized vdevs on rust and ssd, created a pool with no compression, and began populating them with 4K synchronously written files, which is just about the maximum IOPS load you can put on a pool:

root@banshee:~# for i in {1..1048576}
> do
> cp /tmp/4K.bin /alloctest/$i.bin
> sync
> done

This gives us a stream of steady 4K synchronous writes to the pool (as ensured by that sync command in the loop).

Checking zpool iostat -v alloctest while the data is streaming onto the pool confirms that the writes are balanced equally between the equal-sized drives, even though we’re doing 4K writes, and one of the vdevs is an Intel 480GB SSD and the other is WD Red 4TB rust drive:

root@banshee:~# zpool iostat -v alloctest
 capacity operations bandwidth
pool                          alloc free  read  write read  write
----------------------------- ----- ----- ----- ----- ----- -----
alloctest                     4.57G 987G  171   334   1.34M 6.12M
 /ssd/alloctest/500G.raw      2.29G 494G  85    172   683K  3.08M
 /rust/alloctest/500G.raw     2.28G 494G  85    161   685K  3.05M
----------------------------- ----- ----- ----- ----- ----- -----

There’s no significant difference: each device is receiving roughly the same number of operations, and the same amount of bandwidth, at any given second; and we’re accumulating the same amount of data on each same-sized vdev.

The rule of thumb – as we’re seeing here – is that writes to any given vdev bind on the slowest disk in the vdev, and writes to a pool bind on the slowest vdev in the pool. In this case, we’re binding on the performance of the rust vdev. The reason we’re binding on that slower vdev is to keep the pool from filling imbalanced.

Conclusion

ZFS allocates writes to the pool according to the amount of free space left on each vdev, period. With the small vdev sizes we used for testing here, this didn’t result in a “perfect” allocation ratio exactly matching our vdev sizes – but the “imperfect” ratio we got was the same whether the smaller vdev was the slower one or the faster one. And when we tested with 4K synchronous writes to a pool with evenly sized vdevs, the throughput bound to the slower of the two vdevs, and we could see the data moving at the same pace onto each of those vdevs – not allocated according to their individual capacities.

This should remove any confusion about whether ZFS (at least, as of 0.6.5.6) “prefers” faster/lower latency vdevs when allocating writes. It does not.

If you’re frowning because you’ve got an imbalanced distribution of data across your pool and not sure how it happened, see here.

ZVOL vs QCOW2 with KVM

When mixing ZFS and KVM, should you put your virtual machine images on ZVOLs, or on .qcow2 files on plain datasets? It’s a topic that pops up a lot, usually with a ton of people weighing in on performance without having actually done any testing.  My old benchmarks are getting a little long in the tooth, so here’s an fio random write run with 4K blocksize, done on both a .qcow2 on a dataset, and a zvol.

Test Configuration

Host:

CPU :  Intel(R) Xeon(R) CPU E3-1230 v5 @ 3.40GHz
RAM : 32 GB DDR4 SDRAM
SATA : Intel Corporation Sunrise Point-H SATA controller [AHCI mode] (rev 31)
OS : Ubuntu 16.04.4 LTS, fully updated as of 2018-03-13
FS : ZFS 0.6.5.6-0ubuntu19, from Canonical main repo
Disks : 2x Samsung 850 Pro 1TB SATA3, mirror vdev
ZFS parameters: ashift=13,recordsize=8K,atime=off,compression=lz4

Guest:

CPU : Intel Core Processor (Broadwell), 2 threads
RAM : 512MB
OS : Ubuntu 16.04.4 LTS, fully updated as of 2018-03-13
FS : ext4
Disks: /mnt/zvol on 20G zvol volume, /mnt/dataset on 20G .qcow2 file

Synchronous 4K write results

ZVOL, –ioengine=sync:

root@benchmark:/mnt/zvol# fio --name=random-write --ioengine=sync --iodepth=4 \
                              --rw=randwrite --bs=4k --direct=0 --size=256m --numjobs=16 \
                              --end_fsync=1
[...]
Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=50453KB/s, minb=3153KB/s, maxb=3153KB/s, mint=83116msec, maxt=83132msec

QCOW2, –ioengine=sync:

root@benchmark:/mnt/qcow2# fio --name=random-write --ioengine=sync --iodepth=4 \
                               --rw=randwrite --bs=4k --direct=0 --size=256m --numjobs=16 \
                               --end_fsync=1
[...]
Run status group 0 (all jobs):
WRITE: io=4096.0MB, aggrb=45767KB/s, minb=2860KB/s, maxb=2976KB/s, mint=88058msec, maxt=91643msec

So, 50.5 MB/sec (zvol) vs 45.8 MB/sec (qcow2). Yes, there’s a difference; at least on the most punishing I/O workloads. Is it perceptible enough to matter? Probably not, for most use cases, given the benefits in ease of management and maintenance for .qcow2 on datasets. QCOW2 are easier to provision, you don’t have to worry about refreservation keeping you from taking snapshots, they’re not significantly more difficult to mount offline (modprobe nbd ; qemu-nbd -c /dev/nbd0 /path/to/image.qcow2 ; mount -oro /mnt/image /dev/nbd0 or similar); and probably the most importantly, filling the underlying storage beneath a qcow2 won’t crash the guest.

Tuning QCOW2 for even better performance

I found out yesterday that you can tune the underlying cluster size of the .qcow2 format. Creating a new .qcow2 file tuned to use 8K clusters – matching our 8K recordsize, and the 8K underlying hardware blocksize of the Samsung 850 Pro drives in our vdev – produced tremendously better results. With the tuned qcow2, we more than tripled the performance of the zvol – going from 50.5 MB/sec (zvol) to 170 MB/sec (8K tuned qcow2)!

QCOW2 -o cluster_size=8K, –ioengine=sync:

root@benchmark:/mnt/qcow2# fio --name=random-write --ioengine=sync --iodepth=4 \
                               --rw=randwrite --bs=4k --direct=0 --size=256m --numjobs=16 \
                               --end_fsync=1
[...]
Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=170002KB/s, minb=10625KB/s, maxb=12698KB/s, mint=20643msec, maxt=24672msec

ZVOL won’t pause the guest if storage is unavailable

If you fill the underlying pool with a guest that’s using a zvol for its storage, the filesystem in the guest will panic. From the guest’s perspective, this is a hardware I/O error, and the guest and/or its apps which use that virtual disk will crash, leaving it in an unknown and possibly corrput state.

If the guest uses a .qcow2 file on a dataset for storage, the same problem is handled much more safely. When writes become unavailable on host storage, the guest will be automatically paused by libvirt. This gives you a chance to free up space, then virsh resume the guest. The net effect is that the guest and its apps never realize there was ever a problem in the first place. Any pending writes complete automatically and without error once you’ve cleared the host storage problem and resumed the guest.

ZVOL doesn’t honor guest synchronous writes

It may also be worth noting that the guest seems a little less clued in with what’s going on with its storage when using the zvol. I specified --ioengine=sync for these test runs, which should – repeat, should – have made the also-specified parameter end_fsync=1 irrelevant, since all writes were supposed to be synchronous.

On the .qcow2-hosted storage, the data was written verifiably sync, since we can see there’s no pause at the end for end_fsync=1 to finish flushing the data to the metal:

Jobs: 16 (f=16): [w(16)] [66.7% done] [0KB/75346KB/0KB /s]
Jobs: 16 (f=16): [w(16)] [68.0% done] [0KB/0KB/0KB /s]
Jobs: 16 (f=16): [w(16)] [72.0% done] [0KB/263.8MB/0KB /s]
Jobs: 16 (f=16): [w(8),F(1),w(7)] [80.0% done] [0KB/199.1MB/0KB /s] 
Jobs: 15 (f=15): [w(8),_(1),w(7)] [80.8% done] [0KB/53866KB/0KB /s] 
Jobs: 15 (f=15): [w(3),F(1),w(4),_(1),w(3),F(1),w(3)] [84.6% done] 
Jobs: 12 (f=12): [F(1),w(2),_(1),w(4),_(2),w(2),_(1),w(3)] [85.2% done] 
Jobs: 8 (f=8): [_(4),w(4),_(2),w(2),_(1),w(1),_(1),w(1)] [88.9% done] Jobs: 4 (f=3): [_(4),F(1),_(1),w(1),_(3),F(1),_(4),w(1)] [100.0% done] 

random-readwrite: (groupid=0, jobs=1): err= 0: pid=1773: Tue Mar 13 13:57:16 2018

The ZVOL hosted storage, on the other hand, clearly was not honoring ioengine=sync, as it spent a significant amount of time after all data was supposedly already written, waiting for end_fsync=1 to finish:

Jobs: 16 (f=16): [w(16)] [81.0% done] [0KB/527.2MB/0KB /s] 
Jobs: 16 (f=16): [w(10),F(1),w(5)] [94.7% done] [0KB/551.6MB/0KB /s]
Jobs: 16 (f=16): [F(16)] [100.0% done] [0KB/155.2MB/0KB /s]
Jobs: 16 (f=16): [F(16)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops]
Jobs: 16 (f=16): [F(16)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops]
Jobs: 16 (f=16): [F(16)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops]

 ------[[[ above line repeats for 60 more lines ]]]------

Jobs: 16 (f=16): [F(16)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops]

random-readwrite: (groupid=0, jobs=1): err= 0: pid=1792: Tue Mar 13 13:57:42 2018

This strikes me as pretty disturbing; you could end up in a world of hurt if you’re expecting your host to honor the guest’s synchronous writes when, in fact, it’s not.

Asynchronous 4K write results

Well, hrm. Realizing now that zvol storage doesn’t actually honor synchronous write requests very well, what if we use the libaio (native Linux asynchronous I/O) engine instead?

ZVOL, –ioengine=libaio:

root@benchmark:/mnt/zvol# fio --name=random-write --ioengine=libaio --iodepth=4 \
                               --rw=randwrite --bs=4k --direct=0 --size=256m --numjobs=16 \
                               --end_fsync=1
 ... Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=139484KB/s, minb=8717KB/s, maxb=8722KB/s, mint=30054msec, maxt=30070msec

QCOW2, –ioengine=libaio:

root@benchmark:/mnt/qcow2# fio --name=random-write --ioengine=libaio --iodepth=4 \
                               --rw=randwrite --bs=4k --direct=0 --size=256m --numjobs=16 \
                               --end_fsync=1
 ... Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=164392KB/s, minb=10274KB/s, maxb=11651KB/s, mint=22498msec, maxt=25514msec

And there you have it – qcow2 at 164MB/sec vs zvol at 139 MB/sec. So when using asynchronous I/O, the qcow2-backed virtual disk actually finished the fio run faster than the zvol-backed disk.

What if we tune the .qcow2 for 8K cluster size, like we did above in the synchronous write test?

QCOW2 -o cluster_size=8K, –ioengine=libaio:

root@benchmark:/mnt/qcow2# fio --name=random-write --ioengine=libaio --iodepth=4 \
                               --rw=randwrite --bs=4k --direct=0 --size=256m --numjobs=16 \
                               --end_fsync=1
 ... Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=181304KB/s, minb=11331KB/s, maxb=13543KB/s, mint=19356msec, maxt=23134msec

The improvements aren’t as drastic here – 181 MB/sec (tuned qcow2) vs 164 MB/sec (default qcow2) vs 139 MB/sec (zvol) – but they’re still a clear improvement, and the qcow2 storage is still faster than the zvol. (If anybody knows similar tuning that can be done to the zvol to improve its numbers, please tweet or DM me @jrssnet.)

Conclusion: .qcow2 FTW

For me, it’s a no-brainer: qcow2 files are only slightly slower on even the most punishing I/O workloads under default, untuned configuration, while being MUCH easier to manage, and arguably safer (won’t crash the guest if the host fills up the storage, honors sync write requests more predictably). And if you take the time to tune the .qcow2 on creation, they actually outperform the zvol. Winner: .qcow2.

Demonstrating ZFS zpool write distribution

One of my pet peeves is people talking about zfs “striping” writes across a pool. It doesn’t help any that zfs core developers use this terminology too – but it’s sloppy and not really correct.

ZFS distributes writes among all the vdevs in a pool.  If your vdevs all have the same amount of free space available, this will resemble a simple striping action closely enough.  But if you have different amounts of free space on different vdevs – either due to disks of different sizes, or vdevs which have been added to an existing pool – you’ll get more blocks written to the drives which have more free space available.

This came into contention on Reddit recently, when one senior sysadmin stated that a zpool queues the next write to the disk which responds with the least latency.  This statement did not match with my experience, which is that a zpool binds on the performance of the slowest vdev, period.  So, I tested, by creating a test pool with sparse images of mismatched sizes, stored side-by-side on the same backing SSD (which largely eliminates questions of latency).

root@banshee:/tmp# qemu-img create -f qcow2 512M.qcow2 512M root@banshee:/tmp# qemu-img create -f qcow2 2G.qcow2 2G
root@banshee:/tmp# qemu-nbd -c /dev/nbd0 /tmp/512M.qcow2
root@banshee:/tmp# qemu-nbd -c /dev/nbd1 /tmp/2G.qcow2
root@banshee:/tmp# zpool create -oashift=13 test nbd0 nbd1

OK, we’ve now got a 2.5 GB pool, with vdevs of 512M and 2G, and pretty much guaranteed equal latency between the two of them.  What happens when we write some data to it?

root@banshee:/tmp# dd if=/dev/zero bs=4M count=128 status=none | pv -s 512M > /test/512M.zero
 512MiB 0:00:12 [41.4MiB/s] [================================>] 100% 

root@banshee:/tmp# zpool export test
root@banshee:/tmp# ls -lh *qcow2
-rw-r--r-- 1 root root 406M Jul 27 15:25 2G.qcow2
-rw-r--r-- 1 root root 118M Jul 27 15:25 512M.qcow2

There you have it – writes distributed with a ratio of roughly 4:1, matching the mismatched vdev sizes. (I also tested with a 512M image and a 1G image, and got the expected roughly 2:1 ratio afterward.)

OK. What if we put one 512M image on SSD, and one 512M image on much slower rust?  Will the pool distribute more of the writes to the much faster SSD?

root@banshee:/tmp# qemu-img create -f qcow2 /tmp/512M.qcow2 512M
root@banshee:/tmp# qemu-img create -f qcow2 /data/512M.qcow2 512M

root@banshee:/tmp# qemu-nbd -c /dev/nbd0 /tmp/512M.qcow2
root@banshee:/tmp# qemu-nbd -c /dev/nbd1 /data/512M.qcow2

root@banshee:/tmp# zpool create test -oashift=13 nbd0 nbd1
root@banshee:/tmp# dd if=/dev/zero bs=4M count=128 | pv -s 512M > /test/512M.zero 
512MiB 0:00:48 [10.5MiB/s][================================>] 100%
root@banshee:/tmp# zpool export test
root@banshee:/tmp# ls -lh /tmp/512M.qcow2 ; ls -lh /data/512M.qcow2 
-rw-r--r-- 1 root root 266M Jul 27 15:07 /tmp/512M.qcow2 
-rw-r--r-- 1 root root 269M Jul 27 15:07 /data/512M.qcow2

Nope. Once again, zfs distributes the writes according to the amount of free space available – even when this causes performance to bind *severely* on the slowest vdev in the pool.

You should expect to see this happening if you have a vdev with failing hardware, as well – if any one disk is throwing massive latency instead of just returning errors, your entire pool will as well, until the deranged disk has been removed.  You can usually spot this sort of problem using iotop – all of the disks in your pool will have roughly the same throughput in MB/sec (assuming they’ve got equivalent amounts of free space left!), but your problem disk will show a much higher %UTIL than the rest.  Fault that slow disk, and your pool performance returns to normal.

 

ZFS clones: Probably not what you really want

ZFS clones look great on paper: they’re instantaneously generated, they’re read/write, they’re initially “free” because they reference the same blocks their parent snapshots do. They’re also (initially) frequently extra-snappy performance-wise, because a lot of those parent blocks are very likely already in the ARC. If you create ten clones of the same VM image (for instance), all ten clones will share the same blocks in the ARC instead of them needing to be in the ARC ten different times. Huge win!

But, as great as a clone sounds at first blush, you probably don’t want to use them for anything that isn’t ephemeral (intended to be destroyed in fairly short order). This is because a clone’s parent snapshot is forever immutable; you can’t destroy the parent snapshot without destroying the clone along with it… even if and when the clone becomes 100% divergent, and no longer shares any block references with its parent. Let’s examine this on a small scale.

Practical testing

On my workstation banshee, I create a new dataset, make sure compression is turned off so as not to confuse us, and populate it with a 256MB chunk of random binary stuff:

root@banshee:~# zfs create banshee/demo ; zfs set compression=off banshee/demo
root@banshee:~# dd if=/dev/zero bs=16M count=16 | openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt | pv > /banshee/demo/random.bin
16+0 records in
16+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 0.483868 s, 555 MB/s
 256MiB 0:00:00 [ 533MiB/s] [<=>                                               ]

I know this looks a little weird, but AES-256 is roughly an order of magnitude faster than /dev/urandom: so what I did here was use /dev/urandom to seed AES-256, then encrypt a 256MB chunk of /dev/zero with it. At the end of this procedure, we have a dataset with 256MB of data in it:

root@banshee:~# ls -lh /banshee/demo
total 262M
-rw-r--r-- 1 root root 256M Mar 15 14:39 random.bin
root@banshee:~# zfs list banshee/demo
NAME                           USED  AVAIL  REFER  MOUNTPOINT
banshee/demo                   262M  83.3G   262M  /banshee/demo

OK. Next step, we take a snapshot of banshee/demo, then create a clone using that snapshot as its parent.

Creating a clone

You don’t actually create a ZFS clone of a dataset at all; you create a clone from a snapshot of a dataset. So before we can “clone banshee/demo”, we first have to take a snapshot of it, and then we clone that.

root@banshee:~# zfs snapshot banshee/demo@parent-snapshot
root@banshee:~# zfs clone banshee/demo@parent-snapshot banshee/demo-clone
root@banshee:~# zfs list -rt all banshee/demo
NAME                           USED  AVAIL  REFER  MOUNTPOINT
banshee/demo                   262M  83.3G   262M  /banshee/demo
banshee/demo@parent-snapshot      0      -   262M  -
root@banshee:~# zfs list -rt all banshee/demo-clone
NAME                 USED  AVAIL  REFER  MOUNTPOINT
banshee/demo-clone     1K  83.3G   262M  /banshee/demo-clone

So right now, we have the dataset banshee/demo, which shares all its blocks with banshee/demo@parent-snapshot, which in turn shares all its blocks with banshee/demo-clone. We see 262M in USED for banshee/demo, with nothing or next-to-nothing in USED for either banshee/demo@parent-snapshot or banshee/demo-clone.

Beginning divergence: removing data

Now, we remove all the data from banshee/demo:

root@banshee:~# rm /banshee/demo/random.bin
root@banshee:~# zfs list -rt all banshee/demo ; zfs list banshee/demo-clone
NAME                           USED  AVAIL  REFER  MOUNTPOINT
banshee/demo                   262M  83.3G    19K  /banshee/demo
banshee/demo@parent-snapshot   262M      -   262M  -
NAME                 USED  AVAIL  REFER  MOUNTPOINT
banshee/demo-clone     1K  83.3G   262M  /banshee/demo-clone

We still only have 262M of USED – but it’s all actually in banshee/demo@parent-snapshot now. You can tell because the REFER column has changed – banshee/demo@parent-snapshot and banshee/demo-clone still both REFER 262M, but banshee/demo only REFERs 19K now. (You still see 262M in USED for banshee/demo because banshee/demo@parent-snapshot is a child of banshee/demo, so its contents count towards banshee/demo‘s USED figure.)

Next up: we re-fill the parent dataset, banshee/demo, with 256MB of different random garbage.

Continuing divergence: replacing data in the parent

root@banshee:~# dd if=/dev/zero bs=16M count=16 | openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt | pv > /banshee/demo/random.bin
16+0 records in
16+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 0.498349 s, 539 MB/s
 256MiB 0:00:00 [ 516MiB/s] [<=>                                               ]
root@banshee:~# zfs list -rt all banshee/demo ; zfs list banshee/demo-clone
NAME                           USED  AVAIL  REFER  MOUNTPOINT
banshee/demo                   523M  83.2G   262M  /banshee/demo
banshee/demo@parent-snapshot   262M      -   262M  -
NAME                 USED  AVAIL  REFER  MOUNTPOINT
banshee/demo-clone     1K  83.2G   262M  /banshee/demo-clone

OK, at this point you see that the USED for banshee/demo shoots up to 523M: that’s the total of the 262M of original random garbage which is still preserved in banshee/demo@parent-snapshot, plus the new 262M of different random garbage in banshee/demo itself. The snapshot now diverges completely from the parent dataset, having no blocks in common at all.

So far, banshee/demo-clone is still 100% convergent with banshee/demo@parent-snapshot, so we’re still getting some conservation of space on disk and in ARC from that. But remember, the whole point of making the clone was so that we could write to it as well as read from it. So let’s do exactly that, and make the clone 100% divergent from its parent, too.

Diverging completely: replacing data in the clone

root@banshee:~# dd if=/dev/zero bs=16M count=16 | openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt | pv > /banshee/demo-clone/random.bin
16+0 records in
16+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 0.50151 s, 535 MB/s
 256MiB 0:00:00 [ 534MiB/s] [<=>                                               ]
root@banshee:~# zfs list -rt all banshee/demo ; zfs list banshee/demo-clone
NAME                           USED  AVAIL  REFER  MOUNTPOINT
banshee/demo                   523M  82.8G   262M  /banshee/demo
banshee/demo@parent-snapshot   262M      -   262M  -
NAME                 USED  AVAIL  REFER  MOUNTPOINT
banshee/demo-clone  262M  82.8G  262M  /banshee/demo-clone

There, done. We now have a parent dataset, banshee/demo, which diverges completely from its snapshot banshee/demo@parent-snapshot, and a clone, banshee/demo-clone, which also diverges completely from banshee/demo@parent-snapshot.

Examining the suck

Since neither the parent, its snapshot, nor the clone share any blocks with one another anymore, we’re using the full 786MB of on-disk space that the three of them add up to. And since they also don’t share any blocks in the ARC, we’re left with absolutely no benefit in either storage consumption or performance to our having used a clone.

Worse, despite having no blocks in common and no perceptible benefit to the clone structure, all three are still inextricably linked, and neither banshee/demo nor banshee/demo@parent-snapshot can be destroyed without also destroying banshee/demo-clone:

root@banshee:~# zfs destroy banshee/demo -r
cannot destroy 'banshee/demo': filesystem has dependent clones
use '-R' to destroy the following datasets:
banshee/demo-clone
root@banshee:~# zfs destroy banshee/demo@parent-snapshot
cannot destroy 'banshee/demo@parent-snapshot': snapshot has dependent clones
use '-R' to destroy the following datasets:
banshee/demo-clone

So now you’re left with a great unwieldy mass of tangled dependencies, wasted space, and no perceptible benefits at all.

Conclusion and practical example

Imagine that you’re storing VM images in ZFS, and you began with a “gold” image of a freshly installed operating system, and created ten different clones to run ten different VMs from. Initially, this seemed great: you could create the clones instantaneously, and they shared tons of blocks, so they consumed a fraction of the ARC they would as complete, separate copies.

A year later, however, your gold image – of, let’s say, Ubuntu 16.04.1 – has diverged to a staggering degree with the set of rolling updates necessary to bring it all the way to Ubuntu 16.04.2. Your VMs have also diverged tremendously, from their parent snapshot and from one another. And now you’re stuck with the year-old snapshot of the “gold” image, completely useless to you but forever engraved on your drive unless and until you’re willing to replicate or otherwise block-for-block copy your VMs painstakingly into self-sufficient datasets with no references. You also have no remaining performance benefits, and you have an extra SPOF (single point of failure) where some admin – maybe even you – might see that parent snapshot nobody cared about anymore taking up all that disk space, and…

root@banshee:~# zfs destroy -R banshee/demo@parent-snapshot
root@banshee:~# zfs list banshee/demo-clone
cannot open 'banshee/demo-clone': dataset does not exist

One “oops” later, that “useless” parent snapshot and every single one of those clones you were using in production are gone forever. (Or, hopefully, just gone until you can restore them from your off-pool backup. You are maintaining replicated backups on at least one other pool, preferably on another machine, aren’t you? Aren’t you?!)

Depressing Storage Calculator

When a Terabyte is not a Terabyte

It seems like a stupid question, if you’re not an IT professional – and maybe even if you are – how much storage does it take to store 1TB of data? Unfortunately, it’s not a stupid question in the vein of “what weighs more, a pound of feathers or a pound of bricks”, and the answer isn’t “one terabyte” either. I’m going to try to break down all the various things that make the answer harder – and unhappier – in easy steps. Not everybody will need all of these things, so I’ll try to lay it out in a reasonably likely order from “affects everybody” to “only affects mission-critical business data with real RTO and RPO defined”.

Counting the Costs

Simple Local Storage

Computer TB vs Manufacturer TB

To your computer, and to all computers since the dawn of computing, a KB is actually a “kibibyte”, a megabyte a “mebibyte”, and so forth – they’re powers of two, not of ten. So 1 KiB = 2^10 = 1024. That’s an extra 24 bytes from a proper Kilobyte, which is 10^3 = 1000. No big deal, right? Well, the difference squares itself with each hop up from KB to MB to GB to TB, and gets that much more significant. Storage manufacturers prefer – and also have, since the dawn of time – to measure in those proper power-of-ten units, since that means they get to put bigger numbers on a device of a given actual size and thus try to trick you into thinking it’s somehow better.

At the Terabyte/Tebibyte level, you’re talking about the difference between 2^40 and 10^12. So 1 TiB, as your computer measures data, is 1.0995 TB as the rat bastards who sell hard drives measure storage. Let’s just go ahead and round that up to a nice easy 1.1.

TL;DR: multiply times 1.1 to account for vendor units.

Working Free Space

Remember those sliding number puzzles you had as a kid, where the digits 1-8 were embedded in a 9-square grid, and you were supposed to slide them around one at a time until you got them in order? Without the “9” missing, you wouldn’t be able to slide them. That’s a pretty decent rough analogy of how storage generally works, for all sorts of general reasons. If you don’t have any free space, you can’t move the tiles around and actually get anything done. For our sliding number puzzles when we were kids, that was 8/9 of the available storage occupied. A better rule of thumb for us is 8/10, or 80%. Once your disk(s) are 80% full, you should consider them full, and you should immediately be either deleting things or upgrading. If they hit 90% full, you should consider your own personal pants to be on actual fire, and react with an appropriate amount of immediacy to remedy that.

TL;DR: multiply times 1.25 to account for working free space.

Growth

You’re probably not really planning on just storing one chunk of data you have right now and never changing it. You’re almost certainly talking about curating an ever-growing collection of data that changes and accumulates as time goes on. Most people and businesses should plan on their data storage needs to double about every five years – that’s pretty conservative; it can easily get worse than that. Still, five years is also a pretty decent – and very conservative, not aggressive – hardware refresh cycle. So let’s say we want to plan for our storage needs to be fulfilled by what we buy now, until we need new everything anyway. That means doubling everything so you don’t have to upgrade for another few years.

TL;DR: multiply times 2.0 to account for data growth over the next few years.

Disaster Recovery

What, you weren’t planning on not backing your stuff up, were you? At a bare minimum, you’re going to need as much storage for backup as you did for production – most likely, you’ll need considerably more. We’ll be super super generous here and assume all you need is enough space for one single full backup – which usually only applies if you also have redundancy and very heavy-duty “oops recovery” and maybe hotspares as well. But if you don’t have all those things… this really isn’t enough. Really.

TL;DR: multiply times 2.0 to account for one full backup, as disaster recovery.

Redundancy, Hotspares, and Snapshots

Snapshots / “Oops Recovery” Schemes

You want to have a way to fix it pretty much immediately if you accidentally break a document. What this scheme looks like may differ depending on the sophistication of the system you’re working on. At best, you’re talking something like ZFS snapshots. In the middle of the road, Windows’ Volume Shadow Copy service (what powers the “Previous Versions” tab in Windows Explorer). At worst, the Recycle Bin. (And that’s really not good enough and you should figure out a way to do better.) What these things all have in common is that they offer a limiting factor to how badly you can screw yourself with the stroke of a key – you can “undo” whatever it is you broke to a relatively recent version that wasn’t broken in just a few clicks.

Different “oops recovery” schemes have different levels of efficiency, and different amounts of point-in-time granularity. My own ZFS-based systems maintain 30 hourly snapshots, 30 daily snapshots, and 3 monthly snapshots. I generally plan for snapshot space to take up about 33% as much space as my production storage, and that’s not a bad rule of thumb across the board, even if you can’t cram as many of your own schemes level of “oops points” in the same amount of space.

TL;DR: multiply times 1.3 to account for snapshots, VSS, or other “oops recovery”.

Redundancy

Redundancy – in the form of mirrored drives, striped RAID arrays, and so forth – is not a backup! However, it is a very, very useful thing to help you avoid the downtime monster, and in the case of more advanced storage schema like ZFS, to avoid corruption and bitrot. If you’re using 1:1 redundancy – RAID1, RAID10, ZFS mirrors, or btrfs-RAID1 distributed redundancy – this means you need two of every drive. If you’re using two blocks of parity in each eight block stripe (think RAID6 or ZFS RAIDZ2 with eight drives in each vdev), you’re going to be looking at 75% theoretical efficiency that comes out to more like 70% actual efficiency after stripe overhead. I’m just going to go ahead and say “let’s calculate using the more pessimistic number”. So, double everything to account for redundancy.

TL;DR: multiply times 2.0 to account for redundant storage scheme.

Hotspare

This is probably going to be the least common item on the list, but the vast majority of my clients have opted for it at this point. A hotspare server is ready to take over for the production server at a moment’s notice, without an actual “restore the backup” type procedure. With Sanoid, this most frequently means hourly replication from production to hotspare, with the ability to spin up the replicated VMs – both storage and hypervisor – directly on the hotspare server. The hotspare is thus promoted to being production, and what was the production server can be repaired with reduced time pressure and restored into service as a hotspare itself.

If you have a hotspare – and if, say, ten or more people’s payroll and productivity is dependent on your systems being up and running, you probably should – that’s another full redundancy to add to the bill.

TL;DR: bump your “backup” allowance up from x2.0 to x3.0 if you also use hotspare hardware.

The Butcher’s Bill

If you have, and account for, everything we went through above, to store 1 “terabyte” of data you’ll need:

1 “terabyte” (really a tebibyte) of data
x 1.1 TiB per TB
x 1.25 for working free space
x 2.0 for planned growth over the next few years

x 3.0 for disaster recovery + hotspare systems
x 1.3 for snapshot or other “oops recovery”
x 2.0 for redundancy
==========================================
21.45 TB of actual storage hardware.

That can’t be right! You’re insane!

Alright, let’s break that down somewhat differently, then. Keep in mind that we’re talking about three separate computer systems in the above example, each with its own storage (production, hotspare, and disaster recovery). Now let’s instead assume that we’re talking about using drives of a given size, and see what that breaks down to in terms of actual usable storage on them.

Let’s forget about the hotspare and the disaster recovery boxes, so we’re looking at the purely local level now. Then let’s toss out the redundancy, since we’re only talking about one individual drive. That leaves us with 1TB / 1.1 TiB per TB / 1.25 working TiB per stored TiB / 1.3 TiB of prod+snapshots for every TiB of prod = 0.559 TiB of usable capacity per 1TB drive. Factor in planned growth by cutting that in half, and that means you shouldn’t be planning to start out storing more than 0.28 TiB of data on 1TB of storage.

TL;DR: If you have 280GiB of existing data, you need 1TB of local capacity.

That probably sounds more reasonable in terms of your “gut feel”, right? You have 280GiB of data, so you buy a 1TB disk, and that’ll give you some breathing room for a few years? Maybe you think it feels a bit aggressive (it isn’t), but it should at least be within the ballpark of how you’re used to thinking and feeling.

Now multiply by 2 for storage redundancy (mirrored disks), and by 3 for site/server redundancy (production, hotspare, and DR) and you’re at six 1TB disks total, to store 280GiB of data. 6/.28 = 21.43, and we’re right back where we started from, less a couple of rounding errors: we need to provision 21.45 TB for every 1TiB of data we’ve got right now.

8:1 rule of thumb

Based on the same calculations and with a healthy dose of rounding, we come up with another really handy, useful, memorable rule of thumb: when buying, you need eight times as much raw storage in production as the amount of data you have now.

So if you’ve got 1TiB of data, buy servers with 8TB of disks – whether it’s two 4TB disks in a single mirror, or four 2TB disks in two mirrors, or whatever, your rule of thumb is 8:1. Per system, so if you maintain hotspare and DR systems, you’ll need to do that twice more – but it’s still 8:1 in raw storage per machine.

ZFS snapshots and cold storage

OK, this isn’t really a practice I personally have any use for. I much, much prefer replicating snapshots to live systems using syncoid as a wrapper for zfs send and receive. But if you want to use cold storage, or just want to understand conceptually the way snapshots and streams work, read on.

Our working dataset

Let’s say you have a ZFS pool and dataset named, creatively enough, poolname/setname, and let’s say you’ve taken snapshots over a period of time, named @0 through @9.

root@banshee:~# zfs list -rt all poolname/setname
NAME                USED  AVAIL  REFER  MOUNTPOINT
poolname/setname    1004M  21.6G    19K  /poolname/setname
poolname/setname@0   100M      -   100M  -
poolname/setname@1   100M      -   100M  -
poolname/setname@2   100M      -   100M  -
poolname/setname@3   100M      -   100M  -
poolname/setname@4   100M      -   100M  -
poolname/setname@5   100M      -   100M  -
poolname/setname@6   100M      -   100M  -
poolname/setname@7   100M      -   100M  -
poolname/setname@8   100M      -   100M  -
poolname/setname@9   100M      -   100M  -

By looking at the USED and REFER columns, we can see that each snapshot has 100M of data in it, unique to that snapshot, which in total adds up to 1004M of data.

Sending a full backup to cold storage

Now if we want to use zfs send to move stuff to cold storage, we have to start out with a full backup of the oldest snapshot, @0:

root@banshee:~# zfs send poolname/setname@0 | pv > /coldstorage/setname@0.zfsfull
 108MB 0:00:00 [ 297MB/s] [<=>                                                 ]
root@banshee:~# ls -lh /coldstorage
total 1.8M
-rw-r--r-- 1 root root 108M Sep 15 16:26 setname@0.zfsfull

Sending incremental backups to cold storage

Let’s go ahead and take some incrementals of setname now:

root@banshee:~# zfs send -i poolname/setname@0 poolname/setname@1 | pv > /coldstorage/setname@0-@1.zfsinc
 108MB 0:00:00 [ 316MB/s] [<=>                                                 ]
root@banshee:~# zfs send -i poolname/setname@1 poolname/setname@2 | pv > /coldstorage/setname@1-@2.zfsinc
 108MB 0:00:00 [ 299MB/s] [<=>                                                 ]
root@banshee:~# ls -lh /coldstorage
total 5.2M
-rw-r--r-- 1 root root 108M Sep 15 16:33 setname@0-@1.zfsinc
-rw-r--r-- 1 root root 108M Sep 15 16:26 setname@0.zfsfull
-rw-r--r-- 1 root root 108M Sep 15 16:33 setname@1-@2.zfsinc

OK, so now we have one full and two incrementals. Before we do anything that works, let’s look at some of the things we can’t do, because these are really important to know about.

Without a full, incrementals are useless

First of all, we can’t restore an incremental without a full:

root@banshee:~# pv < /coldstorage/setname@0-@1.zfsinc | zfs receive poolname/restore
cannot receive incremental stream: destination 'poolname/restore' does not exist
  64kB 0:00:00 [94.8MB/s] [>                                   ]  0%            

Good thing we’ve got that full backup of @0, eh?

Restoring a full backup from cold storage

root@banshee:~# pv < /coldstorage/setname@0.zfsfull | zfs receive poolname/restore
 108MB 0:00:00 [ 496MB/s] [==================================>] 100%            
root@banshee:~# zfs list -rt all poolname/restore
NAME                USED  AVAIL  REFER  MOUNTPOINT
poolname/restore     100M  21.5G   100M  /poolname/restore
poolname/restore@0      0      -   100M  -

Simple – we restored our full backup of poolname to a new dataset named restore, which is in the condition poolname was in when @0 was taken, and contains @0 as a snapshot. Now, keeping in tune with our “things we cannot do” theme, can we skip straight from this full of @0 to our incremental from @1-@2?

Without ONE incremental, following incrementals are useless

root@banshee:~# pv < /coldstorage/setname@1-@2.zfsinc | zfs receive poolname/restore
cannot receive incremental stream: most recent snapshot of poolname/restore does not
match incremental source
  64kB 0:00:00 [49.2MB/s] [>                                   ]  0%

No, we cannot restore a later incremental directly onto our full of @0. We must use an incremental which starts with @0, which is the only snapshot we actually have present in full. Then we can restore the incremental from @1 to @2 after that.

Restoring incrementals in order

root@banshee:~# pv < /coldstorage/setname@0-@1.zfsinc | zfs receive poolname/restore
 108MB 0:00:00 [ 452MB/s] [==================================>] 100%            
root@banshee:~# pv < /coldstorage/setname@1-@2.zfsinc | zfs receive poolname/restore
 108MB 0:00:00 [ 448MB/s] [==================================>] 100%
root@banshee:~# zfs list -rt all poolname/restore
NAME                                                 USED  AVAIL  REFER  MOUNTPOINT
poolname/restore                                      301M  21.3G   100M  /poolname/restore
poolname/restore@0                                    100M      -   100M  -
poolname/restore@1                                    100M      -   100M  -
poolname/restore@2                                       0      -   100M  -

There we go – first do a full restore of @0 onto a new dataset, then do incremental restores of @0-@1 and @1-@2, in order, and once we’re done we’ve got a restore of setname in the condition it was in when @2 was taken, and containing @0, @1, and @2 as snapshots within it.

Incrementals that skip snapshots

Let’s try one more thing – what if we take, and restore, an incremental directly from @2 to @9?

root@banshee:~# zfs send -i poolname/setname@2 poolname/setname@9 | pv > /coldstorage/setname@2-@9.zfsinc
 108MB 0:00:00 [ 324MB/s] [<=>                                                 ]
root@banshee:~# pv < /coldstorage/setname@2-@9.zfsinc | zfs receive banshee/restore
 108MB 0:00:00 [ 464MB/s] [==================================>] 100%            
root@banshee:~# zfs list -rt all poolname/restore
NAME                USED  AVAIL  REFER  MOUNTPOINT
poolname/restore     402M  21.2G   100M  /poolname/restore
poolname/restore@0   100M      -   100M  -
poolname/restore@1   100M      -   100M  -
poolname/restore@2   100M      -   100M  -
poolname/restore@9      0      -   100M  -

Note that when we restored an incremental of @2-@9, we did restore our dataset to the condition it was in when @9 was taken, but we did not restore the actual snapshots between @2 and @9. If we’d wanted to save those in a single operation, we should’ve used an incremental snapshot stream.

Sending incremental streams to cold storage

To save an incremental stream, we use -I instead of -i when we invoke zfs send. Let’s send an incremental stream from @0 through @9 to cold storage:

root@banshee:~# zfs send -I poolname/setname@0 poolname/setname@9 | pv > /coldstorage/setname@0-@9.zfsincstream
 969MB 0:00:03 [ 294MB/s] [   <=>                                              ]
root@banshee:~# ls -lh /coldstorage
total 21M
-rw-r--r-- 1 root root 108M Sep 15 16:33 setname@0-@1.zfsinc
-rw-r--r-- 1 root root 969M Sep 15 16:48 setname@0-@9.zfsincstream
-rw-r--r-- 1 root root 108M Sep 15 16:26 setname@0.zfsfull
-rw-r--r-- 1 root root 108M Sep 15 16:33 setname@1-@2.zfsinc

Incremental streams can’t be used without a full, either

Again, what can’t we do? We can’t use our incremental stream by itself, either: we still need that full of @0 to do anything with the stream of @0-@9.
Let’s demonstrate that by starting over from scratch.

root@banshee:~# zfs destroy -R poolname/restore
root@banshee:~# pv < /coldstorage/setname@0-@9.zfsincstream | zfs receive poolname/restore cannot receive incremental stream: destination 'poolname/restore' does not exist 256kB 0:00:00 [ 464MB/s] [> ] 0%

Restoring from cold storage using snapshot streams

All right, let’s restore from our full first, and then see what the incremental stream can do:

root@banshee:~# pv < /coldstorage/setname@0.zfsfull | zfs receive poolname/restore
 108MB 0:00:00 [ 418MB/s] [==================================>] 100%            
root@banshee:~# pv < /coldstorage/setname@0-@9.zfsincstream | zfs receive poolname/restore
 969MB 0:00:06 [ 155MB/s] [==================================>] 100%            
root@banshee:~# zfs list -rt all poolname/restore
NAME                USED  AVAIL  REFER  MOUNTPOINT
poolname/restore    1004M  20.6G   100M  /poolname/restore
poolname/restore@0   100M      -   100M  -
poolname/restore@1   100M      -   100M  -
poolname/restore@2   100M      -   100M  -
poolname/restore@3   100M      -   100M  -
poolname/restore@4   100M      -   100M  -
poolname/restore@5   100M      -   100M  -
poolname/restore@6   100M      -   100M  -
poolname/restore@7   100M      -   100M  -
poolname/restore@8   100M      -   100M  -
poolname/restore@9      0      -   100M  -

There you have it – our incremental stream is a single file containing incrementals for all snapshots transitioning from @0 to @9, so after restoring it on top of a full from @0, we have a restore of setname which is in the condition setname was in when @9 was taken, and contains all the actual snapshots from @0 through @9.

Conclusions about cold storage backups

I still don’t particularly recommend doing backups this way due to the ultimate fragility of requiring lots of independent bits and bobs. If you delete or otherwise lose the full of @0, ALL incrementals you take since then are useless – which means eventually you’re going to want to take a new full, maybe of @100, and start over again. The problem is, as soon as you do that, you’re going to take up double the space on cold storage you were using in production, since you have all the data from @0-@99 in incrementals, AND all of the surviving data from those snapshots in @100, along with any new data in @100!

You can delete the full of @0 after the full of @100 finishes, and as soon as you do, all those incrementals of @1 to @99 become completely useless and need to be deleted as well. If you screw up somehow and think you have a good full of @100 when you really don’t, and delete @0… all of your backups become useless, and you’re completely dead in the water. So be careful.

By contrast, if you use syncoid to replicate directly to a second server or pool, you never need more space than you took up in production, you no longer have the nail-biting period when you delete old fulls, and you no longer have to worry about going agonizingly through ten or a hundred separate incremental restores to get back where you were – you can delete snapshots from the back or the middle of a working dataset on another pool without any concern that it’ll break snapshots that come later than the one you’re deleting, and you can restore them all directly in a single operation by replicating in the opposite direction.

But, if you’re still determined to send snapshots to cold storage – local, Glacier, or whatever – at least now you know how.