Primer: How data is stored on-disk with ZFS

As with a lot of things at this blog, I’m largely writing this to confirm and solidify my own knowledge. I tend to be pretty firm on how disks relate to vdevs, and vdevs relate to pools… but once you veer down deeper into the direct on-disk storage, I get a little hazier. So here’s an attempt to remedy that, with citations, for my benefit (and yours!) down the line.

Top level: the zpool

The zpool is the topmost unit of storage under ZFS. A zpool is a single, overarching storage system consisting of one or more vdevs. Writes are distributed among the vdevs according to how much FREE space each vdev has available – you may hear urban myths about ZFS distributing them according to the performance level of the disk, such that “faster disks end up with more writes”, but they’re just that – urban myths. (At least, they’re only myths as of this writing – 2018 April, and ZFS through 7.5.)

A zpool may be created with one or more vdevs, and may have any number of additional vdevs zpool added to it later – but, for the most part, you may not ever remove a vdev from a zpool. There is working code in development to make this possible, but it’s more of a “desperate save” than something you should use lightly – it involves building a permanent lookup table to redirect requests for records stored on the removed vdevs to their new locations on remaining vdevs; sort of a CNAME for storage blocks.

If you create a zpool with vdevs of different sizes, or you add vdevs later when the pool already has a substantial amount of data in it, you’ll end up with an imbalanced distribution of data that causes more writes to land on some vdevs than others, which will limit the performance profile of your pool.

A pool’s performance scales with the number of vdevs within the pool: in a pool of n vdevs, expect the pool to perform roughly equivalently to the slowest of those vdevs, multiplied by n. This is an important distinction – if you create a pool with three solid state disks and a single rust disk, the pool will trend towards the IOPS performance of four rust disks.

Also note that the pool’s performance scales with the number of vdevs, not the number of disks within the vdevs. If you have a single 12 disk wide RAIDZ2 vdev in your pool, expect to see roughly the IOPS profile of a single disk, not of ten!

There is absolutely no parity or redundancy at the pool level. If you lose any vdev, you’ve lost the entire pool, plain and simple. Even if you “didn’t write to anything on that vdev yet” – the pool has altered and distributed its metadata accordingly once the vdev was added; if you lose that vdev “with nothing on it” you’ve still lost the pool.

It’s important to realize that the zpool is not a RAID0; in conventional terms, it’s a JBOD – and a fairly unusual one, at that.

Second level: the vdev

A vdev consists of one or more disks. Standard vdev types are single-disk, mirror, and raidz. A raidz vdev can be raidz1, raidz2, or raidz3. There are also special vdev types – log and l2arc – which extend the ZIL and the ARC, respectively, onto those vdev types. (They aren’t really “write cache” and “read cache” in the traditional sense, which trips a lot of people up. More about that in another post, maybe.)

A single vdev, of any type, will generally have write IOPS characteristics similar to those of a single disk. Specifically, the write IOPS characteristics of its slowest member disk – which may not even be the same disk on every write.

All parity and/or redundancy in ZFS occurs within the vdev level.

Single-disk vdevs

This is as simple as it gets: a vdev that consists of a single disk, no more, no less.

The performance profile of a single-disk vdev is that of, you guessed it, that single disk.

Single-disk vdevs may be expanded in size by replacing that disk with a larger disk: if you zpool attach a 4T disk to a 2T disk, it will resilver into a 2T mirror vdev. When you then zpool detach the 2T disk, the vdev becomes a 4T vdev, expanding your total pool size.

Single-disk vdevs may also be upgraded permanently to mirror vdevs; just zpool attach one or more disks of the same or larger size.

Single-disk vdevs can detect, but not repair, corrupted data records. This makes operating with single-disk vdevs quite dangerous, by ZFS standards – the equivalent, danger-wise, of a conventional RAID0 array.

However, a pool of single-disk vdevs is not actually a RAID0, and really shouldn’t be referred to as one. For one thing, a RAID0 won’t distribute twice as many writes to a 2T disk as to a 1T disk. For another thing, you can’t start out with a three disk RAID0 array, then add a single two-disk RAID1 array (or three five-disk RAID5 arrays!) to your original array, and still call it “a RAID0”.

It may be tempting to use old terminology for conventional RAID, but doing so just makes it that much more difficult to get accustomed to thinking in terms of ZFS’ real topology, hindering both understanding and communication.

Mirror vdevs

Mirror vdevs work basically like traditional RAID1 arrays – each record destined for a mirror vdev is written redundantly to all disks within the vdev. A mirror vdev can have any number of constituent disks; common sizes are 2-disk and 3-disk, but there’s nothing stopping you from creating a 16-disk mirror vdev if that’s what floats your boat.

A mirror vdev offers usable storage capacity equivalent to that of its smallest member disk; and can survive intact as long as any single member disk survives. As long as the vdev has at least two surviving members, it can automatically repair corrupt records detected during normal use or during scrubbing – but once it’s down to the last disk, it can only detect corruption, not repair it. (If you don’t scrub regularly, this means you may already be screwed when you’re down to a single disk in the vdev – any blocks that were already corrupt are no longer repairable, as well as any blocks that become corrupt before you replace the failed disk(s).

You can expand a single disk to a mirror vdev at any time using the zpool attach command; you can also add new disks to an existing mirror in the same way. Disks may also be detached and/or replaced from mirror vdevs arbitrarily. You may also expand the size of an individual mirror vdev by replacing its disks one by one with larger disks; eg start with a mirror of 2T disks, then replace one disk with a 4T disk, wait for it to resilver, then replace the second 2T disk with another 4T disk. Once there are no disks smaller than 4T in the vdev, and it finishes resilvering, the vdev will expand to the new 4T size.

Mirror vdevs are extremely performant: like all vdevs, their write IOPS are roughly those of a single disk, but their read IOPS are roughly those of n disks, where n is the number of disks in the mirror – a mirror vdev n disks wide can read blocks from all n members in parallel.

A pool made of mirror vdevs closely resembles a conventional RAID10 array; each has write IOPS similar to n/2 disks and read IOPS similar to disks, where n is the total number of disks. As with single-disk vdevs, though, I’d advise you not to think and talk sloppily and call it “ZFS RAID10” – it really isn’t, and referring to it that way blurs the boundaries between pool and vdev, hindering both understanding and accurate communication.

RAIDZ vdevs

RAIDZ vdevs are striped parity arrays, similar to RAID5 or RAID6. RAIDZ1 has one parity block per stripe, RAIDZ2 has two parity blocks per stripe, and RAIDZ3 has three parity blocks per stripe. This means that RAIDZ1vdevs can survive loss of a single disk, RAIDZ2 can survive the loss of two disks, and RAIDZ3 vdevs can survive the loss of as many as three disks.

Note, however, that – just like mirror vdevs – once you’ve stripped away all the parity, you’re vulnerable to corruption that can’t be repaired. RAIDZ vdevs take typically take significantly longer to resilver than mirror vdevs do, as well – so you really don’t want to end up completely “uncovered” (surviving, but with no remaining parity blocks) with a RAIDZ array.

Each raidz vdev offers n-(parity*n) storage capacity, where n is the storage capacity of a single disk, and parity is the number of parity blocks per stripe. So a six-disk RAIDZ1 vdev offers the storage capacity of five disks, an eight-disk RAIDZ2 vdev offers the storage capacity of six disks, and so forth.

You may create RAIDZ vdevs using mismatched disk sizes, but the vdev’s capacity will be based around the smallest member disk. You can expand the size of an existing RAIDZ vdev by replacing all of its members individually with larger disks than were originally used, but you cannot expand a RAIDZ vdev by adding new disks to it and making it wider – a 5-disk RAIDZ1 vdev cannot be converted into a 6-disk RAIDZ1 vdev later; neither can a 6-disk RAIDZ2 be converted into a 6-disk RAIDZ1.

It’s a common misconception to think that RAIDZ vdev performance scales linearly with the number of disks used. Although throughput under ideal conditions can scale towards n-parity disks, throughput under moderate to serious load will rapidly degrade toward the profile of a single disk – or even slightly worse, since it scales down toward the profile of the slowest disk for any given operation. This is the difference between IOPS and bandwidth (and it works the same way for conventional RAID!)

RAIDZ vdev IOPS performance is generally more robust than that of a conventional RAID5 or RAID6 array of the same size, because RAIDZ offers variable stripe write sizes – if you routinely write data in records only one record wide, a RAIDZ1 vdev will write to only two of its disks (one for data, and one for parity); a RAIDZ2 vdev will write to only three of its disks (one for data, and two for parity) and so on. This can mitigate some of the otherwise-crushing IOPS penalty associated with wide striped arrays; a three-record variable stripe write to a six-disk RAIDZ vdev only lights up half the disks both when written, and later, when read – which can make the performance profile of that six-disk RAIDZ resemble that of two three-disk RAIDZ1 vdevs rather than that of a single vdev.

The performance improvement described above assumes that multiple reads and writes of the three-record stripes are being requested concurrently; otherwise the entire vdev still binds while waiting for a full-stripe read or write.

Remember that you can – and with larger servers, should – have multiple RAIDZ vdevs per pool, not just one. A pool of three eight-disk RAIDZ2 vdevs will significantly outperform a pool with a single 24-disk RAIDZ2 or RAIDZ3 vdev – and it will resilver much faster when replacing failed disks.

Third level: the metaslab

Each vdev is organized into metaslabs – typically, 200 metaslabs per vdev (although this number can change, if vdevs are expanded and/or as the ZFS codebase itself becomes further optimized over time).

When you issue writes to the pool, those writes are coalesced into a txg (transaction group), which is then distributed among individual vdevs, and finally allocated to specific metaslabs on each vdev. There’s a fairly hefty logic chain which determines exactly what metaslab a record is written to; it was explained to me (with no warranty offered) by a friend who worked with Oracle as follows:

• Is this metaslab “full”? (zfs_mg_noalloc_threshold)
• Is this metaslab excessively fragmented? (zfs_metaslab_fragmentation_threshold)
• Is this metaslab group excessively fragmented? (zfs_mg_fragmentation_threshold)
• Have we exceeded minimum free space thresholds? (metaslab_df_alloc_threshold) This one is weird; it changes the whole storage pool allocation strategy for ZFS if you cross it.
• Should we prefer lower-numbered metaslabs over higher ones? (metaslab_lba_weighting_enabled) This is totally irrelevant to all-SSD pools, and should be disabled there, because it’s pretty stupid without rust disks underneath.
• Should we prefer lower-numbered metaslab groups over higher ones? (metaslab_bias_enabled) Same as above.

You can dive into the hairy details of your pool’s metaslabs using the zdb command – this is a level which I have thankfully not personally needed so far, and I devoutly hope I will continue not to need it in the future.

Fourth level: the record

Each ZFS write is broken into records, the size of which is determined by the zfs set recordsize=command. The default recordsize is currently 128K; it may range from 512B to 1M.

Recordsize is a property which can be tuned individually per dataset, and for higher performance applications, should be tuned per dataset. If you expect to largely be moving large chunks of contiguous data – for example, reading and writing 5MB JPEG files – you’ll benefit from a larger recordsize than default. Setting recordsize=1M here will allow your writes to be less fragmented, resulting in higher performance both when making the writes, and later when reading them.

Conversely, if you expect a lot of small-block random I/O – like reading and writing database binaries, or VM (virtual machine) images – you should set recordsize smaller than the default 128K. MySQL, as an example, typically works with data in 16K chunks; if you set recordsize=16K you will tremendously improve IOPS when working with that data.

ZFS CSUMs – cryptographic hashes which verify its data’s integrity – are written on a per-record basis; data written with recordsize=1M will have a single CSUM per 1MB; data written with recordsize=8K will have 128 times as many CSUMs for the same 1MB of data.

Setting recordsize to a value smaller than your hardware’s individual sector size is a tremendously bad idea, and will lead to massive read/write amplification penalties.

Fifth (and final) level: ashift

Ashift is the property which tells ZFS what the underlying hardware’s actual sector size is. The individual blocksize within each record will be determined by ashift; unlike recordsize, however, ashift is set as a number of bits rather than an actual number.  For example, ashift=13 specifies 8K sectors, ashift=12 specifies 4K sectors, and ashift=9 specifies 512B sectors.

Ashift is per vdev, not per pool – and it’s immutable once set, so be careful not to screw it up!  In theory, ZFS will automatically set ashift to the proper value for your hardware; in practice, storage manufacturers very, very frequently lie about the underlying hardware sector size in order to keep older operating systems from getting confused, so you should do your homework and set it manually. Remember, once you add a vdev to your pool, you can’t get rid of it; so if you accidentally add a vdev with improper ashift value to your pool, you’ve permanently screwed up the entire pool!

Setting ashift too high is, for the most part, harmless – you’ll increase the amount of slack space on your storage, but unless you have a very specialized workload this is unlikely to have any significant impact. Setting ashift too low, on the other hand, is a horrorshow. If you end up with an ashift=9 vdev on a device with 8K sectors (thus, properly ashift=13), you’ll suffer from massive write amplification penalties as ZFS needs to write, read, rewrite again over and over on the same actual hardware sector. I have personally seen improperly set ashift cause a pool of Samsung 840 Pro SSDs perform slower than a pool of WD Black rust disks!

Even if you’ve done your homework and are absolutely certain that your disks use 512B hardware sectors, I strongly advise considering setting ashift=12 or even ashift=13 – because, remember, it’s immutable per vdev, and vdevs cannot be removed from pools. If you ever need to replace a 512B sector disk in a vdev with a 4K or 8K sector disk, you’ll be screwed if that vdev is ashift=9.

How data gets imbalanced on ZFS

In an earlier post, I demonstrated that ZFS distributes writes evenly across vdevs according to FREE space per vdev (not based on latency or anything else: just FREE).

There are three ways I know of that you can end up with an imbalanced distribution of data across your vdevs. The first two are dead obvious; the third took a little head-scratching and empirical testing before I was certain of it.

Different-sized vdevs

If you used vdevs of different sizes in the first place, you end up with more data on the larger vdevs than the smaller vdevs.

This one’s a no-brainer: we know that ZFS will distribute writes according to the amount of FREE on each vdev, so if you create a pool with one 1T vdev and one 2T vdev, twice as many writes will go on the 2T vdev as the 1T vdev; natch.

Vdevs ADDed after data was already written to the pool

If you zpool add one or more vdevs to an existing pool that already has data on it, ZFS isn’t going to redistribute the writes you already made to the older vdevs.

For example, let’s say you create a pool with a single 2T vdev, write 1T of data to it, then add another 2T vdev. You’ve got 1T FREE on one vdev and 2T FREE on the other vdev; ZFS will now write two records to the new vdev for every one record it writes to the old one; this means that while your writes will remain imbalanced for the rest of the pool’s life, each vdev will become full at about the same time.

You might ask, why not bias writes to the new vdevs even more heavily, so that they achieve balance before the pool’s full? The answer is consistency. If you distribute two writes to a 2T FREE vdev for every one write to a 1T FREE vdev, you have a consistent write performance profile for the remainder of the life of the pool, rather than a really bad performance profile either now (if you bias all the writes to the vdev with more FREE) or at the end of the pool’s life (if you deliver writes evenly until one vdev is entirely full, then have no choice but to send all writes to the one vdev that still has FREEspace remaining).

Balanced writes, imbalanced deletes

OK, this is the fun one. Let’s say you create a pool with two equally-sized vdevs, and a year later you look at it and you’ve got imbalanced writes. What gives?

Well, this is going to be more likely the larger your recordsize is, since as far as I can tell each record is written to a single vdev (not split across the pool as a whole in ashift-sized blocks). Basically, although ZFS wrote your data balanced across your equally-sized vdevs, you deleted more records from one vdev than another.

To demonstrate this effect (and give myself a sanity check!), I created a pool with two equally-sized 500GB vdevs, set recordsize=1M, and wrote a ton of 900K files to the pool.

root@banshee:~# zpool create -oashift=13 alloctest /ssd/alloctest/disk1.raw /rust/alloctest/disk2.raw
root@banshee:~# zfs set recordsize=1M alloctest

root@banshee:~# for i in {1..3636} do ; cp /tmp/900K.bin /alloctest/$i.bin ; done

root@banshee:~# zpool iostat -v alloctest
                               capacity   operations  bandwidth
pool                          alloc free  read  write read write
----------------------------- ----- ----- ----- ----- ----- -----
alloctest                     3.14G 989G  0     45    4.07K 14.2M
 /rust/alloctest/disk1.raw    1.57G 494G  0     22    2.04K 7.10M
 /ssd/alloctest/disk2.raw     1.57G 494G  0     22    2.04K 7.09M
----------------------------- ----- ----- ----- ----- ----- -----

As expected, these files are balanced equally across each vdev in the pool… even though one of the vdevs is much, much faster than the other, since they had the same FREE space available.

Now, we write a tiny bit of Perl to delete only the even-numbered files from alloctest


opendir (my $dh, "/alloctest") || die "Can't open directory: $!";

while (readdir $dh) { 
    my $file = $_; 
    $file =~ s/\.bin$// ; 
    if ($file/2 == int($file/2)) { 
        # this is an even-numbered file - delete it
        unlink "/alloctest/$file.bin"; 

closedir $dh;

Now we run our little bit of Perl, delete the even-numbered files only, and see if we’re left with imbalanced data:

root@banshee:~# perl ~/

root@banshee:~# zpool iostat -v alloctest
                               capacity   operations  bandwidth
pool                          alloc free  read  write read write
----------------------------- ----- ----- ----- ----- ----- -----
alloctest                     1.57G 990G  0     24    2.13K 7.44M
 /rust/alloctest/disk1.raw    12.3M 496G  0     12    1.07K 3.72M
 /ssd/alloctest/disk2.raw     1.56G 494G  0     12    1.07K 3.72M
----------------------------- ----- ----- ----- ----- ----- -----

Bingo! 12.3M ALLOCed on disk1, and 1.56G ALLOCed on disk2 – it took some careful planning, but we now have imbalanced data on a pool with equally-sized vdevs that have been present since the pool’s creation.

However, it’s not imbalanced because ZFS wrote it that way, it’s imbalanced because we deleted it that way.  By deleting all the even-numbered files, we got rid of the files on /ssd/alloctest/disk1.raw while leaving all the files (actually, all the records) on /ssd/alloctest/disk2.rawintact. And since ZFS allocates writes according to FREE per vdev, we know that our data will slowly creep back into balance, as ZFS favors the vdev with a higher FREE count on new writes.

In practice, most people shouldn’t see a really large imbalance like this in normal usage, even with a large recordsize. I had to pretty specifically gimmick this scenario up to save files right at the desired recordsize and then delete them very specifically in a pattern which would produce the results I was looking for; organic deletions should be very unlikely to create a large imbalance.

ZFS allocates writes according to free space per vdev, not latency per vdev

I frequently see the mistaken idea popping up that ZFS allocates writes to the quickest vdev to respond. This isn’t the case: ZFS allocates pool writes in proportion to the amount of free space available on each vdev, so that the vdevs will become full at roughly the same time regardless of how small or large each was to begin with.

Testing: one large slow vdev, one small fast vdev

We can demonstrate this quickly and easily. Below, I use the truncate command to create raw storage files on two pools: rust and ssd.  By creating a 10G storage file on rust and a 2G storage file on ssd, we will see quickly whether ZFS prefers to allocate data according to free space or to latency: the ssd storage is tremendously lower latency, but the size of the device on the rust is larger.

root@banshee:~# zfs create ssd/alloctest
root@banshee:~# zfs create rust/alloctest
root@banshee:~# zfs set compression=off ssd/alloctest
root@banshee:~# zfs set compression=off rust/alloctest
root@banshee:~# truncate -s 10G /rust/alloctest/10Grust.raw
root@banshee:~# truncate -s 2G /ssd/alloctest/2Gssd.raw
root@banshee:~# zpool create -oashift=13 alloctest /rust/alloctest/10Grust.raw /ssd/alloctest/2Gssd.raw
root@banshee:~# zfs set compression=off alloctest

root@banshee:~# zpool list -v alloctest
alloctest                     11.9G 672K  11.9G -       0%   0% 1.00x ONLINE -
 /rust/alloctest/10Grust.raw  9.94G 416K  9.94G -       0%   0%
 /ssd/alloctest/2Gssd.raw     1.98G 256K  1.98G -       0%   0%

OK, now we’ve got our lopsided pool “alloctest”, which has one very fast 2G vdev and one much slower 10G vdev. Let’s see what happens when we dump 2GB of data into it:

root@banshee:~# dd if=/dev/zero bs=256M count=8 of=/alloctest/2G.bin
8+0 records in
8+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 16.6184 s, 129 MB/s

root@banshee:~# zpool list -v alloctest
alloctest                     11.9G 2.00G 9.92G -       9%   16% 1.00x ONLINE -
 /rust/alloctest/10Grust.raw  9.94G 1.56G 8.37G -       9%   15%
 /ssd/alloctest/2Gssd.raw     1.98G 451M  1.54G -       13%  22%

We’ve ALLOC’d 451M to the smaller vdev, and 1.56G to the larger vdev – a ratio of 3.54:1, quite close to the 5:1 ratio of the storage sizes themselves.

What if we dump more data in?

root@banshee:~# dd if=/dev/zero bs=256M count=12 of=/alloctest/3G.bin
12+0 records in
12+0 records out
3221225472 bytes (3.2 GB, 3.0 GiB) copied, 29.0672 s, 111 MB/s

root@banshee:~# zpool list -v alloctest
alloctest                     11.9G 5.01G 6.91G -        24%  42% 1.00x ONLINE -
 /rust/alloctest/10Grust.raw  9.94G 3.92G 6.02G -        23%  39%
 /ssd/alloctest/2Gssd.raw     1.98G 1.09G 916M  -        34%  54%

3.92G to 1.09G – 3.59 to 1, or no real change. Let’s fill the pool literally to bursting:

root@banshee:~# dd if=/dev/zero bs=256M count=48 of=/alloctest/12G.bin
dd: error writing '/alloctest/12G.bin': No space left on device
27+0 records in
26+0 records out
7014973440 bytes (7.0 GB, 6.5 GiB) copied, 99.4393 s, 70.5 MB/s

root@banshee:~# zpool list -v alloctest
alloctest                     11.9G 11.5G 381M  -        58%  96% 1.00x ONLINE -
 /rust/alloctest/10Grust.raw  9.94G 9.61G 330M  -        58%  96%
 /ssd/alloctest/2Gssd.raw     1.98G 1.93G 50.8M -        61%  97%

With the pool entirely full, we have a ratio of 4.98:1 – still not quite the exact 5:1 ratio of our vdevs’ sizes, but pretty damn close.

Testing: one large fast vdev, one small slow vdev

OK… now what if we repeat the same experiment, but this time we put the big vdev on ssd and the little one on rust?

root@banshee:~# truncate -s 10G /ssd/alloctest/10Gssd.raw
root@banshee:~# truncate -s 2G /rust/alloctest/2Grust.raw
root@banshee:~# zpool create -oashift=13 alloctest /ssd/alloctest/10Gssd.raw /rust/alloctest/2Grust.raw
root@banshee:~# zfs set compression=off alloctest

root@banshee:~# zpool list -v alloctest
alloctest                   11.9G 552K  11.9G -        0%   0% 1.00x ONLINE -
 /ssd/alloctest/10Gssd.raw  9.94G 336K  9.94G -        0%   0%
 /rust/alloctest/2Grust.raw 1.98G 216K  1.98G -        0%   0%

OK, the tables have turned. Now we’ve got a 12G pool with 10G of the storage on fast SSD, and 2G of the storage on slow rust. Let’s dump data in it:

root@banshee:~# dd if=/dev/zero bs=256M count=8 of=/alloctest/2G.bin
8+0 records in
8+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 13.5287 s, 159 MB/s

root@banshee:~# zpool list -v alloctest
alloctest                   11.9G 1.98G 9.95G -        9%   16% 1.00x ONLINE -
 /ssd/alloctest/10Gssd.raw  9.94G 1.55G 8.39G -        9%   15%
 /rust/alloctest/2Grust.raw 1.98G 440M  1.56G -        13%  21%

1.55G to 440M – 3.6:1. That’s a pretty familiar ratio, isn’t it? Let’s dump another 3G of data in, just like we did earlier, when the big vdev was rust:

root@banshee:~# dd if=/dev/zero bs=256M count=12 of=/alloctest/3G.bin
12+0 records in
12+0 records out
3221225472 bytes (3.2 GB, 3.0 GiB) copied, 23.5282 s, 137 MB/s

root@banshee:~# zpool list -v alloctest
alloctest                   11.9G 5.01G 6.91G -        25%  42% 1.00x ONLINE -
 /ssd/alloctest/10Gssd.raw  9.94G 3.92G 6.02G -        24%  39%
 /rust/alloctest/2Grust.raw 1.98G 1.09G 916M  -        34%  54%

1.09G to 3.92G ALLOCated… simplified, that’s 3.6:1 again. Just like it was when the big vdev was rust and the small vdev was ssd.

What about high-IOPS, small random writes?

For this one, I set up equally-sized vdevs on rust and ssd, created a pool with no compression, and began populating them with 4K synchronously written files, which is just about the maximum IOPS load you can put on a pool:

root@banshee:~# for i in {1..1048576}
> do
> cp /tmp/4K.bin /alloctest/$i.bin
> sync
> done

This gives us a stream of steady 4K synchronous writes to the pool (as ensured by that sync command in the loop).

Checking zpool iostat -v alloctest while the data is streaming onto the pool confirms that the writes are balanced equally between the equal-sized drives, even though we’re doing 4K writes, and one of the vdevs is an Intel 480GB SSD and the other is WD Red 4TB rust drive:

root@banshee:~# zpool iostat -v alloctest
 capacity operations bandwidth
pool                          alloc free  read  write read  write
----------------------------- ----- ----- ----- ----- ----- -----
alloctest                     4.57G 987G  171   334   1.34M 6.12M
 /ssd/alloctest/500G.raw      2.29G 494G  85    172   683K  3.08M
 /rust/alloctest/500G.raw     2.28G 494G  85    161   685K  3.05M
----------------------------- ----- ----- ----- ----- ----- -----

There’s no significant difference: each device is receiving roughly the same number of operations, and the same amount of bandwidth, at any given second; and we’re accumulating the same amount of data on each same-sized vdev.

The rule of thumb – as we’re seeing here – is that writes to any given vdev bind on the slowest disk in the vdev, and writes to a pool bind on the slowest vdev in the pool. In this case, we’re binding on the performance of the rust vdev. The reason we’re binding on that slower vdev is to keep the pool from filling imbalanced.


ZFS allocates writes to the pool according to the amount of free space left on each vdev, period. With the small vdev sizes we used for testing here, this didn’t result in a “perfect” allocation ratio exactly matching our vdev sizes – but the “imperfect” ratio we got was the same whether the smaller vdev was the slower one or the faster one. And when we tested with 4K synchronous writes to a pool with evenly sized vdevs, the throughput bound to the slower of the two vdevs, and we could see the data moving at the same pace onto each of those vdevs – not allocated according to their individual capacities.

This should remove any confusion about whether ZFS (at least, as of “prefers” faster/lower latency vdevs when allocating writes. It does not.

If you’re frowning because you’ve got an imbalanced distribution of data across your pool and not sure how it happened, see here.