ZFS allocates writes according to free space per vdev, not latency per vdev

I frequently see the mistaken idea popping up that ZFS allocates writes to the quickest vdev to respond. This isn’t the case: ZFS allocates pool writes in proportion to the amount of free space available on each vdev, so that the vdevs will become full at roughly the same time regardless of how small or large each was to begin with.

Testing: one large slow vdev, one small fast vdev

We can demonstrate this quickly and easily. Below, I use the truncate command to create raw storage files on two pools: rust and ssd.  By creating a 10G storage file on rust and a 2G storage file on ssd, we will see quickly whether ZFS prefers to allocate data according to free space or to latency: the ssd storage is tremendously lower latency, but the size of the device on the rust is larger.

root@banshee:~# zfs create ssd/alloctest
root@banshee:~# zfs create rust/alloctest
root@banshee:~# zfs set compression=off ssd/alloctest
root@banshee:~# zfs set compression=off rust/alloctest
root@banshee:~# truncate -s 10G /rust/alloctest/10Grust.raw
root@banshee:~# truncate -s 2G /ssd/alloctest/2Gssd.raw
root@banshee:~# zpool create -oashift=13 alloctest /rust/alloctest/10Grust.raw /ssd/alloctest/2Gssd.raw
root@banshee:~# zfs set compression=off alloctest

root@banshee:~# zpool list -v alloctest
NAME                          SIZE  ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
alloctest                     11.9G 672K  11.9G -       0%   0% 1.00x ONLINE -
 /rust/alloctest/10Grust.raw  9.94G 416K  9.94G -       0%   0%
 /ssd/alloctest/2Gssd.raw     1.98G 256K  1.98G -       0%   0%

OK, now we’ve got our lopsided pool “alloctest”, which has one very fast 2G vdev and one much slower 10G vdev. Let’s see what happens when we dump 2GB of data into it:

root@banshee:~# dd if=/dev/zero bs=256M count=8 of=/alloctest/2G.bin
8+0 records in
8+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 16.6184 s, 129 MB/s

root@banshee:~# zpool list -v alloctest
NAME                          SIZE  ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
alloctest                     11.9G 2.00G 9.92G -       9%   16% 1.00x ONLINE -
 /rust/alloctest/10Grust.raw  9.94G 1.56G 8.37G -       9%   15%
 /ssd/alloctest/2Gssd.raw     1.98G 451M  1.54G -       13%  22%

We’ve ALLOC’d 451M to the smaller vdev, and 1.56G to the larger vdev – a ratio of 3.54:1, quite close to the 5:1 ratio of the storage sizes themselves.

What if we dump more data in?

root@banshee:~# dd if=/dev/zero bs=256M count=12 of=/alloctest/3G.bin
12+0 records in
12+0 records out
3221225472 bytes (3.2 GB, 3.0 GiB) copied, 29.0672 s, 111 MB/s

root@banshee:~# zpool list -v alloctest
NAME                          SIZE  ALLOC FREE  EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
alloctest                     11.9G 5.01G 6.91G -        24%  42% 1.00x ONLINE -
 /rust/alloctest/10Grust.raw  9.94G 3.92G 6.02G -        23%  39%
 /ssd/alloctest/2Gssd.raw     1.98G 1.09G 916M  -        34%  54%

3.92G to 1.09G – 3.59 to 1, or no real change. Let’s fill the pool literally to bursting:

root@banshee:~# dd if=/dev/zero bs=256M count=48 of=/alloctest/12G.bin
dd: error writing '/alloctest/12G.bin': No space left on device
27+0 records in
26+0 records out
7014973440 bytes (7.0 GB, 6.5 GiB) copied, 99.4393 s, 70.5 MB/s

root@banshee:~# zpool list -v alloctest
NAME                          SIZE  ALLOC FREE  EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
alloctest                     11.9G 11.5G 381M  -        58%  96% 1.00x ONLINE -
 /rust/alloctest/10Grust.raw  9.94G 9.61G 330M  -        58%  96%
 /ssd/alloctest/2Gssd.raw     1.98G 1.93G 50.8M -        61%  97%

With the pool entirely full, we have a ratio of 4.98:1 – still not quite the exact 5:1 ratio of our vdevs’ sizes, but pretty damn close.

Testing: one large fast vdev, one small slow vdev

OK… now what if we repeat the same experiment, but this time we put the big vdev on ssd and the little one on rust?

root@banshee:~# truncate -s 10G /ssd/alloctest/10Gssd.raw
root@banshee:~# truncate -s 2G /rust/alloctest/2Grust.raw
root@banshee:~# zpool create -oashift=13 alloctest /ssd/alloctest/10Gssd.raw /rust/alloctest/2Grust.raw
root@banshee:~# zfs set compression=off alloctest

root@banshee:~# zpool list -v alloctest
NAME                        SIZE  ALLOC FREE  EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
alloctest                   11.9G 552K  11.9G -        0%   0% 1.00x ONLINE -
 /ssd/alloctest/10Gssd.raw  9.94G 336K  9.94G -        0%   0%
 /rust/alloctest/2Grust.raw 1.98G 216K  1.98G -        0%   0%

OK, the tables have turned. Now we’ve got a 12G pool with 10G of the storage on fast SSD, and 2G of the storage on slow rust. Let’s dump data in it:

root@banshee:~# dd if=/dev/zero bs=256M count=8 of=/alloctest/2G.bin
8+0 records in
8+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 13.5287 s, 159 MB/s

root@banshee:~# zpool list -v alloctest
NAME                        SIZE  ALLOC FREE  EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
alloctest                   11.9G 1.98G 9.95G -        9%   16% 1.00x ONLINE -
 /ssd/alloctest/10Gssd.raw  9.94G 1.55G 8.39G -        9%   15%
 /rust/alloctest/2Grust.raw 1.98G 440M  1.56G -        13%  21%

1.55G to 440M – 3.6:1. That’s a pretty familiar ratio, isn’t it? Let’s dump another 3G of data in, just like we did earlier, when the big vdev was rust:

root@banshee:~# dd if=/dev/zero bs=256M count=12 of=/alloctest/3G.bin
12+0 records in
12+0 records out
3221225472 bytes (3.2 GB, 3.0 GiB) copied, 23.5282 s, 137 MB/s

root@banshee:~# zpool list -v alloctest
NAME                        SIZE  ALLOC FREE  EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
alloctest                   11.9G 5.01G 6.91G -        25%  42% 1.00x ONLINE -
 /ssd/alloctest/10Gssd.raw  9.94G 3.92G 6.02G -        24%  39%
 /rust/alloctest/2Grust.raw 1.98G 1.09G 916M  -        34%  54%

1.09G to 3.92G ALLOCated… simplified, that’s 3.6:1 again. Just like it was when the big vdev was rust and the small vdev was ssd.

What about high-IOPS, small random writes?

For this one, I set up equally-sized vdevs on rust and ssd, created a pool with no compression, and began populating them with 4K synchronously written files, which is just about the maximum IOPS load you can put on a pool:

root@banshee:~# for i in {1..1048576}
> do
> cp /tmp/4K.bin /alloctest/$i.bin
> sync
> done

This gives us a stream of steady 4K synchronous writes to the pool (as ensured by that sync command in the loop).

Checking zpool iostat -v alloctest while the data is streaming onto the pool confirms that the writes are balanced equally between the equal-sized drives, even though we’re doing 4K writes, and one of the vdevs is an Intel 480GB SSD and the other is WD Red 4TB rust drive:

root@banshee:~# zpool iostat -v alloctest
 capacity operations bandwidth
pool                          alloc free  read  write read  write
----------------------------- ----- ----- ----- ----- ----- -----
alloctest                     4.57G 987G  171   334   1.34M 6.12M
 /ssd/alloctest/500G.raw      2.29G 494G  85    172   683K  3.08M
 /rust/alloctest/500G.raw     2.28G 494G  85    161   685K  3.05M
----------------------------- ----- ----- ----- ----- ----- -----

There’s no significant difference: each device is receiving roughly the same number of operations, and the same amount of bandwidth, at any given second; and we’re accumulating the same amount of data on each same-sized vdev.

The rule of thumb – as we’re seeing here – is that writes to any given vdev bind on the slowest disk in the vdev, and writes to a pool bind on the slowest vdev in the pool. In this case, we’re binding on the performance of the rust vdev. The reason we’re binding on that slower vdev is to keep the pool from filling imbalanced.

Conclusion

ZFS allocates writes to the pool according to the amount of free space left on each vdev, period. With the small vdev sizes we used for testing here, this didn’t result in a “perfect” allocation ratio exactly matching our vdev sizes – but the “imperfect” ratio we got was the same whether the smaller vdev was the slower one or the faster one. And when we tested with 4K synchronous writes to a pool with evenly sized vdevs, the throughput bound to the slower of the two vdevs, and we could see the data moving at the same pace onto each of those vdevs – not allocated according to their individual capacities.

This should remove any confusion about whether ZFS (at least, as of 0.6.5.6) “prefers” faster/lower latency vdevs when allocating writes. It does not.

If you’re frowning because you’ve got an imbalanced distribution of data across your pool and not sure how it happened, see here.

Published by

Jim Salter

Mercenary sysadmin, open source advocate, and frotzer of the jim-jam.

Leave a Reply

Your email address will not be published. Required fields are marked *