In an earlier post, I demonstrated why you shouldn’t mix rust and SSDs – reads on your pool bind at the speed of the slowest vdev; effectively making SSDs in a pool containing rust little more than extremely small, expensive rust disks themselves. That post was a follow-up to an even earlier post demonstrating that – as of 0.6.x – ZFS did not allocate writes to the lowest latency vdev.

An update to the Storage Pool Allocator (SPA) has changed the original write behavior; as of 0.7.0 (and Ubuntu Bionic includes 0.7.5) writes really are allocated to the lowest-latency vdev in the pool. To test this, I created a throwaway pool on a system with both rust and SSD devices on board. This isn’t the cleanest test possible – the vdevs are actually sparse files created on, respectively, an SSD mdraid1 and another pool consisting of on rust mirror vdev. It’s good enough for government work, though, so let’s see how small-block random write operations are allocated when you’ve got one rust vdev and one SSD vdev:

root@demo0:/tmp# zpool create -oashift=12 test /tmp/rust.bin /tmp/ssd.bin
root@demo0:/tmp# zfs set compression=off test

root@demo0:/tmp# fio --name=write --ioengine=sync --rw=randwrite \
--bs=16K --size=1G --numjobs=1 --end_fsync=1

[...]

Run status group 0 (all jobs):
WRITE: bw=204MiB/s (214MB/s), 204MiB/s-204MiB/s (214MB/s-214MB/s),
io=1024MiB (1074MB), run=5012-5012msec

root@demo0:/tmp# du -h /tmp/ssd.bin ; du -h /tmp/rust.bin
1.8M /tmp/ssd.bin
237K /tmp/rust.bin

Couldn’t be much clearer – 204 MB/sec is higher throughput than a single rust mirror can manage for 16K random writes, and almost 90% of the write operations were committed to the SSD side. So the SPA updates in 0.7.0 work as intended – when pushed to the limit, ZFS will now allocate far more of its writes to the fastest vdevs available in the pool.

I italicized that for a reason, of course. When you don’t push ZFS hard with synchronous, small-block writes like we did with fio above, it still allocates according to free space available. To demonstrate this, we’ll destroy and recreate our hybrid test pool – and this time, we’ll write a GB or so of random data sequentially and asynchronously, using openssl to rapidly generate pseudo-random data, which we’ll pipe through pv into a file on our pool.

root@demo0:/tmp# zpool create -oashift=12 test /tmp/rust.bin /tmp/ssd.bin root@demo0:/tmp# zfs set compression=off test 

root@demo0:~# openssl enc -aes-256-ctr -pass \
              pass:"$(dd if=/dev/urandom bs=128 \
                    count=1 2>/dev/null | base64)" \
              -nosalt < /dev/zero | pv > /test/randomfile.bin

 1032MiB 0:00:04 [ 370MiB/s] [    <=>                 ] ^C

root@demo0:~# du -h /tmp/*bin
root@demo0:~# du -h /tmp/*bin
571M /tmp/rust.bin
627M /tmp/ssd.bin

Although we wrote our pseudorandom data very rapidly to the pool, in this case we did so sequentially and asynchronously, rather than in small random access blocks and synchronously. And in this case, our writes were committed near-equally to each vdev, despite one being immensely faster than the other.

Please note that this describes the SPA’s behavior when allocating writes at the pool level – it has nothing at all to do with the behavior of individual vdevs which have both rust and SSD member devices. My recent test of half-rust/half-SSD mirror vdevs was also run on Bionic with ZFS 0.7.5, and demonstrated conclusively that even read behavior inside a vdev doesn’t favor lower-latency devices, let alone write behavior.

The new SPA code is great, and it absolutely does improve write performance on IOPS-saturated pools. However, it is not intended to enable the undying dream of mixing rust and SSD storage willy-nilly, and if you try to do so anyway, you’re gonna have a bad time.

I still do not recommend mixing SSDs and rust in the same pool, or in the same vdev.

One thought on “ZFS write allocation in 0.7.x”

db says:

October 22, 2024 at 19:32

I’d previously read this and your last post, so they were in the back of my mind recently when trying some things on a new pool I created. It’s a mix of SATA SSDs of various sizes, each one a single-disk vdev. They’re 500GB, 1TB, and 4TB. I would expect them to be relatively the same in performance, and if anything, the 4TB drive to be the fastest (it’s both the largest and newest). However, it seemed like it was actually writing the most data to the smallest disk (which is also the oldest, and I would also assume to be the slowest). It had actually reached over 90% at one point. I eventually removed it from the pool and re-added it, and now it seems to be somewhat more balanced, but now the 1TB disk is sitting at 80% with the others both hovering around 17%. I haven’t removed/re-added the 1TB yet, but might try. In any case, it’s not behaving quite how I would expect. Even if latency were factored, I would assume it also weighs available space as well — otherwise we potentially have a minor difference in latency resulting in a single disk consistently being filled disproportionately.

ZFS write allocation in 0.7.x

I still do not recommend mixing SSDs and rust in the same pool, or in the same vdev.

Published by

Jim Salter

One thought on “ZFS write allocation in 0.7.x”

Leave a Reply