In an earlier post, I demonstrated that ZFS distributes writes evenly across vdevs according to FREE
space per vdev (not based on latency or anything else: just FREE
).
There are three ways I know of that you can end up with an imbalanced distribution of data across your vdevs. The first two are dead obvious; the third took a little head-scratching and empirical testing before I was certain of it.
Different-sized vdevs
If you used vdevs of different sizes in the first place, you end up with more data on the larger vdevs than the smaller vdevs.
This one’s a no-brainer: we know that ZFS will distribute writes according to the amount of FREE
on each vdev, so if you create a pool with one 1T vdev and one 2T vdev, twice as many writes will go on the 2T vdev as the 1T vdev; natch.
Vdevs ADDed after data was already written to the pool
If you zpool add
one or more vdevs to an existing pool that already has data on it, ZFS isn’t going to redistribute the writes you already made to the older vdevs.
For example, let’s say you create a pool with a single 2T vdev, write 1T of data to it, then add another 2T vdev. You’ve got 1T FREE
on one vdev and 2T FREE
on the other vdev; ZFS will now write two records to the new vdev for every one record it writes to the old one; this means that while your writes will remain imbalanced for the rest of the pool’s life, each vdev will become full at about the same time.
You might ask, why not bias writes to the new vdevs even more heavily, so that they achieve balance before the pool’s full? The answer is consistency. If you distribute two writes to a 2T FREE
vdev for every one write to a 1T FREE
vdev, you have a consistent write performance profile for the remainder of the life of the pool, rather than a really bad performance profile either now (if you bias all the writes to the vdev with more FREE
) or at the end of the pool’s life (if you deliver writes evenly until one vdev is entirely full, then have no choice but to send all writes to the one vdev that still has FREE
space remaining).
Balanced writes, imbalanced deletes
OK, this is the fun one. Let’s say you create a pool with two equally-sized vdevs, and a year later you look at it and you’ve got imbalanced writes. What gives?
Well, this is going to be more likely the larger your recordsize
is, since as far as I can tell each record
is written to a single vdev (not split across the pool as a whole in ashift
-sized blocks). Basically, although ZFS wrote your data balanced across your equally-sized vdevs, you deleted more record
s from one vdev than another.
To demonstrate this effect (and give myself a sanity check!), I created a pool with two equally-sized 500GB vdevs, set recordsize=1M
, and wrote a ton of 900K files to the pool.
root@banshee:~# zpool create -oashift=13 alloctest /ssd/alloctest/disk1.raw /rust/alloctest/disk2.raw root@banshee:~# zfs set recordsize=1M alloctest root@banshee:~# for i in {1..3636} do ; cp /tmp/900K.bin /alloctest/$i.bin ; done root@banshee:~# zpool iostat -v alloctest capacity operations bandwidth pool alloc free read write read write ----------------------------- ----- ----- ----- ----- ----- ----- alloctest 3.14G 989G 0 45 4.07K 14.2M /rust/alloctest/disk1.raw 1.57G 494G 0 22 2.04K 7.10M /ssd/alloctest/disk2.raw 1.57G 494G 0 22 2.04K 7.09M ----------------------------- ----- ----- ----- ----- ----- -----
As expected, these files are balanced equally across each vdev in the pool… even though one of the vdevs is much, much faster than the other, since they had the same FREE
space available.
Now, we write a tiny bit of Perl to delete only the even-numbered files from alloctest
…
#!/usr/bin/perl opendir (my $dh, "/alloctest") || die "Can't open directory: $!"; while (readdir $dh) { my $file = $_; $file =~ s/\.bin$// ; if ($file/2 == int($file/2)) { # this is an even-numbered file - delete it unlink "/alloctest/$file.bin"; } } closedir $dh;
Now we run our little bit of Perl, delete the even-numbered files only, and see if we’re left with imbalanced data:
root@banshee:~# perl ~/deleteevens.pl root@banshee:~# zpool iostat -v alloctest capacity operations bandwidth pool alloc free read write read write ----------------------------- ----- ----- ----- ----- ----- ----- alloctest 1.57G 990G 0 24 2.13K 7.44M /rust/alloctest/disk1.raw 12.3M 496G 0 12 1.07K 3.72M /ssd/alloctest/disk2.raw 1.56G 494G 0 12 1.07K 3.72M ----------------------------- ----- ----- ----- ----- ----- -----
Bingo! 12.3M ALLOC
ed on disk1, and 1.56G ALLOC
ed on disk2 – it took some careful planning, but we now have imbalanced data on a pool with equally-sized vdevs that have been present since the pool’s creation.
However, it’s not imbalanced because ZFS wrote it that way, it’s imbalanced because we deleted it that way. By deleting all the even-numbered files, we got rid of the files on /ssd/alloctest/disk1.raw
while leaving all the files (actually, all the record
s) on /ssd/alloctest/disk2.raw
intact. And since ZFS allocates writes according to FREE
per vdev, we know that our data will slowly creep back into balance, as ZFS favors the vdev with a higher FREE
count on new writes.
In practice, most people shouldn’t see a really large imbalance like this in normal usage, even with a large recordsize
. I had to pretty specifically gimmick this scenario up to save files right at the desired recordsize
and then delete them very specifically in a pattern which would produce the results I was looking for; organic deletions should be very unlikely to create a large imbalance.
One thought on “How data gets imbalanced on ZFS”