Quick and dirty cheat sheet for anyone getting ready to set up a new ZFS pool. Here are all the settings you’ll want to think about, and the values I think you’ll probably want to use.
I am not generally a fan of tuning things unless you need to, but unfortunately a lot of the ZFS defaults aren’t optimal for most workloads.
SLOG and L2ARC are special devices, not parameters… but I included them anyway. Lean into it.
|why / what does it do?
|Ashift tells ZFS what the underlying physical block size your disks use is. It’s in bits, so ashift=9 means 512B sectors (used by all ancient drives), ashift=12 means 4K sectors (used by most modern hard drives), and ashift=13 means 8K sectors (used by some modern SSDs).
If you get this wrong, you want to get it wrong high. Too low an ashift value will cripple your performance. Too high an ashift value won’t have much impact on almost any normal workload.
Ashift is per vdev, and immutable once set. This means you should manually set it at pool creation, and any time you add a vdev to an existing pool, and should never get it wrong because if you do, it will screw up your entire pool and cannot be fixed.
| Sets Linux eXtended ATTRibutes directly in the inodes, rather than as tiny little files in special hidden folders.
This can have a significant performance impact on datasets with lots of files in them, particularly if SELinux is in play. Unlikely to make any difference on datasets with very few, extremely large files (eg VM images).
| Compression defaults to off, and that’s a losing default value. Even if your data is incompressible, your slack space is (highly) compressible.
LZ4 compression is faster than storage. Yes, really. Even if you have a $50 tinkertoy CPU and a blazing-fast SSD. Yes, really. I’ve tested it. It’s a win.
You might consider gzip compression for datasets with highly compressible files. It will have better compression rate but likely lower throughput. YMMV, caveat imperator.
| If atime is on – which it is by default – your system has to update the “Accessed” attribute of every file every time you look at it. This can easily double the IOPS load on a system all by itself.
Do you care when the last time somebody opened a given file was, or the last time they ls’d a directory? Probably not. Turn this off.
| If you have files that will be read from or written to in random batches regularly, you want to match the recordsize to the size of the reads or writes you’re going to be digging out of / cramming into those large files.
For most database binaries or VM images, 64K is going to be either an exact match to the VM’s back end storage cluster size (eg the default cluster_size=64K on QEMU’s QCOW2 storage) or at least a better one than the default recordsize, 128K.
If you’ve got a workload that wants even smaller blocks—for example, 16KiB to match MySQL InnoDB or 8KiB to match PostgreSQL back-ends—you should tune both ZFS recordsize and the VM back end storage (where applicable) to match.
This can improve the IOPS capability of an array used for db binaries or VM images fourfold or more.
| Wait, didn’t we just do recordsize…? Well, yes, but different workloads call for different settings if you’re tuning.
If you’re only reading and writing in fairly large chunks – for example, a collection of 5-8MB JPEG images from a camera, or 100GB movie files, either of which will not be read or written random access – you’ll want to set recordsize=1M, to reduce the IOPS load on the system by requiring fewer individual records for the same amount of data. This can also increase compression ratio, for compressible data, since each record uses its own individual compression dictionary.
If you’re using bittorrent, recordsize=16K results in higher possible bittorrent write performance… but recordsize=1M results in lower overall fragmentation, and much better performance when reading the files you’ve acquired by torrent later.
| SLOG isn’t a setting, it’s a special vdev type that acts as a write aggregation layer for the entire pool. It only affects synchronous writes – asynchronous writes are already aggregated in the ZIL in RAM.
SLOG doesn’t need to be a large device; it only has to accumulate a few seconds’ worth of writes… but if you’re using NAND flash, it probably should be a large device, since write endurance is proportional to device size. On systems that need a LOG vdev in the first place, that LOG vdev will generally get an awful lot of writes.
Having a LOG vdev means that synchronous writes perform like asynchronous writes; it doesn’t really act like a “write cache” in the way new ZFS users tend to hope it will.
Great for databases, NFS exports, or anything else that calls sync() a lot. Not too useful for more casual workloads.
| L2ARC is a layer of ARC that resides on fast storage rather than in RAM. It sounds amazing – super huge super fast read cache!
Yeah, it’s not really like that. For one thing, L2ARC is ephemeral – data in L2ARC doesn’t survive reboots. For another thing, it costs a significant amount of RAM to index the L2ARC, which means now you have a smaller ARC due to the need for indexing your L2ARC.
Even the very fastest SSD is a couple orders of magnitude slower than RAM. When you have to go to L2ARC to fetch data that would have fit in the ARC if it hadn’t been for needing to index the L2ARC, it’s a massive lose.
Most people won’t see any real difference at all after adding L2ARC. A significant number of people will see performance decrease after adding L2ARC. There is such a thing as a workload that benefits from L2ARC… but you don’t have it. (Think hundreds of users, each with extremely large, extremely hot datasets.)
* “best” is always debatable. Read reasoning before applying. No warranties offered, explicit or implied.