Quick and dirty cheat sheet for anyone getting ready to set up a new ZFS pool. Here are all the settings you’ll want to think about, and the values I think you’ll probably want to use.
I am not generally a fan of tuning things unless you need to, but unfortunately a lot of the ZFS defaults aren’t optimal for most workloads.
SLOG and L2ARC are special devices, not parameters… but I included them anyway. Lean into it.
parameter | best* value | why / what does it do? |
ashift | 12 | Ashift tells ZFS what the underlying physical block size your disks use is. It’s in bits, so ashift=9 means 512B sectors (used by all ancient drives), ashift=12 means 4K sectors (used by most modern hard drives), and ashift=13 means 8K sectors (used by some modern SSDs).
If you get this wrong, you want to get it wrong high. Too low an ashift value will cripple your performance. Too high an ashift value won’t have much impact on almost any normal workload. Ashift is per vdev, and immutable once set. This means you should manually set it at pool creation, and any time you add a vdev to an existing pool, and should never get it wrong because if you do, it will screw up your entire pool and cannot be fixed. |
xattr | sa | Sets Linux eXtended ATTRibutes directly in the inodes, rather than as tiny little files in special hidden folders.
This can have a significant performance impact on datasets with lots of files in them, particularly if SELinux is in play. Unlikely to make any difference on datasets with very few, extremely large files (eg VM images). |
compression | lz4 | Compression defaults to off, and that’s a losing default value. Even if your data is incompressible, your slack space is (highly) compressible.
LZ4 compression is faster than storage. Yes, really. Even if you have a $50 tinkertoy CPU and a blazing-fast SSD. Yes, really. I’ve tested it. It’s a win. You might consider gzip compression for datasets with highly compressible files. It will have better compression rate but likely lower throughput. YMMV, caveat imperator. |
atime | off | If atime is on – which it is by default – your system has to update the “Accessed” attribute of every file every time you look at it. This can easily double the IOPS load on a system all by itself.
Do you care when the last time somebody opened a given file was, or the last time they ls’d a directory? Probably not. Turn this off. |
recordsize | 64K | If you have files that will be read from or written to in random batches regularly, you want to match the recordsize to the size of the reads or writes you’re going to be digging out of / cramming into those large files.
For most database binaries or VM images, 64K is going to be either an exact match to the VM’s back end storage cluster size (eg the default cluster_size=64K on QEMU’s QCOW2 storage) or at least a better one than the default recordsize, 128K. If you’ve got a workload that wants even smaller blocks—for example, 16KiB to match MySQL InnoDB or 8KiB to match PostgreSQL back-ends—you should tune both ZFS recordsize and the VM back end storage (where applicable) to match. This can improve the IOPS capability of an array used for db binaries or VM images fourfold or more. |
recordsize | 1M | Wait, didn’t we just do recordsize…? Well, yes, but different workloads call for different settings if you’re tuning.
If you’re only reading and writing in fairly large chunks – for example, a collection of 5-8MB JPEG images from a camera, or 100GB movie files, either of which will not be read or written random access – you’ll want to set recordsize=1M, to reduce the IOPS load on the system by requiring fewer individual records for the same amount of data. This can also increase compression ratio, for compressible data, since each record uses its own individual compression dictionary. If you’re using bittorrent, recordsize=16K results in higher possible bittorrent write performance… but recordsize=1M results in lower overall fragmentation, and much better performance when reading the files you’ve acquired by torrent later. |
SLOG | maybe | SLOG isn’t a setting, it’s a special vdev type that acts as a write aggregation layer for the entire pool. It only affects synchronous writes – asynchronous writes are already aggregated in the ZIL in RAM.
SLOG doesn’t need to be a large device; it only has to accumulate a few seconds’ worth of writes… but if you’re using NAND flash, it probably should be a large device, since write endurance is proportional to device size. On systems that need a LOG vdev in the first place, that LOG vdev will generally get an awful lot of writes. Having a LOG vdev means that synchronous writes perform like asynchronous writes; it doesn’t really act like a “write cache” in the way new ZFS users tend to hope it will. Great for databases, NFS exports, or anything else that calls sync() a lot. Not too useful for more casual workloads. |
L2ARC | nope! | L2ARC is a layer of ARC that resides on fast storage rather than in RAM. It sounds amazing – super huge super fast read cache!
Yeah, it’s not really like that. For one thing, L2ARC is ephemeral – data in L2ARC doesn’t survive reboots. For another thing, it costs a significant amount of RAM to index the L2ARC, which means now you have a smaller ARC due to the need for indexing your L2ARC. Even the very fastest SSD is a couple orders of magnitude slower than RAM. When you have to go to L2ARC to fetch data that would have fit in the ARC if it hadn’t been for needing to index the L2ARC, it’s a massive lose. Most people won’t see any real difference at all after adding L2ARC. A significant number of people will see performance decrease after adding L2ARC. There is such a thing as a workload that benefits from L2ARC… but you don’t have it. (Think hundreds of users, each with extremely large, extremely hot datasets.) |
* “best” is always debatable. Read reasoning before applying. No warranties offered, explicit or implied.
Just wanted to say thanks, this is really putting many articles I read in a nutshell and a nice overview.
I appreciate it.
Hi,
urbackup is recommending a cache drive (which I may be misunderstanding to be L2ARC) because of their use of dedup (https://www.urbackup.org/ServerAdminGuide-v2.4.x.pdf page 34). This doesn’t sound like it’s actually correct based on your tuning cheat sheet and I was wondering what you thought.
Thanks in advance.
Dedupe is one of the few workloads that actually benefits from an L2ARC device. The reason being that you need to keep the Dedupe Table (DDT) in RAM in order to scan it for existing hashes, to update it when a new duplicate block is found, and to add to it when a new unique block is found. By default, the DDT is stored in the pool, so you’re adding a lot of extra read/writes to the pool, using up precious IOps. With an L2ARC, the DDT is loaded into the L2ARC, so any reads/writes go to that separate device instead (with updates written out to the pool in batches with the normal transaction group writes).
However, unless you absolutely *know* for a fact that you will have a *lot* of duplicate data (like 5x copies of everything), then you really, really, really don’t want to enable it. It’s the only feature of ZFS that *requires* gobs of RAM, and will absolutely kill the performance of your pool without said gobs of RAM (an L2ARC helps, but doesn’t remove the need for gobs of RAM). If you don’t have gobs of RAM, deleting files and snapshots from the pool will crash your system.
When we started with our ZFS-based backups servers, we enabled dedupe. Back when the “sweet spot” for buying hard drives was 500 GB, and having 32 GB of RAM in a server was *expensive*. But we were backing up 50 nearly identical school servers, and 50 nearly identical wireless servers, and 50 nearly identical FreeBSD firewalls, and 100-odd other servers, so we thought it would be a win. Never saw a dedupe ratio over 3, and even combined with compression, the ratio never went over 4. Still impressive how much storage we saved, but it was a royal pain managing the pool due to all the RAM issues (started with 16 GB, then 24, then 32, finally maxed it out at 64) and teething pains with ZFS (started with ZFSv6). Each new storage server we bring online, we make sure to *NOT* enable dedupe. Buying bigger drives brings way more benefits than using dedupe.
Starting with zfs 2.0 it will support persistent L2ARC (https://github.com/openzfs/zfs/pull/9582) would you nevertheless don’t recommend L2ARC’s in general?
Since OpenZFS 2.0 we do have ZSTD as a compression option too.
https://github.com/openzfs/zfs/pull/10278
I definitely think It’s worth a revisit.
Does the new “persistent L2ARC” feature change your mind about L2ARC?
I don’t think L2ARC is often the kind of silver bullet users hope it will be, even with persistence–but persistence makes it considerably more attractive, without a doubt.
I would think single-user video editing is the ideal use case for a large L2ARC. Keeping the 50-100 exceedingly large sourxe media files (or even granted proxies even) hot in the L2 with a relatively small overhead to index them in RAM. You should theoretically be able to build a thrifty 32G ram system with 4-6 rust disks and a snappy nvme for your L2, and keep your 10G Ethernet connection more or less saturated the whole time you are editing after the first few reads of your media.
Just don’t store the project database file on there… they are small, keep them local. Or in a separate dataset optimized for that.
Excellent guide, I’ve come back to this in my bookmarks many times now. Do you have a best practice for volblocksize, perhaps in regards to proxmox?
Sorry, zfs beginner here. So for a regular homelab hosting multiple personal file types (documents, jpg photos, movies) on spinning HDDs, would these tuning tips still apply?
Creating a pool:
zpool create -o ashift=12 tank mirror sdc sdd
Can be set anytime, but doesn’t affect existing files, therefore ought to be set immediately:
zfs set atime=off tank
zfs set compression=lz4 tank
zfs set xattr=sa tank
Adding to the pool:
zpool add -o ashift=12 tank mirror sde sdf
I find persistent L2ARC very useful.
Personally my advice would be if you’re in a serious setup devise a way to test performance both with and without cache and just try it using some spare SSDs. NEVER buy high-end SSDs specifically for L2ARC until you KNOW you need it; even if your test SSDs aren’t the fastest they should be enough to get an idea. Tests need to be either representative or come from live use after a reasonable period (several days), measuring some mixture of average speed, IOps or hit/miss rate.
For casual use you probably don’t need an L2ARC, but I’m getting good mileage out of smaller persistent L2ARC (16gb) with all of my datasets set to secondarycache=metadata (and only some set to primarycache=all); this ensures that my metadata is either in RAM or on the SSD, so searching folders etc. is always nice and fast no matter what has been churning through primary ARC recently. Arguably I’d be better served with a special vdev, but I couldn’t configure one initially, and can’t add enough redundancy at the moment, so a metadata-only L2ARC has been a great way to speed up searching with no extra hardware (I’m using a partition from an internal system disk that’s mostly used for booting and not much else). Works well enough for me.