ZFS does NOT favor lower latency devices. Don’t mix rust disks and SSDs!

In an earlier post, I addressed the never-ending urban legend that ZFS writes data to the lowest-latency vdev. Now the urban legend that never dies has reared its head again; this time with someone claiming that ZFS will issue read operations to the lowest-latency disk in a given mirror vdev.

TL;DR – this, too, is a myth. If you need or want an empirical demonstration, read on.

I’ve got an Ubuntu Bionic machine handy with both rust and SSD available; /tmp is an ext4 filesystem on an mdraid1 SSD mirror and /rust is an ext4 filesystem on a single WD 4TB black disk. Let’s play.

root@box:~# truncate -s 4G /tmp/ssd.bin
root@box:~# truncate -s 4G /rust/rust.bin
root@box:~# mkdir /tmp/disks
root@box:~# ln -s /tmp/ssd.bin /tmp/disks/ssd.bin ; ln -s /rust/rust.bin /tmp/disks/rust.bin
root@box:~# zpool create -oashift=12 test /tmp/disks/rust.bin
root@box:~# zfs set compression=off test

Now we’ve got a pool that is rust only… but we’ve got an ssd vdev off to the side, ready to attach. Let’s run an fio test on our rust-only pool first. Note: since this is read testing, we’re going to throw away our first result set; they’ll largely be served from ARC and that’s not what we’re trying to do here.

root@box:~# cd /test
root@box:/test# fio --name=read --ioengine=sync  --rw=randread --bs=16K --size=1G --numjobs=1 --end_fsync=1

OK, cool. Now that fio has generated its dataset, we’ll clear all caches by exporting the pool, then clearing the kernel page cache, then importing the pool again.

root@box:/test# cd ~
root@box:~# zpool export test
root@box:~# echo 3 > /proc/sys/vm/drop_caches
root@box:~# zpool import -d /tmp/disks test
root@box:~# cd /test

Now we can get our first real, uncached read from our rust-only pool. It’s not terribly pretty; this is going to take 5 minutes or so.

root@box:/test# fio --name=read --ioengine=sync  --rw=randread --bs=16K --size=1G --numjobs=1 --end_fsync=1
[ ... ]
Run status group 0 (all jobs):
  READ: bw=17.6MiB/s (18.5MB/s), 17.6MiB/s-17.6MiB/s (18.5MB/s-18.5MB/s), io=1024MiB (1074MB), run=58029-58029msec

Alright. Now let’s attach our ssd and make this a mirror vdev, with one rust and one SSD disk.

root@box:/test# zpool attach test /tmp/disks/rust.bin /tmp/disks/ssd.bin
root@box:/test# zpool status test
  pool: test
 state: ONLINE
  scan: resilvered 1.00G in 0h0m with 0 errors on Sat Jul 14 14:34:07 2018
config:

    NAME                     STATE     READ WRITE CKSUM
    test                     ONLINE       0     0     0
      mirror-0               ONLINE       0     0     0
        /tmp/disks/rust.bin  ONLINE       0     0     0
        /tmp/disks/ssd.bin   ONLINE       0     0     0

errors: No known data errors

Cool. Now that we have one rust and one SSD device in a mirror vdev, let’s export the pool, drop all the kernel page cache, and reimport the pool again.

root@box:/test# cd ~
root@box:~# zpool export test
root@box:~# echo 3 > /proc/sys/vm/drop_caches
root@box:~# zpool import -d /tmp/disks test
root@box:~# cd /test

Gravy. Now, do we see massively improved throughput when we run the same fio test? If ZFS favors the SSD, we should see enormously improved results. If ZFS does not favor the SSD, we’ll not-quite-doubled results.

root@box:/test# fio --name=read --ioengine=sync  --rw=randread --bs=16K --size=1G --numjobs=1 --end_fsync=1
[...]
Run status group 0 (all jobs):
   READ: bw=31.1MiB/s (32.6MB/s), 31.1MiB/s-31.1MiB/s (32.6MB/s-32.6MB/s), io=1024MiB (1074MB), run=32977-32977msec

Welp. There you have it. Not-quite-doubled throughput, matching half – but only half – of the read ops coming from the SSD. To confirm, we’ll do this one more time; but this time we’ll detach the rust disk and run fio with nothing in the pool but the SSD.

root@box:/test# cd ~
root@box:~# zpool detach test /tmp/disks/rust.bin
root@box:~# zpool export test
root@box:~# zpool import -d /tmp/disks test
root@box:~# cd /test

Moment of truth… this time, fio runs on pure solid state:

root@box:/test# fio --name=read --ioengine=sync  --rw=randread --bs=16K --size=1G --numjobs=1 --end_fsync=1
[...]
Run status group 0 (all jobs):
  READ: bw=153MiB/s (160MB/s), 153MiB/s-153MiB/s (160MB/s-160MB/s), io=1024MiB (1074MB), run=6710-6710msec

Welp, there you have it.

Rust only: reads 18.5 MB/sec
SSD only: reads 160 MB/sec
Rust + SSD: reads 32.6 MB/sec

No, ZFS does not read from the lowest-latency disk in a mirror vdev.

Please don’t perpetuate the myth that ZFS favors lower latency devices.

PSA: new SATA power standard / HGST 10TB drives

PSA to anyone who bought a new 10T or 12T drive and can’t figure out why the damn thing won’t power on: the SATA power standard changed. The 3.3v rail is now used to command a new-spec drive to spin down – which means that an old-style SATA power supply will never allow one of the newer spec drives to spin up.

I discovered this the hard way with two new HGST 10TB NAS drives this afternoon. I wondered why such shiny big drives shipped with molex->SATA power adapters… and now I know.

Fortunately, you don’t have to use those crappy molex->SATA power adapters to get the drives working; the fix is just to pull the 3.3V rail out of the SATA adapter coming off your PSU that you want to power the newer drive with. This should typically be the orange wire; it’s the one on the “dogleg down” side of the adapter:

To get newer drives to spin up on older SATA PSUs, remove the 3.3V rail from the plug. It’s the wire on the “dogleg down” side of the SATA power plug, and is typically orange in color.

From what I’ve read online, no production hard drive prior to the SATA standard change actually used that 3.3V rail for anything, so it should also be safe to power older drives (and backplanes) with the 3.3V rail forcibly removed. I can confirm that my HGST 10TB NAS drives worked after removing the orange rail as shown; and the WD 2TB Black drives that they are replacing also worked fine without the 3.3V rail; I successfully booted the system on one of them after removing the 3.3V as shown, with no apparent problems whatsoever.

I am expressly providing this information with NO WARRANTY; if your drives or backplane stops working / your cat gets pregnant / a republican congress is elected after you remove the 3.3V rail from a SATA adapter, that’s your problem not mine. With that said, this worked great for me, saved me from having to use one of those crappy little firetrap molex adapters, and does not seem to cause any issues whatsoever with either newer or older drives.

Wifi Acronym/Protocol Cheat Sheet

I can never find all this stuff in easy human-readable form in one place and have trouble remembering some of it, so here’s a cheat sheet for myself (and for you!)

AC Speed Ratings:

They’re basically complete snake oil and cannot be trusted to mean anything concrete. The only really meaningful basic hardware designator looks like “3×3:2”, which actually means “three input antennas, three output antennas, and two simultaneous MIMO streams.” The relevant part of that is the two MIMO streams. A laptop that supports two MIMO streams can get roughly double the throughput from a router or AP that also supports two MIMO streams than it could from a router or AP which only supported one.

Very, very few client devices (laptops, phones, tablets, etc) support more than two MIMO streams. But a rare handful can support three – the most common being recent-model Macbook Pro laptops. If a router supports more MIMO streams than any of the clients connected to it, it does nobody any good at all, though. (MU-MIMO changes that, slightly, but almost no client devices support MU-MIMO, either. Welcome to wifi.)

Unfortunately, without hitting a specialty site like wikidevi, you’re going to find it really, really difficult to find anything but AC speed ratings, so here’s a list of what each of them probably means. Assuming you’re talking about a single router or access point – if you’re looking at the “rating” on a box of wifi mesh nodes, you’re going to need a couple of hours and all the algebra you ever learned to try to reverse engineer something meaningful out of it!

  • AC1200 or AC1350 probably means a 2×2:2 dual-band device.
  • AC1750, AC1900, or AC2300 probably means a 3×3:3 dual-band device.
  • AC2600 probably means a 4×4:4 dual-band device.
  • AC3200 or higher probably means a tri-band device, with two 5 GHz radios as well as a 2.4 GHz radio, and god help you if you need to know the specifics of the MIMO streams beyond that.

You may see the MIMO ratings just listed as “2×2” or “3×3” instead of the full “2×2:2” or “3×3:3”; you can generally assume this will mean the same number of MIMO streams as antennae. Probably. But if you really want to know for sure, go look the device up at wikidevi.

Terms/Acronyms:

  • AP – Access Point. This is wifi infrastructure – a router or access point which offers network access to clients.
  • STA – Station. This is nerd shorthand for “client device”; a device that connects to APs in order to have access to the network.
  • SSID – Service Set IDentifier. Normal humans call this a “wifi network name”. What you see on the list of wifi networks to connect to.
  • BSSID – Basic Service Set IDentifier. This is the hardware address of the wifi chipset in an AP or STA; wired network nerds will be also familiar with this as the “MAC address”.
  • MAC Address – this is a string of text which uniquely identifies a particular network interface to other network interfaces on the network. It’s the fundamental network identity – IP addresses will get you to the right network domain, but from there you need a translation table (ARP) to tell you which MAC address owns which IP addresses. When speaking of Wifi, MAC address is synonymous with BSSID.
  • ARP – Address Resolution Protocol. ARP is not unique to wifi; much like MAC addresses, wired networking uses it too. ARP is the protocol which allows machines on the local network to convert IP addresses to MAC addresses (which are how the packets ultimately get to the right local-network destination).
  • NIC – Network Interface Card. Used to refer specifically to the network chipset doing the communicating; a STA or AP may have multiple NICs. Each NIC has its own MAC/BSSID.

Protocols:

802.11k – RF-based roaming report

802.11k and 802.11v are protocols which facilitate BSS (Basic Service Set) transitions. Normal humans tend to call this “roaming.” K, specifically, is how an AP offers a STA information about the network, so that the STA can choose a reasonable AP to roam to.

1. AP determines that STA is moving away from it
2. AP informs STA to prepare for roaming
3. STA requests list of nearby access points
4. AP gives site report
5. STA moves to best AP based on report

Both AP and STA must support 802.11k for it to be of use. Without K, roaming takes longer (since the STA must switch bands “sniffing” the air for new APs), and is more likely to send the STA to a suboptimal AP.

If you need more info, rabbit hole begins here: https://en.wikipedia.org/wiki/IEEE_802.11k-2008

802.11r – Fast BSS transition

802.11r is only relevant to networks using EAP (Extensible Authentication Protocol), an enterprise-typical technology which allows each individual STA on the same SSID to use different passwords, and thus separate encryption keys. 802.11r does not apply to PSK networks, eg WPA/WPA2 “personal”.

Without 802.11r, a roaming event is much slower on an EAP network than on a Pre-Shared-Key style network, because the STA must first complete the full roaming process it would on the PSK network – then it must renegotiate the crypto side of things all over again with the new AP.

With 802.11r enabled (and supported on both STA and AP), part of the authentication and encryption keys may be cached for a certain amount of time, speeding up handoffs from AP to AP on an EAP network.

The details get a little hairy if you’re not super up on both the crypto and the nitty-gritty of the protocol; rabbit hole begins here: https://en.wikipedia.org/wiki/IEEE_802.11r-2008

802.11s – Mesh infrastructure protocol

802.11s is a mesh networking extension. It’s how most, if not all, Wifi Mesh networking kits handle communication between APs. Key features include:

1. SAE – Simultaneous Authentication between Equals. The idea here is that the various nodes of the mesh network can recognize one another without dependence on a central, authoritative controller.
2. broadcast/multicast and unicast delivery – in a normal network, if you hit the broadcast address a packet is relayed out to each STA. This becomes more difficult in a mesh network as not every STA is connected to a single infrastructure node; 802.11s facilitates the delivery of these *cast packets to all the STAs on the network.

802.11s is for APs only – normal STAs do not need to support and do not know anything about 802.11s, even if they’re connected to a “mesh” Wifi network.

Rabbit hole starts here: https://en.wikipedia.org/wiki/IEEE_802.11s

802.11v – Load-based roaming report

802.11v assists roaming based on AP load conditions. 802.11v BSS-TM management frames include a list of APs, and a report of their current loads. Providing this information to a STA reduces the scan time necessary, and allows for more graceful, steered roaming.

An 802.11v-enabled STA may request an 802.11v BSS-TM management frame from an AP, or an AP may send an unsolicited BSS-TM frame to the STA (indicating to the STA that a more preferred AP is available).

Similarly to 802.11k, the AP doesn’t unconditionally command the STA to roam to a specific AP, and the STA does not unconditionally obey. Both STA and AP must support 802.11v for load-based roaming to function.

I haven’t found a really good rabbit hole start for this one, but try here, here, and here.

Demonstrating ZFS zpool write distribution

One of my pet peeves is people talking about zfs “striping” writes across a pool. It doesn’t help any that zfs core developers use this terminology too – but it’s sloppy and not really correct.

ZFS distributes writes among all the vdevs in a pool.  If your vdevs all have the same amount of free space available, this will resemble a simple striping action closely enough.  But if you have different amounts of free space on different vdevs – either due to disks of different sizes, or vdevs which have been added to an existing pool – you’ll get more blocks written to the drives which have more free space available.

This came into contention on Reddit recently, when one senior sysadmin stated that a zpool queues the next write to the disk which responds with the least latency.  This statement did not match with my experience, which is that a zpool binds on the performance of the slowest vdev, period.  So, I tested, by creating a test pool with sparse images of mismatched sizes, stored side-by-side on the same backing SSD (which largely eliminates questions of latency).

root@banshee:/tmp# qemu-img create -f qcow2 512M.qcow2 512M root@banshee:/tmp# qemu-img create -f qcow2 2G.qcow2 2G
root@banshee:/tmp# qemu-nbd -c /dev/nbd0 /tmp/512M.qcow2
root@banshee:/tmp# qemu-nbd -c /dev/nbd1 /tmp/2G.qcow2
root@banshee:/tmp# zpool create -oashift=13 test nbd0 nbd1

OK, we’ve now got a 2.5 GB pool, with vdevs of 512M and 2G, and pretty much guaranteed equal latency between the two of them.  What happens when we write some data to it?

root@banshee:/tmp# dd if=/dev/zero bs=4M count=128 status=none | pv -s 512M > /test/512M.zero
 512MiB 0:00:12 [41.4MiB/s] [================================>] 100% 

root@banshee:/tmp# zpool export test
root@banshee:/tmp# ls -lh *qcow2
-rw-r--r-- 1 root root 406M Jul 27 15:25 2G.qcow2
-rw-r--r-- 1 root root 118M Jul 27 15:25 512M.qcow2

There you have it – writes distributed with a ratio of roughly 4:1, matching the mismatched vdev sizes. (I also tested with a 512M image and a 1G image, and got the expected roughly 2:1 ratio afterward.)

OK. What if we put one 512M image on SSD, and one 512M image on much slower rust?  Will the pool distribute more of the writes to the much faster SSD?

root@banshee:/tmp# qemu-img create -f qcow2 /tmp/512M.qcow2 512M
root@banshee:/tmp# qemu-img create -f qcow2 /data/512M.qcow2 512M

root@banshee:/tmp# qemu-nbd -c /dev/nbd0 /tmp/512M.qcow2
root@banshee:/tmp# qemu-nbd -c /dev/nbd1 /data/512M.qcow2

root@banshee:/tmp# zpool create test -oashift=13 nbd0 nbd1
root@banshee:/tmp# dd if=/dev/zero bs=4M count=128 | pv -s 512M > /test/512M.zero 
512MiB 0:00:48 [10.5MiB/s][================================>] 100%
root@banshee:/tmp# zpool export test
root@banshee:/tmp# ls -lh /tmp/512M.qcow2 ; ls -lh /data/512M.qcow2 
-rw-r--r-- 1 root root 266M Jul 27 15:07 /tmp/512M.qcow2 
-rw-r--r-- 1 root root 269M Jul 27 15:07 /data/512M.qcow2

Nope. Once again, zfs distributes the writes according to the amount of free space available – even when this causes performance to bind *severely* on the slowest vdev in the pool.

You should expect to see this happening if you have a vdev with failing hardware, as well – if any one disk is throwing massive latency instead of just returning errors, your entire pool will as well, until the deranged disk has been removed.  You can usually spot this sort of problem using iotop – all of the disks in your pool will have roughly the same throughput in MB/sec (assuming they’ve got equivalent amounts of free space left!), but your problem disk will show a much higher %UTIL than the rest.  Fault that slow disk, and your pool performance returns to normal.

 

Multiple client wifi testing

I’ve started beta testing my new tools for modeling and testing multiple client network usage. The main tool is something I didn’t actually think I’d need to build, which I’ve named netburn. The overall concept is using an HTTP back end server to feed multiple client devices, and I thought I’d be able to just use ApacheBench (ab) for that… but it turned out that ab was missing some crucial features I needed. Ab is designed to test the HTTP server on the back end, whereas my goal is to test the network in the middle – if the server on the back end fails, my tests fail with it.

So, ab doesn’t feature any throttling at all, and that wouldn’t work for me. Netburn, like ab, is a flexible tool, but I have four basic workloads in mind:

  • browsing: a multiple-concurrent-fetch operation that’s extremely bursty and moderately latency-sensitive, but low Mbps over time
  • 4kstream: a consistent, latency-insensitive, serial 25 Mbps download that mustn’t fall below 20 Mbps (the dreaded buffering!)
  • voip: a 1 Mbps, steady/non-bursty, extremely latency-sensitive download
  • download: a completely unthrottled, serialized download of large object(s)

I installed GalliumOS Linux on four Chromebooks, set them up with Linksys WUSB-6300 USB3 802.11ac 2×2 NICs, and got to testing against a reference Archer C7 wifi router. For this first round of very-much-beta testing, the Chromebooks aren’t really properly distributed around the house – the “4kstream” Chromebook is a pretty reasonable 20-ish feet away in the next room, but the other three were just sitting on the workbench right next to the router.

The Archer C7 got default settings overall, with a single SSID for both 5 GHz and 2.4 GHz bands. There was clearly no band-steering in play on the C7, as all four Chromebooks associated with the 5 GHz radio. This lead to some unsurprisingly crappy results for our simultaneous tests:

The C7 clearly doesn't feature any band-steering: all four Chromebooks associated with the 5 GHz radio, with predictably awful results.
The C7 clearly doesn’t feature any band-steering: all four Chromebooks associated with the 5 GHz radio, with predictably awful results.

The latency was godawful for the web browsing workload, the voip was mostly tolerable but failed our 150ms goal significantly in one packet out of every 100, and the 4K stream very definitely buffered a lot. Sad face. While we got a totally respectable 156.8 Mbps overall throughput over the course of this 5 minute test, the actual experience for humans using it would have been quite bad.

Manually splitting the SSIDs and joining the "download" client to the 2.4 GHz radio produced significantly better results. We had some failures to meet latency goals, but overall I'd call this a "mediocre pass".
Manually splitting the SSIDs and joining the “download” client to the 2.4 GHz radio produced significantly better results. We had some failures to meet latency goals, but overall I’d call this a “mediocre pass”.

Splitting the SSIDs manually and forcing the “download” client to associate to the 2.4 GHz radio produced much better results. While we had some latency failures in the bottom 5% of the packets, they weren’t massively over our 500ms goal; this would have been a bit laggy maybe but tolerable. 99% of our VOIP packets met our 150ms latency goal, and even the absolute worst single packet wasn’t much over 200ms.

The interesting takeaways here are first, how important band steering – or manual management of clients to split them between radios – is, and second, that higher overall throughput does not correlate that strongly with a better actual experience. The second run produced only 113 Mbps throughput to the first run’s 157 Mbps… but it would have been a much better actual experience for users.

Depressing Storage Calculator

When a Terabyte is not a Terabyte

It seems like a stupid question, if you’re not an IT professional – and maybe even if you are – how much storage does it take to store 1TB of data? Unfortunately, it’s not a stupid question in the vein of “what weighs more, a pound of feathers or a pound of bricks”, and the answer isn’t “one terabyte” either. I’m going to try to break down all the various things that make the answer harder – and unhappier – in easy steps. Not everybody will need all of these things, so I’ll try to lay it out in a reasonably likely order from “affects everybody” to “only affects mission-critical business data with real RTO and RPO defined”.

Counting the Costs

Simple Local Storage

Computer TB vs Manufacturer TB

To your computer, and to all computers since the dawn of computing, a KB is actually a “kibibyte”, a megabyte a “mebibyte”, and so forth – they’re powers of two, not of ten. So 1 KiB = 2^10 = 1024. That’s an extra 24 bytes from a proper Kilobyte, which is 10^3 = 1000. No big deal, right? Well, the difference squares itself with each hop up from KB to MB to GB to TB, and gets that much more significant. Storage manufacturers prefer – and also have, since the dawn of time – to measure in those proper power-of-ten units, since that means they get to put bigger numbers on a device of a given actual size and thus try to trick you into thinking it’s somehow better.

At the Terabyte/Tebibyte level, you’re talking about the difference between 2^40 and 10^12. So 1 TiB, as your computer measures data, is 1.0995 TB as the rat bastards who sell hard drives measure storage. Let’s just go ahead and round that up to a nice easy 1.1.

TL;DR: multiply times 1.1 to account for vendor units.

Working Free Space

Remember those sliding number puzzles you had as a kid, where the digits 1-8 were embedded in a 9-square grid, and you were supposed to slide them around one at a time until you got them in order? Without the “9” missing, you wouldn’t be able to slide them. That’s a pretty decent rough analogy of how storage generally works, for all sorts of general reasons. If you don’t have any free space, you can’t move the tiles around and actually get anything done. For our sliding number puzzles when we were kids, that was 8/9 of the available storage occupied. A better rule of thumb for us is 8/10, or 80%. Once your disk(s) are 80% full, you should consider them full, and you should immediately be either deleting things or upgrading. If they hit 90% full, you should consider your own personal pants to be on actual fire, and react with an appropriate amount of immediacy to remedy that.

TL;DR: multiply times 1.25 to account for working free space.

Growth

You’re probably not really planning on just storing one chunk of data you have right now and never changing it. You’re almost certainly talking about curating an ever-growing collection of data that changes and accumulates as time goes on. Most people and businesses should plan on their data storage needs to double about every five years – that’s pretty conservative; it can easily get worse than that. Still, five years is also a pretty decent – and very conservative, not aggressive – hardware refresh cycle. So let’s say we want to plan for our storage needs to be fulfilled by what we buy now, until we need new everything anyway. That means doubling everything so you don’t have to upgrade for another few years.

TL;DR: multiply times 2.0 to account for data growth over the next few years.

Disaster Recovery

What, you weren’t planning on not backing your stuff up, were you? At a bare minimum, you’re going to need as much storage for backup as you did for production – most likely, you’ll need considerably more. We’ll be super super generous here and assume all you need is enough space for one single full backup – which usually only applies if you also have redundancy and very heavy-duty “oops recovery” and maybe hotspares as well. But if you don’t have all those things… this really isn’t enough. Really.

TL;DR: multiply times 2.0 to account for one full backup, as disaster recovery.

Redundancy, Hotspares, and Snapshots

Snapshots / “Oops Recovery” Schemes

You want to have a way to fix it pretty much immediately if you accidentally break a document. What this scheme looks like may differ depending on the sophistication of the system you’re working on. At best, you’re talking something like ZFS snapshots. In the middle of the road, Windows’ Volume Shadow Copy service (what powers the “Previous Versions” tab in Windows Explorer). At worst, the Recycle Bin. (And that’s really not good enough and you should figure out a way to do better.) What these things all have in common is that they offer a limiting factor to how badly you can screw yourself with the stroke of a key – you can “undo” whatever it is you broke to a relatively recent version that wasn’t broken in just a few clicks.

Different “oops recovery” schemes have different levels of efficiency, and different amounts of point-in-time granularity. My own ZFS-based systems maintain 30 hourly snapshots, 30 daily snapshots, and 3 monthly snapshots. I generally plan for snapshot space to take up about 33% as much space as my production storage, and that’s not a bad rule of thumb across the board, even if you can’t cram as many of your own schemes level of “oops points” in the same amount of space.

TL;DR: multiply times 1.3 to account for snapshots, VSS, or other “oops recovery”.

Redundancy

Redundancy – in the form of mirrored drives, striped RAID arrays, and so forth – is not a backup! However, it is a very, very useful thing to help you avoid the downtime monster, and in the case of more advanced storage schema like ZFS, to avoid corruption and bitrot. If you’re using 1:1 redundancy – RAID1, RAID10, ZFS mirrors, or btrfs-RAID1 distributed redundancy – this means you need two of every drive. If you’re using two blocks of parity in each eight block stripe (think RAID6 or ZFS RAIDZ2 with eight drives in each vdev), you’re going to be looking at 75% theoretical efficiency that comes out to more like 70% actual efficiency after stripe overhead. I’m just going to go ahead and say “let’s calculate using the more pessimistic number”. So, double everything to account for redundancy.

TL;DR: multiply times 2.0 to account for redundant storage scheme.

Hotspare

This is probably going to be the least common item on the list, but the vast majority of my clients have opted for it at this point. A hotspare server is ready to take over for the production server at a moment’s notice, without an actual “restore the backup” type procedure. With Sanoid, this most frequently means hourly replication from production to hotspare, with the ability to spin up the replicated VMs – both storage and hypervisor – directly on the hotspare server. The hotspare is thus promoted to being production, and what was the production server can be repaired with reduced time pressure and restored into service as a hotspare itself.

If you have a hotspare – and if, say, ten or more people’s payroll and productivity is dependent on your systems being up and running, you probably should – that’s another full redundancy to add to the bill.

TL;DR: bump your “backup” allowance up from x2.0 to x3.0 if you also use hotspare hardware.

The Butcher’s Bill

If you have, and account for, everything we went through above, to store 1 “terabyte” of data you’ll need:

1 “terabyte” (really a tebibyte) of data
x 1.1 TiB per TB
x 1.25 for working free space
x 2.0 for planned growth over the next few years

x 3.0 for disaster recovery + hotspare systems
x 1.3 for snapshot or other “oops recovery”
x 2.0 for redundancy
==========================================
21.45 TB of actual storage hardware.

That can’t be right! You’re insane!

Alright, let’s break that down somewhat differently, then. Keep in mind that we’re talking about three separate computer systems in the above example, each with its own storage (production, hotspare, and disaster recovery). Now let’s instead assume that we’re talking about using drives of a given size, and see what that breaks down to in terms of actual usable storage on them.

Let’s forget about the hotspare and the disaster recovery boxes, so we’re looking at the purely local level now. Then let’s toss out the redundancy, since we’re only talking about one individual drive. That leaves us with 1TB / 1.1 TiB per TB / 1.25 working TiB per stored TiB / 1.3 TiB of prod+snapshots for every TiB of prod = 0.559 TiB of usable capacity per 1TB drive. Factor in planned growth by cutting that in half, and that means you shouldn’t be planning to start out storing more than 0.28 TiB of data on 1TB of storage.

TL;DR: If you have 280GiB of existing data, you need 1TB of local capacity.

That probably sounds more reasonable in terms of your “gut feel”, right? You have 280GiB of data, so you buy a 1TB disk, and that’ll give you some breathing room for a few years? Maybe you think it feels a bit aggressive (it isn’t), but it should at least be within the ballpark of how you’re used to thinking and feeling.

Now multiply by 2 for storage redundancy (mirrored disks), and by 3 for site/server redundancy (production, hotspare, and DR) and you’re at six 1TB disks total, to store 280GiB of data. 6/.28 = 21.43, and we’re right back where we started from, less a couple of rounding errors: we need to provision 21.45 TB for every 1TiB of data we’ve got right now.

8:1 rule of thumb

Based on the same calculations and with a healthy dose of rounding, we come up with another really handy, useful, memorable rule of thumb: when buying, you need eight times as much raw storage in production as the amount of data you have now.

So if you’ve got 1TiB of data, buy servers with 8TB of disks – whether it’s two 4TB disks in a single mirror, or four 2TB disks in two mirrors, or whatever, your rule of thumb is 8:1. Per system, so if you maintain hotspare and DR systems, you’ll need to do that twice more – but it’s still 8:1 in raw storage per machine.

Dual-NIC fanless Celeron 1037u router test – promising!

fanless_celeron_1037u_box_routingFinally found the time to set up my little fanless Celeron 1037u router project today. So far, it’s very promising!

I installed Ubuntu Server on an elderly 4GB SD card I had lying around, with no problems other than the SD card being slow as molasses – which is no fault of the Alibaba machine, of course. Booted from it just fine. I plan on using this little critter at home and don’t want to deal with glacial I/O, though, so the next step was to reinstall Ubuntu Server on a 60GB Kingston SSD, which also had no problems.

With Ubuntu Server (14.04.3 LTS) installed, the next step was getting a basic router-with-NAT iptables config going. I used MASQUERADE so that the LAN side would have NAT, and I went ahead and set up a couple of basic service rules – including a pinhole for forwarding iperf from the WAN side to a client machine on the LAN side – and saved them in /etc/network/iptables, suitable for being restored using /sbin/iptables-restore (ruleset at the end of this post).

Once that was done and I’d gotten dhcpd serving IP addresses on the LAN side, I was ready to plug up the laptop and go! The results were very, very nice:

root@demoserver:~# iperf -c springbok
------------------------------------------------------------
Client connecting to 192.168.0.125, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local demoserver port 48808 connected with springbok port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.09 GBytes   935 Mbits/sec
You have new mail in /var/mail/root
root@demoserver:~# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local demoserver port 5001 connected with springbok port 40378
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  1.10 GBytes   939 Mbits/sec

935mbps up and down… not too freakin’ shabby for a lil’ completely fanless Celeron. What about OpenVPN, with 2048-bit SSL?

------------------------------------------------------------
Client connecting to 10.8.0.38, TCP port 5001
TCP window size: 22.6 KByte (default)
------------------------------------------------------------
[  3] local 10.8.0.1 port 45727 connected with 10.8.0.38 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-11.6 sec   364 MBytes   264 Mbits/sec 

264mbps? Yeah, that’ll do.

To be fair, though, LZO compression is enabled in my OpenVPN setup, which is undoubtedly improving our iperf run. So let’s be fair, and try a slightly more “real-world” test using ssh to bring in a hefty chunk of incompressible pseudorandom data, instead:

root@router:/etc/openvpn# ssh -c arcfour jrs@10.8.0.1 'cat /tmp/test.bin' | pv > /dev/null
 333MB 0:00:17 [19.5MB/s] [                         <=>                                  ]

Still rockin’ a solid 156mbps, over OpenVPN, after SSH overhead, using incompressible data. Niiiiiiice.

For posterity’s sake, here is the iptables ruleset I’m using for testing on the little Celeron.

*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]

# p4p1 is WAN interface
-A POSTROUTING -o p4p1 -j MASQUERADE

# NAT pinhole: iperf from WAN to LAN
-A PREROUTING -p tcp -m tcp -i p4p1 --dport 5001 -j DNAT --to-destination 192.168.100.101:5001

COMMIT

*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:LOGDROP - [0:0]

# create LOGDROP target to log and drop packets
-A LOGDROP -j LOG
-A LOGDROP -j DROP

##### basic global accept rules - ICMP, loopback, traceroute, established all accepted
-A INPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -i lo -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -m state --state ESTABLISHED -j ACCEPT

# enable traceroute rejections to get sent out
-A INPUT -p udp -m udp --dport 33434:33523 -j REJECT --reject-with icmp-port-unreachable

##### Service rules
#
# OpenVPN
-A INPUT -p udp -m udp --dport 1194 -j ACCEPT

# ssh - drop any IP that tries more than 10 connections per minute
-A INPUT -i eth0 -p tcp -m tcp --dport 22 -m state --state NEW -m recent --set --name DEFAULT --mask 255.255.255.255 --rsource
-A INPUT -i eth0 -p tcp -m tcp --dport 22 -m state --state NEW -m recent --update --seconds 60 --hitcount 11 --name DEFAULT --mask 255.255.255.255 --rsource -j LOGDROP
-A INPUT -p tcp -m tcp --dport 22 -j ACCEPT

# www
-A INPUT -p tcp -m tcp --dport 80 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 443 -j ACCEPT

# default drop because I'm awesome
-A INPUT -j DROP

##### forwarding ruleset
#
# forward packets along established/related connections
-A FORWARD -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

# forward from LAN (p1p1) to WAN (p4p1)
-A FORWARD -i p1p1 -o p4p1 -j ACCEPT

# NAT pinhole: iperf from WAN to LAN
-A FORWARD -p tcp -d 192.168.100.101 --dport 5001 -j ACCEPT

# drop all other forwarded traffic
-A FORWARD -j DROP

COMMIT

Reshuffling pool storage on the fly

If you’re new here:

Sanoid is an open-source storage management project, built on top of the OpenZFS filesystem and Linux KVM hypervisor, with the aim of providing affordable, open source, enterprise-class hyperconverged infrastructure. Most of what we’re talking about today boils down to “managing ZFS storage” – although Sanoid’s replication management tool Syncoid does make the operation a lot less complicated.

Recently, I deployed two Sanoid appliances to a new customer in Raleigh, NC.

When the customer specced out their appliances, their plan was to deploy one production server and one offsite DR server – and they wanted to save a little money, so the servers were built out differently. Production had two SSDs and six conventional disks, but offsite DR just had eight conventional disks – not like DR needs a lot of IOPS performance, right?

Well, not so right. When I got onsite, I discovered that the “disaster recovery” site was actually a working space, with a mission critical server in it, backed up only by a USB external disk. So we changed the plan: instead of a production server and an offsite DR server, we now had two production servers, each of which replicated to the other for its offsite DR. This was a big perk for the customer, because the lower-specced “DR” appliance still handily outperformed their original server, as well as providing ZFS and Sanoid’s benefits of rolling snapshots, offsite replication, high data integrity, and so forth.

But it still bothered me that we didn’t have solid state in the second suite.

The main suite had two pools – one solid state, for boot disks and database instances, and one rust, for bulk storage (now including backups of this suite). Yes, our second suite was performing better now than it had been on their original, non-Sanoid server… but they had a MySQL instance that tended to be noticeably slow on inserts, and the desire to put that MySQL instance on solid state was just making me itch. Problem is, the client was 250 miles away, and their Sanoid Standard appliance was full – eight hot-swap bays, each of which already had a disk in it. No more room at the inn!

We needed minimal downtime, and we also needed minimal on-site time for me.

You can’t remove a vdev from an existing pool, so we couldn’t just drop the existing four-mirror pool to a three-mirror pool. So what do you do? We could have stuffed the new pair of SSDs somewhere inside the case, but I really didn’t want to give up the convenience of externally accessible hot swap bays.

So what do you do?

In this case, what you do – after discussing all the pros and cons with the client decision makers, of course – is you break some vdevs. Our existing pool had four mirrors, like this:

	NAME                              STATE     READ WRITE CKSUM
	data                              ONLINE       0     0     0
	  mirror-0                        ONLINE       0     0     0
	    wwn-0x50014ee20b8b7ba0-part3  ONLINE       0     0     0
	    wwn-0x50014ee20be7deb4-part3  ONLINE       0     0     0
	  mirror-1                        ONLINE       0     0     0
	    wwn-0x50014ee261102579-part3  ONLINE       0     0     0
	    wwn-0x50014ee2613cc470-part3  ONLINE       0     0     0
	  mirror-2                        ONLINE       0     0     0
	    wwn-0x50014ee2613cfdf8-part3  ONLINE       0     0     0
	    wwn-0x50014ee2b66693b9-part3  ONLINE       0     0     0
          mirror-3                        ONLINE       0     0     0
            wwn-0x50014ee20b9b4e0d-part3  ONLINE       0     0     0
            wwn-0x50014ee2610ffa17-part3  ONLINE       0     0     0

Each of those mirrors can be broken, freeing up one disk – at the expense of removing redundancy on that mirror, of course. At first, I thought I’d break all the mirrors, create a two-mirror pool, migrate the data, then destroy the old pool and add one more mirror to the new pool. And that would have worked – but it would have left the data unbalanced, so that the majority of reads would only hit two of my three mirrors. I decided to go for the cleanest result possible – a three mirror pool with all of its data distributed equally across all three mirrors – and that meant I’d need to do my migration in two stages, with two periods of user downtime.

First, I broke mirror-0 and mirror-1.

I detached a single disk from each of my first two mirrors, then cleared its ZFS label afterward.

    root@client-prod1:/# zpool detach wwn-0x50014ee20be7deb4-part3 ; zpool labelclear wwn-0x50014ee20be7deb4-part3
    root@client-prod1:/# zpool detach wwn-0x50014ee2613cc470-part3 ; zpool labelclear wwn-0x50014ee2613cc470-part3

Now mirror-0 and mirror-1 are in DEGRADED condition, as is the pool – but it’s still up and running, and the users (who are busily working on storage and MySQL virtual machines hosted on the Sanoid Standard appliance we’re shelled into) are none the wiser.

Now we can create a temporary pool with the two freed disks.

We’ll also be sure to set compression on by default for all datasets created on or replicated onto our new pool – something I truly wish was the default setting for OpenZFS, since for almost all possible cases, LZ4 compression is a big win.

    root@client-prod1:/# zpool create -o ashift=12 tmppool mirror wwn-0x50014ee20be7deb4-part3 wwn-0x50014ee2613cc470-part3
    root@client-prod1:/# zfs set compression=lz4 tmppool

We haven’t really done much yet, but it felt like a milestone – we can actually start moving data now!

Next, we use Syncoid to replicate our VMs onto the new pool.

At this point, these are still running VMs – so our users won’t see any downtime yet. After doing an initial replication with them up and running, we’ll shut them down and do a “touch-up” – but this way, we get the bulk of the work done with all systems up and running, keeping our users happy.

    root@client-prod1:/# syncoid -r data/images tmppool/images ; syncoid -r data/backup tmppool/backup

This took a while, but I was very happy with the performance – never dipped below 140MB/sec for the entire replication run. Which also strongly implies that my users weren’t seeing a noticeable amount of slowdown! This initial replication completed in a bit over an hour.

Now, I was ready for my first little “blip” of actual downtime.

First, I shut down all the VMs running on the machine:

    root@client-prod1:/# virsh shutdown suite100 ; virsh shutdown suite100-mysql ; virsh shutdown suite100-openvpn
    root@client-prod1:/# watch -n 1 virsh list

As soon as virsh list showed me that the last of my three VMs were down, I ctrl-C’ed out of my watch command and replicated again, to make absolutely certain that no user data would be lost.

    root@client-prod1:/# syncoid -r data/images tmppool/images ; syncoid -r data/backup tmppool/backup

This time, my replication was done in less than ten seconds.

Doing replication in two steps like this is a huge win for uptime, and a huge win for the users – while our initial replication needed a little more than an hour, the “touch-up” only had to copy as much data as the users could store in a few moments, so it was done in a flash.

Next, it’s time to rename the pools.

Our system expects to find the storage for its VMs in /data/images/VMname, so for minimum downtime and reconfiguration, we’ll just export and re-import our pools so that it finds what it’s looking for.

    root@client-prod1:/# zpool export data ; zpool import data olddata 
    root@client-prod1:/# zfs set mountpoint=/olddata/images/qemu olddata/images/qemu ; zpool export olddata

Wait, what was that extra step with the mountpoint?

Sanoid keeps the virtual machines’ hardware definitions on the zpool rather than on the root filesystem – so we want to make sure our old pool’s ‘qemu’ dataset doesn’t try to automount itself back to its original mountpoint, /etc/libvirt/qemu.

    root@client-prod1:/# zpool export tmppool ; zpool import tmppool data
    root@client-prod1:/# zfs set mountpoint=/etc/libvirt/qemu data/images/qemu

OK, at this point our original, degraded zpool still exists, intact, as an exported pool named olddata; and our temporary two disk pool exists as an active pool named data, ready to go.

After less than one minute of downtime, it’s time to fire up the VMs again.

    root@client-prod1:/# virsh start suite100 ; virsh start suite100-mysql ; virsh start suite100-openvpn

If anybody took a potty break or got up for a fresh cup of coffee, they probably missed our first downtime window entirely. Not bad!

Time to destroy the old pool, and re-use its remaining disks.

After a couple of checks to make absolutely sure everything was working – not that it shouldn’t have been, but I’m definitely of the “measure twice, cut once” school, especially when the equipment is a few hundred miles away – we’re ready for the first completely irreversible step in our eight-disk fandango: destroying our original pool, so that we can create our final one.

    root@client-prod1:/# zpool destroy olddata
    root@client-prod1:/# zpool create -o ashift=12 newdata mirror wwn-0x50014ee20b8b7ba0-part3 wwn-0x50014ee261102579-part3
    root@client-prod1:/# zpool add -o ashift=12 newdata mirror wwn-0x50014ee2613cfdf8-part3 wwn-0x50014ee2b66693b9-part3
    root@client-prod1:/# zpool add -o ashift=12 newdata mirror wwn-0x50014ee20b9b4e0d-part3 wwn-0x50014ee2610ffa17-part3
    root@client-prod1:/# zfs set compression=lz4 newdata

Perfect! Our new, final pool with three mirrors is up, LZ4 compression is enabled, and it’s ready to go.

Now we do an initial Syncoid replication to the final, six-disk pool:

    root@client-prod1:/# syncoid -r data/images newdata/images ; syncoid -r data/backup newdata/backup

About an hour later, it’s time to shut the VMs down for Brief Downtime Window #2.

    root@client-prod1:/# virsh shutdown suite100 ; virsh shutdown suite100-mysql ; virsh shutdown suite100-openvpn
    root@client-prod1:/# watch -n 1 virsh list

Once our three VMs are down, we ctrl-C out of ‘watch’ again, and…

Time for our final “touch-up” re-replication:

    root@client-prod1:/# syncoid -r data/images newdata/images ; syncoid -r data/backup newdata/backup

At this point, all the actual data is where it should be, in the right datasets, on the right pool.

We fix our mountpoints, shuffle the pool names, and fire up our VMs again:

    root@client-prod1:/# zpool export data ; zpool import data tmppool 
    root@client-prod1:/# zfs set mountpoint=/tmppool/images/qemu olddata/images/qemu ; zpool export tmppool
    root@client-prod1:/# zpool export newdata ; zpool import newdata data
    root@client-prod1:/# zfs set mountpoint=/etc/libvirt/qemu data/images/qemu
    root@client-prod1:/# virsh start suite100 ; virsh start suite100-mysql ; virsh start suite100-openvpn

Boom! Another downtime window over with in less than a minute.

Our TOTAL elapsed downtime was less than two minutes.

At this point, our users are up and running on the final three-mirror pool, and we won’t be inconveniencing them again today. Again we do some testing to make absolutely certain everything’s fine, and of course it is.

The very last step: destroying tmppool.

    root@client-prod1:/# zpool destroy tmppool

That’s it; we’re done for the day.

We’re now up and running on only six total disks, not eight, which gives us the room we need to physically remove two disks. With those two disks gone, we’ve got room to slap in a pair of SSDs for a second pool with a solid-state mirror vdev when we’re (well, I’m) there in person, in a week or so. That will also take a minute or less of actual downtime – and in that case, the preliminary replication will go ridiculously fast too, since we’ll only be moving the MySQL VM (less than 20G of data), and we’ll be writing at solid state device speeds (upwards of 400MB/sec, for the Samsung Pro 850 series I’ll be using).

None of this was exactly rocket science. So why am I sharing it?

Well, it’s pretty scary going in to deliberately degrade a production system, so I wanted to lay out a roadmap for anybody else considering it. And I definitely wanted to share the actual time taken for the various steps – I knew my downtime windows would be very short, but honestly I’d been a little unsure how the initial replication would go, given that I was deliberately breaking mirrors and degrading arrays. But it went great! 140MB/sec sustained throughput makes even pretty substantial tasks go by pretty quickly – and aside from the two intervals with a combined downtime of less than two minutes, my users never even noticed anything happening.

Closing with a plug: yes, you can afford it.

If this kind of converged infrastructure (storage and virtualization) management sounds great to you – high performance, rapid onsite and offsite replication, nearly zero user downtime, and a whole lot more – let me add another bullet point: low cost. Getting started isn’t prohibitively expensive.

Sanoid appliances like the ones we’re describing here – including all the operating systems, hardware, and software needed to run your VMs and manage their storage and automatically replicate them both on and offsite – start at less than $5,000. For more information, send us an email, or call us at (803) 250-1577.

PSA: don’t buy or trust Lenovo

There’s a big flurry in the IT world today about Lenovo shipping malware – oops, pardon me, a PUP or “Potentially Unwanted Program” – in some of its consumer laptops.

I’m going to try to keep my own technical coverage of this fairly brief; you can refer to ZDNet’s article for a somewhat glossier overview.

Superfish – the maladware in question – does the following:

  • installs a certificate in the Trusted CA store on the infected machine
  • installs an SSL-enabled proxy on the machine to intercept all HTTP and HTTPS traffic
  • automatically generates a new certificate from the Superfish CA onboard to match any SSL connection that’s being made

So Superfish is sniffing literally ALL of the traffic on your machine – everything from browsing Reddit to transferring funds online with your bank. But wait, it gets worse:

  • Superfish’s proxy does not pass on validation errors it encounters
  • uninstalling Superfish does not remove the bogus CA cert from your machine
  • all machines use the same private key for all Superfish-generated certs

This means that if you have Superfish, anyone can insert themselves in your traffic – go to a coffee shop, and anyone who wants to can intercept your wireless connection, use a completely bogus certificate to claim to be your bank, and Superfish will obligingly stamp its own bogus certificate on top of the connection – which your browser trusts, which means you get the green lock icon and no warning even though both Superfish and the other attacker are actively compromising your connection – they can steal credentials, change the content of the pages you see, perform actions as you while you’re logged in, sky’s the limit.

This also means that even after you remove Superfish, if you haven’t manually found and deleted the bogus CA certificate, anybody who is aware of Superfish can generate bogus certificates that pass the Superfish CA – so you’re still vulnerable to being MITM’ed by literally anybody anywhere, even though you’ve removed Superfish itself.

So, this is bad. Really bad. Far worse than the usual bloatware / shovelware crap found on consumer machines. In fact, this is unusually bad even by the already-terrible standards of “PUPs” which mangle and modify your web traffic. But that’s not the worst part. The worst part is Lenovo’s official statement (mirrored on the Wayback Machine in case they alter it):

We have thoroughly investigated this technology and do not find any evidence to substantiate security concerns. […] The relationship with Superfish is not financially significant; our goal was to enhance the experience for users.

this-is-fine

The company is looking you dead in the eye and telling you that they didn’t care about the money they got for installing software that injects ads into your web browsing experience, they did it because they thought it would be awesome for you.

You can take that one of two ways: either they’re far too malicious to trust with your IT purchases, or they’re far too ignorant to trust with your IT purchases. I cannot for the life of me think of a third option.

ZFS: You should use mirror vdevs, not RAIDZ.

Continuing this week’s “making an article so I don’t have to keep typing it” ZFS series… here’s why you should stop using RAIDZ, and start using mirror vdevs instead.

The basics of pool topology

A pool is a collection of vdevs. Vdevs can be any of the following (and more, but we’re keeping this relatively simple):

  • single disks (think RAID0)
  • redundant vdevs (aka mirrors – think RAID1)
  • parity vdevs (aka stripes – think RAID5/RAID6/RAID7, aka single, dual, and triple parity stripes)

The pool itself will distribute writes among the vdevs inside it on a relatively even basis. However, this is not a “stripe” like you see in RAID10 – it’s just distribution of writes. If you make a RAID10 out of 2 2TB drives and 2 1TB drives, the second TB on the bigger drives is wasted, and your after-redundancy storage is still only 2 TB. If you put the same drives in a zpool as two mirror vdevs, they will be a 2x2TB mirror and a 2x1TB mirror, and your after-redundancy storage will be 3TB. If you keep writing to the pool until you fill it, you may completely fill the two 1TB disks long before the two 2TB disks are full. Exactly how the writes are distributed isn’t guaranteed by the specification, only that they will be distributed.

What if you have twelve disks, and you configure them as two RAIDZ2 (dual parity stripe) vdevs of six disks each? Well, your pool will consist of two RAIDZ2 arrays, and it will distribute writes across them just like it did with the pool of mirrors. What if you made a ten disk RAIDZ2, and a two disk mirror? Again, they go in the pool, the pool distributes writes across them. In general, you should probably expect a pool’s performance to exhibit the worst characteristics of each vdev inside it. In practice, there’s no guarantee where reads will come from inside the pool – they’ll come from “whatever vdev they were written to”, and the pool gets to write to whichever vdevs it wants to for any given block(s).

Storage Efficiency

If it isn’t clear from the name, storage efficiency is the ratio of usable storage capacity (after redundancy or parity) to raw storage capacity (before redundancy or parity).

This is where a lot of people get themselves into trouble. “Well obviously I want the most usable TB possible out of the disks I have, right?” Probably not. That big number might look sexy, but it’s liable to get you into a lot of trouble later. We’ll cover that further in the next section; for now, let’s just look at the storage efficiency of each vdev type.

  • single disk vdev(s) – 100% storage efficiency. Until you lose any single disk, and it becomes 0% storage efficency…
    single-disk vdevs
    eight single-disk vdevs



  • RAIDZ1 vdev(s) – (n-1)/n, where n is the number of disks in each vdev. For example, a RAIDZ1 of eight disks has an SE of 7/8 = 87.5%.
    raidz1
    an eight disk raidz1 vdev



  • RAIDZ2 vdev(s) – (n-2)/n. For example, a RAIDZ2 of eight disks has an SE of 6/8 = 75%.
    raidz2
    an eight disk raidz2 vdev



  • RAIDZ3 vdev(s) – (n-3)/n. For example, a RAIDZ3 of eight disks has an SE of 5/8 = 62.5%.
    raidz3
    an eight disk raidz3 vdev



  • mirror vdev(s) – 1/n, where n is the number of disks in each vdev. Eight disks set up as 4 2-disk mirror vdevs have an SE of 1/2 = 50%.
    mirror vdevs
    a pool of four 2-disk mirror vdevs



One final note: striped (RAIDZ) vdevs aren’t supposed to be “as big as you can possibly make them.” Experts are cagey about actually giving concrete recommendations about stripe width (the number of devices in a striped vdev), but they invariably recommend making them “not too wide.” If you consider yourself an expert, make your own expert decision about this. If you don’t consider yourself an expert, and you want more concrete general rule-of-thumb advice: no more than eight disks per vdev.

Fault tolerance / degraded performance

Be careful here. Keep in mind that if any single vdev fails, the entire pool fails with it. There is no fault tolerance at the pool level, only at the individual vdev level! So if you create a pool with single disk vdevs, any failure will bring the whole pool down.

It may be tempting to go for that big storage number and use RAIDZ1… but it’s just not enough. If a disk fails, the performance of your pool will be drastically degraded while you’re replacing it. And you have no fault tolerance at all until the disk has been replaced and completely resilvered… which could take days or even weeks, depending on the performance of your disks, the load your actual use places on the disks, etc. And if one of your disks failed, and age was a factor… you’re going to be sweating bullets wondering if another will fail before your resilver completes. And then you’ll have to go through the whole thing again every time you replace a disk. This sucks. Don’t do it. Conventional RAID5 is strongly deprecated for exactly the same reasons. According to Dell, “Raid 5 for all business critical data on any drive type [is] no longer best practice.”

RAIDZ2 and RAIDZ3 try to address this nightmare scenario by expanding to dual and triple parity, respectively. This means that a RAIDZ2 vdev can survive two drive failures, and a RAIDZ3 vdev can survive three. Problem solved, right? Well, problem mitigated – but the degraded performance and resilver time is even worse than a RAIDZ1, because the parity calculations are considerably gnarlier. And it gets worse the wider your stripe (number of disks in the vdev).

Saving the best for last: mirror vdevs. When a disk fails in a mirror vdev, your pool is minimally impacted – nothing needs to be rebuilt from parity, you just have one less device to distribute reads from. When you replace and resilver a disk in a mirror vdev, your pool is again minimally impacted – you’re doing simple reads from the remaining member of the vdev, and simple writes to the new member of the vdev. In no case are you re-writing entire stripes, all other vdevs in the pool are completely unaffected, etc. Mirror vdev resilvering goes really quickly, with very little impact on the performance of the pool. Resilience to multiple failure is very strong, though requires some calculation – your chance of surviving a disk failure is 1-(f/(n-f)), where f is the number of disks already failed, and n is the number of disks in the full pool. In an eight disk pool, this means 100% survival of the first disk failure, 85.7% survival of a second disk failure, 66.7% survival of a third disk failure. This assumes two disk vdevs, of course – three disk mirrors are even more resilient.

But wait, why would I want to trade guaranteed two disk failure in RAIDZ2 with only 85.7% survival of two disk failure in a pool of mirrors? Because of the drastically shorter time to resilver, and drastically lower load placed on the pool while doing so. The only disk more heavily loaded than usual during a mirror vdev resilvering is the other disk in the vdev – which might sound bad, but remember that it’s no more heavily loaded than it would’ve been as a RAIDZ member.  Each block resilvered on a RAIDZ vdev requires a block to be read from each surviving RAIDZ member; each block written to a resilvering mirror only requires one block to be read from a surviving vdev member.  For a six-disk RAIDZ1 vs a six disk pool of mirrors, that’s five times the extra I/O demands required of the surviving disks.

Resilvering a mirror is much less stressful than resilvering a RAIDZ.

One last note on fault tolerance

No matter what your ZFS pool topology looks like, you still need regular backup.

Say it again with me: I must back up my pool!

ZFS is awesome. Combining checksumming and parity/redundancy is awesome. But there are still lots of potential ways for your data to die, and you still need to back up your pool. Period. PERIOD!

Normal performance

It’s easy to think that a gigantic RAIDZ vdev would outperform a pool of mirror vdevs, for the same reason it’s got a greater storage efficiency. “Well when I read or write the data, it comes off of / goes onto more drives at once, so it’s got to be faster!” Sorry, doesn’t work that way. You might see results that look kinda like that if you’re doing a single read or write of a lot of data at once while absolutely no other activity is going on, if the RAIDZ is completely unfragmented… but the moment you start throwing in other simultaneous reads or writes, fragmentation on the vdev, etc then you start looking for random access IOPS. But don’t listen to me, listen to one of the core ZFS developers, Matthew Ahrens: “For best performance on random IOPS, use a small number of disks in each RAID-Z group. E.g, 3-wide RAIDZ1, 6-wide RAIDZ2, or 9-wide RAIDZ3 (all of which use ⅓ of total storage for parity, in the ideal case of using large blocks). This is because RAID-Z spreads each logical block across all the devices (similar to RAID-3, in contrast with RAID-4/5/6). For even better performance, consider using mirroring.

Please read that last bit extra hard: For even better performance, consider using mirroring. He’s not kidding. Just like RAID10 has long been acknowledged the best performing conventional RAID topology, a pool of mirror vdevs is by far the best performing ZFS topology.

Future expansion

This is one that should strike near and dear to your heart if you’re a SOHO admin or a hobbyist. One of the things about ZFS that everybody knows to complain about is that you can’t expand RAIDZ. Once you create it, it’s created, and you’re stuck with it.

Well, sorta.

Let’s say you had a server with 12 slots to put drives in, and you put six drives in it as a RAIDZ2. When you bought it, 1TB drives were a great bang for the buck, so that’s what you used. You’ve got 6TB raw / 4TB usable. Two years later, 2TB drives are cheap, and you’re feeling cramped. So you fill the rest of the six available bays in your server, and now you’ve added an 12TB raw / 8TB usable vdev, for a total pool size of 18TB/12TB. Two years after that, 4TB drives are out, and you’re feeling cramped again… but you’ve got no place left to put drives. Now what?

Well, you actually can upgrade that original RAIDZ2 of 1TB drives – what you have to do is fail one disk out of the vdev and remove it, then replace it with one of your 4TB drives. Wait for the resilvering to complete, then fail a second one, and replace it. Lather, rinse, repeat until you’ve replaced all six drives, and resilvered the vdev six separate times – and after the sixth and last resilvering finishes, you have a 24TB raw / 16TB usable vdev in place of the original 6TB/4TB one. Question is, how long did it take to do all that resilvering? Well, if that 6TB raw vdev was nearly full, it’s not unreasonable to expect each resilvering to take twelve to sixteen hours… even if you’re doing absolutely nothing else with the system. The more you’re trying to actually do in the meantime, the slower the resilvering goes. You might manage to get six resilvers done in six full days, replacing one disk per day. But it might take twice that long or worse, depending on how willing to hover over the system you are, and how heavily loaded it is in the meantime.

What if you’d used mirror vdevs? Well, to start with, your original six drives would have given you 6TB raw / 3TB usable. So you did give up a terabyte there. But maybe you didn’t do such a big upgrade the first time you expanded. Maybe since you only needed to put in two more disks to get more storage, you only bought two 2TB drives, and by the time you were feeling cramped again the 4TB disks were available – and you still had four bays free. Eventually, though, you crammed the box full, and now you’re in that same position of wanting to upgrade those old tiny 1TB disks. You do it the same way – you replace, resilver, replace, resilver – but this time, you see the new space after only two resilvers. And each resilvering happens tremendously faster – it’s not unreasonable to expect nearly-full 1TB mirror vdevs to resilver in three or four hours. So you can probably upgrade an entire vdev in a single day, even without having to hover over the machine too crazily. The performance on the machine is hardly impacted during the resilver. And you see the new capacity after every two disks replaced, not every six.

TL;DR

Too many words, mister sysadmin. What’s all this boil down to?

  • don’t be greedy. 50% storage efficiency is plenty.
  • for a given number of disks, a pool of mirrors will significantly outperform a RAIDZ stripe.
  • a degraded pool of mirrors will severely outperform a degraded RAIDZ stripe.
  • a degraded pool of mirrors will rebuild tremendously faster than a degraded RAIDZ stripe.
  • a pool of mirrors is easier to manage, maintain, live with, and upgrade than a RAIDZ stripe.
  • BACK. UP. YOUR POOL. REGULARLY. TAKE THIS SERIOUSLY.

TL;DR to the TL;DR – unless you are really freaking sure you know what you’re doing… use mirrors. (And if you are really, really sure what you’re doing, you’ll probably change your mind after a few years and wish you’d done it this way to begin with.)