Installing WordPress on Apache the modern way

It’s been bugging me for a while that there are no correct guides to be found about using modern Apache 2.4 or above with the Event or Worker MPMs. We’re going to go ahead and correct that lapse today, by walking through a brand-new WordPress install on a new Ubuntu 18.04 VM (grab one for $5/mo at Linode, Digital Ocean, or your favorite host).

Installing system packages

Once you’ve set up the VM itself, you’ll first need to update the package list:

root@VM:~# apt update

Once it’s updated, you’ll need to install Apache itself, along with PHP and the various extras needed for a WordPress installation.

root@VM:~# apt install apache2 mysql-server php-fpm php-common php-mbstring php-xmlrpc php-soap php-gd php-xml php-intl php-mysql php-cli php-ldap php-zip php-curl

The key bits here are Apache2, your HTTP server; MySQL, your database server; and php-fpm, which is a pool of PHP worker processes your HTTP server can connect to in order to build WordPress dynamic content as necessary.

What you absolutely, positively do not want to do here is install mod_php. If you do that, your nice modern Apache2 with its nice modern Event process model gets immediately switched back to your granddaddy’s late-90s-style prefork, loading PHP processors into every single child process, and preventing your site from scaling if you get any significant traffic!

Enable the proxy_fcgi module

Instead – and this is the bit none of the guides I’ve found mention – you just need to enable one module in Apache itself, and enable the already-installed PHP configuration module. (You will need to figure out which version of php-fpm is installed: dpkg –get-selections | grep fpm can help here if you aren’t sure.)

root@VM:~# a2enmod proxy_fcgi
root@VM:~# a2enconf php7.4-fpm.conf
root@VM:~# systemctl restart apache2

Your Apache2 server is now ready to serve PHP applications, like WordPress. (Note for more advanced admins: if you’re tuning for larger scale, don’t forget that it’s not only about the web server connections anymore; you also want to keep an eye on how many PHP worker processes you have in your pool. You’ll do that in /etc/php/[version]/fpm/pool.d/www.conf.)

Download and extract WordPress

We’re going to keep things super simple in this guide, and just serve WordPress from the existing default vhost in its standard location, at /var/www/html.

root@VM:~# cd /var/www
root@VM:/var/www# wget https://wordpress.org/latest.tar.gz
root@VM:/var/www# tar zxvf latest.tar.gz
root@VM:/var/www# chown -R www-data.www-data wordpress
root@VM:/var/www# mv html html.dist
root@VM:/var/www# mv wordpress html

Create a database for WordPress

The last step before you can browse to your new WordPress installation is creating the database itself.

root@VM:/var/www# mysql -u root

mysql> create database wordpress;
Query OK, 1 row affected (0.01 sec)

mysql> grant all on wordpress.* to 'wordpress'@'localhost' identified by 'superduperpassword';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> exit;

This created a database named wordpress, with a user named wordpress, and a password superduperpassword. That’s a bad password. Don’t actually use that password. (Also, if mysql -u root wanted a password, and you don’t have it – cat /etc/mysql/debian.cnf, look for the debian-sys-maint password, and connect to mysql using mysql -u debian-sys-maint instead. Everything else will work fine.)

note for ubuntu 20.04 / mysql 8.0 users:

MySQL changed things a bit with 8.0. grant all on db.* to ‘user’@’localhost’ identified by ‘password’; no longer works all in one step. Instead, you’ll need first to create user ‘user’@’localhost’ identified by ‘password’; then you can grant all on db.* to ‘user’@’localhost’; —you no longer need to (or can) specify password on the actual grant line itself.

All done – browser time!

Now that you’ve set up Apache, dropped the WordPress installer in its default directory, and created a mysql database – you’re ready to run through the WordPress setup itself, by browsing directly to http://your.servers.ip.address/. Have fun!

About ZFS recordsize

ZFS stores data in records—also known as, and interchangeably referred to by OpenZFS devs as blocks—which are themselves composed of on-disk sectors. The physical sector size on disk is set by the ashift value at time of vdev creation, and is immutable. The recordsize, on the other hand, is individual to each dataset(although it can be inherited from parent datasets), and can be changed at any time you like. In 2019, recordsize defaults to 128K if not explicitly set.

big files? big recordsize.

The general rule of recordsize is that it should closely match the typical workload experienced within that dataset. For example, a dataset used to store high-quality JPGs, averaging 5MB or more, should have recordsize=1M. This matches the typical I/O seen in that dataset – either reading or writing a full 5+ MB JPG, with no random access within each file – quite well; setting that larger recordsize prevents the files from becoming unduly fragmented, ensuring the fewest IOPS are consumed during either read or write of the data within that dataset.

DB binaries? Smaller recordsize.

By contrast, a dataset which directly contains a MySQL InnoDB database should have recordsize=16K. That’s because InnoDB defaults to a 16KB page size, so most operations on an InnoDB database will be done in individual 16K chunks of data. Matching recordsize to MySQL’s page size here means we maximize the available IOPS, while minimizing latency on the highly sync()hronous reads and writes made by the database (since we don’t need to read or write extraneous data while handling our MySQL pages).

VMs? Match the recordsize to the VM storage format.

(That’s cluster_size, for QEMU/KVM .qcow2.)

On the other hand, if you’ve got a MySQL InnoDB database stored within a VM, your optimal recordsize won’t necessarily be either of the above – for example, KVM .qcow2 files default to a cluster_size of 64KB. If you’ve set up a VM on .qcow2 with default cluster_size, you don’t want to set recordsize any lower (or higher!) than the cluster_size of the .qcow2 file. So in this case, you’ll want recordsize=64K to match the .qcow2’s cluster_size=64K, even though the InnoDB database inside the VM is probably using smaller pages.

An advanced administrator might look at all of this, determine that a VM’s primary function in life is to run MySQL, that MySQL’s default page size is good, and therefore set both the .qcow2 cluster_size and the dataset’s recordsize to match, at 16K each.

A different administrator might look at all this, determine that the performance of MySQL in the VM with all the relevant settings left to their defaults was perfectly fine, and elect not to hand-tune all this crap at all. And that’s okay.

What if I set recordsize too high?

If recordsize is much higher than the size of the typical storage operation within the dataset, latency will be greatly increased and this is likely to be incredibly frustrating. IOPS will be very limited, databases will perform poorly, desktop UI will be glacial, etc.

What if I set recordsize too low?

If recordsize is a lot smaller than the size of the typical storage operation within the dataset, fragmentation will be greatly (and unnecessarily) increased, leading to unnecessary performance problems down the road. IOPS as measured by artificial tools will be super high, but performance profiles will be limited to those presented by random I/O at the record size you’ve set, which in turn can be significantly worse than the performance profile of larger block operations.

You’ll also screw up compression with an unnecessarily low recordsize; zfs inline compression dictionaries are per-record, and work by fitting the contents of a single logical record (aka block) into fewer on-disk sectors than it would have needed if uncompressed.

If you set compression=lz4ashift=12, and recordsize=4K you’ll effectively have NO compression, because your sector size is equal to your recordsize—and you can’t store data in “half a sector”, so compressing a one-sector record down to 0.5 sectors’ worth of data would not result in less usage of on-disk capacity.

Meanwhile, the same dataset with the default 128K recordsize might easily have a 1.7:1 compression ratio.

Are the defaults good? Do I aim high, or do I aim low?

128K is a pretty reasonable “ah, what the heck, it works well enough” setting in general. It penalizes you significantly on IOPS and latency for small random I/O operations, and it presents more fragmentation than necessary for large contiguous files, but it’s not horrible at either task. There is a lot to be gained from tuning recordsize more appropriately for task, though.

What about bittorrent?

The “big records for big files” rule of thumb still applies for datasets used as bittorrent targets.

This is one of those cases where things work just the opposite of how you might think – torrents write data in relatively small chunks, and access them randomly for both read and write, so you might reasonably think this calls for a small recordsize. However, the actual data in the torrents is typically huge files, which are accessed in their entirety for everything but the initial bittorrent session.

Since the typical access pattern is “large-file”, most people will be better off using recordsize=1M in the torrent target storage. This keeps the downloaded data unfragmented despite the bittorrent client’s insanely random writing patterns. The data acquired during the bittorrent session in chunks is accumulated in the ZIL until a full record is available to write, since the torrent client itself is not synchronous – it writes all the time, but rarely if ever calls sync().

As a proof-of-concept, I used the Transmission client on an Ubuntu 16.04 LTS workstation to download the Ubuntu 18.04.2 Server LTS ISO, with a dataset using recordsize=1M as the target. This workstation has a pool consisting of two mirror vdevs on rust, so high levels of fragmentation would be very easy to spot.

root@locutus:/# zpool export data ; modprobe -r zfs ; modprobe zfs ; zpool import data

root@locutus:/# pv < /data/torrent/ubu*18*iso > /dev/null
 883MB 0:00:03 [ 233MB/s] [==================================>] 100%

Exporting the pool and unloading the ZFS kernel module entirely is a weapons-grade-certain method of emptying the ARC entirely; getting better than 200 MB/sec average read throughput directly from the rust vdevs afterward (the transfer actually peaked at nearly 400 MB/sec!) confirms that our torrented ISO is not fragmented.

Note that preallocation settings in your bittorrent client are meaningless when the client is saving to ZFS – you can’t actually preallocate in any meaningful way on ZFS, because it’s a copy-on-write filesystem.

VLANs with KVM guests on Ubuntu 18.04 / netplan

There is a frustrating lack of information on how to set up multiple VLAN interfaces on a KVM host out there. I made my way through it in production today with great applications of thud and blunder; here’s an example of a working 01-netcfg.yaml with multiple VLANs on a single (real) bridge interface, presenting as multiple bridges.

Everything feeds through properly so that you can bring KVM guests up on br0 for the default VLAN, br100 for VLAN 100, or br200 for VLAN 200. Adapt as necessary for whatever VLANs you happen to be using.

# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    eno1:
      dhcp4: no
      dhcp6: no
    eno2:
      dhcp4: no
      dhcp6: no
  vlans:
    br0.100:
      link: br0
      id: 100
    br0.200:
      link: br0
      id: 200
  bridges:
    br0:
      interfaces:
        - eno1
        - eno2
      dhcp4: no
      dhcp6: no
      addresses: [ 10.0.0.2/24 ]
      gateway4: 10.0.0.1
      nameservers:
        addresses: [ 8.8.8.8,1.1.1.1 ]
    br100:
      interfaces:
        - br0.100
      dhcp4: no
      dhcp6: no
      addresses: [ 10.0.100.1/24 ]
    br200:
      interfaces:
        - br0.200
      dhcp4: no
      dhcp6: no
      addresses: [ 10.0.200.1/24 ]

ZFS does NOT favor lower latency devices. Don’t mix rust disks and SSDs!

In an earlier post, I addressed the never-ending urban legend that ZFS writes data to the lowest-latency vdev. Now the urban legend that never dies has reared its head again; this time with someone claiming that ZFS will issue read operations to the lowest-latency disk in a given mirror vdev.

TL;DR – this, too, is a myth. If you need or want an empirical demonstration, read on.

I’ve got an Ubuntu Bionic machine handy with both rust and SSD available; /tmp is an ext4 filesystem on an mdraid1 SSD mirror and /rust is an ext4 filesystem on a single WD 4TB black disk. Let’s play.

root@box:~# truncate -s 4G /tmp/ssd.bin
root@box:~# truncate -s 4G /rust/rust.bin
root@box:~# mkdir /tmp/disks
root@box:~# ln -s /tmp/ssd.bin /tmp/disks/ssd.bin ; ln -s /rust/rust.bin /tmp/disks/rust.bin
root@box:~# zpool create -oashift=12 test /tmp/disks/rust.bin
root@box:~# zfs set compression=off test

Now we’ve got a pool that is rust only… but we’ve got an ssd vdev off to the side, ready to attach. Let’s run an fio test on our rust-only pool first. Note: since this is read testing, we’re going to throw away our first result set; they’ll largely be served from ARC and that’s not what we’re trying to do here.

root@box:~# cd /test
root@box:/test# fio --name=read --ioengine=sync  --rw=randread --bs=16K --size=1G --numjobs=1 --end_fsync=1

OK, cool. Now that fio has generated its dataset, we’ll clear all caches by exporting the pool, then clearing the kernel page cache, then importing the pool again.

root@box:/test# cd ~
root@box:~# zpool export test
root@box:~# echo 3 > /proc/sys/vm/drop_caches
root@box:~# zpool import -d /tmp/disks test
root@box:~# cd /test

Now we can get our first real, uncached read from our rust-only pool. It’s not terribly pretty; this is going to take 5 minutes or so.

root@box:/test# fio --name=read --ioengine=sync  --rw=randread --bs=16K --size=1G --numjobs=1 --end_fsync=1
[ ... ]
Run status group 0 (all jobs):
  READ: bw=17.6MiB/s (18.5MB/s), 17.6MiB/s-17.6MiB/s (18.5MB/s-18.5MB/s), io=1024MiB (1074MB), run=58029-58029msec

Alright. Now let’s attach our ssd and make this a mirror vdev, with one rust and one SSD disk.

root@box:/test# zpool attach test /tmp/disks/rust.bin /tmp/disks/ssd.bin
root@box:/test# zpool status test
  pool: test
 state: ONLINE
  scan: resilvered 1.00G in 0h0m with 0 errors on Sat Jul 14 14:34:07 2018
config:

    NAME                     STATE     READ WRITE CKSUM
    test                     ONLINE       0     0     0
      mirror-0               ONLINE       0     0     0
        /tmp/disks/rust.bin  ONLINE       0     0     0
        /tmp/disks/ssd.bin   ONLINE       0     0     0

errors: No known data errors

Cool. Now that we have one rust and one SSD device in a mirror vdev, let’s export the pool, drop all the kernel page cache, and reimport the pool again.

root@box:/test# cd ~
root@box:~# zpool export test
root@box:~# echo 3 > /proc/sys/vm/drop_caches
root@box:~# zpool import -d /tmp/disks test
root@box:~# cd /test

Gravy. Now, do we see massively improved throughput when we run the same fio test? If ZFS favors the SSD, we should see enormously improved results. If ZFS does not favor the SSD, we’ll not-quite-doubled results.

root@box:/test# fio --name=read --ioengine=sync  --rw=randread --bs=16K --size=1G --numjobs=1 --end_fsync=1
[...]
Run status group 0 (all jobs):
   READ: bw=31.1MiB/s (32.6MB/s), 31.1MiB/s-31.1MiB/s (32.6MB/s-32.6MB/s), io=1024MiB (1074MB), run=32977-32977msec

Welp. There you have it. Not-quite-doubled throughput, matching half – but only half – of the read ops coming from the SSD. To confirm, we’ll do this one more time; but this time we’ll detach the rust disk and run fio with nothing in the pool but the SSD.

root@box:/test# cd ~
root@box:~# zpool detach test /tmp/disks/rust.bin
root@box:~# zpool export test
root@box:~# zpool import -d /tmp/disks test
root@box:~# cd /test

Moment of truth… this time, fio runs on pure solid state:

root@box:/test# fio --name=read --ioengine=sync  --rw=randread --bs=16K --size=1G --numjobs=1 --end_fsync=1
[...]
Run status group 0 (all jobs):
  READ: bw=153MiB/s (160MB/s), 153MiB/s-153MiB/s (160MB/s-160MB/s), io=1024MiB (1074MB), run=6710-6710msec

Welp, there you have it.

Rust only: reads 18.5 MB/sec
SSD only: reads 160 MB/sec
Rust + SSD: reads 32.6 MB/sec

No, ZFS does not read from the lowest-latency disk in a mirror vdev.

Please don’t perpetuate the myth that ZFS favors lower latency devices.

sample netplan config for ubuntu 18.04

Here’s a sample /etc/netplan config for Ubuntu 18.04. HUGE LIFE PRO TIP: against all expectations of decency, netplan refuses to function if you don’t indent everything exactly the way it likes it and returns incomprehensible wharrgarbl errors like “mapping values are not allowed in this context, line 17, column 15” if you, for example, have a single extra space somewhere in the config.

I wish I was kidding.

Anyway, here’s a sample /etc/netplan/01-config.yaml with a couple interfaces, one wired and static, one wireless and dynamic. Enjoy. And for the love of god, get the spacing exactly right; I really wasn’t kidding about it barfing if you have one too many spaces for a whitespace indent somewhere. Ask me how I know. >=\

If for any reason you have trouble reading this exact spacing, the rule is two spaces for each level of indent. So the v in “version” should line up under the t in “network”, the d in “dhcp4” should line up under the o in “eno1”, and so forth.

# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    eno1:
      dhcp4: no
      dhcp6: no
      addresses: [192.168.0.1/24]
      gateway4: 192.168.0.1
      nameservers:
        addresses: [8.8.8.8, 1.1.1.1]
  wifis:
    wlp58s0:
      dhcp4: yes
      dhcp6: no
      access-points:
        "your-wifi-SSID-name":
          password: "your-wifi-password"

Wait for network to be configured (no limit)

In Ubuntu 16.04 or up (ie, post systemd) if you’re ever stuck staring for two straight minutes at “Waiting for network to be configured (no limit)” and despairing, there’s a simple fix:

systemctl mask systemd-networkd-wait-online.service

This links the service that sits there with its thumb up its butt if you don’t have a network connection to /dev/null, causing it to just return instantly whenever it’s called. Which is probably a good idea. There may indeed be a situation in which I want a machine to refuse to boot until it gets an IP address, but whatever that situation MIGHT be, I’ve never encountered it in 20+ years of professional system administration, so…

ZVOL vs QCOW2 with KVM

When mixing ZFS and KVM, should you put your virtual machine images on ZVOLs, or on .qcow2 files on plain datasets? It’s a topic that pops up a lot, usually with a ton of people weighing in on performance without having actually done any testing.  My old benchmarks are getting a little long in the tooth, so here’s an fio random write run with 4K blocksize, done on both a .qcow2 on a dataset, and a zvol.

Test Configuration

Host:

CPU :  Intel(R) Xeon(R) CPU E3-1230 v5 @ 3.40GHz
RAM : 32 GB DDR4 SDRAM
SATA : Intel Corporation Sunrise Point-H SATA controller [AHCI mode] (rev 31)
OS : Ubuntu 16.04.4 LTS, fully updated as of 2018-03-13
FS : ZFS 0.6.5.6-0ubuntu19, from Canonical main repo
Disks : 2x Samsung 850 Pro 1TB SATA3, mirror vdev
ZFS parameters: ashift=13,recordsize=8K,atime=off,compression=lz4

Guest:

CPU : Intel Core Processor (Broadwell), 2 threads
RAM : 512MB
OS : Ubuntu 16.04.4 LTS, fully updated as of 2018-03-13
FS : ext4
Disks: /mnt/zvol on 20G zvol volume, /mnt/dataset on 20G .qcow2 file

Synchronous 4K write results

ZVOL, –ioengine=sync:

root@benchmark:/mnt/zvol# fio --name=random-write --ioengine=sync --iodepth=4 \
                              --rw=randwrite --bs=4k --direct=0 --size=256m --numjobs=16 \
                              --end_fsync=1
[...]
Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=50453KB/s, minb=3153KB/s, maxb=3153KB/s, mint=83116msec, maxt=83132msec

QCOW2, –ioengine=sync:

root@benchmark:/mnt/qcow2# fio --name=random-write --ioengine=sync --iodepth=4 \
                               --rw=randwrite --bs=4k --direct=0 --size=256m --numjobs=16 \
                               --end_fsync=1
[...]
Run status group 0 (all jobs):
WRITE: io=4096.0MB, aggrb=45767KB/s, minb=2860KB/s, maxb=2976KB/s, mint=88058msec, maxt=91643msec

So, 50.5 MB/sec (zvol) vs 45.8 MB/sec (qcow2). Yes, there’s a difference; at least on the most punishing I/O workloads. Is it perceptible enough to matter? Probably not, for most use cases, given the benefits in ease of management and maintenance for .qcow2 on datasets. QCOW2 are easier to provision, you don’t have to worry about refreservation keeping you from taking snapshots, they’re not significantly more difficult to mount offline (modprobe nbd ; qemu-nbd -c /dev/nbd0 /path/to/image.qcow2 ; mount -oro /mnt/image /dev/nbd0 or similar); and probably the most importantly, filling the underlying storage beneath a qcow2 won’t crash the guest.

Tuning QCOW2 for even better performance

I found out yesterday that you can tune the underlying cluster size of the .qcow2 format. Creating a new .qcow2 file tuned to use 8K clusters – matching our 8K recordsize, and the 8K underlying hardware blocksize of the Samsung 850 Pro drives in our vdev – produced tremendously better results. With the tuned qcow2, we more than tripled the performance of the zvol – going from 50.5 MB/sec (zvol) to 170 MB/sec (8K tuned qcow2)!

QCOW2 -o cluster_size=8K, –ioengine=sync:

root@benchmark:/mnt/qcow2# fio --name=random-write --ioengine=sync --iodepth=4 \
                               --rw=randwrite --bs=4k --direct=0 --size=256m --numjobs=16 \
                               --end_fsync=1
[...]
Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=170002KB/s, minb=10625KB/s, maxb=12698KB/s, mint=20643msec, maxt=24672msec

ZVOL won’t pause the guest if storage is unavailable

If you fill the underlying pool with a guest that’s using a zvol for its storage, the filesystem in the guest will panic. From the guest’s perspective, this is a hardware I/O error, and the guest and/or its apps which use that virtual disk will crash, leaving it in an unknown and possibly corrput state.

If the guest uses a .qcow2 file on a dataset for storage, the same problem is handled much more safely. When writes become unavailable on host storage, the guest will be automatically paused by libvirt. This gives you a chance to free up space, then virsh resume the guest. The net effect is that the guest and its apps never realize there was ever a problem in the first place. Any pending writes complete automatically and without error once you’ve cleared the host storage problem and resumed the guest.

ZVOL doesn’t honor guest synchronous writes

It may also be worth noting that the guest seems a little less clued in with what’s going on with its storage when using the zvol. I specified --ioengine=sync for these test runs, which should – repeat, should – have made the also-specified parameter end_fsync=1 irrelevant, since all writes were supposed to be synchronous.

On the .qcow2-hosted storage, the data was written verifiably sync, since we can see there’s no pause at the end for end_fsync=1 to finish flushing the data to the metal:

Jobs: 16 (f=16): [w(16)] [66.7% done] [0KB/75346KB/0KB /s]
Jobs: 16 (f=16): [w(16)] [68.0% done] [0KB/0KB/0KB /s]
Jobs: 16 (f=16): [w(16)] [72.0% done] [0KB/263.8MB/0KB /s]
Jobs: 16 (f=16): [w(8),F(1),w(7)] [80.0% done] [0KB/199.1MB/0KB /s] 
Jobs: 15 (f=15): [w(8),_(1),w(7)] [80.8% done] [0KB/53866KB/0KB /s] 
Jobs: 15 (f=15): [w(3),F(1),w(4),_(1),w(3),F(1),w(3)] [84.6% done] 
Jobs: 12 (f=12): [F(1),w(2),_(1),w(4),_(2),w(2),_(1),w(3)] [85.2% done] 
Jobs: 8 (f=8): [_(4),w(4),_(2),w(2),_(1),w(1),_(1),w(1)] [88.9% done] Jobs: 4 (f=3): [_(4),F(1),_(1),w(1),_(3),F(1),_(4),w(1)] [100.0% done] 

random-readwrite: (groupid=0, jobs=1): err= 0: pid=1773: Tue Mar 13 13:57:16 2018

The ZVOL hosted storage, on the other hand, clearly was not honoring ioengine=sync, as it spent a significant amount of time after all data was supposedly already written, waiting for end_fsync=1 to finish:

Jobs: 16 (f=16): [w(16)] [81.0% done] [0KB/527.2MB/0KB /s] 
Jobs: 16 (f=16): [w(10),F(1),w(5)] [94.7% done] [0KB/551.6MB/0KB /s]
Jobs: 16 (f=16): [F(16)] [100.0% done] [0KB/155.2MB/0KB /s]
Jobs: 16 (f=16): [F(16)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops]
Jobs: 16 (f=16): [F(16)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops]
Jobs: 16 (f=16): [F(16)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops]

 ------[[[ above line repeats for 60 more lines ]]]------

Jobs: 16 (f=16): [F(16)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops]

random-readwrite: (groupid=0, jobs=1): err= 0: pid=1792: Tue Mar 13 13:57:42 2018

This strikes me as pretty disturbing; you could end up in a world of hurt if you’re expecting your host to honor the guest’s synchronous writes when, in fact, it’s not.

Asynchronous 4K write results

Well, hrm. Realizing now that zvol storage doesn’t actually honor synchronous write requests very well, what if we use the libaio (native Linux asynchronous I/O) engine instead?

ZVOL, –ioengine=libaio:

root@benchmark:/mnt/zvol# fio --name=random-write --ioengine=libaio --iodepth=4 \
                               --rw=randwrite --bs=4k --direct=0 --size=256m --numjobs=16 \
                               --end_fsync=1
 ... Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=139484KB/s, minb=8717KB/s, maxb=8722KB/s, mint=30054msec, maxt=30070msec

QCOW2, –ioengine=libaio:

root@benchmark:/mnt/qcow2# fio --name=random-write --ioengine=libaio --iodepth=4 \
                               --rw=randwrite --bs=4k --direct=0 --size=256m --numjobs=16 \
                               --end_fsync=1
 ... Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=164392KB/s, minb=10274KB/s, maxb=11651KB/s, mint=22498msec, maxt=25514msec

And there you have it – qcow2 at 164MB/sec vs zvol at 139 MB/sec. So when using asynchronous I/O, the qcow2-backed virtual disk actually finished the fio run faster than the zvol-backed disk.

What if we tune the .qcow2 for 8K cluster size, like we did above in the synchronous write test?

QCOW2 -o cluster_size=8K, –ioengine=libaio:

root@benchmark:/mnt/qcow2# fio --name=random-write --ioengine=libaio --iodepth=4 \
                               --rw=randwrite --bs=4k --direct=0 --size=256m --numjobs=16 \
                               --end_fsync=1
 ... Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=181304KB/s, minb=11331KB/s, maxb=13543KB/s, mint=19356msec, maxt=23134msec

The improvements aren’t as drastic here – 181 MB/sec (tuned qcow2) vs 164 MB/sec (default qcow2) vs 139 MB/sec (zvol) – but they’re still a clear improvement, and the qcow2 storage is still faster than the zvol. (If anybody knows similar tuning that can be done to the zvol to improve its numbers, please tweet or DM me @jrssnet.)

Conclusion: .qcow2 FTW

For me, it’s a no-brainer: qcow2 files are only slightly slower on even the most punishing I/O workloads under default, untuned configuration, while being MUCH easier to manage, and arguably safer (won’t crash the guest if the host fills up the storage, honors sync write requests more predictably). And if you take the time to tune the .qcow2 on creation, they actually outperform the zvol. Winner: .qcow2.

A comprehensive guide to fixing slow SSH logins

The debug text that brought you here

Most of you are probably getting here just from frustratedly googling “slow ssh login”.  Those of you who got a little froggier and tried doing an ssh -vv to get lots of debug output saw things hanging at debug1: SSH2_MSG_SERVICE_ACCEPT received, likely for long enough that you assumed the entire process was hung and ctrl-C’d out.  If you’re patient enough, the process will generally eventually continue after the debug1: SSH2_MSG_SERVICE_ACCEPT received line, but it may take 30 seconds.  Or even five minutes.

You might also have enabled debug logging on the server, and discovered that your hang occurs immediately after debug1: KEX done [preauth]
and before debug1: userauth-request for user in /var/log/auth.log.

I feel your frustration, dear reader. I have solved this problem after hours of screeching head-desking probably ten times over the years.  There are a few fixes for this, with the most common – DNS – tending to drown out the rest.  Which is why I keep screeching in frustration every few years; I remember the dreaded debug1: SSH2_MSG_SERVICE_ACCEPT received hang is something I’ve solved before, but I can only remember some of the fixes I’ve needed.

Anyway, here are all the fixes I’ve needed to deploy over the years, collected in one convenient place where I can find them again.

It’s usually DNS.

The most common cause of slow SSH login authentications is DNS. To fix this one, go to the SSH server, edit /etc/ssh/sshd_config, and set UseDNS no.  You’ll need to restart the service after changing sshd_config: /etc/init.d/ssh restart, systemctl restart ssh, etc as appropriate.

If it’s not DNS, it’s Avahi.

The next most common cause – which is devilishly difficult to find reference to online, and I hope this helps – is the never-to-be-sufficiently damned avahi daemon.  To fix this one, go to the SSH client, edit /etc/nsswitch.conf, and change this line:

hosts:          files mdns4_minimal [NOTFOUND=return] dns

to:

hosts:          files dns

In theory maybe something might stop working without that mdns4_minimal option?  But I haven’t got the foggiest notion what that might be, because nothing ever seems broken for me after disabling it.  No services need restarting after making this change, which again, must be made on the client.

You might think this isn’t your problem. Maybe your slow logins only happen when SSHing to one particular server, even one particular server on your local network, even one particular server on your local network which has UseDNS no and which you don’t need any DNS resolution to connect to in the first place.  But yeah, it can still be this avahi crap. Yay.

When it’s not Avahi… it’s PAM.

This is another one that’s really, really difficult to find references to online.  Optional PAM modules can really screw you here.  In my experience, you can’t get away with simply disabling PAM login in /etc/ssh/sshd_config – if you do, you won’t be able to log in at all.

What you need to do is go to the SSH server, edit /etc/pam.d/common-session and comment out the optional module that’s causing you grief.  In the past, that was pam_ck_connector.so.  More recently, in Ubuntu 16.04, the culprit that bit me hard was pam_systemd.so. Get in there and comment that bugger out.  No restarts required.

#session optional pam_systemd.so

GSSAPI, and ChallengeResponse.

I’ve seen a few seconds added to a pokey login from GSSAPIAuthentication, whatever that is. I feel slightly embarrassed about not knowing, but seriously, I have no clue.  Ditto for ChallengeResponseAuthentication.  All I can tell you is that neither cover standard interactive passwords, or standard public/private keypair authentication (the keys you keep in ~/ssh/authorized_keys).

If you aren’t using them either, then disable them.  If you’re not using Active Directory authentication, might as well go ahead and nuke Kerberos while you’re at it.  Make these changes on the server in /etc/ssh/sshd_config, and restart the service.

ChallengeResponseAuthentication no
KerberosAuthentication no
GSSAPIAuthentication no

Host-based Authentication.

If you’re actually using this, don’t disable it. But let’s get real with each other: you’re not using it.  I mean, I’m sure somebody out there is.  But it’s almost certainly not you.  Get rid of it.  This is also on the server in /etc/ssh/sshd_config, and also will require a service restart.

# Don't read the user's ~/.rhosts and ~/.shosts files
IgnoreRhosts yes
# For this to work you will also need host keys in /etc/ssh_known_hosts
RhostsRSAAuthentication no
# similar for protocol version 2
HostbasedAuthentication no

Need frequent connections? Consider control sockets.

If you need to do repetitive ssh’ing into another host, you can speed up the repeated ssh commands enormously with the use of control sockets.  The first time you SSH into the host, you establish a socket.  After that, the socket obviates the need for re-authentication.

ssh -M -S /path/to/socket -o ControlPersist=5m remotehost exit

This creates a new SSH socket at /path/to/socket which will survive for the next 5 minutes, after which it’ll automatically expire.  The exit at the end just causes the ssh connection you establish to immediately terminate (otherwise you’d have a perfectly normal ssh session going to remotehost).

After creating the control socket, you utilize it with -S /path/to/socket in any normal ssh command to remotehost.

The improvement in execution time for commands using that socket is delicious.

me@banshee:/~$ ssh -M -S /tmp/demo -o ControlPersist=5m eohippus exit

me@banshee:/~$ time ssh eohippus exit
real 0m0.660s
user 0m0.052s
sys 0m0.016s

me@banshee:/~$ time ssh -S /tmp/demo eohippus exit
real 0m0.050s
user 0m0.005s
sys 0m0.010s

Yep… from 660ms to 50ms.  SSH control sockets are pretty awesome.

ZFS clones: Probably not what you really want

ZFS clones look great on paper: they’re instantaneously generated, they’re read/write, they’re initially “free” because they reference the same blocks their parent snapshots do. They’re also (initially) frequently extra-snappy performance-wise, because a lot of those parent blocks are very likely already in the ARC. If you create ten clones of the same VM image (for instance), all ten clones will share the same blocks in the ARC instead of them needing to be in the ARC ten different times. Huge win!

But, as great as a clone sounds at first blush, you probably don’t want to use them for anything that isn’t ephemeral (intended to be destroyed in fairly short order). This is because a clone’s parent snapshot is forever immutable; you can’t destroy the parent snapshot without destroying the clone along with it… even if and when the clone becomes 100% divergent, and no longer shares any block references with its parent. Let’s examine this on a small scale.

Practical testing

On my workstation banshee, I create a new dataset, make sure compression is turned off so as not to confuse us, and populate it with a 256MB chunk of random binary stuff:

root@banshee:~# zfs create banshee/demo ; zfs set compression=off banshee/demo
root@banshee:~# dd if=/dev/zero bs=16M count=16 | openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt | pv > /banshee/demo/random.bin
16+0 records in
16+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 0.483868 s, 555 MB/s
 256MiB 0:00:00 [ 533MiB/s] [<=>                                               ]

I know this looks a little weird, but AES-256 is roughly an order of magnitude faster than /dev/urandom: so what I did here was use /dev/urandom to seed AES-256, then encrypt a 256MB chunk of /dev/zero with it. At the end of this procedure, we have a dataset with 256MB of data in it:

root@banshee:~# ls -lh /banshee/demo
total 262M
-rw-r--r-- 1 root root 256M Mar 15 14:39 random.bin
root@banshee:~# zfs list banshee/demo
NAME                           USED  AVAIL  REFER  MOUNTPOINT
banshee/demo                   262M  83.3G   262M  /banshee/demo

OK. Next step, we take a snapshot of banshee/demo, then create a clone using that snapshot as its parent.

Creating a clone

You don’t actually create a ZFS clone of a dataset at all; you create a clone from a snapshot of a dataset. So before we can “clone banshee/demo”, we first have to take a snapshot of it, and then we clone that.

root@banshee:~# zfs snapshot banshee/demo@parent-snapshot
root@banshee:~# zfs clone banshee/demo@parent-snapshot banshee/demo-clone
root@banshee:~# zfs list -rt all banshee/demo
NAME                           USED  AVAIL  REFER  MOUNTPOINT
banshee/demo                   262M  83.3G   262M  /banshee/demo
banshee/demo@parent-snapshot      0      -   262M  -
root@banshee:~# zfs list -rt all banshee/demo-clone
NAME                 USED  AVAIL  REFER  MOUNTPOINT
banshee/demo-clone     1K  83.3G   262M  /banshee/demo-clone

So right now, we have the dataset banshee/demo, which shares all its blocks with banshee/demo@parent-snapshot, which in turn shares all its blocks with banshee/demo-clone. We see 262M in USED for banshee/demo, with nothing or next-to-nothing in USED for either banshee/demo@parent-snapshot or banshee/demo-clone.

Beginning divergence: removing data

Now, we remove all the data from banshee/demo:

root@banshee:~# rm /banshee/demo/random.bin
root@banshee:~# zfs list -rt all banshee/demo ; zfs list banshee/demo-clone
NAME                           USED  AVAIL  REFER  MOUNTPOINT
banshee/demo                   262M  83.3G    19K  /banshee/demo
banshee/demo@parent-snapshot   262M      -   262M  -
NAME                 USED  AVAIL  REFER  MOUNTPOINT
banshee/demo-clone     1K  83.3G   262M  /banshee/demo-clone

We still only have 262M of USED – but it’s all actually in banshee/demo@parent-snapshot now. You can tell because the REFER column has changed – banshee/demo@parent-snapshot and banshee/demo-clone still both REFER 262M, but banshee/demo only REFERs 19K now. (You still see 262M in USED for banshee/demo because banshee/demo@parent-snapshot is a child of banshee/demo, so its contents count towards banshee/demo‘s USED figure.)

Next up: we re-fill the parent dataset, banshee/demo, with 256MB of different random garbage.

Continuing divergence: replacing data in the parent

root@banshee:~# dd if=/dev/zero bs=16M count=16 | openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt | pv > /banshee/demo/random.bin
16+0 records in
16+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 0.498349 s, 539 MB/s
 256MiB 0:00:00 [ 516MiB/s] [<=>                                               ]
root@banshee:~# zfs list -rt all banshee/demo ; zfs list banshee/demo-clone
NAME                           USED  AVAIL  REFER  MOUNTPOINT
banshee/demo                   523M  83.2G   262M  /banshee/demo
banshee/demo@parent-snapshot   262M      -   262M  -
NAME                 USED  AVAIL  REFER  MOUNTPOINT
banshee/demo-clone     1K  83.2G   262M  /banshee/demo-clone

OK, at this point you see that the USED for banshee/demo shoots up to 523M: that’s the total of the 262M of original random garbage which is still preserved in banshee/demo@parent-snapshot, plus the new 262M of different random garbage in banshee/demo itself. The snapshot now diverges completely from the parent dataset, having no blocks in common at all.

So far, banshee/demo-clone is still 100% convergent with banshee/demo@parent-snapshot, so we’re still getting some conservation of space on disk and in ARC from that. But remember, the whole point of making the clone was so that we could write to it as well as read from it. So let’s do exactly that, and make the clone 100% divergent from its parent, too.

Diverging completely: replacing data in the clone

root@banshee:~# dd if=/dev/zero bs=16M count=16 | openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt | pv > /banshee/demo-clone/random.bin
16+0 records in
16+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 0.50151 s, 535 MB/s
 256MiB 0:00:00 [ 534MiB/s] [<=>                                               ]
root@banshee:~# zfs list -rt all banshee/demo ; zfs list banshee/demo-clone
NAME                           USED  AVAIL  REFER  MOUNTPOINT
banshee/demo                   523M  82.8G   262M  /banshee/demo
banshee/demo@parent-snapshot   262M      -   262M  -
NAME                 USED  AVAIL  REFER  MOUNTPOINT
banshee/demo-clone  262M  82.8G  262M  /banshee/demo-clone

There, done. We now have a parent dataset, banshee/demo, which diverges completely from its snapshot banshee/demo@parent-snapshot, and a clone, banshee/demo-clone, which also diverges completely from banshee/demo@parent-snapshot.

Examining the suck

Since neither the parent, its snapshot, nor the clone share any blocks with one another anymore, we’re using the full 786MB of on-disk space that the three of them add up to. And since they also don’t share any blocks in the ARC, we’re left with absolutely no benefit in either storage consumption or performance to our having used a clone.

Worse, despite having no blocks in common and no perceptible benefit to the clone structure, all three are still inextricably linked, and neither banshee/demo nor banshee/demo@parent-snapshot can be destroyed without also destroying banshee/demo-clone:

root@banshee:~# zfs destroy banshee/demo -r
cannot destroy 'banshee/demo': filesystem has dependent clones
use '-R' to destroy the following datasets:
banshee/demo-clone
root@banshee:~# zfs destroy banshee/demo@parent-snapshot
cannot destroy 'banshee/demo@parent-snapshot': snapshot has dependent clones
use '-R' to destroy the following datasets:
banshee/demo-clone

So now you’re left with a great unwieldy mass of tangled dependencies, wasted space, and no perceptible benefits at all.

Conclusion and practical example

Imagine that you’re storing VM images in ZFS, and you began with a “gold” image of a freshly installed operating system, and created ten different clones to run ten different VMs from. Initially, this seemed great: you could create the clones instantaneously, and they shared tons of blocks, so they consumed a fraction of the ARC they would as complete, separate copies.

A year later, however, your gold image – of, let’s say, Ubuntu 16.04.1 – has diverged to a staggering degree with the set of rolling updates necessary to bring it all the way to Ubuntu 16.04.2. Your VMs have also diverged tremendously, from their parent snapshot and from one another. And now you’re stuck with the year-old snapshot of the “gold” image, completely useless to you but forever engraved on your drive unless and until you’re willing to replicate or otherwise block-for-block copy your VMs painstakingly into self-sufficient datasets with no references. You also have no remaining performance benefits, and you have an extra SPOF (single point of failure) where some admin – maybe even you – might see that parent snapshot nobody cared about anymore taking up all that disk space, and…

root@banshee:~# zfs destroy -R banshee/demo@parent-snapshot
root@banshee:~# zfs list banshee/demo-clone
cannot open 'banshee/demo-clone': dataset does not exist

One “oops” later, that “useless” parent snapshot and every single one of those clones you were using in production are gone forever. (Or, hopefully, just gone until you can restore them from your off-pool backup. You are maintaining replicated backups on at least one other pool, preferably on another machine, aren’t you? Aren’t you?!)

ZFS snapshots and cold storage

OK, this isn’t really a practice I personally have any use for. I much, much prefer replicating snapshots to live systems using syncoid as a wrapper for zfs send and receive. But if you want to use cold storage, or just want to understand conceptually the way snapshots and streams work, read on.

Our working dataset

Let’s say you have a ZFS pool and dataset named, creatively enough, poolname/setname, and let’s say you’ve taken snapshots over a period of time, named @0 through @9.

root@banshee:~# zfs list -rt all poolname/setname
NAME                USED  AVAIL  REFER  MOUNTPOINT
poolname/setname    1004M  21.6G    19K  /poolname/setname
poolname/setname@0   100M      -   100M  -
poolname/setname@1   100M      -   100M  -
poolname/setname@2   100M      -   100M  -
poolname/setname@3   100M      -   100M  -
poolname/setname@4   100M      -   100M  -
poolname/setname@5   100M      -   100M  -
poolname/setname@6   100M      -   100M  -
poolname/setname@7   100M      -   100M  -
poolname/setname@8   100M      -   100M  -
poolname/setname@9   100M      -   100M  -

By looking at the USED and REFER columns, we can see that each snapshot has 100M of data in it, unique to that snapshot, which in total adds up to 1004M of data.

Sending a full backup to cold storage

Now if we want to use zfs send to move stuff to cold storage, we have to start out with a full backup of the oldest snapshot, @0:

root@banshee:~# zfs send poolname/setname@0 | pv > /coldstorage/setname@0.zfsfull
 108MB 0:00:00 [ 297MB/s] [<=>                                                 ]
root@banshee:~# ls -lh /coldstorage
total 1.8M
-rw-r--r-- 1 root root 108M Sep 15 16:26 setname@0.zfsfull

Sending incremental backups to cold storage

Let’s go ahead and take some incrementals of setname now:

root@banshee:~# zfs send -i poolname/setname@0 poolname/setname@1 | pv > /coldstorage/setname@0-@1.zfsinc
 108MB 0:00:00 [ 316MB/s] [<=>                                                 ]
root@banshee:~# zfs send -i poolname/setname@1 poolname/setname@2 | pv > /coldstorage/setname@1-@2.zfsinc
 108MB 0:00:00 [ 299MB/s] [<=>                                                 ]
root@banshee:~# ls -lh /coldstorage
total 5.2M
-rw-r--r-- 1 root root 108M Sep 15 16:33 setname@0-@1.zfsinc
-rw-r--r-- 1 root root 108M Sep 15 16:26 setname@0.zfsfull
-rw-r--r-- 1 root root 108M Sep 15 16:33 setname@1-@2.zfsinc

OK, so now we have one full and two incrementals. Before we do anything that works, let’s look at some of the things we can’t do, because these are really important to know about.

Without a full, incrementals are useless

First of all, we can’t restore an incremental without a full:

root@banshee:~# pv < /coldstorage/setname@0-@1.zfsinc | zfs receive poolname/restore
cannot receive incremental stream: destination 'poolname/restore' does not exist
  64kB 0:00:00 [94.8MB/s] [>                                   ]  0%            

Good thing we’ve got that full backup of @0, eh?

Restoring a full backup from cold storage

root@banshee:~# pv < /coldstorage/setname@0.zfsfull | zfs receive poolname/restore
 108MB 0:00:00 [ 496MB/s] [==================================>] 100%            
root@banshee:~# zfs list -rt all poolname/restore
NAME                USED  AVAIL  REFER  MOUNTPOINT
poolname/restore     100M  21.5G   100M  /poolname/restore
poolname/restore@0      0      -   100M  -

Simple – we restored our full backup of poolname to a new dataset named restore, which is in the condition poolname was in when @0 was taken, and contains @0 as a snapshot. Now, keeping in tune with our “things we cannot do” theme, can we skip straight from this full of @0 to our incremental from @1-@2?

Without ONE incremental, following incrementals are useless

root@banshee:~# pv < /coldstorage/setname@1-@2.zfsinc | zfs receive poolname/restore
cannot receive incremental stream: most recent snapshot of poolname/restore does not
match incremental source
  64kB 0:00:00 [49.2MB/s] [>                                   ]  0%

No, we cannot restore a later incremental directly onto our full of @0. We must use an incremental which starts with @0, which is the only snapshot we actually have present in full. Then we can restore the incremental from @1 to @2 after that.

Restoring incrementals in order

root@banshee:~# pv < /coldstorage/setname@0-@1.zfsinc | zfs receive poolname/restore
 108MB 0:00:00 [ 452MB/s] [==================================>] 100%            
root@banshee:~# pv < /coldstorage/setname@1-@2.zfsinc | zfs receive poolname/restore
 108MB 0:00:00 [ 448MB/s] [==================================>] 100%
root@banshee:~# zfs list -rt all poolname/restore
NAME                                                 USED  AVAIL  REFER  MOUNTPOINT
poolname/restore                                      301M  21.3G   100M  /poolname/restore
poolname/restore@0                                    100M      -   100M  -
poolname/restore@1                                    100M      -   100M  -
poolname/restore@2                                       0      -   100M  -

There we go – first do a full restore of @0 onto a new dataset, then do incremental restores of @0-@1 and @1-@2, in order, and once we’re done we’ve got a restore of setname in the condition it was in when @2 was taken, and containing @0, @1, and @2 as snapshots within it.

Incrementals that skip snapshots

Let’s try one more thing – what if we take, and restore, an incremental directly from @2 to @9?

root@banshee:~# zfs send -i poolname/setname@2 poolname/setname@9 | pv > /coldstorage/setname@2-@9.zfsinc
 108MB 0:00:00 [ 324MB/s] [<=>                                                 ]
root@banshee:~# pv < /coldstorage/setname@2-@9.zfsinc | zfs receive banshee/restore
 108MB 0:00:00 [ 464MB/s] [==================================>] 100%            
root@banshee:~# zfs list -rt all poolname/restore
NAME                USED  AVAIL  REFER  MOUNTPOINT
poolname/restore     402M  21.2G   100M  /poolname/restore
poolname/restore@0   100M      -   100M  -
poolname/restore@1   100M      -   100M  -
poolname/restore@2   100M      -   100M  -
poolname/restore@9      0      -   100M  -

Note that when we restored an incremental of @2-@9, we did restore our dataset to the condition it was in when @9 was taken, but we did not restore the actual snapshots between @2 and @9. If we’d wanted to save those in a single operation, we should’ve used an incremental snapshot stream.

Sending incremental streams to cold storage

To save an incremental stream, we use -I instead of -i when we invoke zfs send. Let’s send an incremental stream from @0 through @9 to cold storage:

root@banshee:~# zfs send -I poolname/setname@0 poolname/setname@9 | pv > /coldstorage/setname@0-@9.zfsincstream
 969MB 0:00:03 [ 294MB/s] [   <=>                                              ]
root@banshee:~# ls -lh /coldstorage
total 21M
-rw-r--r-- 1 root root 108M Sep 15 16:33 setname@0-@1.zfsinc
-rw-r--r-- 1 root root 969M Sep 15 16:48 setname@0-@9.zfsincstream
-rw-r--r-- 1 root root 108M Sep 15 16:26 setname@0.zfsfull
-rw-r--r-- 1 root root 108M Sep 15 16:33 setname@1-@2.zfsinc

Incremental streams can’t be used without a full, either

Again, what can’t we do? We can’t use our incremental stream by itself, either: we still need that full of @0 to do anything with the stream of @0-@9.
Let’s demonstrate that by starting over from scratch.

root@banshee:~# zfs destroy -R poolname/restore
root@banshee:~# pv < /coldstorage/setname@0-@9.zfsincstream | zfs receive poolname/restore cannot receive incremental stream: destination 'poolname/restore' does not exist 256kB 0:00:00 [ 464MB/s] [> ] 0%

Restoring from cold storage using snapshot streams

All right, let’s restore from our full first, and then see what the incremental stream can do:

root@banshee:~# pv < /coldstorage/setname@0.zfsfull | zfs receive poolname/restore
 108MB 0:00:00 [ 418MB/s] [==================================>] 100%            
root@banshee:~# pv < /coldstorage/setname@0-@9.zfsincstream | zfs receive poolname/restore
 969MB 0:00:06 [ 155MB/s] [==================================>] 100%            
root@banshee:~# zfs list -rt all poolname/restore
NAME                USED  AVAIL  REFER  MOUNTPOINT
poolname/restore    1004M  20.6G   100M  /poolname/restore
poolname/restore@0   100M      -   100M  -
poolname/restore@1   100M      -   100M  -
poolname/restore@2   100M      -   100M  -
poolname/restore@3   100M      -   100M  -
poolname/restore@4   100M      -   100M  -
poolname/restore@5   100M      -   100M  -
poolname/restore@6   100M      -   100M  -
poolname/restore@7   100M      -   100M  -
poolname/restore@8   100M      -   100M  -
poolname/restore@9      0      -   100M  -

There you have it – our incremental stream is a single file containing incrementals for all snapshots transitioning from @0 to @9, so after restoring it on top of a full from @0, we have a restore of setname which is in the condition setname was in when @9 was taken, and contains all the actual snapshots from @0 through @9.

Conclusions about cold storage backups

I still don’t particularly recommend doing backups this way due to the ultimate fragility of requiring lots of independent bits and bobs. If you delete or otherwise lose the full of @0, ALL incrementals you take since then are useless – which means eventually you’re going to want to take a new full, maybe of @100, and start over again. The problem is, as soon as you do that, you’re going to take up double the space on cold storage you were using in production, since you have all the data from @0-@99 in incrementals, AND all of the surviving data from those snapshots in @100, along with any new data in @100!

You can delete the full of @0 after the full of @100 finishes, and as soon as you do, all those incrementals of @1 to @99 become completely useless and need to be deleted as well. If you screw up somehow and think you have a good full of @100 when you really don’t, and delete @0… all of your backups become useless, and you’re completely dead in the water. So be careful.

By contrast, if you use syncoid to replicate directly to a second server or pool, you never need more space than you took up in production, you no longer have the nail-biting period when you delete old fulls, and you no longer have to worry about going agonizingly through ten or a hundred separate incremental restores to get back where you were – you can delete snapshots from the back or the middle of a working dataset on another pool without any concern that it’ll break snapshots that come later than the one you’re deleting, and you can restore them all directly in a single operation by replicating in the opposite direction.

But, if you’re still determined to send snapshots to cold storage – local, Glacier, or whatever – at least now you know how.