Ubuntu 18.04 hung at update-grub 66%

I’ve encountered this two or three times now, and it’s always a slog figuring out how to fix it. When doing a fresh install of Ubuntu 18.04 to a new system, it hangs forever (never times out, no matter how long you wait) at 66% running update-grub.

The problem is a bug in os-prober. The fix is to ctrl-alt-F2 into a new BusyBox shell, ps and grep for the offending process, and kill it:

BusyBox v1.27.2 (Ubuntu 1:1.27.2-2ubuntu3.1) built-in shell (ash)
Enter 'help' for a list of built-in commands.

# ps wwaux | grep dmsetup | grep -v grep
6114   root   29466 S    dmsetup create -r osprober-linux-sdc9

# kill 6114

Now ctrl-alt-F1 back into your installer session. After a moment, it’ll kick back into high gear and finish your Ubuntu 18.04 installation… but you’re unfortunately not done yet; killing os-prober got the install to complete, but it didn’t get GRUB to actually install onto your disks.

You can get a shell and chroot into your new install environment right now, but if you’re not intimately familiar with that process, it may be easier to just reboot using the same Ubuntu install media, but this time select “Rescue broken system”. Once you’ve made your way through selecting your keyboard layout and given your system a bogus name (it only persists for this rescue environment; it doesn’t change on-disk configuration) you’ll be asked to pick an environment to boot into, with a list of disks and partitions.

If you installed root to a simple partition, pick that partition. If, like me, you installed to an mdraid array, you should see that array listed as “md127”, which is Ubuntu’s default name for an array it knows is there but otherwise doesn’t know much about. Choose that, and you’ll get a shell with everything already conveniently mounted and chrooted for you.

(If you didn’t have the option to get into the environment the simple way, you can still do it from a standard installer environment: find your root partition or array, mount it to /mnt like mount /dev/md127 /mnt ; then chroot into it like chroot /mnt and you’ll be caught up and ready to proceed.)

The last part is easy. First, we need to get the buggy os-prober module out of the execution path.

root@ubuntu:~# cd /etc/grub.d
root@ubuntu:~/etc/grub.d# mkdir nerfed
root@ubuntu:~/etc/grub.d# mv 30_os-prober/nerfed

OK, that got rid of our problem module that locked up on us during the install. Now we’re ready to run update-grub and grub-install. I’m assuming here that you have two disks which should be bootable, /dev/sda and /dev/sdb; if that doesn’t match your situation, adjust accordingly. (If you’re using an mdraid array, mdadm –detail /dev/md127 to tell you for sure which disks to make bootable.)

root@ubuntu:~# update-grub
root@ubuntu:~# grub-install /dev/sda
root@ubuntu:~# grub install /dev/sdb

That’s it; now you can shutdown the system, pull the USB installer, and boot from the actual disks!

I’m stuck at update-grub, but it times out and errors!

If your update-grub process hangs for quite a while (couple full minutes?) at 50% but then falls to an angry error screen with a red background, you’ve got a different problem. If you’re trying to install with an mdraid root directory on a disk 4TiB or larger, you need to do a UEFI-style install – which requires EFI boot partitions available on each of your bootable disks.

You’re going to need to start the install process over again; this time when you partition your disks, make sure to create a small partition of type “EFI System Partition”. This is not the same partition you’ll use for your actual root; it’s also not the same thing as /boot – it’s a special snowflake all to itself, and it’s mandatory for systems booting from a drive or drives 4 TiB or larger. (You can still boot in BIOS mode, with no boot partition, from 2 TiB or smaller drives. Not sure about 3 TiB drives; I’ve never owned one IIRC.)

Installing WordPress on Apache the modern way

It’s been bugging me for a while that there are no correct guides to be found about using modern Apache 2.4 or above with the Event or Worker MPMs. We’re going to go ahead and correct that lapse today, by walking through a brand-new WordPress install on a new Ubuntu 18.04 VM (grab one for $5/mo at Linode, Digital Ocean, or your favorite host).

Installing system packages

Once you’ve set up the VM itself, you’ll first need to update the package list:

root@VM:~# apt update

Once it’s updated, you’ll need to install Apache itself, along with PHP and the various extras needed for a WordPress installation.

root@VM:~# apt install apache2 mysql-server php-fpm php-common php-mbstring php-xmlrpc php-soap php-gd php-xml php-intl php-mysql php-cli php-ldap php-zip php-curl

The key bits here are Apache2, your HTTP server; MySQL, your database server; and php-fpm, which is a pool of PHP worker processes your HTTP server can connect to in order to build WordPress dynamic content as necessary.

What you absolutely, positively do not want to do here is install mod_php. If you do that, your nice modern Apache2 with its nice modern Event process model gets immediately switched back to your granddaddy’s late-90s-style prefork, loading PHP processors into every single child process, and preventing your site from scaling if you get any significant traffic!

Enable the proxy_fcgi module

Instead – and this is the bit none of the guides I’ve found mention – you just need to enable one module in Apache itself, and enable the already-installed PHP configuration module. (You will need to figure out which version of php-fpm is installed: dpkg –get-selections | grep fpm can help here if you aren’t sure.)

root@VM:~# a2enmod proxy_fcgi
root@VM:~# a2enconf php7.4-fpm.conf
root@VM:~# systemctl restart apache2

Your Apache2 server is now ready to serve PHP applications, like WordPress. (Note for more advanced admins: if you’re tuning for larger scale, don’t forget that it’s not only about the web server connections anymore; you also want to keep an eye on how many PHP worker processes you have in your pool. You’ll do that in /etc/php/[version]/fpm/pool.d/www.conf.)

Download and extract WordPress

We’re going to keep things super simple in this guide, and just serve WordPress from the existing default vhost in its standard location, at /var/www/html.

root@VM:~# cd /var/www
root@VM:/var/www# wget https://wordpress.org/latest.tar.gz
root@VM:/var/www# tar zxvf latest.tar.gz
root@VM:/var/www# chown -R www-data.www-data wordpress
root@VM:/var/www# mv html html.dist
root@VM:/var/www# mv wordpress html

Create a database for WordPress

The last step before you can browse to your new WordPress installation is creating the database itself.

root@VM:/var/www# mysql -u root

mysql> create database wordpress;
Query OK, 1 row affected (0.01 sec)

mysql> grant all on wordpress.* to 'wordpress'@'localhost' identified by 'superduperpassword';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> exit;

This created a database named wordpress, with a user named wordpress, and a password superduperpassword. That’s a bad password. Don’t actually use that password. (Also, if mysql -u root wanted a password, and you don’t have it – cat /etc/mysql/debian.cnf, look for the debian-sys-maint password, and connect to mysql using mysql -u debian-sys-maint instead. Everything else will work fine.)

note for ubuntu 20.04 / mysql 8.0 users:

MySQL changed things a bit with 8.0. grant all on db.* to ‘user’@’localhost’ identified by ‘password’; no longer works all in one step. Instead, you’ll need first to create user ‘user’@’localhost’ identified by ‘password’; then you can grant all on db.* to ‘user’@’localhost’; —you no longer need to (or can) specify password on the actual grant line itself.

All done – browser time!

Now that you’ve set up Apache, dropped the WordPress installer in its default directory, and created a mysql database – you’re ready to run through the WordPress setup itself, by browsing directly to http://your.servers.ip.address/. Have fun!

About ZFS recordsize

ZFS stores data in records, which are themselves composed of blocks. The block size is set by the ashift value at time of vdev creation, and is immutable. The recordsize, on the other hand, is individual to each dataset(although it can be inherited from parent datasets), and can be changed at any time you like. In 2019, recordsize defaults to 128K if not explicitly set.

big files? big recordsize.

The general rule of recordsize is that it should closely match the typical workload experienced within that dataset. For example, a dataset used to store high-quality JPGs, averaging 5MB or more, should have recordsize=1M. This matches the typical I/O seen in that dataset – either reading or writing a full 5+ MB JPG, with no random access within each file – quite well; setting that larger recordsize prevents the files from becoming unduly fragmented, ensuring the fewest IOPS are consumed during either read or write of the data within that dataset.

DB binaries? Smaller recordsize.

By contrast, a dataset which directly contains a MySQL InnoDB database should have recordsize=16K. That’s because InnoDB defaults to a 16KB page size, so most operations on an InnoDB database will be done in individual 16K chunks of data. Matching recordsize to MySQL’s page size here means we maximize the available IOPS, while minimizing latency on the highly sync()hronous reads and writes made by the database (since we don’t need to read or write extraneous data while handling our MySQL pages).

VMs? Match the recordsize to the VM storage format.

(That’s cluster_size, for QEMU/KVM .qcow2.)

On the other hand, if you’ve got a MySQL InnoDB database stored within a VM, your optimal recordsize won’t necessarily be either of the above – for example, KVM .qcow2 files default to a cluster_size of 64KB. If you’ve set up a VM on .qcow2 with default cluster_size, you don’t want to set recordsize any lower (or higher!) than the cluster_size of the .qcow2 file. So in this case, you’ll want recordsize=64K to match the .qcow2’s cluster_size=64K, even though the InnoDB database inside the VM is probably using smaller pages.

An advanced administrator might look at all of this, determine that a VM’s primary function in life is to run MySQL, that MySQL’s default page size is good, and therefore set both the .qcow2 cluster_size and the dataset’s recordsize to match, at 16K each.

A different administrator might look at all this, determine that the performance of MySQL in the VM with all the relevant settings left to their defaults was perfectly fine, and elect not to hand-tune all this crap at all. And that’s okay.

What if I set recordsize too high?

If recordsize is much higher than the size of the typical storage operation within the dataset, latency will be greatly increased and this is likely to be incredibly frustrating. IOPS will be very limited, databases will perform poorly, desktop UI will be glacial, etc.

What if I set recordsize too low?

If recordsize is a lot smaller than the size of the typical storage operation within the dataset, fragmentation will be greatly (and unnecessarily) increased, leading to unnecessary performance problems down the road. IOPS as measured by artificial tools will be super high, but performance profiles will be limited to those presented by random I/O at the record size you’ve set, which in turn can be significantly worse than the performance profile of larger block operations.

You’ll also screw up compression with an unnecessarily low recordsize; zfs inline compression dictionaries are per-record, and work by fitting more than one entire block into a single record’s space. If you set compression=lz4ashift=12, and recordsize=4K you’ll effectively have NO compression, because your blocksize is equal to your recordsize – pretty much nothing but all-zero blocks can be compressed. Meanwhile, the same dataset with the default 128K recordsize might easily have a 1.7:1 compression ratio.

Are the defaults good? Do I aim high, or do I aim low?

128K is a pretty reasonable “ah, what the heck, it works well enough” setting in general. It penalizes you significantly on IOPS and latency for small random I/O operations, and it presents more fragmentation than necessary for large contiguous files, but it’s not horrible at either task. There is a lot to be gained from tuning recordsize more appropriately for task, though.

What about bittorrent?

The “big records for big files” rule of thumb still applies for datasets used as bittorrent targets.

This is one of those cases where things work just the opposite of how you might think – torrents write data in relatively small chunks, and access them randomly for both read and write, so you might reasonably think this calls for a small recordsize. However, the actual data in the torrents is typically huge files, which are accessed in their entirety for everything but the initial bittorrent session.

Since the typical access pattern is “large-file”, most people will be better off using recordsize=1M in the torrent target storage. This keeps the downloaded data unfragmented despite the bittorrent client’s insanely random writing patterns. The data acquired during the bittorrent session in chunks is accumulated in the ZIL until a full record is available to write, since the torrent client itself is not synchronous – it writes all the time, but rarely if ever calls sync().

As a proof-of-concept, I used the Transmission client on an Ubuntu 16.04 LTS workstation to download the Ubuntu 18.04.2 Server LTS ISO, with a dataset using recordsize=1M as the target. This workstation has a pool consisting of two mirror vdevs on rust, so high levels of fragmentation would be very easy to spot.

root@locutus:/# zpool export data ; modprobe -r zfs ; modprobe zfs ; zpool import data

root@locutus:/# pv < /data/torrent/ubu*18*iso > /dev/null
 883MB 0:00:03 [ 233MB/s] [==================================>] 100%

Exporting the pool and unloading the ZFS kernel module entirely is a weapons-grade-certain method of emptying the ARC entirely; getting better than 200 MB/sec average read throughput directly from the rust vdevs afterward (the transfer actually peaked at nearly 400 MB/sec!) confirms that our torrented ISO is not fragmented.

Note that preallocation settings in your bittorrent client are meaningless when the client is saving to ZFS – you can’t actually preallocate in any meaningful way on ZFS, because it’s a copy-on-write filesystem.

Disable Twitter vibration on Android (8.0 and up)

Holy CRAP I can’t believe how difficult this was to figure out.

TL;DR: Settings –> Apps & Notifications –> Twitter –> Notifications

Settings –> Apps & Notifications –> Twitter –> Notifications

Now here’s the super crazy part. See all those checkboxes, to turn notifications on or off entirely for various events – Direct Messages, Emergency alerts, etc? Yeah, ignore the checkboxes, and tap the TEXT. Now tap “Behavior”, and set to “Show silently.” You will now continue to get notifications, but stop getting sounds and insanely-irritating vibrations, for each event type you change Behavior on.

Tap event category NAME (NOT checkbox!) –> Behavior –> Show silently

I’ve been suffering from over-vibrating apps, with Twitter being the absolute worst offender, for months. There still weren’t any vaguely decent how-tos today, but I finally pieced it together from vague clues on an androidcentral forum post.

 

VLANs with KVM guests on Ubuntu 18.04 / netplan

There is a frustrating lack of information on how to set up multiple VLAN interfaces on a KVM host out there. I made my way through it in production today with great applications of thud and blunder; here’s an example of a working 01-netcfg.yaml with multiple VLANs on a single (real) bridge interface, presenting as multiple bridges.

Everything feeds through properly so that you can bring KVM guests up on br0 for the default VLAN, br100 for VLAN 100, or br200 for VLAN 200. Adapt as necessary for whatever VLANs you happen to be using.

# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    eno1:
      dhcp4: no
      dhcp6: no
    eno2:
      dhcp4: no
      dhcp6: no
  vlans:
    br0.100:
      link: br0
      id: 100
    br0.200:
      link: br0
      id: 200
  bridges:
    br0:
      interfaces:
        - eno1
        - eno2
      dhcp4: no
      dhcp6: no
      addresses: [ 10.0.0.2/24 ]
      gateway4: 10.0.0.1
      nameservers:
        addresses: [ 8.8.8.8,1.1.1.1 ]
    br100:
      interfaces:
        - br0.100
      dhcp4: no
      dhcp6: no
      addresses: [ 10.0.100.1/24 ]
    br200:
      interfaces:
        - br0.200
      dhcp4: no
      dhcp6: no
      addresses: [ 10.0.200.1/24 ]

Some testing notes on WireGuard

I got super, super interested in WireGuard when Linus Torvalds heaped fulsome praise on its design (if you’re not familiar with Linus’ commentary, then trust me – that’s extremely fulsome in context) in an initial code review this week. WireGuard aims to be more secure and faster than competing VPN solutions; as far as security goes, it’s certainly one hell of a lot more auditable, at 4,000 lines of code compared to several hundred thousand lines of code for OpenVPN/OpenSSL or IPSEC/StrongSwan.

I’ve got a decade-and-a-half of production experience with OpenVPN and various IPSEC implementations, and “prettiness of code” aside, frankly they all suck. They’re not so bad if you only work with a client or ten at a time which are manually connected and disconnected; but if you’re working at a scale of hundred+ clients expected to be automatically connected 24/7/365, they’re a maintenance nightmare. The idea of something that connects quicker and cleaner, and is less of a buggy nightmare both in terms of security and ongoing usage, is pretty strongly appealing!

WARNING:  These are my initial testing notes, on 2018-Aug-05. I am not a WireGuard expert. This is my literal day zero. Proceed at own risk!

Alright, so clearly I wanna play with this stuff.  I’m an Ubuntu person, so my initial step is apt-add-repository ppa:wireguard/wireguard ; apt update ; apt install wireguard-dkms wireguard-tools .

After we’ve done that, we’ll need to generate a keypair for our wireguard instance. The basic commands here are wg genkey and wg pubkey. You’ll need to pipe private key created with wg genkey into wg pubkey to get a working private key.  You don’t have to store your private key anywhere outside the wg0.conf itself, but if you’re a traditionalist and want them saved in nice organized files you can find (and which aren’t automagically monkeyed with – more on that later), you can do so like this:

root@box:/etc/wireguard# touch machinename.wg0.key ; chmod 600 machinename.wg0.key
root@box:/etc/wireguard# wg genkey > machinename.wg0.key
root@box:/etc/wireguard# wg pubkey < machinename.wg0.key > machinename.wg0.pub

You’ll need this keypair to connect to other wireguard machines; it’s generated the same way on servers or clients. The private key goes in the [Interface] section of the machine it belongs to; the public key isn’t used on that machine at all, but is given to machines it wants to connect to, where it’s specified in a [Peer] section.

From there, you need to generate a wg0.conf to define a wireguard network interface. I had some trouble finding definitive information on what would or wouldn’t work with various configs on the server side, so let’s dissect a (fairly) simple one:

# /etc/wireguard/wg0.conf - server configs

[Interface]
   Address = 10.0.0.1/24
   ListenPort = 51820
   PrivateKey = SERVER_PRIVATE_KEY
   
   # SaveConfig = true makes commenting, formattting impossible
   SaveConfig = false
   # This stuff sets up masquerading through the server's WAN,
   # if you want to route all internet traffic from your client
   # across the Wireguard link. 
   #
   # You'll also need to set net.ipv4.ip_forward=1 in /etc/sysctl.conf
   # if you're going this route; sysctl -p to reload sysctl.conf after
   # making your changes.
   #
   PostUp = iptables -A FORWARD -i wg0 -j ACCEPT; iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE; ip6tables -A FORWARD -i wg0 -j ACCEPT; ip6tables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
   PostDown = iptables -D FORWARD -i wg0 -j ACCEPT; iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE; ip6tables -D FORWARD -i wg0 -j ACCEPT; ip6tables -t nat -D POSTROUTING -o eth0 -j MASQUERADE

OK, so far so good. Note that SERVER_PRIVATE_KEY above is not a reference to a filename – it’s the server’s private key itself, pasted directly into the config file!

With the above server config file (and a real private key on the private key line), wg0 will start, and will answer incoming connections. The problem is, it’ll answer incoming connections from anybody who has the server’s public key – no verification of the client necessary. (TESTED)

Here’s a sample client config:

# client config - client 1 - /etc/wireguard/wg0.conf

[Interface]
   Address = 10.0.0.2/20
   SaveConfig = true
   PrivateKey = MY_PRIVATE_KEY

   # Warning: setting DNS here won't work if you don't
   # have resolvconf installed... and if you're running
   # Ubuntu 18.04, you probably don't have resolvconf
   # installed. If you set this without resolvconf available,
   # the whole interface will fail to come up.
   #
   # DNS = 1.1.1.1

[Peer]
   PublicKey = SERVER_PUBLIC_KEY
   Endpoint = wireguard.mydomain.wtflol:51820

   # this restricts tunnel traffic to the VPN server itself
   AllowedIPs = 172.29.128.1/32

   # if you wanted to route ALL traffic across the VPN, do this instead:
   # AllowedIPs = 0.0.0.0/0

Notice that we set SaveConfig=true in wg0.conf here on our client. This may be more of a bug than a feature. See those nice helpful comments we put in there? And notice how we specified an FQDN instead of a raw IP address for our server endpoint? Well, with SaveConfig=true on, those are going to get wiped out every time the service is restarted (such as on boot). The comments will just get wiped, stuff like the random dynamic port the client service uses will get hard-coded into the file, and the FQDN will be replaced with whatever IP address it resolved to the last time the service was started.

So, yes, you can use an FQDN in your configs – but if you use SaveConfig=true you might as well not bother, since it’ll get immediately replaced with a raw IP address anyway. Caveat imperator.

If we want our server to refuse random anonymous clients and only accept clients who have a private key matching a pubkey in our possession, we need to add [Peer] section(s):

[Peer]
PublicKey = PUBLIC_KEY_OF_CLIENT_ONE
AllowedIPs = 10.0.0.2/32

This works… and with it in place, we will no longer accept connections from anonymous clients. If we haven’t specifically authorized the pubkey for a connecting client, it won’t be allowed to send or receive any traffic. (TESTED.)

We can have multiple peers defined, and they’ll all work simultaneously, on the same port on the same server: (TESTED)

# appended to wg0.conf on SERVER

[Peer]
PublicKey = PUBKEY_OF_CLIENT_ONE
AllowedIPs = 10.0.0.2/32

[Peer]
PublicKey = PUBKEY_OF_CLIENT_TWO
AllowedIPs = 10.0.0.3/32

Wireguard won’t dynamically reload wg0.conf looking for new keys, though; so if we’re adding our new peers manually to the config file like this we’ll have to bring the wg0 interface down and back up again to load the changes, with wg-quick down wg0 && wg-quick up wg0. This is definitely not a good way to do things in production at scale, because it means approximately 15 seconds of downtime for existing clients before they automatically reconnect themselves: (TESTED)

64 bytes from 172.29.128.1: icmp_seq=10 ttl=64 time=35.0 ms
64 bytes from 172.29.128.1: icmp_seq=11 ttl=64 time=39.3 ms
64 bytes from 172.29.128.1: icmp_seq=12 ttl=64 time=37.6 ms

[[[       client disconnected due to server restart      ]]]
[[[  16 pings dropped ==> approx 15-16 seconds downtime  ]]]
[[[ client automatically reconnects itself after timeout ]]]

64 bytes from 172.29.128.1: icmp_seq=28 ttl=64 time=51.4 ms
64 bytes from 172.29.128.1: icmp_seq=29 ttl=64 time=37.6 ms

A better way to do things in production is to add our clients manually with the wg command itself. This allows us to dynamically add clients without bringing the server down, and that doing so will also add those clients into wg0.conf for persistence across reboots and what-have-you.

If we wanted to use this method, the CLI commands we’ll need to run on the server look like this: (TESTED)

root@server:/etc/wireguard# wg set wg0 peer CLIENT3_PUBKEY allowed-ips 10.0.0.4/24

The client CLIENT3 will immediately be able to connect to the server after running this command; but its config information won’t be added to wg0.conf, so this isn’t a persistent addition. To make it persistent, we’ll either need to append a [Peer] block for CLIENT3 to wg0.conf manually, or we could use wg-quick save wg0 to do it automatically. (TESTED)

root@server:/etc/wireguard# wg-quick save wg0

The problem with using wg-quick save (which does not require, but shares the limitations of, SaveConfig = true in the wg0.conf itself) is that it strips all comments and formatting, permanently resolves FQDNs to raw IP addresses, and makes some things permanent that you might wish to keep ephemeral (such as ListenPort on client machines). So in production at scale, while you will likely want to use the wg set command to directly add peers to the server, you probably won’t want to use wg-quick save to make the addition permanent; you’re better off scripting something to append a well-formatted [Peer] block to your existing wg0.conf instead.

Once you’ve gotten everything working to your liking, you’ll want to make your wg0 interface come up automatically on boot. On Ubuntu Xenial or later, this is (of course, and however you may feel about it) a systemd thing:

root@box:/etc/wireguard# systemctl enable wg-quick@wg0

This is sufficient to automatically bring up wg0 at boot; but note that since we’ve already brought it up manually with wg-quick up in this session, an attempt to systemctl status wg-quick@wg0 will show an error. This is harmless, but if it bugs you, you’ll need to manually bring wg0 down, then start it up again using systemctl:

root@box:/etc/wireguard# wg-quick down wg0
root@box:/etc/wireguard# systemctl start wg-quick@wg0

At this point, you’ve got a working wireguard interface on server and client(s), that’s persistent across reboots (and other disconnections) if you want it to be.

What we haven’t covered

Note that we haven’t covered getting packets from CLIENT1 to CLIENT2 here – if you try to communicate directly between two clients with this setup and no additional work, you’ll see the following error: (TESTED)

root@client1:/etc/wireguard# ping -c1 CLIENT2
From 10.0.0.2 icmp_seq=1 Destination Host Unreachable
ping: sendmsg: Required key not available
--- 10.0.0.3 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

We also haven’t looked around at any kind of crypto configuration yet; at this point we’re blindly accepting whatever defaults for algorithms, key sizes, and so forth and hoping for the best. Make sure you understand these (and I don’t, yet!) before deploying in production.

At this point, though, we’ve at least got something working we can play with. Happy hacking, and good luck!

ZFS does NOT favor lower latency devices. Don’t mix rust disks and SSDs!

In an earlier post, I addressed the never-ending urban legend that ZFS writes data to the lowest-latency vdev. Now the urban legend that never dies has reared its head again; this time with someone claiming that ZFS will issue read operations to the lowest-latency disk in a given mirror vdev.

TL;DR – this, too, is a myth. If you need or want an empirical demonstration, read on.

I’ve got an Ubuntu Bionic machine handy with both rust and SSD available; /tmp is an ext4 filesystem on an mdraid1 SSD mirror and /rust is an ext4 filesystem on a single WD 4TB black disk. Let’s play.

root@box:~# truncate -s 4G /tmp/ssd.bin
root@box:~# truncate -s 4G /rust/rust.bin
root@box:~# mkdir /tmp/disks
root@box:~# ln -s /tmp/ssd.bin /tmp/disks/ssd.bin ; ln -s /rust/rust.bin /tmp/disks/rust.bin
root@box:~# zpool create -oashift=12 test /tmp/disks/rust.bin
root@box:~# zfs set compression=off test

Now we’ve got a pool that is rust only… but we’ve got an ssd vdev off to the side, ready to attach. Let’s run an fio test on our rust-only pool first. Note: since this is read testing, we’re going to throw away our first result set; they’ll largely be served from ARC and that’s not what we’re trying to do here.

root@box:~# cd /test
root@box:/test# fio --name=read --ioengine=sync  --rw=randread --bs=16K --size=1G --numjobs=1 --end_fsync=1

OK, cool. Now that fio has generated its dataset, we’ll clear all caches by exporting the pool, then clearing the kernel page cache, then importing the pool again.

root@box:/test# cd ~
root@box:~# zpool export test
root@box:~# echo 3 > /proc/sys/vm/drop_caches
root@box:~# zpool import -d /tmp/disks test
root@box:~# cd /test

Now we can get our first real, uncached read from our rust-only pool. It’s not terribly pretty; this is going to take 5 minutes or so.

root@box:/test# fio --name=read --ioengine=sync  --rw=randread --bs=16K --size=1G --numjobs=1 --end_fsync=1
[ ... ]
Run status group 0 (all jobs):
  READ: bw=17.6MiB/s (18.5MB/s), 17.6MiB/s-17.6MiB/s (18.5MB/s-18.5MB/s), io=1024MiB (1074MB), run=58029-58029msec

Alright. Now let’s attach our ssd and make this a mirror vdev, with one rust and one SSD disk.

root@box:/test# zpool attach test /tmp/disks/rust.bin /tmp/disks/ssd.bin
root@box:/test# zpool status test
  pool: test
 state: ONLINE
  scan: resilvered 1.00G in 0h0m with 0 errors on Sat Jul 14 14:34:07 2018
config:

    NAME                     STATE     READ WRITE CKSUM
    test                     ONLINE       0     0     0
      mirror-0               ONLINE       0     0     0
        /tmp/disks/rust.bin  ONLINE       0     0     0
        /tmp/disks/ssd.bin   ONLINE       0     0     0

errors: No known data errors

Cool. Now that we have one rust and one SSD device in a mirror vdev, let’s export the pool, drop all the kernel page cache, and reimport the pool again.

root@box:/test# cd ~
root@box:~# zpool export test
root@box:~# echo 3 > /proc/sys/vm/drop_caches
root@box:~# zpool import -d /tmp/disks test
root@box:~# cd /test

Gravy. Now, do we see massively improved throughput when we run the same fio test? If ZFS favors the SSD, we should see enormously improved results. If ZFS does not favor the SSD, we’ll not-quite-doubled results.

root@box:/test# fio --name=read --ioengine=sync  --rw=randread --bs=16K --size=1G --numjobs=1 --end_fsync=1
[...]
Run status group 0 (all jobs):
   READ: bw=31.1MiB/s (32.6MB/s), 31.1MiB/s-31.1MiB/s (32.6MB/s-32.6MB/s), io=1024MiB (1074MB), run=32977-32977msec

Welp. There you have it. Not-quite-doubled throughput, matching half – but only half – of the read ops coming from the SSD. To confirm, we’ll do this one more time; but this time we’ll detach the rust disk and run fio with nothing in the pool but the SSD.

root@box:/test# cd ~
root@box:~# zpool detach test /tmp/disks/rust.bin
root@box:~# zpool export test
root@box:~# zpool import -d /tmp/disks test
root@box:~# cd /test

Moment of truth… this time, fio runs on pure solid state:

root@box:/test# fio --name=read --ioengine=sync  --rw=randread --bs=16K --size=1G --numjobs=1 --end_fsync=1
[...]
Run status group 0 (all jobs):
  READ: bw=153MiB/s (160MB/s), 153MiB/s-153MiB/s (160MB/s-160MB/s), io=1024MiB (1074MB), run=6710-6710msec

Welp, there you have it.

Rust only: reads 18.5 MB/sec
SSD only: reads 160 MB/sec
Rust + SSD: reads 32.6 MB/sec

No, ZFS does not read from the lowest-latency disk in a mirror vdev.

Please don’t perpetuate the myth that ZFS favors lower latency devices.

sample netplan config for ubuntu 18.04

Here’s a sample /etc/netplan config for Ubuntu 18.04. HUGE LIFE PRO TIP: against all expectations of decency, netplan refuses to function if you don’t indent everything exactly the way it likes it and returns incomprehensible wharrgarbl errors like “mapping values are not allowed in this context, line 17, column 15” if you, for example, have a single extra space somewhere in the config.

I wish I was kidding.

Anyway, here’s a sample /etc/netplan/01-config.yaml with a couple interfaces, one wired and static, one wireless and dynamic. Enjoy. And for the love of god, get the spacing exactly right; I really wasn’t kidding about it barfing if you have one too many spaces for a whitespace indent somewhere. Ask me how I know. >=\

If for any reason you have trouble reading this exact spacing, the rule is two spaces for each level of indent. So the v in “version” should line up under the t in “network”, the d in “dhcp4” should line up under the o in “eno1”, and so forth.

# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    eno1:
      dhcp4: no
      dhcp6: no
      addresses: [192.168.0.1/24]
      gateway4: 192.168.0.1
      nameservers:
        addresses: [8.8.8.8, 1.1.1.1]
  wifis:
    wlp58s0:
      dhcp4: yes
      dhcp6: no
      access-points:
        "your-wifi-SSID-name":
          password: "your-wifi-password"

Wait for network to be configured (no limit)

In Ubuntu 16.04 or up (ie, post systemd) if you’re ever stuck staring for two straight minutes at “Waiting for network to be configured (no limit)” and despairing, there’s a simple fix:

systemctl mask systemd-networkd-wait-online.service

This links the service that sits there with its thumb up its butt if you don’t have a network connection to /dev/null, causing it to just return instantly whenever it’s called. Which is probably a good idea. There may indeed be a situation in which I want a machine to refuse to boot until it gets an IP address, but whatever that situation MIGHT be, I’ve never encountered it in 20+ years of professional system administration, so…

Primer: How data is stored on-disk with ZFS

As with a lot of things at this blog, I’m largely writing this to confirm and solidify my own knowledge. I tend to be pretty firm on how disks relate to vdevs, and vdevs relate to pools… but once you veer down deeper into the direct on-disk storage, I get a little hazier. So here’s an attempt to remedy that, with citations, for my benefit (and yours!) down the line.

Top level: the zpool

The zpool is the topmost unit of storage under ZFS. A zpool is a single, overarching storage system consisting of one or more vdevs. Writes are distributed among the vdevs according to how much FREE space each vdev has available – you may hear urban myths about ZFS distributing them according to the performance level of the disk, such that “faster disks end up with more writes”, but they’re just that – urban myths. (At least, they’re only myths as of this writing – 2018 April, and ZFS through 7.5.)

A zpool may be created with one or more vdevs, and may have any number of additional vdevs zpool added to it later – but, for the most part, you may not ever remove a vdev from a zpool. There is working code in development to make this possible, but it’s more of a “desperate save” than something you should use lightly – it involves building a permanent lookup table to redirect requests for records stored on the removed vdevs to their new locations on remaining vdevs; sort of a CNAME for storage blocks.

If you create a zpool with vdevs of different sizes, or you add vdevs later when the pool already has a substantial amount of data in it, you’ll end up with an imbalanced distribution of data that causes more writes to land on some vdevs than others, which will limit the performance profile of your pool.

A pool’s performance scales with the number of vdevs within the pool: in a pool of n vdevs, expect the pool to perform roughly equivalently to the slowest of those vdevs, multiplied by n. This is an important distinction – if you create a pool with three solid state disks and a single rust disk, the pool will trend towards the IOPS performance of four rust disks.

Also note that the pool’s performance scales with the number of vdevs, not the number of disks within the vdevs. If you have a single 12 disk wide RAIDZ2 vdev in your pool, expect to see roughly the IOPS profile of a single disk, not of ten!

There is absolutely no parity or redundancy at the pool level. If you lose any vdev, you’ve lost the entire pool, plain and simple. Even if you “didn’t write to anything on that vdev yet” – the pool has altered and distributed its metadata accordingly once the vdev was added; if you lose that vdev “with nothing on it” you’ve still lost the pool.

It’s important to realize that the zpool is not a RAID0; in conventional terms, it’s a JBOD – and a fairly unusual one, at that.

Second level: the vdev

A vdev consists of one or more disks. Standard vdev types are single-disk, mirror, and raidz. A raidz vdev can be raidz1, raidz2, or raidz3. There are also special vdev types – log and l2arc – which extend the ZIL and the ARC, respectively, onto those vdev types. (They aren’t really “write cache” and “read cache” in the traditional sense, which trips a lot of people up. More about that in another post, maybe.)

A single vdev, of any type, will generally have write IOPS characteristics similar to those of a single disk. Specifically, the write IOPS characteristics of its slowest member disk – which may not even be the same disk on every write.

All parity and/or redundancy in ZFS occurs within the vdev level.

Single-disk vdevs

This is as simple as it gets: a vdev that consists of a single disk, no more, no less.

The performance profile of a single-disk vdev is that of, you guessed it, that single disk.

Single-disk vdevs may be expanded in size by replacing that disk with a larger disk: if you zpool attach a 4T disk to a 2T disk, it will resilver into a 2T mirror vdev. When you then zpool detach the 2T disk, the vdev becomes a 4T vdev, expanding your total pool size.

Single-disk vdevs may also be upgraded permanently to mirror vdevs; just zpool attach one or more disks of the same or larger size.

Single-disk vdevs can detect, but not repair, corrupted data records. This makes operating with single-disk vdevs quite dangerous, by ZFS standards – the equivalent, danger-wise, of a conventional RAID0 array.

However, a pool of single-disk vdevs is not actually a RAID0, and really shouldn’t be referred to as one. For one thing, a RAID0 won’t distribute twice as many writes to a 2T disk as to a 1T disk. For another thing, you can’t start out with a three disk RAID0 array, then add a single two-disk RAID1 array (or three five-disk RAID5 arrays!) to your original array, and still call it “a RAID0”.

It may be tempting to use old terminology for conventional RAID, but doing so just makes it that much more difficult to get accustomed to thinking in terms of ZFS’ real topology, hindering both understanding and communication.

Mirror vdevs

Mirror vdevs work basically like traditional RAID1 arrays – each record destined for a mirror vdev is written redundantly to all disks within the vdev. A mirror vdev can have any number of constituent disks; common sizes are 2-disk and 3-disk, but there’s nothing stopping you from creating a 16-disk mirror vdev if that’s what floats your boat.

A mirror vdev offers usable storage capacity equivalent to that of its smallest member disk; and can survive intact as long as any single member disk survives. As long as the vdev has at least two surviving members, it can automatically repair corrupt records detected during normal use or during scrubbing – but once it’s down to the last disk, it can only detect corruption, not repair it. (If you don’t scrub regularly, this means you may already be screwed when you’re down to a single disk in the vdev – any blocks that were already corrupt are no longer repairable, as well as any blocks that become corrupt before you replace the failed disk(s).

You can expand a single disk to a mirror vdev at any time using the zpool attach command; you can also add new disks to an existing mirror in the same way. Disks may also be detached and/or replaced from mirror vdevs arbitrarily. You may also expand the size of an individual mirror vdev by replacing its disks one by one with larger disks; eg start with a mirror of 2T disks, then replace one disk with a 4T disk, wait for it to resilver, then replace the second 2T disk with another 4T disk. Once there are no disks smaller than 4T in the vdev, and it finishes resilvering, the vdev will expand to the new 4T size.

Mirror vdevs are extremely performant: like all vdevs, their write IOPS are roughly those of a single disk, but their read IOPS are roughly those of n disks, where n is the number of disks in the mirror – a mirror vdev n disks wide can read blocks from all n members in parallel.

A pool made of mirror vdevs closely resembles a conventional RAID10 array; each has write IOPS similar to n/2 disks and read IOPS similar to disks, where n is the total number of disks. As with single-disk vdevs, though, I’d advise you not to think and talk sloppily and call it “ZFS RAID10” – it really isn’t, and referring to it that way blurs the boundaries between pool and vdev, hindering both understanding and accurate communication.

RAIDZ vdevs

RAIDZ vdevs are striped parity arrays, similar to RAID5 or RAID6. RAIDZ1 has one parity block per stripe, RAIDZ2 has two parity blocks per stripe, and RAIDZ3 has three parity blocks per stripe. This means that RAIDZ1vdevs can survive loss of a single disk, RAIDZ2 can survive the loss of two disks, and RAIDZ3 vdevs can survive the loss of as many as three disks.

Note, however, that – just like mirror vdevs – once you’ve stripped away all the parity, you’re vulnerable to corruption that can’t be repaired. RAIDZ vdevs take typically take significantly longer to resilver than mirror vdevs do, as well – so you really don’t want to end up completely “uncovered” (surviving, but with no remaining parity blocks) with a RAIDZ array.

Each raidz vdev offers n-(parity*n) storage capacity, where n is the storage capacity of a single disk, and parity is the number of parity blocks per stripe. So a six-disk RAIDZ1 vdev offers the storage capacity of five disks, an eight-disk RAIDZ2 vdev offers the storage capacity of six disks, and so forth.

You may create RAIDZ vdevs using mismatched disk sizes, but the vdev’s capacity will be based around the smallest member disk. You can expand the size of an existing RAIDZ vdev by replacing all of its members individually with larger disks than were originally used, but you cannot expand a RAIDZ vdev by adding new disks to it and making it wider – a 5-disk RAIDZ1 vdev cannot be converted into a 6-disk RAIDZ1 vdev later; neither can a 6-disk RAIDZ2 be converted into a 6-disk RAIDZ1.

It’s a common misconception to think that RAIDZ vdev performance scales linearly with the number of disks used. Although throughput under ideal conditions can scale towards n-parity disks, throughput under moderate to serious load will rapidly degrade toward the profile of a single disk – or even slightly worse, since it scales down toward the profile of the slowest disk for any given operation. This is the difference between IOPS and bandwidth (and it works the same way for conventional RAID!)

RAIDZ vdev IOPS performance is generally more robust than that of a conventional RAID5 or RAID6 array of the same size, because RAIDZ offers variable stripe write sizes – if you routinely write data in records only one record wide, a RAIDZ1 vdev will write to only two of its disks (one for data, and one for parity); a RAIDZ2 vdev will write to only three of its disks (one for data, and two for parity) and so on. This can mitigate some of the otherwise-crushing IOPS penalty associated with wide striped arrays; a three-record variable stripe write to a six-disk RAIDZ vdev only lights up half the disks both when written, and later, when read – which can make the performance profile of that six-disk RAIDZ resemble that of two three-disk RAIDZ1 vdevs rather than that of a single vdev.

The performance improvement described above assumes that multiple reads and writes of the three-record stripes are being requested concurrently; otherwise the entire vdev still binds while waiting for a full-stripe read or write.

Remember that you can – and with larger servers, should – have multiple RAIDZ vdevs per pool, not just one. A pool of three eight-disk RAIDZ2 vdevs will significantly outperform a pool with a single 24-disk RAIDZ2 or RAIDZ3 vdev – and it will resilver much faster when replacing failed disks.

Third level: the metaslab

Each vdev is organized into metaslabs – typically, 200 metaslabs per vdev (although this number can change, if vdevs are expanded and/or as the ZFS codebase itself becomes further optimized over time).

When you issue writes to the pool, those writes are coalesced into a txg (transaction group), which is then distributed among individual vdevs, and finally allocated to specific metaslabs on each vdev. There’s a fairly hefty logic chain which determines exactly what metaslab a record is written to; it was explained to me (with no warranty offered) by a friend who worked with Oracle as follows:

• Is this metaslab “full”? (zfs_mg_noalloc_threshold)
• Is this metaslab excessively fragmented? (zfs_metaslab_fragmentation_threshold)
• Is this metaslab group excessively fragmented? (zfs_mg_fragmentation_threshold)
• Have we exceeded minimum free space thresholds? (metaslab_df_alloc_threshold) This one is weird; it changes the whole storage pool allocation strategy for ZFS if you cross it.
• Should we prefer lower-numbered metaslabs over higher ones? (metaslab_lba_weighting_enabled) This is totally irrelevant to all-SSD pools, and should be disabled there, because it’s pretty stupid without rust disks underneath.
• Should we prefer lower-numbered metaslab groups over higher ones? (metaslab_bias_enabled) Same as above.

You can dive into the hairy details of your pool’s metaslabs using the zdb command – this is a level which I have thankfully not personally needed so far, and I devoutly hope I will continue not to need it in the future.

Fourth level: the record

Each ZFS write is broken into records, the size of which is determined by the zfs set recordsize=command. The default recordsize is currently 128K; it may range from 512B to 1M.

Recordsize is a property which can be tuned individually per dataset, and for higher performance applications, should be tuned per dataset. If you expect to largely be moving large chunks of contiguous data – for example, reading and writing 5MB JPEG files – you’ll benefit from a larger recordsize than default. Setting recordsize=1M here will allow your writes to be less fragmented, resulting in higher performance both when making the writes, and later when reading them.

Conversely, if you expect a lot of small-block random I/O – like reading and writing database binaries, or VM (virtual machine) images – you should set recordsize smaller than the default 128K. MySQL, as an example, typically works with data in 16K chunks; if you set recordsize=16K you will tremendously improve IOPS when working with that data.

ZFS CSUMs – cryptographic hashes which verify its data’s integrity – are written on a per-record basis; data written with recordsize=1M will have a single CSUM per 1MB; data written with recordsize=8K will have 128 times as many CSUMs for the same 1MB of data.

Setting recordsize to a value smaller than your hardware’s individual sector size is a tremendously bad idea, and will lead to massive read/write amplification penalties.

Fifth (and final) level: ashift

Ashift is the property which tells ZFS what the underlying hardware’s actual sector size is. The individual blocksize within each record will be determined by ashift; unlike recordsize, however, ashift is set as a number of bits rather than an actual number.  For example, ashift=13 specifies 8K sectors, ashift=12 specifies 4K sectors, and ashift=9 specifies 512B sectors.

Ashift is per vdev, not per pool – and it’s immutable once set, so be careful not to screw it up!  In theory, ZFS will automatically set ashift to the proper value for your hardware; in practice, storage manufacturers very, very frequently lie about the underlying hardware sector size in order to keep older operating systems from getting confused, so you should do your homework and set it manually. Remember, once you add a vdev to your pool, you can’t get rid of it; so if you accidentally add a vdev with improper ashift value to your pool, you’ve permanently screwed up the entire pool!

Setting ashift too high is, for the most part, harmless – you’ll increase the amount of slack space on your storage, but unless you have a very specialized workload this is unlikely to have any significant impact. Setting ashift too low, on the other hand, is a horrorshow. If you end up with an ashift=9 vdev on a device with 8K sectors (thus, properly ashift=13), you’ll suffer from massive write amplification penalties as ZFS needs to write, read, rewrite again over and over on the same actual hardware sector. I have personally seen improperly set ashift cause a pool of Samsung 840 Pro SSDs perform slower than a pool of WD Black rust disks!

Even if you’ve done your homework and are absolutely certain that your disks use 512B hardware sectors, I strongly advise considering setting ashift=12 or even ashift=13 – because, remember, it’s immutable per vdev, and vdevs cannot be removed from pools. If you ever need to replace a 512B sector disk in a vdev with a 4K or 8K sector disk, you’ll be screwed if that vdev is ashift=9.