Installing WordPress on Apache the modern way

It’s been bugging me for a while that there are no correct guides to be found about using modern Apache 2.4 or above with the Event or Worker MPMs. We’re going to go ahead and correct that lapse today, by walking through a brand-new WordPress install on a new Ubuntu 18.04 VM (grab one for $5/mo at Linode, Digital Ocean, or your favorite host).

Installing system packages

Once you’ve set up the VM itself, you’ll first need to update the package list:

root@VM:~# apt update

Once it’s updated, you’ll need to install Apache itself, along with PHP and the various extras needed for a WordPress installation.

root@VM:~# apt install apache2 mysql-server php-fpm php-common php-mbstring php-xmlrpc php-soap php-gd php-xml php-intl php-mysql php-cli php-ldap php-zip php-curl

The key bits here are Apache2, your HTTP server; MySQL, your database server; and php-fpm, which is a pool of PHP worker processes your HTTP server can connect to in order to build WordPress dynamic content as necessary.

What you absolutely, positively do not want to do here is install mod_php. If you do that, your nice modern Apache2 with its nice modern Event process model gets immediately switched back to your granddaddy’s late-90s-style prefork, loading PHP processors into every single child process, and preventing your site from scaling if you get any significant traffic!

Enable the proxy_fcgi module

Instead – and this is the bit none of the guides I’ve found mention – you just need to enable one module in Apache itself, and enable the already-installed PHP configuration module. (You will need to figure out which version of php-fpm is installed: dpkg –get-selections | grep fpm can help here if you aren’t sure.)

root@VM:~# a2enmod proxy_fcgi
root@VM:~# a2enconf php7.4-fpm.conf
root@VM:~# systemctl restart apache2

Your Apache2 server is now ready to serve PHP applications, like WordPress. (Note for more advanced admins: if you’re tuning for larger scale, don’t forget that it’s not only about the web server connections anymore; you also want to keep an eye on how many PHP worker processes you have in your pool. You’ll do that in /etc/php/[version]/fpm/pool.d/www.conf.)

Download and extract WordPress

We’re going to keep things super simple in this guide, and just serve WordPress from the existing default vhost in its standard location, at /var/www/html.

root@VM:~# cd /var/www
root@VM:/var/www# wget https://wordpress.org/latest.tar.gz
root@VM:/var/www# tar zxvf latest.tar.gz
root@VM:/var/www# chown -R www-data.www-data wordpress
root@VM:/var/www# mv html html.dist
root@VM:/var/www# mv wordpress html

Create a database for WordPress

The last step before you can browse to your new WordPress installation is creating the database itself.

root@VM:/var/www# mysql -u root

mysql> create database wordpress;
Query OK, 1 row affected (0.01 sec)

mysql> grant all on wordpress.* to 'wordpress'@'localhost' identified by 'superduperpassword';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> exit;

This created a database named wordpress, with a user named wordpress, and a password superduperpassword. That’s a bad password. Don’t actually use that password. (Also, if mysql -u root wanted a password, and you don’t have it – cat /etc/mysql/debian.cnf, look for the debian-sys-maint password, and connect to mysql using mysql -u debian-sys-maint instead. Everything else will work fine.)

note for ubuntu 20.04 / mysql 8.0 users:

MySQL changed things a bit with 8.0. grant all on db.* to ‘user’@’localhost’ identified by ‘password’; no longer works all in one step. Instead, you’ll need first to create user ‘user’@’localhost’ identified by ‘password’; then you can grant all on db.* to ‘user’@’localhost’; —you no longer need to (or can) specify password on the actual grant line itself.

All done – browser time!

Now that you’ve set up Apache, dropped the WordPress installer in its default directory, and created a mysql database – you’re ready to run through the WordPress setup itself, by browsing directly to http://your.servers.ip.address/. Have fun!

ZFS sync/async + ZIL/SLOG, explained

Recently on r/zfs, the topic of ZIL (ZFS Intent Log) and SLOG (Secondary LOG device) came up again. It’s a frequently misunderstood part of the ZFS workflow, and I had to go back and correct some of my own misconceptions about it during the thread. ixSystems has a reasonably good explainer up – with the great advantage that it was apparently error-checked by Matt Ahrens, founding ZFS developer – but it could use a diagram or two to make the workflow clear.

In the normal course of operations on a basic pool with no special devices (such as a SLOG), the write workflow looks like this:

Unless explicitly declared as synchronous (by opening with O_SYNC set, or manually calling sync()), all writes are asynchronous. And – here’s the bit I find most people misunderstand – all writes, including synchronous writes, are aggregated in RAM and committed to the pool in TXGs (Transaction Groups) on a regular basis.

The difference with sync writes is, they’re also written to a special area of the pool called the ZIL – ZFS Intent Log – in parallel with writing them to the aggregator in RAM. This doesn’t mean the sync writes are actually committed to main storage immediately; it just means they’re buffered on-disk in a way that will survive a crash if necessary. The other key difference is that any asynchronous write operation returns immediately; but sync() calls don’t return until they’ve been committed to disk in the ZIL.

I want you to go back and look at that diagram again, though, and notice that there’s no arrow coming out of the ZIL. That’s not a bug – in normal operation, blocks written to the ZIL are never read from again; the sync writes still get committed to the main pool in TXGs from RAM alongside the async writes. The sync write blocks in the ZIL get unlinked after the copies of them in RAM get written out to the pool in TXGs.

During the import process for a zpool, ZFS checks the ZIL for any dirty writes. If it finds some (due to a kernel crash or system power event), it will replay them from the ZIL, aggregating them into TXG(s), and committing the TXG(s) to the pool as normal. Once the dirty writes from the ZIL have been committed and the ZIL itself cleared, the pool import can proceed normally and we’re back to diagram 1, normal operation.

Why would we want a SLOG?

While normal operation with the ZIL works very reliably, it introduces a couple of pretty serious performance drawbacks. With any filesystem, writing small groups of blocks to disk immediately without benefit of aggregation and ordering introduces serious IOPS (I/O Operations per Second) penalties.

With most filesystems, sync writes also introduce severe fragmentation penalties for any future reads of that data. ZFS avoids the increased future fragmentation penalty by writing the sync blocks out to disk as though they’d been asynchronous to begin with. While this avoids the future read fragmentation, it introduces a write amplification penalty at the time of committing the writes; small writes must be written out twice (once to ZIL and then again later in TXGs to main storage).

Larger writes avoid some of this write amplification by committing the blocks directly to main storage, committing a pointer to those blocks to the ZIL, and then only needing to update the pointer when writing out the permanent TXG later. This is pretty effective at minimizing the write throughput amplification, but doesn’t do much to mitigate write IOPS amplification – and, please repeat with me, most storage workloads bind on IOPS.

So if your system experiences a lot of sync write operations, a SLOG – Secondary LOG device – can help. The SLOG is a special standalone vdev that takes the place of the ZIL. It performs exactly like the ZIL, it just happens to be on a separate, isolated device – which means that “double writes” due to sync don’t consume the IOPS or throughput of the main storage itself. This also means the latency of the sync write operations themselves improves, since the call to sync() doesn’t return until after the data has been committed temporarily to disk – in this case, to the SLOG, which should be nice and idle in comparison with our busy main storage vdevs.

Ideally, your SLOG device should also be extremely fast, with tons of IOPS – read “fast solid state drive” – to get that sync write latency down as low as possible. However, the only speed we care about here is write speed; the SLOG, just like the ZIL, is never read from at all during normal operation. It also doesn’t need to be very large – just enough to hold a few seconds’ worth of writes. Remember, every time ZFS commits TXGs to the pool, it unlinks whatever’s in the SLOG/ZIL!

Pictured above is the only time the SLOG gets read from – after a crash, just like the ZIL. There really is zero difference between SLOG and ZIL, apart from the SLOG being separate from the main pool vdevs in order to conserve write throughput and IOPS, and minimize sync write latency.

Should I set sync=always with a fast SLOG?

Yes, you can zfs set sync=always to force all writes to a given dataset or zvol to be committed to the SLOG. But it won’t make your asynchronous writes go any faster. Remember, asynchronous write calls already return immediately – you literally can’t improve on that, no matter what you do.

You also can’t materially improve throughput, since the SLOG is only going to buffer a few seconds of writes before main commits to the pool via TXGs from RAM kick in.

The potential benefit to setting zfs sync=always isn’t speed, it’s safety.

If you’ve got applications that notoriously write unsafely and tend to screw themselves after a power outage or other crash – eg any databases using myISAM or other non-journalling storage engine – you might decide to set zfs sync=always on the dataset or zvol containing their back ends, to make certain that you don’t end up with a corrupt db after a crash. Again, you’re not going faster, you’re going safer.

OK, what about sync=disabled?

No matter how fast a SLOG you add, setting sync=always won’t make anything go faster. Setting sync=disabled, on the other hand, will definitely speed up any workload with a lot of synchronous writes.

sync=disabled decreases latency at the expense of safety.

If you have an application that calls sync() (or opens O_SYNC) far too often for your tastes and you think it’s just a nervous nelly, setting sync=disabled forces its synchronous writes to be handled as asynchronous, eliminating any double write penalty (with only ZIL) or added latency waiting for on-disk commits. But you’d better know exactly what you’re doing – and be willing to cheerfully say “welp, that one’s on me” if you have a kernel crash or power failure, and your application comes back with corrupt data due to missing writes that it had depended on being already committed to disk.

About ZFS recordsize

ZFS stores data in records—also known as, and interchangeably referred to by OpenZFS devs as blocks—which are themselves composed of on-disk sectors. The physical sector size on disk is set by the ashift value at time of vdev creation, and is immutable. The recordsize, on the other hand, is individual to each dataset(although it can be inherited from parent datasets), and can be changed at any time you like. In 2019, recordsize defaults to 128K if not explicitly set.

The general rule of recordsize is that it should closely match the typical workload experienced within that dataset. For example, a dataset used to store high-quality JPGs, averaging 5MB or more, should have recordsize=1M. This matches the typical I/O seen in that dataset – either reading or writing a full 5+ MB JPG, with no random access within each file – quite well; setting that larger recordsize prevents the files from becoming unduly fragmented, ensuring the fewest IOPS are consumed during either read or write of the data within that dataset.

By contrast, a dataset which directly contains a MySQL InnoDB database should have recordsize=16K. That’s because InnoDB defaults to a 16KB page size, so most operations on an InnoDB database will be done in individual 16K chunks of data. Matching recordsize to MySQL’s page size here means we maximize the available IOPS, while minimizing latency on the highly sync()hronous reads and writes made by the database (since we don’t need to read or write extraneous data while handling our MySQL pages).

VMs? Match the recordsize to the VM storage format.

(That’s cluster_size, for QEMU/KVM .qcow2.)

On the other hand, if you’ve got a MySQL InnoDB database stored within a VM, your optimal recordsize won’t necessarily be either of the above – for example, KVM .qcow2 files default to a cluster_size of 64KB. If you’ve set up a VM on .qcow2 with default cluster_size, you don’t want to set recordsize any lower (or higher!) than the cluster_size of the .qcow2 file. So in this case, you’ll want recordsize=64K to match the .qcow2’s cluster_size=64K, even though the InnoDB database inside the VM is probably using smaller pages.

An advanced administrator might look at all of this, determine that a VM’s primary function in life is to run MySQL, that MySQL’s default page size is good, and therefore set both the .qcow2 cluster_size and the dataset’s recordsize to match, at 16K each.

A different administrator might look at all this, determine that the performance of MySQL in the VM with all the relevant settings left to their defaults was perfectly fine, and elect not to hand-tune all this crap at all. And that’s okay.

What if I set `recordsize` too high?

If recordsize is much higher than the size of the typical storage operation within the dataset, latency will be greatly increased and this is likely to be incredibly frustrating. IOPS will be very limited, databases will perform poorly, desktop UI will be glacial, etc.

What if I set `recordsize` too low?

If recordsize is a lot smaller than the size of the typical storage operation within the dataset, fragmentation will be greatly (and unnecessarily) increased, leading to unnecessary performance problems down the road. IOPS as measured by artificial tools will be super high, but performance profiles will be limited to those presented by random I/O at the record size you’ve set, which in turn can be significantly worse than the performance profile of larger block operations.

You’ll also screw up compression with an unnecessarily low recordsize; zfs inline compression dictionaries are per-record, and work by fitting the contents of a single logical record (aka block) into fewer on-disk sectors than it would have needed if uncompressed.

If you set compression=lz4, ashift=12, and recordsize=4K you’ll effectively have NO compression, because your sector size is equal to your recordsize—and you can’t store data in “half a sector”, so compressing a one-sector record down to 0.5 sectors’ worth of data would not result in less usage of on-disk capacity.

Meanwhile, the same dataset with the default 128K recordsize might easily have a 1.7:1 compression ratio.

Are the defaults good? Do I aim high, or do I aim low?

128K is a pretty reasonable “ah, what the heck, it works well enough” setting in general. It penalizes you significantly on IOPS and latency for small random I/O operations, and it presents more fragmentation than necessary for large contiguous files, but it’s not horrible at either task. There is a lot to be gained from tuning recordsize more appropriately for task, though.

What about bittorrent?

The “big records for big files” rule of thumb still applies for datasets used as bittorrent targets.

This is one of those cases where things work just the opposite of how you might think – torrents write data in relatively small chunks, and access them randomly for both read and write, so you might reasonably think this calls for a small recordsize. However, the actual data in the torrents is typically huge files, which are accessed in their entirety for everything but the initial bittorrent session.

Since the typical access pattern is “large-file”, most people will be better off using recordsize=1M in the torrent target storage. This keeps the downloaded data unfragmented despite the bittorrent client’s insanely random writing patterns. The data acquired during the bittorrent session in chunks is accumulated in the ZIL until a full record is available to write, since the torrent client itself is not synchronous – it writes all the time, but rarely if ever calls sync().

As a proof-of-concept, I used the Transmission client on an Ubuntu 16.04 LTS workstation to download the Ubuntu 18.04.2 Server LTS ISO, with a dataset using recordsize=1M as the target. This workstation has a pool consisting of two mirror vdevs on rust, so high levels of fragmentation would be very easy to spot.

root@locutus:/# zpool export data ; modprobe -r zfs ; modprobe zfs ; zpool import data

root@locutus:/# pv < /data/torrent/ubu*18*iso > /dev/null
 883MB 0:00:03 [ 233MB/s] [==================================>] 100%

Exporting the pool and unloading the ZFS kernel module entirely is a weapons-grade-certain method of emptying the ARC entirely; getting better than 200 MB/sec average read throughput directly from the rust vdevs afterward (the transfer actually peaked at nearly 400 MB/sec!) confirms that our torrented ISO is not fragmented.

Note that preallocation settings in your bittorrent client are meaningless when the client is saving to ZFS – you can’t actually preallocate in any meaningful way on ZFS, because it’s a copy-on-write filesystem.

Reverse proxy + Let’s Encrypt CertBot

I have a client who’s using Varnish as a caching reverse proxy for a site which sometimes experiences extremely heavy traffic that the actual web server can’t handle. The client wanted to start offering HTTPS on the website, and I suggested Let’s Encrypt, with its automatically self-renewing 90-day certs, to avoid the yearly hassle of manually renewing certificates with a provider, updating configs, etc.

Varnish does an incredibly good job at handling absurd amounts of traffic… but what it doesn’t do is SSL termination. Nginx, on the other hand, does SSL termination just fine – I’ve had difficulty in the past getting Apache to reverse-proxy HTTPS down to HTTP without hiccups, but it’s a breeze with nginx.

The problem is, how do we get Let’s Encrypt working when it isn’t installed on the root webserver? Certbot’s automatic renewal process depends on updating files on the actual website, in order to demonstrate to the CA that we are who we say we are (since our website is what DNS resolves to). The secret is nginx’s extremely flexible Location blocks, and knowing exactly where CertBot wants to create its identify files for remote verification.

Long story short, we want nginx proxying everything on https://website.tld/ to http//website.tld/ on the back end server. The first step for me was putting an entry in /etc/hosts that points website.tld to the back end server’s IP address (5.6.7.8) , not to the front-end proxy’s own (1.2.3.4).

127.0.0.1 localhost
127.0.1.1 website-proxy
1.2.3.4 website.tld www.website.tld

That way when we tell nginx that we want it to proxy requests made for website.tld to website.tld, it will correctly connect out to the back end server rather than to itself.

With that done, we create a stub directory for the site using mkdir -p /var/www/website.tld ; chown -R www-data.www-data /var/www and then create a simple site config at /etc/nginx/sites-available/website.tld:

server {
   server_name website.tld;
   server_name www.website.tld;
   root /var/www/website.tld;

   location / {
     proxy_pass http://www.website.tld;
   }

   location ^~ /.well_known {
     root /var/www/website.tld;
   }
}

This simple config proxies all requests made to website.tld except those looking for content in or under .well_known – CertBot’s challenge directory – which it serves directly from the local machine itself.

After symlinking the site live and restarting nginx – ln -s /etc/nginx/sites-available/website.tld /etc/nginx/sites-enabled/website.tld && /etc/init.d/nginx restart – we’re ready to install certbot.

root@proxy:~# apt-add-repository ppa:certbot/certbot ; apt update ; apt install -y python-certbot-nginx
                [[[ output elided ]]]

Now that certbot’s installed, we just need to run it once and let it automatically configure. Note that I chose to have Certbot automatically configure nginx to force-redirect all HTTP traffic to HTTPS for me. (And don’t judge me for not saying yes to giving the email address to the EFF: that’s because I’m already on the mailinglist, and a dues-paid-up member!)

root@proxy:~# certbot --nginx

Saving debug log to /var/log/letsencrypt/letsencrypt.log
Plugins selected: Authenticator nginx, Installer nginx
Enter email address (used for urgent renewal and security notices) (Enter 'c' to cancel): you@website.tld

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Please read the Terms of Service at https://letsencrypt.org/documents/LE-SA-v1.2-November-15-2017.pdf. You must agree in order to register with the ACME server at https://acme-v02.api.letsencrypt.org/directory
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(A)gree/(C)ancel: A

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Would you be willing to share your email address with the Electronic Frontier Foundation, a founding partner of the Let's Encrypt project and the non-profit organization that develops Certbot? We'd like to send you email about our work encrypting the web, EFF news, campaigns, and ways to support digital freedom.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(Y)es/(N)o: N

Which names would you like to activate HTTPS for?
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1: website.tld
2: www.website.tld
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Select the appropriate numbers separated by commas and/or spaces, or leave input
blank to select all options shown (Enter 'c' to cancel): 
Obtaining a new certificate
Performing the following challenges:
http-01 challenge for website.tld
http-01 challenge for www.website.tld
Waiting for verification...
Cleaning up challenges
Deploying Certificate to VirtualHost /etc/nginx/sites-enabled/website.tld
Deploying Certificate to VirtualHost /etc/nginx/sites-enabled/website.tld

Please choose whether or not to redirect HTTP traffic to HTTPS, removing HTTP access.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1: No redirect - Make no further changes to the webserver configuration.
2: Redirect - Make all requests redirect to secure HTTPS access. Choose this for new sites, or if you're confident your site works on HTTPS. You can undo this change by editing your web server's configuration.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Select the appropriate number [1-2] then [enter] (press 'c' to cancel): 2
Redirecting all traffic on port 80 to ssl in /etc/nginx/sites-enabled/website.tld
Redirecting all traffic on port 80 to ssl in /etc/nginx/sites-enabled/website.tld

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Congratulations! You have successfully enabled https://website.tld and
https://www.website.tld

You should test your configuration at:
https://www.ssllabs.com/ssltest/analyze.html?d=website.tld
https://www.ssllabs.com/ssltest/analyze.html?d=www.website.tld
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

IMPORTANT NOTES:
 - Congratulations! Your certificate and chain have been saved at:
 /etc/letsencrypt/live/website.tld/fullchain.pem
 Your key file has been saved at:
 /etc/letsencrypt/live/website.tld/privkey.pem
 Your cert will expire on 2019-03-22. To obtain a new or tweaked
 version of this certificate in the future, simply run certbot again
 with the "certonly" option. To non-interactively renew *all* of
 your certificates, run "certbot renew"
 - Your account credentials have been saved in your Certbot
 configuration directory at /etc/letsencrypt. You should make a
 secure backup of this folder now. This configuration directory will
 also contain certificates and private keys obtained by Certbot so
 making regular backups of this folder is ideal.
 - If you like Certbot, please consider supporting our work by:

Donating to ISRG / Let's Encrypt: https://letsencrypt.org/donate
 Donating to EFF: https://eff.org/donate-le

That’s it. You should test the renewal process once using certbot renew --dry-run, and make sure it works (it should; mine certainly did). There’s already a cron job created for you in /etc/cron.d, so you don’t need to do anything else – you’ve got working SSL on your reverse proxy server now.

Disable Twitter vibration on Android (8.0 and up)

Holy CRAP I can’t believe how difficult this was to figure out.

TL;DR: Settings –> Apps & Notifications –> Twitter –> Notifications

Now here’s the super crazy part. See all those checkboxes, to turn notifications on or off entirely for various events – Direct Messages, Emergency alerts, etc? Yeah, ignore the checkboxes, and tap the TEXT. Now tap “Behavior”, and set to “Show silently.” You will now continue to get notifications, but stop getting sounds and insanely-irritating vibrations, for each event type you change Behavior on.

Tap event category NAME (NOT checkbox!) –> Behavior –> Show silently

I’ve been suffering from over-vibrating apps, with Twitter being the absolute worst offender, for months. There still weren’t any vaguely decent how-tos today, but I finally pieced it together from vague clues on an androidcentral forum post.

VLANs with KVM guests on Ubuntu 18.04 / netplan

There is a frustrating lack of information on how to set up multiple VLAN interfaces on a KVM host out there. I made my way through it in production today with great applications of thud and blunder; here’s an example of a working 01-netcfg.yaml with multiple VLANs on a single (real) bridge interface, presenting as multiple bridges.

Everything feeds through properly so that you can bring KVM guests up on br0 for the default VLAN, br100 for VLAN 100, or br200 for VLAN 200. Adapt as necessary for whatever VLANs you happen to be using.

# This file describes the network interfaces available on your system
# For more information, see netplan(5).
network:
  version: 2
  renderer: networkd
  ethernets:
    eno1:
      dhcp4: no
      dhcp6: no
    eno2:
      dhcp4: no
      dhcp6: no
  vlans:
    br0.100:
      link: br0
      id: 100
    br0.200:
      link: br0
      id: 200
  bridges:
    br0:
      interfaces:
        - eno1
        - eno2
      dhcp4: no
      dhcp6: no
      addresses: [ 10.0.0.2/24 ]
      gateway4: 10.0.0.1
      nameservers:
        addresses: [ 8.8.8.8,1.1.1.1 ]
    br100:
      interfaces:
        - br0.100
      dhcp4: no
      dhcp6: no
      addresses: [ 10.0.100.1/24 ]
    br200:
      interfaces:
        - br0.200
      dhcp4: no
      dhcp6: no
      addresses: [ 10.0.200.1/24 ]

ZFS write allocation in 0.7.x

In an earlier post, I demonstrated why you shouldn’t mix rust and SSDs – reads on your pool bind at the speed of the slowest vdev; effectively making SSDs in a pool containing rust little more than extremely small, expensive rust disks themselves. That post was a follow-up to an even earlier post demonstrating that – as of 0.6.x – ZFS did not allocate writes to the lowest latency vdev.

An update to the Storage Pool Allocator (SPA) has changed the original write behavior; as of 0.7.0 (and Ubuntu Bionic includes 0.7.5) writes really are allocated to the lowest-latency vdev in the pool. To test this, I created a throwaway pool on a system with both rust and SSD devices on board. This isn’t the cleanest test possible – the vdevs are actually sparse files created on, respectively, an SSD mdraid1 and another pool consisting of on rust mirror vdev. It’s good enough for government work, though, so let’s see how small-block random write operations are allocated when you’ve got one rust vdev and one SSD vdev:

root@demo0:/tmp# zpool create -oashift=12 test /tmp/rust.bin /tmp/ssd.bin
root@demo0:/tmp# zfs set compression=off test

root@demo0:/tmp# fio --name=write --ioengine=sync --rw=randwrite \
--bs=16K --size=1G --numjobs=1 --end_fsync=1

[...]

Run status group 0 (all jobs):
WRITE: bw=204MiB/s (214MB/s), 204MiB/s-204MiB/s (214MB/s-214MB/s),
io=1024MiB (1074MB), run=5012-5012msec

root@demo0:/tmp# du -h /tmp/ssd.bin ; du -h /tmp/rust.bin
1.8M /tmp/ssd.bin
237K /tmp/rust.bin

Couldn’t be much clearer – 204 MB/sec is higher throughput than a single rust mirror can manage for 16K random writes, and almost 90% of the write operations were committed to the SSD side. So the SPA updates in 0.7.0 work as intended – when pushed to the limit, ZFS will now allocate far more of its writes to the fastest vdevs available in the pool.

I italicized that for a reason, of course. When you don’t push ZFS hard with synchronous, small-block writes like we did with fio above, it still allocates according to free space available. To demonstrate this, we’ll destroy and recreate our hybrid test pool – and this time, we’ll write a GB or so of random data sequentially and asynchronously, using openssl to rapidly generate pseudo-random data, which we’ll pipe through pv into a file on our pool.

root@demo0:/tmp# zpool create -oashift=12 test /tmp/rust.bin /tmp/ssd.bin root@demo0:/tmp# zfs set compression=off test 

root@demo0:~# openssl enc -aes-256-ctr -pass \
              pass:"$(dd if=/dev/urandom bs=128 \
                    count=1 2>/dev/null | base64)" \
              -nosalt < /dev/zero | pv > /test/randomfile.bin

 1032MiB 0:00:04 [ 370MiB/s] [    <=>                 ] ^C

root@demo0:~# du -h /tmp/*bin
root@demo0:~# du -h /tmp/*bin
571M /tmp/rust.bin
627M /tmp/ssd.bin

Although we wrote our pseudorandom data very rapidly to the pool, in this case we did so sequentially and asynchronously, rather than in small random access blocks and synchronously. And in this case, our writes were committed near-equally to each vdev, despite one being immensely faster than the other.

Please note that this describes the SPA’s behavior when allocating writes at the pool level – it has nothing at all to do with the behavior of individual vdevs which have both rust and SSD member devices. My recent test of half-rust/half-SSD mirror vdevs was also run on Bionic with ZFS 0.7.5, and demonstrated conclusively that even read behavior inside a vdev doesn’t favor lower-latency devices, let alone write behavior.

The new SPA code is great, and it absolutely does improve write performance on IOPS-saturated pools. However, it is not intended to enable the undying dream of mixing rust and SSD storage willy-nilly, and if you try to do so anyway, you’re gonna have a bad time.

I still do not recommend mixing SSDs and rust in the same pool, or in the same vdev.

ZFS tuning cheat sheet

Quick and dirty cheat sheet for anyone getting ready to set up a new ZFS pool. Here are all the settings you’ll want to think about, and the values I think you’ll probably want to use.

I am not generally a fan of tuning things unless you need to, but unfortunately a lot of the ZFS defaults aren’t optimal for most workloads.

SLOG and L2ARC are special devices, not parameters… but I included them anyway. Lean into it.

parameter	*best value**	why / what does it do?
ashift	12	Ashift tells ZFS what the underlying physical block size your disks use is. It’s in bits, so ashift=9 means 512B sectors (used by all ancient drives), ashift=12 means 4K sectors (used by most modern hard drives), and ashift=13 means 8K sectors (used by some modern SSDs). If you get this wrong, you want to get it wrong high. Too low an ashift value will cripple your performance. Too high an ashift value won’t have much impact on almost any normal workload. Ashift is per vdev, and immutable once set. This means you should manually set it at pool creation, and any time you add a vdev to an existing pool, and should never get it wrong because if you do, it will screw up your entire pool and cannot be fixed.
xattr	sa	Sets Linux eXtended ATTRibutes directly in the inodes, rather than as tiny little files in special hidden folders. This can have a significant performance impact on datasets with lots of files in them, particularly if SELinux is in play. Unlikely to make any difference on datasets with very few, extremely large files (eg VM images).
compression	lz4	Compression defaults to off, and that’s a losing default value. Even if your data is incompressible, your slack space is (highly) compressible. LZ4 compression is faster than storage. Yes, really. Even if you have a $50 tinkertoy CPU and a blazing-fast SSD. Yes, really. I’ve tested it. It’s a win. You might consider gzip compression for datasets with highly compressible files. It will have better compression rate but likely lower throughput. YMMV, caveat imperator.
atime	off	If atime is on – which it is by default – your system has to update the “Accessed” attribute of every file every time you look at it. This can easily double the IOPS load on a system all by itself. Do you care when the last time somebody opened a given file was, or the last time they ls’d a directory? Probably not. Turn this off.
recordsize	64K	If you have files that will be read from or written to in random batches regularly, you want to match the recordsize to the size of the reads or writes you’re going to be digging out of / cramming into those large files. For most database binaries or VM images, 64K is going to be either an exact match to the VM’s back end storage cluster size (eg the default cluster_size=64K on QEMU’s QCOW2 storage) or at least a better one than the default recordsize, 128K. If you’ve got a workload that wants even smaller blocks—for example, 16KiB to match MySQL InnoDB or 8KiB to match PostgreSQL back-ends—you should tune both ZFS recordsize and the VM back end storage (where applicable) to match. This can improve the IOPS capability of an array used for db binaries or VM images fourfold or more.
recordsize	1M	Wait, didn’t we just do recordsize…? Well, yes, but different workloads call for different settings if you’re tuning. If you’re only reading and writing in fairly large chunks – for example, a collection of 5-8MB JPEG images from a camera, or 100GB movie files, either of which will not be read or written random access – you’ll want to set recordsize=1M, to reduce the IOPS load on the system by requiring fewer individual records for the same amount of data. This can also increase compression ratio, for compressible data, since each record uses its own individual compression dictionary. If you’re using bittorrent, recordsize=16K results in higher possible bittorrent write performance… but recordsize=1M results in lower overall fragmentation, and much better performance when reading the files you’ve acquired by torrent later.
SLOG	maybe	SLOG isn’t a setting, it’s a special vdev type that acts as a write aggregation layer for the entire pool. It only affects synchronous writes – asynchronous writes are already aggregated in the ZIL in RAM. SLOG doesn’t need to be a large device; it only has to accumulate a few seconds’ worth of writes… but if you’re using NAND flash, it probably should be a large device, since write endurance is proportional to device size. On systems that need a LOG vdev in the first place, that LOG vdev will generally get an awful lot of writes. Having a LOG vdev means that synchronous writes perform like asynchronous writes; it doesn’t really act like a “write cache” in the way new ZFS users tend to hope it will. Great for databases, NFS exports, or anything else that calls sync() a lot. Not too useful for more casual workloads.
L2ARC	nope!	L2ARC is a layer of ARC that resides on fast storage rather than in RAM. It sounds amazing – super huge super fast read cache! Yeah, it’s not really like that. For one thing, L2ARC is ephemeral – data in L2ARC doesn’t survive reboots. For another thing, it costs a significant amount of RAM to index the L2ARC, which means now you have a smaller ARC due to the need for indexing your L2ARC. Even the very fastest SSD is a couple orders of magnitude slower than RAM. When you have to go to L2ARC to fetch data that would have fit in the ARC if it hadn’t been for needing to index the L2ARC, it’s a massive lose. Most people won’t see any real difference at all after adding L2ARC. A significant number of people will see performance decrease after adding L2ARC. There is such a thing as a workload that benefits from L2ARC… but you don’t have it. (Think hundreds of users, each with extremely large, extremely hot datasets.)

* “best” is always debatable. Read reasoning before applying. No warranties offered, explicit or implied.

Routing between wg interfaces with WireGuard

Aha! This was the last piece I was really looking for with WireGuard. It gets a bit tricky when you want packets to route between WireGuard clients. But once you grok how it works, well, it works.

This also works for passing traffic between WireGuard clients on the same interface – the trick is in making certain that AllowedIPs in the client configs includes the entire IP subnet services by the server, not just the single IP address of the server itself (with a /32 subnet)… and that you not only set up the tunnel on each client, but initialize it with a bit of data as well.

Set up your server with two WireGuard interfaces:

root@server:~# touch /etc/wireguard/keys/server.wg0.key
root@server:~# chmod 600 /etc/wireguard/keys/server.wg0.key 
root@server:~# wg genkey > /etc/wireguard/keys/server.wg0.key root@server:~# wg pubkey < /etc/wireguard/keys/server.wg0.key > /etc/wireguard/keys/server.wg0.pub

root@server:~# touch /etc/wireguard/keys/server.wg1.key
root@server:~# chmod 600 /etc/wireguard/keys/server.wg1.key 
root@server:~# wg genkey > /etc/wireguard/keys/server.wg1.key root@server:~# wg pubkey < /etc/wireguard/keys/server.wg1.key > /etc/wireguard/keys/server.wg1.pub

Don’t forget to make sure ipv4 forwarding is enabled on your server:

root@server:~# sed -i 's/^#net\.ipv4\.ip_forward=1/net.ipv4.ip_forward=1/' /etc/sysctl.conf
root@server:~# sysctl -p

Now set up your wg0.conf:

# server.wg0.conf

[Interface] 
   Address = 10.0.0.1/24 
   ListenPort = 51820
   PrivateKey = WG0_SERVER_PRIVATE_KEY
   SaveConfig = false

[Peer] 
   # client1 
   PublicKey = PUBKEY_FROM_CLIENT_ONE
   AllowedIPs = 10.0.0.2/32

And your wg1.conf:

# server.wg1.conf

[Interface] 
 Address = 10.0.1.1/24 
 ListenPort = 51821
 PrivateKey = WG1_SERVER_PRIVATE_KEY
 SaveConfig = false

[Peer] 
 # client2 
 PublicKey = PUBKEY_FROM_CLIENT_TWO
 AllowedIPs = 10.0.1.2/32

Gravy. Now enable both interfaces, and bring them online.

root@server:~# systemctl enable wg-quick@wg0 ; systemctl enable wg-quick@wg1
root@server:~# systemctl start wg-quick@wg0 ; systemctl start wg-quick@wg1

Server’s done. Now, set up your clients.

Client install, multi-wg server:

Client one will connect to the server’s wg0, and client two will connect to the server’s wg1. After creating your keys, set them up as follows:

# /etc/wireguard/wg0.conf on Client1
#    connecting to server/wg0
 
[Interface]
   Address = 10.0.0.2/24
   PrivateKey = PRIVATE_KEY_FROM_CLIENT1
   # set up routing from server/wg0 to server/wg1
   PostUp = route add -net 10.0.1.0/24 gw 10.0.0.1 ; ping -c1 10.0.0.1
   PostDown = route delete -net 10.0.1.0/24 gw 10.0.0.1
   SaveConfig = false

[Peer]
   PublicKey = PUBKEY_FROM_SERVER
   AllowedIPs = 10.0.0.1/24, 10.0.1.1/24
   Endpoint = wireguard.yourdomain.tld:51820

Now set up wg0 on Client2:

# /etc/wireguard/wg0.conf on Client2
#   connecting to server/wg1 

[Interface]
   Address = 10.0.1.2/24
   PrivateKey = PRIVATE_KEY_FROM_CLIENT2
   # set up routing from server/wg1 to server/wg0
   PostUp = route add -net 10.0.0.0/24 gw 10.0.1.1 ; ping -c1 10.0.1.1
   PostDown = route delete -net 10.0.0.0/24 gw 10.0.1.1
   SaveConfig = false

[Peer]
   PublicKey = PUBKEY_FROM_SERVER
   AllowedIPs = 10.0.0.1/24, 10.0.1.1/24
   Endpoint = wireguard.yourdomain.tld:51821

Now start up the client interfaces. First, Client1:

root@client1:~# systemctl start wg-quick@wg0
root@client1:~# wg
interface: wg0
 public key: MY_PUBLIC_KEY
 private key: (hidden)
 listening port: some-random-port

peer: SERVER_PUBLIC_KEY
 endpoint: server-ip-address:51820
 allowed ips: 10.0.0.0/24, 10.0.1.0/24
 latest handshake: 3 seconds ago
 transfer: 2.62 KiB received, 3.05 KiB sent

Now Client2:

root@client2:~# systemctl start wg-quick@wg0
root@client2:~# wg
interface: wg0
 public key: MY_PUBLIC_KEY
 private key: (hidden)
 listening port: some-random-port

peer: SERVER_PUBLIC_KEY
 endpoint: server-ip-address:51821
 allowed ips: 10.0.0.0/24, 10.0.1.0/24
 latest handshake: 4 seconds ago
 transfer: 2.62 KiB received, 3.05 KiB sent

Now that both clients are connected, we can successfully send traffic back and forth between client1 and client2.

root@client1:~# ping -c3 10.0.1.2
PING 10.0.1.2 (10.0.1.2) 56(84) bytes of data.
64 bytes from 10.0.1.2: icmp_seq=1 ttl=63 time=80.4 ms
64 bytes from 10.0.1.2: icmp_seq=2 ttl=63 time=83.5 ms
64 bytes from 10.0.1.2: icmp_seq=3 ttl=63 time=83.2 ms

--- 10.0.1.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 80.492/82.410/83.501/1.380 ms

Sweet.

The step some of you undoubtedly missed:

There’s a little trick that will frustrate you to no end if you glossed over the client config files without paying attention.

PostUp = route add -net 10.0.0.0/24 gw 10.0.1.1 ; ping -c1 10.0.1.1

If you left out that ping command in one or both of the clients, then you won’t be able to ping client1 from client2, or vice versa, until each client has sent some data down the tunnel. This frustrated the crap out of me initially – ping client2 from client1 would be broken, but then ping client1 from client2 would work and then ping client2 from client1 would work. The solution, as shown, is just to make sure we immediately chuck a packet down the tunnel whenever each client connects; then everything works as intended.

What if I don’t want multiple interfaces?

If all you want to do is pass traffic from one client to the next, you don’t need two interfaces on the server, and you don’t need PostUp and PostDown route commands, either. The trick is that you do need to make sure that AllowedIPs on each client is set to a range that includes the other client, and you do need the PostUp ping -c1 server-ip-addresscommand on each client as well. Without that PostUp ping, you’re going to get frustrated by connectivity that “sometimes works and sometimes doesn’t”.

If you want to give access to some clients but not all clients, you can do that by setting multiple AllowedIPs arguments on the clients, like so:

[Peer]
   PublicKey = PUBKEY_FROM_SERVER
   # this stanza allows access from the server (.1), client one (.2),
   # and client two (.3) - but not from any clients at .4-.254.
   AllowedIPs = 10.0.0.1/32, 10.0.0.2/32, 10.0.0.3/32
   Endpoint = wireguard.yourdomain.tld:51820

That is a sample [Peer] stanza of a client wg config, not a[Peer] stanza of the server wg config! The[Peer] stanzas of the server config should only allow connection to a single IP (using a /32 subnet) for each individual[Peer] definition.

If you try to set AllowedIPs 10.0.0.0/24 on both client1 and client2’s[Peer] stanzas in the server’s wg config, you’ll break one or the other client – they can’t BOTH be allowed the entire subnet.

This bit’s a little confusing, I know, but that’s how it works.

As always, caveat imperator.

Just like the last couple of posts, everything here is tested and working. Also just like the last couple of posts, this is day zero of my personal experience with Wireguard. While everything here works, I still have not delved into any of the actual crypto components’ configurability or personally evaluated/researched how sane the defaults are.

These configs absolutely will encrypt the data sent down the tunnel – I’ve verified that much! – but I cannot offer a well-established and researched opinion on how well they encrypt that data, or what weaknesses there might be. I have no particular reason to believe they’re not secure, but I haven’t done any donkey-work to thoroughly establish that they are secure either.

Caveat imperator.

I may come back to edit out these dire warnings in a few weeks or months as I hammer at this stuff more. Or I might forget to! Feel free to tweet @jrssnet if you want to ask how or if things have changed; I am definitely going to continue my evaluation and testing.

Working VPN-gateway configs for WireGuard

Want to set up a simple security VPN, that routes all your internet traffic out of a potentially hostile network through a trusted VM somewhere? Here you go. Note that while all this is tested and working, this is still literal day zero of my personal experience with Wireguard; in particular while Wireguard claims to use only the most secure crypto (the best, everybody says that!) I not only have not really investigated that, I don’t know how to configure that part of it, so this is just using whatever the WG defaults are. Caveat imperator.

Installing Wireguard, generating keys:

This first set of steps is the same for all machines. Substitute the actual machine name as appropriate; you want to make sure you know which of these keys is which later on down the line, so actually name them and don’t be sloppy about it.

root@machine:~# apt-add-repository ppa:wireguard/wireguard ; apt update ; apt install wireguard-dkms wireguard-tools

root@machinename:~# mkdir /etc/wireguard/keys
root@machinename:~# chmod 700 /etc/wireguard/keys

root@machinename:~# touch /etc/wireguard/keys/machinename.wg0.key
root@machinename:~# chmod 600 /etc/wireguard/keys/machinename.wg0.key

root@machinename:~# wg genkey > /etc/wireguard/keys/machinename.wg0.key
root@machinename:~# wg pubkey < /etc/wireguard/keys/machinename.wg0.key > /etc/wireguard/keys/machinename.wg0.pub

OK, you’ve installed wireguard on your server VM and one or two clients, and you’ve generated some keys.

Setting up your server VM:

Create your config file on the server, at /etc/wireguard/wg0.conf:

[Interface] 
   Address = 10.0.0.1/24 
   ListenPort = 51820 
   PrivateKey = YOUR_SERVER_PRIVATE_KEY
   SaveConfig = false
 
   # Internet Gateway config: nat wg1 out to the internet on eth0 
   PostUp = iptables -A FORWARD -i wg1 -j ACCEPT; iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE 
   PostDown = iptables -D FORWARD -i wg1 -j ACCEPT; iptables -t nat -D POSTROUTING -o eth0 -j MASQUERADE

[Peer] 
   # Client1
   PublicKey = PUBLIC_KEY_FROM_CLIENT1
   AllowedIPs = 10.0.0.2/32

[Peer] 
   # Client2 
   PublicKey = PUBLIC_KEY_FROM_CLIENT2
   AllowedIPs = 10.0.0.3/32

Now you’ll need to enable ipv4 forwarding in /etc/sysctl.conf.

root@server:~# sed -i 's/^#net\.ipv4\.ip_forward=1/net.ipv4.ip_forward=1/' /etc/sysctl.conf
root@server:~# sysctl -p

Enable your wg0 interface to start automatically at boot, and bring it up:

root@server:~# sysctl enable wg-quick@wg0
root@server:~# sysctl start wg-quick@wg0

Server should be good to go now.

Setting up your clients:

Client setup is a bit simpler; all you really need is the /etc/wireguard/wg0.conf file itself.

[Interface] 
   # CLIENT1 
   Address = 10.0.0.2/24 
   PrivateKey = CLIENT1_PRIVATE_KEY
   SaveConfig = false

   # the DNS line is broken on 18.04 due to lack of resolvconf 
   # DNS = 1.1.1.1

[Peer] 
   # SERVER 
   PublicKey = PUBLIC_KEY_FROM_SERVER
   Endpoint = wireguard.yourdomain.fqdn:51820

   # gateway rule - send all traffic out over the VPN
   AllowedIPs = 0.0.0.0/0

Note that I have the DNS = 1.1.1.1 line commented out above – its syntax is correct, and it works fine on Ubuntu 16.04, but on 18.04 it will cause the entire interface not to come up due to a lack of installed resolvconf.

You can use sysctl enable wg-quick@wg0 to have the wg0 interface automatically start at boot the way we did on the server, but you likely won’t want to. Without enabling it to start automatically at boot, you can use sysctl start wg-quick@wg0 by itself to manually start it, and sysctl stop wg-quick@wg0 to manually disconnect it. Or if you’re not in love with systemd, you can accomplish the same thing with the raw wg-quick commands: wg-quick up wg0 to start it, and wg-quick down wg0 to bring it down again. Your choice.

What about Windows? Android? Etc?

You can use TunSafe as a Windows client, and the WireGuard app on Android. Setup steps will basically be the same as shown above. On a Mac, you can reportedly brew install wireguard-tools and have everything work as above (though you’ll need to invoke wg-quick directly; systemd isn’t a thing there).

If you’ve rooted your Android phone, you can build a kernel that includes the Wireguard kernel module; if you haven’t, stock kernels work fine – the Android app just runs in userspace mode, which is somewhat less efficient. (You’re currently stuck in userspace mode on a Mac no matter what, AFAIK; not sure what the story is with TunSafe on Windows.)

If you’re using iOS, there’s a Git repository that purports to be a Wireguard client for iPhone/iPad; but good f’n luck actually doing anything with it unless you’re pretty deep into the iOS development world already.