Open Source – Page 6 – JRS Systems: the blog

FIO cheat sheet

With any luck I’ll turn this into a real blog post soon, but for the moment: a cheat sheet for simple usage of fio to benchmark storage. This command will run 16 simultaneous 4k random writers in sync mode. It’s a big enough config to push through the ZIL on most zpools and actually do some testing of the real hardware underneath the cache.

fio --name=random-writers --ioengine=sync --iodepth=4 --rw=randwrite --bs=4k --direct=0 --size=256m --numjobs=16 --end_fsync=1

For reference, a Sanoid Standard host with two 1TB solid state pro mirror vdevs gets 429MB/sec throughput with 16 4k random writers in sync mode:

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=426293KB/s, minb=26643KB/s, maxb=28886KB/s, mint=9075msec, maxt=9839msec

Awww, yeah.

378MB/sec for a single 4k random writer, so don’t think you have to have a ridiculous queue depth to see outstanding throughput, either:

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=378444KB/s, minb=378444KB/s, maxb=378444KB/s, mint=11083msec, maxt=11083msec

That’s a spicy meatball.

A note from the future: to do a 50% mix of reads and writes:

fio --name=random-readwrite --ioengine=sync --iodepth=4 --rw=randrw --bs=4k --direct=0 --size=256m --numjobs=16 --end_fsync=1

Dual-NIC fanless Celeron 1037u router test – promising!

Finally found the time to set up my little fanless Celeron 1037u router project today. So far, it’s very promising!

I installed Ubuntu Server on an elderly 4GB SD card I had lying around, with no problems other than the SD card being slow as molasses – which is no fault of the Alibaba machine, of course. Booted from it just fine. I plan on using this little critter at home and don’t want to deal with glacial I/O, though, so the next step was to reinstall Ubuntu Server on a 60GB Kingston SSD, which also had no problems.

With Ubuntu Server (14.04.3 LTS) installed, the next step was getting a basic router-with-NAT iptables config going. I used MASQUERADE so that the LAN side would have NAT, and I went ahead and set up a couple of basic service rules – including a pinhole for forwarding iperf from the WAN side to a client machine on the LAN side – and saved them in /etc/network/iptables, suitable for being restored using /sbin/iptables-restore (ruleset at the end of this post).

Once that was done and I’d gotten dhcpd serving IP addresses on the LAN side, I was ready to plug up the laptop and go! The results were very, very nice:

root@demoserver:~# iperf -c springbok
------------------------------------------------------------
Client connecting to 192.168.0.125, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local demoserver port 48808 connected with springbok port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.09 GBytes   935 Mbits/sec
You have new mail in /var/mail/root
root@demoserver:~# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local demoserver port 5001 connected with springbok port 40378
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  1.10 GBytes   939 Mbits/sec

935mbps up and down… not too freakin’ shabby for a lil’ completely fanless Celeron. What about OpenVPN, with 2048-bit SSL?

------------------------------------------------------------
Client connecting to 10.8.0.38, TCP port 5001
TCP window size: 22.6 KByte (default)
------------------------------------------------------------
[  3] local 10.8.0.1 port 45727 connected with 10.8.0.38 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-11.6 sec   364 MBytes   264 Mbits/sec

264mbps? Yeah, that’ll do.

To be fair, though, LZO compression is enabled in my OpenVPN setup, which is undoubtedly improving our iperf run. So let’s be fair, and try a slightly more “real-world” test using ssh to bring in a hefty chunk of incompressible pseudorandom data, instead:

root@router:/etc/openvpn# ssh -c arcfour jrs@10.8.0.1 'cat /tmp/test.bin' | pv > /dev/null
 333MB 0:00:17 [19.5MB/s] [                         <=>                                  ]

Still rockin’ a solid 156mbps, over OpenVPN, after SSH overhead, using incompressible data. Niiiiiiice.

For posterity’s sake, here is the iptables ruleset I’m using for testing on the little Celeron.

*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]

# p4p1 is WAN interface
-A POSTROUTING -o p4p1 -j MASQUERADE

# NAT pinhole: iperf from WAN to LAN
-A PREROUTING -p tcp -m tcp -i p4p1 --dport 5001 -j DNAT --to-destination 192.168.100.101:5001

COMMIT

*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:LOGDROP - [0:0]

# create LOGDROP target to log and drop packets
-A LOGDROP -j LOG
-A LOGDROP -j DROP

##### basic global accept rules - ICMP, loopback, traceroute, established all accepted
-A INPUT -s 127.0.0.0/8 -d 127.0.0.0/8 -i lo -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -m state --state ESTABLISHED -j ACCEPT

# enable traceroute rejections to get sent out
-A INPUT -p udp -m udp --dport 33434:33523 -j REJECT --reject-with icmp-port-unreachable

##### Service rules
#
# OpenVPN
-A INPUT -p udp -m udp --dport 1194 -j ACCEPT

# ssh - drop any IP that tries more than 10 connections per minute
-A INPUT -i eth0 -p tcp -m tcp --dport 22 -m state --state NEW -m recent --set --name DEFAULT --mask 255.255.255.255 --rsource
-A INPUT -i eth0 -p tcp -m tcp --dport 22 -m state --state NEW -m recent --update --seconds 60 --hitcount 11 --name DEFAULT --mask 255.255.255.255 --rsource -j LOGDROP
-A INPUT -p tcp -m tcp --dport 22 -j ACCEPT

# www
-A INPUT -p tcp -m tcp --dport 80 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 443 -j ACCEPT

# default drop because I'm awesome
-A INPUT -j DROP

##### forwarding ruleset
#
# forward packets along established/related connections
-A FORWARD -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

# forward from LAN (p1p1) to WAN (p4p1)
-A FORWARD -i p1p1 -o p4p1 -j ACCEPT

# NAT pinhole: iperf from WAN to LAN
-A FORWARD -p tcp -d 192.168.100.101 --dport 5001 -j ACCEPT

# drop all other forwarded traffic
-A FORWARD -j DROP

COMMIT

Emoji on Ubuntu Trusty

OK, so this is maybe kinda useless. But I wanted an emoji ONCE for a presentation, and the ONCE I wanted it, the fool thing wouldn’t display on my presentation laptop and I had to scramble at the last minute to do something that wasn’t quite as entertaining. So here’s how you fix that problem:

you@box:~$ sudo apt-get update ; sudo apt-get install ttf-ancient-fonts unifont

Poof, you got emojis. In my case, the one I wanted was this:

Yes, I AM comfortable in my masculinity, why do you ask…?

Blindrename.pl – a tool to aid blinded analysis in a lab setting

I made a tiny contribution to science this morning – a friend in neuroscience lamented that she couldn’t find any tools to automate the process of renaming a set of images for blinded analysis, so I made one.

https://github.com/jimsalterjrs/blindanalysis

TL;DR on what it does: you feed it a folder full of files, and it renames them all to random names while preserving their original extensions (such as .tif, .lsm, .jpeg, etc). While doing so, it creates a keyfile.csv which ties the original filename to the new, randomized filename – so that you can open up keyfile.csv in Excel, LibreOffice Calc, etc after your blind analysis is done and associate your blind results with your original data.

It’s reasonably smart and cautious – it refuses to run as root, won’t mess with dotfiles or subdirectories, won’t traverse subdirectories, won’t let you accidentally randomize the same folder twice, and spits out human-readable errors if things go wrong.

This is what it looks like in operation:

me@banshee:~$ ls -l /tmp/test total 24 -rw-rw-r-- 1 me me 2 Oct 14 13:44 1.tif -rw-rw-r-- 1 me me 2 Oct 14 13:44 2.tif -rw-rw-r-- 1 me me 2 Oct 14 13:44 3.tif -rw-rw-r-- 1 me me 2 Oct 14 13:44 4.tif -rw-rw-r-- 1 me me 2 Oct 14 13:44 5 drwxrwxr-x 2 me me 4096 Oct 14 12:56 subdir


me@banshee:~$ blindrename.pl /tmp/test

Renaming: 1.tif... 2.tif... 3.tif... 4.tif... 5...

5 files successfully blind renamed; keyfile saved to /tmp/test/keyfile.csv.
me@banshee:~$ ls -l /tmp/test

total 28

-rw-rw-r-- 1 me me    2 Oct 14 13:44 B4LOz.tif

-rw-rw-r-- 1 me me    2 Oct 14 13:44 Ek76e.tif

-rw-rw-r-- 1 me me    2 Oct 14 13:44 kdVFM.tif

-rw-rw-r-- 1 me me  131 Oct 14 14:02 keyfile.csv

-rw-rw-r-- 1 me me    2 Oct 14 13:44 Oklr1

drwxrwxr-x 2 me me 4096 Oct 14 12:56 subdir

-rw-rw-r-- 1 me me    2 Oct 14 13:44 wsy7e.tif

me@banshee:~$ cat /tmp/test/keyfile.csv "Original Filename","Cloaked Filename" "1.tif","kdVFM.tif" "2.tif","Ek76e.tif" "3.tif","B4LOz.tif" "4.tif","wsy7e.tif" "5","Oklr1"

There are no dependencies other than Perl itself, and the script is licensed GPLv3 – free for all to use, as in beer and as in speech. I hope this helps somebody (else); this task has got to come up frequently enough in all sorts of labwork that a free tool should be easy to find!

Future science workers: if this helped you and you’re feeling grateful, the EFF can always use a donation, whether large, small, or micro. =)

Another KVM storage comparison article

http://www.ilsistemista.net/index.php/virtualization/47-zfs-btrfs-xfs-ext4-and-lvm-with-kvm-a-storage-performance-comparison.html

Good stuff. Nice in-depth run of several different benchmarks for everything from fileserver to mailserver to database type usage. A bit thin on the ground for configuration, and probably no surprises here if you already read my KVM storage article from 2013, but it’s always nice to get completely independent confirmation.

The author’s hardware setup was surprisingly wimpy – an AMD Phenom II with only 8GB of RAM – which may explain part of why the advanced filesystems did even more relatively poorly than expected in his testing. (ZFS did fine, but it didn’t blow the doors off of everything else the way it did in my testing, which was on a machine with four times as much RAM onboard. And btrfs absolutely tanked, across the board, whereas in my experience it’s typically more a case of “btrfs works really well until it works really badly.”) It was also interesting and gratifying to me that this article tested on CentOS, where mine tested on Ubuntu. Not that I expected any tremendous changes, but it’s always nice to see things holding up across distributions!

Thank you for the article, Gionatan – your work is appreciated!

Nagios initial configuration / NSClient++ / check_nt

A note to my future self:

Not ALL of the configs for a newly installed Nagios server are in /etc/nagios3/conf.d. Some of them are in /etc/nagios-plugins/config. In particular, the configuration for the raw check_nt command is in there, and it’s a little buggy. You’ll need to specify the correct port, and you’ll need to make it pass more than one argument along.

This is the commented-out dist definition of check_nt, followed by the correct way to define check_nt:

# 'check_nt' command definition #define command { # command_name check_nt # command_line /usr/lib/nagios/plugins/check_nt -H '$HOSTADDRESS$' -v '$ARG1$' #}

define command { command_name check_nt command_line /usr/lib/nagios/plugins/check_nt -H $HOSTADDRESS$ -p 12489 -v '$ARG1$' '$ARG2$' }

You’re welcome, future self. You’re welcome.

Blank reports in SimpleInvoices

For any fellow SimpleInvoices users – here is a godawful monkeypatch which works around the “blank reports” problem in SI.

A little background: the issue is that phpreports, which is long obsolete code SI uses to generate reports, does call by references in a couple of places. Worse, it does it using the eval() function on a variable populated from an object. I threw up in my mouth a little bit just typing that.

I frankly couldn’t be stuffed to chase everything ALL the way down to the end to find the object method which populates the string and fix IT, but I DID write a simple six line “monkey patch” which at least fixes the output so that reports would work.

In library/phpmaker/PHPReportMaker.php, find the following line:

$sRst = $this->_oProc->run();

And IMMEDIATELY after that line, insert these lines:

// this is a godawful monkeypatch to keep the eval($sRst) line from
// trying to do pass-by-reference function calls. This is awful and
// I am not proud, but it does at least get reports working again.
$pcre_pattern = '/\&\$_o/'; // jrs
$pcre_replace = '\$_o'; // jrs
$sRst = preg_replace ($pcre_pattern,$pcre_replace,$sRst); // jrs
$pcre_pattern = '/\&\$o/'; // jrs
$pcre_replace = '\$o'; // jrs
$sRst = preg_replace ($pcre_pattern,$pcre_replace,$sRst); // jrs
// print $sRst; //jrs debug

This changes the function calls eval()’ed in $sRst from pass-by-reference – which looks like myFunc(&$variable) – to pass-by-value – which looks like myFunc($variable). Simple stuff, and somebody should REALLY chase down the actual code which POPULATES $sRst in the first place and fix it in phpreportmaker, but for the moment, this was enough to get reports working again in my SI.

Hope this helped somebody else.

Reshuffling pool storage on the fly

If you’re new here:

Sanoid is an open-source storage management project, built on top of the OpenZFS filesystem and Linux KVM hypervisor, with the aim of providing affordable, open source, enterprise-class hyperconverged infrastructure. Most of what we’re talking about today boils down to “managing ZFS storage” – although Sanoid’s replication management tool Syncoid does make the operation a lot less complicated.

Recently, I deployed two Sanoid appliances to a new customer in Raleigh, NC.

When the customer specced out their appliances, their plan was to deploy one production server and one offsite DR server – and they wanted to save a little money, so the servers were built out differently. Production had two SSDs and six conventional disks, but offsite DR just had eight conventional disks – not like DR needs a lot of IOPS performance, right?

Well, not so right. When I got onsite, I discovered that the “disaster recovery” site was actually a working space, with a mission critical server in it, backed up only by a USB external disk. So we changed the plan: instead of a production server and an offsite DR server, we now had two production servers, each of which replicated to the other for its offsite DR. This was a big perk for the customer, because the lower-specced “DR” appliance still handily outperformed their original server, as well as providing ZFS and Sanoid’s benefits of rolling snapshots, offsite replication, high data integrity, and so forth.

But it still bothered me that we didn’t have solid state in the second suite.

The main suite had two pools – one solid state, for boot disks and database instances, and one rust, for bulk storage (now including backups of this suite). Yes, our second suite was performing better now than it had been on their original, non-Sanoid server… but they had a MySQL instance that tended to be noticeably slow on inserts, and the desire to put that MySQL instance on solid state was just making me itch. Problem is, the client was 250 miles away, and their Sanoid Standard appliance was full – eight hot-swap bays, each of which already had a disk in it. No more room at the inn!

We needed minimal downtime, and we also needed minimal on-site time for me.

You can’t remove a vdev from an existing pool, so we couldn’t just drop the existing four-mirror pool to a three-mirror pool. So what do you do? We could have stuffed the new pair of SSDs somewhere inside the case, but I really didn’t want to give up the convenience of externally accessible hot swap bays.

So what do you do?

In this case, what you do – after discussing all the pros and cons with the client decision makers, of course – is you break some vdevs. Our existing pool had four mirrors, like this:

	NAME                              STATE     READ WRITE CKSUM
	data                              ONLINE       0     0     0
	  mirror-0                        ONLINE       0     0     0
	    wwn-0x50014ee20b8b7ba0-part3  ONLINE       0     0     0
	    wwn-0x50014ee20be7deb4-part3  ONLINE       0     0     0
	  mirror-1                        ONLINE       0     0     0
	    wwn-0x50014ee261102579-part3  ONLINE       0     0     0
	    wwn-0x50014ee2613cc470-part3  ONLINE       0     0     0
	  mirror-2                        ONLINE       0     0     0
	    wwn-0x50014ee2613cfdf8-part3  ONLINE       0     0     0
	    wwn-0x50014ee2b66693b9-part3  ONLINE       0     0     0
          mirror-3                        ONLINE       0     0     0
            wwn-0x50014ee20b9b4e0d-part3  ONLINE       0     0     0
            wwn-0x50014ee2610ffa17-part3  ONLINE       0     0     0

Each of those mirrors can be broken, freeing up one disk – at the expense of removing redundancy on that mirror, of course. At first, I thought I’d break all the mirrors, create a two-mirror pool, migrate the data, then destroy the old pool and add one more mirror to the new pool. And that would have worked – but it would have left the data unbalanced, so that the majority of reads would only hit two of my three mirrors. I decided to go for the cleanest result possible – a three mirror pool with all of its data distributed equally across all three mirrors – and that meant I’d need to do my migration in two stages, with two periods of user downtime.

First, I broke mirror-0 and mirror-1.

I detached a single disk from each of my first two mirrors, then cleared its ZFS label afterward.

    root@client-prod1:/# zpool detach wwn-0x50014ee20be7deb4-part3 ; zpool labelclear wwn-0x50014ee20be7deb4-part3
    root@client-prod1:/# zpool detach wwn-0x50014ee2613cc470-part3 ; zpool labelclear wwn-0x50014ee2613cc470-part3

Now mirror-0 and mirror-1 are in DEGRADED condition, as is the pool – but it’s still up and running, and the users (who are busily working on storage and MySQL virtual machines hosted on the Sanoid Standard appliance we’re shelled into) are none the wiser.

Now we can create a temporary pool with the two freed disks.

We’ll also be sure to set compression on by default for all datasets created on or replicated onto our new pool – something I truly wish was the default setting for OpenZFS, since for almost all possible cases, LZ4 compression is a big win.

    root@client-prod1:/# zpool create -o ashift=12 tmppool mirror wwn-0x50014ee20be7deb4-part3 wwn-0x50014ee2613cc470-part3
    root@client-prod1:/# zfs set compression=lz4 tmppool

We haven’t really done much yet, but it felt like a milestone – we can actually start moving data now!

Next, we use Syncoid to replicate our VMs onto the new pool.

At this point, these are still running VMs – so our users won’t see any downtime yet. After doing an initial replication with them up and running, we’ll shut them down and do a “touch-up” – but this way, we get the bulk of the work done with all systems up and running, keeping our users happy.

    root@client-prod1:/# syncoid -r data/images tmppool/images ; syncoid -r data/backup tmppool/backup

This took a while, but I was very happy with the performance – never dipped below 140MB/sec for the entire replication run. Which also strongly implies that my users weren’t seeing a noticeable amount of slowdown! This initial replication completed in a bit over an hour.

Now, I was ready for my first little “blip” of actual downtime.

First, I shut down all the VMs running on the machine:

    root@client-prod1:/# virsh shutdown suite100 ; virsh shutdown suite100-mysql ; virsh shutdown suite100-openvpn
    root@client-prod1:/# watch -n 1 virsh list

As soon as virsh list showed me that the last of my three VMs were down, I ctrl-C’ed out of my watch command and replicated again, to make absolutely certain that no user data would be lost.

    root@client-prod1:/# syncoid -r data/images tmppool/images ; syncoid -r data/backup tmppool/backup

This time, my replication was done in less than ten seconds.

Doing replication in two steps like this is a huge win for uptime, and a huge win for the users – while our initial replication needed a little more than an hour, the “touch-up” only had to copy as much data as the users could store in a few moments, so it was done in a flash.

Next, it’s time to rename the pools.

Our system expects to find the storage for its VMs in /data/images/VMname, so for minimum downtime and reconfiguration, we’ll just export and re-import our pools so that it finds what it’s looking for.

    root@client-prod1:/# zpool export data ; zpool import data olddata 
    root@client-prod1:/# zfs set mountpoint=/olddata/images/qemu olddata/images/qemu ; zpool export olddata

Wait, what was that extra step with the mountpoint?

Sanoid keeps the virtual machines’ hardware definitions on the zpool rather than on the root filesystem – so we want to make sure our old pool’s ‘qemu’ dataset doesn’t try to automount itself back to its original mountpoint, /etc/libvirt/qemu.

    root@client-prod1:/# zpool export tmppool ; zpool import tmppool data
    root@client-prod1:/# zfs set mountpoint=/etc/libvirt/qemu data/images/qemu

OK, at this point our original, degraded zpool still exists, intact, as an exported pool named olddata; and our temporary two disk pool exists as an active pool named data, ready to go.

After less than one minute of downtime, it’s time to fire up the VMs again.

    root@client-prod1:/# virsh start suite100 ; virsh start suite100-mysql ; virsh start suite100-openvpn

If anybody took a potty break or got up for a fresh cup of coffee, they probably missed our first downtime window entirely. Not bad!

Time to destroy the old pool, and re-use its remaining disks.

After a couple of checks to make absolutely sure everything was working – not that it shouldn’t have been, but I’m definitely of the “measure twice, cut once” school, especially when the equipment is a few hundred miles away – we’re ready for the first completely irreversible step in our eight-disk fandango: destroying our original pool, so that we can create our final one.

    root@client-prod1:/# zpool destroy olddata
    root@client-prod1:/# zpool create -o ashift=12 newdata mirror wwn-0x50014ee20b8b7ba0-part3 wwn-0x50014ee261102579-part3
    root@client-prod1:/# zpool add -o ashift=12 newdata mirror wwn-0x50014ee2613cfdf8-part3 wwn-0x50014ee2b66693b9-part3
    root@client-prod1:/# zpool add -o ashift=12 newdata mirror wwn-0x50014ee20b9b4e0d-part3 wwn-0x50014ee2610ffa17-part3
    root@client-prod1:/# zfs set compression=lz4 newdata

Perfect! Our new, final pool with three mirrors is up, LZ4 compression is enabled, and it’s ready to go.

Now we do an initial Syncoid replication to the final, six-disk pool:

    root@client-prod1:/# syncoid -r data/images newdata/images ; syncoid -r data/backup newdata/backup

About an hour later, it’s time to shut the VMs down for Brief Downtime Window #2.

    root@client-prod1:/# virsh shutdown suite100 ; virsh shutdown suite100-mysql ; virsh shutdown suite100-openvpn
    root@client-prod1:/# watch -n 1 virsh list

Once our three VMs are down, we ctrl-C out of ‘watch’ again, and…

Time for our final “touch-up” re-replication:

    root@client-prod1:/# syncoid -r data/images newdata/images ; syncoid -r data/backup newdata/backup

At this point, all the actual data is where it should be, in the right datasets, on the right pool.

We fix our mountpoints, shuffle the pool names, and fire up our VMs again:

    root@client-prod1:/# zpool export data ; zpool import data tmppool 
    root@client-prod1:/# zfs set mountpoint=/tmppool/images/qemu olddata/images/qemu ; zpool export tmppool
    root@client-prod1:/# zpool export newdata ; zpool import newdata data
    root@client-prod1:/# zfs set mountpoint=/etc/libvirt/qemu data/images/qemu
    root@client-prod1:/# virsh start suite100 ; virsh start suite100-mysql ; virsh start suite100-openvpn

Boom! Another downtime window over with in less than a minute.

Our TOTAL elapsed downtime was less than two minutes.

At this point, our users are up and running on the final three-mirror pool, and we won’t be inconveniencing them again today. Again we do some testing to make absolutely certain everything’s fine, and of course it is.

The very last step: destroying tmppool.

    root@client-prod1:/# zpool destroy tmppool

That’s it; we’re done for the day.

We’re now up and running on only six total disks, not eight, which gives us the room we need to physically remove two disks. With those two disks gone, we’ve got room to slap in a pair of SSDs for a second pool with a solid-state mirror vdev when we’re (well, I’m) there in person, in a week or so. That will also take a minute or less of actual downtime – and in that case, the preliminary replication will go ridiculously fast too, since we’ll only be moving the MySQL VM (less than 20G of data), and we’ll be writing at solid state device speeds (upwards of 400MB/sec, for the Samsung Pro 850 series I’ll be using).

None of this was exactly rocket science. So why am I sharing it?

Well, it’s pretty scary going in to deliberately degrade a production system, so I wanted to lay out a roadmap for anybody else considering it. And I definitely wanted to share the actual time taken for the various steps – I knew my downtime windows would be very short, but honestly I’d been a little unsure how the initial replication would go, given that I was deliberately breaking mirrors and degrading arrays. But it went great! 140MB/sec sustained throughput makes even pretty substantial tasks go by pretty quickly – and aside from the two intervals with a combined downtime of less than two minutes, my users never even noticed anything happening.

Closing with a plug: yes, you can afford it.

If this kind of converged infrastructure (storage and virtualization) management sounds great to you – high performance, rapid onsite and offsite replication, nearly zero user downtime, and a whole lot more – let me add another bullet point: low cost. Getting started isn’t prohibitively expensive.

Sanoid appliances like the ones we’re describing here – including all the operating systems, hardware, and software needed to run your VMs and manage their storage and automatically replicate them both on and offsite – start at less than $5,000. For more information, send us an email, or call us at (803) 250-1577.

libguestfs0 and ZFS on Linux in Ubuntu

Trying to get Kimchi installed this morning, I ran into a roadblock almost immediately: libguestfs-tools depends on libguestfs0, which, on Ubuntu at least, stupidly has a hard dependency on zfs-fuse. Which is a dead project, and which conflicts with zfsutils.

In the real world, you might want libguestfs-tools without ever wanting the first thing to do with ANY form of zfs, so this dependency is a really bad idea. Even if you DO want to use libguestfs-tools WITH zfs, it’s an incredibly bad idea because zfsutils – part of ZFS on Linux – provides all the functionality needed already. Unfortunately, the package maintainers don’t seem to quite understand the issues here – I’m guessing none of them are ZFS people – so that leaves you with the need to edit the dependencies yourself.

Luckily, that’s not too hard. First, you’ll need a script, which we’ll call debedit:

#!/bin/bash

EDITOR=nano

if [[ -z "$1" ]]; then
  echo "Syntax: $0 debfile"
  exit 1
fi

DEBFILE="$1"
TMPDIR=`mktemp -d /tmp/deb.XXXXXXXXXX` || exit 1
OUTPUT=`basename "$DEBFILE" .deb`.modified.deb

if [[ -e "$OUTPUT" ]]; then
  echo "$OUTPUT exists."
  rm -r "$TMPDIR"
  exit 1
fi

dpkg-deb -x "$DEBFILE" "$TMPDIR"
dpkg-deb --control "$DEBFILE" "$TMPDIR"/DEBIAN

if [[ ! -e "$TMPDIR"/DEBIAN/control ]]; then
  echo DEBIAN/control not found.

  rm -r "$TMPDIR"
  exit 1
fi

CONTROL="$TMPDIR"/DEBIAN/control

MOD=`stat -c "%y" "$CONTROL"`
$EDITOR "$CONTROL"

if [[ "$MOD" == `stat -c "%y" "$CONTROL"` ]]; then
  echo Not modfied.
else
  echo Building new deb...
  dpkg -b "$TMPDIR" "$OUTPUT"
fi

rm -r "$TMPDIR"

Save that, name it debedit, and chmod 755 it.

Now, you’ll need to download libguestfs0, which is the package that has the bad dependencies, which you’ll edit:

you@box:~$ apt-get download libguestfs0
you@box:~$ ./debedit libguest*deb

Remove the zfs-fuse dependency from the Depends: line in the deb file, and exit nano. Finally, install your modified libguestfs0 package:

you@box:~$ sudo dpkg -i *modified.deb ; sudo apt-get -f install

All done! At least, until and unless the next update to libguestfs0 downloads and attempts to install a new .deb that wants to put that dependency right back again, in which case you’ll need to lather-rinse-repeat.

I me-too’ed an existing bug at https://bugs.launchpad.net/ubuntu/+source/libguestfs/+bug/1053911 ; if you’re affected, you probably should too.

ZFS compression: yes, you want this

So ZFS dedup is a complete lose. What about compression?

Compression is a hands-down win. LZ4 compression should be on by default for nearly anything you ever set up under ZFS. I typically have LZ4 on even for datasets that will house database binaries… yes, really. Let’s look at two quick test runs, on a Xeon E3 server with 32GB ECC RAM and a pair of Samsung 850 EVO 1TB disks set up as a mirror vdev.

This is an inline compression torture test: we’re reading pseudorandom data (completely incompressible) and writing it to an LZ4 compressed dataset.

root@lab:/data# pv < in.rnd > incompressible/out.rnd
7.81GB 0:00:22 [ 359MB/s] [==================================>] 100%

root@lab:/data# zfs get compressratio data/incompressible
NAME                 PROPERTY       VALUE  SOURCE
data/incompressible  compressratio  1.00x  -

359MB/sec write… yyyyyeah, I’d say LZ4 isn’t hurting us too terribly here – and this is a worst case scenario. What about something a little more realistic? Let’s try again, this time with a raw binary of my Windows Server 2012 R2 “gold” image (the OS is installed and Windows Updates are applied, but nothing else is done to it):

root@lab:/data/test# pv < win2012r2-gold.raw > realworld/win2012r2-gold.out
8.87GB 0:00:17 [ 515MB/s] [==================================>] 100%

Oh yeah – 515MB/sec this time. Definitely not hurting from using our LZ4 compression. What’d we score for a compression ratio?

root@lab:/data# zfs get compressratio data/realworld
NAME            PROPERTY       VALUE  SOURCE
data/realworld  compressratio  1.48x  -

1.48x sounds pretty good! Can we see some real numbers on that?

root@lab:/data# ls -lh /data/realworld/win2012r2-gold.raw
-rw-rw-r-- 1 root root 8.9G Feb 24 18:01 win2012r2-gold.raw
root@lab:/data# du -hs /data/realworld
6.2G	/data/realworld

8.9G of data in 6.2G of space… with sustained writes of 515MB/sec.

What if we took our original 8G of incompressible data, and wrote it to an uncompressed dataset?

root@lab:/data#  zfs create data/uncompressed
root@lab:/data# zfs set compression=off data/uncompressed
root@lab:/data# cat 8G.in > /dev/null ; # this is to make sure our source data is preloaded in the ARC
root@lab:/data# pv < 8G.in > uncompressed/8G.out
7.81GB 0:00:21 [ 378MB/s] [==================================>] 100%

So, our worst case scenario – completely incompressible data – means a 5% performance hit, and a more real-world-ish scenario – copying a Windows Server installation – means a 27% performance increase. That’s on fast solid state, of course; the performance numbers will look even better on slower storage (read: spinning rust), where even worst-case writes are unlikely to slow down at all.

Yep, that’s a win.