Open Source – Page 8 – JRS Systems: the blog

Heartbleed SSL vulnerability

Last night (2014 Apr 7) a massive security vulnerability was publicly disclosed in OpenSSL, the library that encrypts most of the world’s sensitive traffic. The bug in question is approximately two years old – systems older than 2012 are not vulnerable – and affects the TLS “heartbeat” function, which is why the vulnerability has been nicknamed HeartBleed.

The bug allows a malicious remote user to scan arbitrary 64K chunks of the affected server’s memory. This can disclose any and ALL information in that affected server’s memory, including SSL private keys, usernames and passwords of ANY running service accepting logins, and more. Nobody knows if the vulnerability was known or exploited in the wild prior to its public disclosure last night.

If you are an end user:

You will need to change any passwords you use online unless you are absolutely sure that the servers you used them on were not vulnerable. If you are not a HIGHLY experienced admin or developer, you absolutely should NOT assume that sites and servers you use were not vulnerable. They almost certainly were. If you are a highly experienced ops or dev person… you still absolutely should not assume that, but hey, it’s your rope, do what you want with it.

Note that most sites and servers are not yet patched, meaning that changing your password right now will only expose that password as well. If you have not received any notification directly from the site or server in question, you may try a scanner like the one at http://filippo.io/Heartbleed/ to see if your site/server has been patched. Note that this script is not bulletproof, and in fact it’s less than 24 hours old as of the time of this writing, up on a free site, and under massive load.

The most important thing for end users to understand: You must not, must not, MUST NOT reuse passwords between sites. If you have been using one or two passwords for every site and service you access – your email, forums you post on, Facebook, Twitter, chat, YouTube, whatever – you are now compromised everywhere and will continue to be compromised everywhere until ALL sites are patched. Further, this will by no means be the last time a site is compromised. Criminals can and absolutely DO test compromised credentials from one site on other sites and reuse them elsewhere when they work! You absolutely MUST use different passwords – and I don’t just mean tacking a “2” on the end instead of a “1”, or similar cheats – on different sites if you care at all about your online presence, the money and accounts attached to your online presence, etc.

If you are a sysadmin, ops person, dev, etc:

Any systems, sites, services, or code that you are responsible for needs to be checked for links against OpenSSL versions 1.0.1 through 1.0.1f. Note, that’s the OpenSSL vendor versioning system – your individual distribution, if you are using repo versions like a sane person, may have different numbering schemes. (For example, Ubuntu is vulnerable from 1.0.1-0 through 1.0.1-4ubuntu5.11.)

Examples of affected services: HTTPS, IMAPS, POP3S, SMTPS, OpenVPN. Fabulously enough, for once OpenSSH is not affected, even in versions linking to the affected OpenSSL library, since OpenSSH did not use the Heartbeat function. If you are a developer and are concerned about code that you wrote, the key here is whether your code exposed access to the Heartbeat function of OpenSSL. If it was possible for an attacker to access the TLS heartbeat functionality, your code was vulnerable. If it was absolutely not possible to check an SSL heartbeat through your application, then your application was not vulnerable even if it linked to the vulnerable OpenSSL library.

In contrast, please realize that just because your service passed an automated scanner like the one linked above doesn’t mean it was safe. Most of those scanners do not test services that use STARTTLS instead of being TLS-encrypted from the get-go, but services using STARTTLS are absolutely still affected. Similarly, none of the scanners I’ve seen will test UDP services – but UDP services are affected. In short, if you as a developer don’t absolutely know that you weren’t exposing access to the TLS heartbeat function, then you should assume that your OpenSSL-using application or service was/is exploitable until your libraries are brought up to date.

You need to update all copies of the OpenSSL library to 1.0.1g or later (or your distribution’s equivalent), both dynamically AND statically linked (PS: stop using static links, for exactly things like this!), and restart any affected services. You should also, unfortunately, consider any and all credentials, passwords, certificates, keys, etc. that were used on any vulnerable servers, whether directly related to SSL or not, as compromised and regenerate them. The Heartbleed bug allowed scanning ALL memory on any affected server and thus could be used by a sufficiently skilled user to extract ANY sensitive data held in server RAM. As a trivial example, as of today (2014-Apr-08) users at the Ars Technica forums are logging on as other users using password credentials held in server RAM, as exposed by standard exploit test scripts publicly disclosed.

Completely eradicating all potential vulnerability is a STAGGERING amount of work and will involve a lot of user disruption. When estimating your paranoia level, please do remember that the bug itself has been in the wild since 2012 – the public disclosure was not until 2014-Apr-07, but we have no way of knowing how long private, possibly criminal entities have been aware of and/or exploiting the bug in the wild.

Restoring Legacy Boot (Linux Boot) on a Chromebook

press ctrl-alt-forward to jump to TTY2 and a standard login prompt

I let the battery die completely on my Acer C720 Chromebook, and discovered that unfortunately if you do that, your Chromebook will no longer Legacy boot when you press Ctrl-L – it just beeps at you despondently, with no error message to indicate what’s going wrong.

Sadly, I found message after message on forums indicating that people encountering this issue just reinstalled ChrUbuntu from scratch. THIS IS NOT NECESSARY!

If you just get several beeps when you press Ctrl-L to boot into Linux on your Chromebook, don’t fret – press Ctrl-D to boot into ChromeOS, but DON’T LOG IN. Instead, change terminals to get a shell. The function keys at the top of the keyboard (the row with “Esc” at the far left) map to the F-keys on a normal keyboard, and ctrl-alt-[Fkey] works here just as it would in Linux. The [forward arrow] key two keys to the right of Esc maps to F2, so pressing ctrl-alt-[forward arrow on top row] will bring you to tty2, which presents you with a standard Linux login prompt.

sudo crossystem dev_boot_usb=1 dev_boot_legacy=1

That’s it. You’re now ready to reboot and SUCCESSFULLY Ctrl-L into your existing Linux install.

btrfs RAID awesomeness

I am INDECENTLY excited about some plans for further btrfs development that I just got confirmed on the mailing list.

Btrfs can already support a redundant array on an arbitrary collection of “mutt” hard drives via btrfs-raid1 – as an example, say you’ve got five hard drives lying around; a 4TB drive, two 2TB drives, a 1TB drive, and an old 750GB drive. You can dump all of those mutts into one btrfs-raid1 array for a total of 9.75TB of raw capacity, and roughly half that (4.5TB-ish) of usable capacity. The btrfs-raid1 will just store two copies of each data block it writes on two separate drives, doing its best to keep each drive in the array about the same percentage of its overall capacity “full”.

This is actually a huge win for midmarket and enterprise as well as for small business and prosumers – if you assume a more business-y environment with matched drives, the performance characteristics of an array like this are very interesting; you aren’t tied to stripes, you don’t have to do a read-write-rewrite dance when data is modified, and you don’t have to have an extremely performance-penalized restriping event if a drive fails and is replaced – the array can just quickly write new redundant copies of any orphaned blocks out to the new replacement drives, without needing to disturb (or, worse, read / write / rewrite) blocks that weren’t orphaned.

For small business and especially prosumer, of course, the benefits are even more obvious – the world is littered with people who want to throw an arbitrary number of arbitrarily sized “mutt” drives they have lying around into one big array, and finally, they can do so – quickly, easily, effectively. The only drawback is that you’re locked into n/2 storage efficiency – since this is still a system that relies on a redundant copy of each data block to be written to a different disk than the first copy is written to, the 9.75TB array up there will only be able to store about 4.5TB of data, which is just not efficient enough for a lot of people, especially the small business and prosumer types.

OK, now we’ve wrapped our brain around the idea of a “raid1” that just distributes data blocks evenly around an arbitrary array while making sure that redundant blocks are on separate physical disks – and can handle weird numbers and sizes of disks with aplomb. (I’ve personally tested a btrfs-RAID1 with three equal sized drives, and yes, it can store half the raw capacity of all three drives just fine.) What’s next?

For those folks who aren’t satisfied with n/2 storage efficiency… there are plans on the roadmap for striped/parity storage that ALSO isn’t tied to the number of physical drives present. Let’s say you have five disks and you’re satisfied with single-disk fault tolerance – you might choose to create a striped array with four data blocks and one parity block per stripe. This is pretty obvious and easy to picture (if it were actually tied directly to disks, and the disks had to be identically sized, we’d have just described a RAID3 array). Now, what happens if you add two more disks to your array without changing your stripe size and parity level? No problem – (future versions of) btrfs will just distribute blocks from each stripe equally around the array of physical disks, so that “half full” on the array roughly corresponds to “half full” on each individual disk as well, just like we demo’ed above on the btrfs-raid1 array.

I am VERY excited about being able to decide, for example, “I want 80% storage efficiency and single-failure fault tolerance”, and therefore being able to select a stripe length of 4 data blocks and one parity block… or “I want 75% storage efficiency and two-failure fault tolerance” and selecting a stripe length of 6 data blocks and 2 parity blocks… and in either case, to then be able to say “OK, now I want to expand my array with these six new drives I just bought” and be able to do so just as simply as adding the new drives, without worrying about how that will affect my storage efficency, fault tolerance level, or having to do a severely expensive and somewhat dangerous “restriping” of the array.

We are living in very interesting times, when it comes to data storage, and I have a feeling the majority of the storage industry – even the relatively well-informed bits that keep up with ZFS, Ceph, OrangeFS, etc – are going to be utterly gobsmacked and scrambling for footing when btrfs hits mainstream.

Slow performance with dovecot – mysql – roundcube

This drove me crazy forever, and Google wasn’t too helpful. If you’re running dovecot with mysql authentication, your logins will be exceedingly slow. This isn’t much of a problem with traditional mail clients – just an annoying bit of a hiccup you probably won’t even notice except for SASL authentication when sending mail – but it makes Roundcube webmail PAINFULLY painfully slow in a VERY obvious way.

The issue is due to PAM authentication being enabled by default in Dovecot, and on Ubuntu at least, it’s done in a really hidden little out-of-the-way file with no easy way to forcibly override it elsewhere that I’m aware of.

Again on Ubuntu, you’ll find the file in question at /etc/dovecot/conf.d/auth-system.conf.ext, and the relevant block should be commented out COMPLETELY, like this:

# PAM authentication. Preferred nowadays by most systems.
# PAM is typically used with either userdb passwd or userdb static.
# REMEMBER: You'll need /etc/pam.d/dovecot file created for PAM
# authentication to actually work. <doc/wiki/PasswordDatabase.PAM.txt>
#passdb {
  # driver = pam
  # [session=yes] [setcred=yes] [failure_show_msg=yes] [max_requests=]
  # [cache_key=] []
  #args = dovecot
#}

Once you’ve done this (and remember, we’re assuming you’re using SQL auth here, and NOT actually USING the PAM!) you’ll auth immediately instead of having to fail PAM and then fall back to SQL auth on every auth request, and things will speed up IMMENSELY. This turns Roundcube from “painfully slow” to “blazing fast”.

Block common trojans in SpamAssassin

If you have a reasonably modern (>= 3.1) version of SpamAssassin, you should by default have a MIMEHeader plugin available (at least on Ubuntu). This enables you to create a couple of custom rules that block the more pernicious “open this ZIP file” style trojans.

Put the following in /etc/spamassassin/local.cf:

loadplugin Mail::SpamAssassin::Plugin::MIMEHeader

mimeheader ZIP_ATTACHED Content-Type =~ /zip/i
describe ZIP_ATTACHED email contains a zip file attachment
score ZIP_ATTACHED 0.1

header SUBJ_PACKAGE_PICKUP Subject =~ /(parcel|package).*avail.*pickup/i
describe SUBJ_PACKAGE_PICKUP 1 of 2 for meta-rule TROJAN_PACKAGE_PICKUP
score SUBJ_PACKAGE_PICKUP 0.1

header FROM_IRS_GOV From =~ /irs\.gov/i
describe FROM_IRS_GOV 1 of 2 for meta-rule TROJAN_IRS_ZIPFILE
score FROM_IRS_GOV 0.1

meta TROJAN_PACKAGE_PICKUP (SUBJ_PACKAGE_PICKUP && ZIP_ATTACHED)
describe TROJAN_PACKAGE_PICKUP Nobody sends a ZIP file to say “your package is ready”.
score TROJAN_PACKAGE_PICKUP 4.0

meta TROJAN_IRS_ZIPFILE (FROM_IRS_GOV && ZIP_ATTACHED)
describe TROJAN_IRS_ZIPFILE If the IRS really wants to send a ZIP, they’ll have to find another way
score TROJAN_IRS_ZIPFILE 4.0

As always, spamassassin –lint to make sure the new rules work okay, and /etc/init.d/spamassassin reload to activate them in your running spamd process.

Benchmarking Windows Guests on KVM:I/O performance

I’ve been using KVM in production to host Windows Server guests for close to 4 years now. I’ve always been thoroughly impressed at what a great job KVM did with accelerating disk I/O for the guests – making Windows guests perform markedly faster virtualized than they used to on the bare metal. So when I got really, REALLY bad performance recently on a few Windows Server Standard 2012 guests – bad enough to make the entire guest seem “locked up tight” for minutes at a time – I did some delving to figure out what was going on.

Linux and KVM offer a wealth of options for handling caching and underlying subsystems of host storage… an almost embarassing wealth, which nobody seems to have really benchmarked. I have seen quite a few people tossing out offhanded comments about this cache mode or that cache mode being “safer”or “faster”or “better”, but no hard numbers. So to both fix my own immediate problem and do some much-needed documentation, I spent more hours this week than I really want to think about doing some real, no-kidding, here-are-the-numbers benchmarking.

Methodology

Test system: AMD FX-8320 8-core CPU, 32GB DDR3 SDRAM, 1x WD 2TB Black (WD2002FAEX) drive, 1x Samsung 840 PRO Series 512GB SSD, Ubuntu 12.04.2-LTS fully updated, Windows Server 2008 R2 guest OS, HD Tune Pro 5.50 Windows disk benchmark suite.

The host and guest OS are both installed on the WD 2TB Black conventional disk; the Samsung 840 PRO Series SSD is attached to the guest in various configurations for benchmarking. The guest OS is given approximately 30 seconds to “settle” after each boot and login before running any benchmarks. No other operations are occurring on either guest or host while benchmarks are run.

Exploratory Testing

Before diving straight into “which combination works the fastest”, I really wanted to explore the individual characteristics of all the various overlapping options available.

The first thing I wanted to find out: how much of a penalty, if any, do you pay for operating a raw disk virtualized under KVM, as opposed to under Windows on the bare metal? And how much of a boost do the VirtIO guest drivers offer over basic IDE drivers?

As you can see, we do pay a penalty – particularly without the VirtIO drivers, which offer a substantial increase in performance over the default IDE, even without caching. We can also see that LVM logical volumes perform effectively identically to actual raw disks. Nice to know!

Now that we know that “raw is raw”, and “VirtIO is significantly better than IDE”, the next burning question is: how much of a performance hit do we take if we use .qcow2 files on an actual filesystem, instead of feeding KVM a raw block device? Actually, let’s pause that question – before that, why would we want to use a .qcow2 file instead of a raw disk or LV? Two big answers: rsync, and state saves. Unless you compile rsync from source with an experimental patch, you can’t use it to synchronize copies of a guest that are stored on a block device – whereas you can, with a qcow2 or raw file on a filesystem. And you can’t save state (basically, like hibernation – only much faster, and handled by the host instead of the guest) with raw storage either – you need qcow2 for that. As a really, really nice aside, if you’re using qcow2 and your host runs out of space… your guest pauses instead of crashing, and as soon as you’ve made more space available on your host, you can resume the guest as though nothing ever happened. Which is nice.

So, if we can afford to, we really would like to have qcow2. Can we afford to?

Yes… yes we can. There’s nothing too exciting to see here – basically, the takeaway is “there is little to no performance penalty for using qcow2 files on a filesystem instead of raw disks.” So, performance is determined by cache settings and by the presence of VirtIO drivers in our guest… not so much by whether we’re using raw disks, or LV, or ZVOL, or qcow2 files.

One caveat: I tested using fully-allocated qcow2 files. I did a little bit of casual testing with sparsely allocated (aka “thin provisioned”) qcow2 files, and basically, they follow the same performance curves, but with roughly half the performance on some writes. This isn’t that big a deal, in my opinion – you only have to do a “slow” write to any given block once. After that, it’s a rewrite, not a new write, and you’re back to the same performance level you’d have had if you’d fully allocated your qcow2 file to start with. So, basically, it’s a self-correcting problem, with a tolerable temporary performance penalty. I’m more than willing to deal with that in return for not having to potentially synchronize gigabytes of slack space when I do backups and migrations.

So… since performance is determined largely by cache settings, let’s take a look at how those play out in the guest:

In plain English, “writethrough” – the default – is read caching with no write cache, and “writeback” is both read and write caching. “None” is a little deceptive – it’s not just “no caching”, it actually requires Direct I/O access to the storage medium, which not all filesystems provide.

The first thing we see here is that the default cache method, writethrough, is very very fast at reading, but painfully slow on writes – cripplingly so. On very small writes, writethrough is capable of less than 0.2 MB/sec in some cases! This is on a brand-new 840 Pro Series SSD... and it’s going to get even worse than this later, when we look at qcow2 storage. Avoid, avoid, avoid.

KVM caching really is pretty phenomenal when it hits, though. Take a look at the writeback cache method – it jumps well above bare metal performance for large reads and writes… and it’s not a small jump, either; 1MB random reads of well over 1GB / sec are completely normal. It’s potentially a little risky, though – you could potentially lose guest data if you have a power failure or host system crash during a write. This shouldn’t be an issue on a stable host with a UPS and apcupsd.

Finally, there’s cache=none. It works. It doesn’t impress. It isn’t risky in terms of data safety. And it generally performs somewhat better with extremely, extremely small random I/O… but without getting the truly mind-boggling wins that caching can offer. In my personal opinion, cache=none is mostly useful when you’re limited to IDE drivers in your guest. Also worth noting: “cache=none” isn’t available on ZFS or FUSE filesystems.

Moving on, we get to the stuff I really care about when I started this project – ZFS! Storing guests on ZFS is really exciting, because it offers you the ability to take block-level host-managed snapshots of your guests; set and modify quotas; set and configure compression; do asynchronous replication; do block-level deduplication – the list goes on and on and on. This is a really big deal. But… how’s the performance?

The performance is very, very solid… as long as you don’t use writethrough. If you use writethrough cache and ZFS, you’re going to have a bad time. Also worth noting: Direct I/O is not available on the ZFS filesystem – although it is available with ZFS zvols! – so there are no results here for “cache=none” and ZFS qcow2.

The big, big, big thing you need to take away from this is that abysmal write performance line for ZFS/qcow2/writethrough – well under 2MB/sec for any and all writes. If you set a server up this way, it will look blazing quick and you’ll love it… until the first time you or a user tries to write a few hundred MB of data to it across the network, at which point the whole thing will lock up tighter than a drum for half an hour plus. Which will make you, and your users, very unhappy.

What else can we learn here? Well, although we’ve got the option of using a zvol – which is basically ZFS’s answer to an LVM LV – we really would like to avoid it, for the same reasons we talked about when we compared qcow to raw. So, let’s look at the performance of that raw zvol – is it worth the hassle? In the end, no.

But here’s the big surprise – if we set up a ZVOL, then format it with ext4 and put a .qcow2 on top of that… it performs as well, and in some cases better than, the raw zvol itself did! As odd as it sounds, this leaves qcow2-on-ext4-on-zvol as one of our best performing overall storage methods, with the most convenient options for management. It sounds like it’d be a horrible Rube Goldberg, but it performs like best-in-breed. Who’d’a thunk it?

There’s one more scenario worth exploring – so far, since discovering how much faster it was, we’ve almost exclusively looked at VirtIO performance. But you can’t always get VirtIO – for example, I have a couple of particularly crotchety old P2V’ed Small Business Server images that absolutely refuse to boot under VirtIO without blue-screening. It happens. What are your best options if you’re stuck with IDE?

Without VirtIO drivers, caching does very, very little for you at best, and kills your performance at worst. (Hello again, horrible writethrough write performance.) So you really want to use “cache=none” if you’re stuck on IDE. If you can’t do that for some reason (like using ZFS as a filesystem, not a zvol), writeback will perform quite acceptably… but it will also expose you to whatever added data integrity risk that the write caching presents, without giving you any performance benefits in return. Caveat emptor.

Final Tests / Top Performers

At this point, we’ve pretty thoroughly explored how individual options affect performance, and the general ways in which they interact. So it’s time to cut to the chase: what are our top performers?

First, let’s look at our top read performers. My method for determining “best” read performance was to take the 4KB random read and the sequential read, then multiply the 4KB random by a factor which, when applied to the bare metal Windows performance, would leave a roughly identical value to the sequential read. Once you’ve done this, taking the average of the two gives you a mean weighted value that makes 4KB read performance roughly as “important” as sequential read performance. Sorting the data by these values gives us…

Woah, hey, what’s that joker in the deck? RAIDZ1…?

My primary workstation is also an FX-8320 with 32GB of DDR3, but instead of an SSD, it has a 4 drive RAIDZ1 array of Western Digital 1TB Black (WD1001FAEX) drives. I thought it would be interesting to see how the RAIDZ1 on spinning rust compared to the 840 Pro SSD… and was pretty surprised to see it completely stomping it, across the board. At least, that’s how it looks in these benchmarks. The benchmarks don’t tell the whole story, though, which we’ll cover in more detail later. For now, we just want to notice that yes, a relatively small and inexpensive RAIDZ1 array does surprisingly well compared to a top-of-the-line SSD – and that makes for some very interesting and affordable options, if you need to combine large amounts of data with high performance access.

Joker aside, the winner here is pretty obvious – qcow2 on xfs, writeback. Wait, xfs? Yep, xfs. I didn’t benchmark xfs as thoroughly as I did ext4 – never tried it layered on top of a zvol, in particular – but I did do an otherwise full set of xfs benchmarks. In general, xfs slightly outperforms ext4 across an identically shaped curve – enough of a difference notice on a graph, but not enough to write home about. The difference is just enough to punt ext4/writeback out of the top 5 – and even though we aren’t actually testing write performance here, it’s worth noting how much better xfs/writeback writes than the two bottom-of-the-barrel “top performers” do.

I keep harping on this, but seriously, look close at those two writethrough entries – it’s not as bad as writethrough-on-ZFS-qcow2, but it’s still way worse than any of the other contenders, with 4KB writes under 2MB/sec. That’s single-raw-spinning-disk territory, at best. Writethrough does give you great read performance, and it’s “safe” as in data integrity, but it’s dangerous as hell in terms of suddenly underperforming under really badly under heavy load. That’s not just “I see a valley on the graph” levels of bad, it’s very potentially “hey IT guy, why did the server lock up?” levels of bad. I do not recommend writethrough caching, “default” option or not.

How ’bout write performance? I calculated “weighted write” just like “weighted read” – divide the bare metal sequential write speed by the bare metal 4KB write speed, then apply the resulting factor to all the 4KB random writes, and average them with the sequential writes. Here are the top 5 weighted write performers:

The first thing to notice here is that while the top 5 slots have changed, the peak read numbers really haven’t. All we’ve really done here is kick the writethrough entries to the curb – we haven’t paid any significant penalty in read performance to do so. Realizing that, let’s not waste too much time talking about this one… instead, let’s cut straight to the “money graph” – our top performers in average mean weighted read and write performance. The following are, plain and simple, the best performers for any general purpose (and most fairly specialized) use cases:

Interestingly, our “jokers” – zvol/ext4/qcow2/writeback and zfs/qcow2/writeback on my workstation’s relatively humble 4-drive RAIDZ1 – are still dominating the pack, at #1 and #2 respectively. This is because they read as well as any of the heavy lifters do, and are showing significantly better write performance – with caveats, which we’ll cover in the conclusions.

Jokers aside, xfs/qcow2/writeback is next, followed by zvol/ext4/qcow2/writeback. You aren’t seeing any cache=none at all here – the gains some cache=none contenders make in very tiny writes just don’t offset the penalties paid in reads and in larger writes from foregoing the cache. If you have a very heavily teeny-tiny-write-loaded workload – like a super-heavy-traffic database – cache=none may still perform better… but you probably don’t want virtualization for a really heavy database workload in the first place, KVM or otherwise. If you hammer your disks, rust or solid state, to within an inch of their lives… you’re really going to feel that raw performance penalty.

Conclusions

In the end, the recommendation is pretty clear – a ZFS zvol with ext4, qcow2 files, and writeback caching offers you the absolute best performance. (Using xfs on a zvol would almost certainly perform as well, or even better, but I didn’t test that exact combination here.) Best read performance, best write performance, qcow2 management features, ZFS snapshots / replication / data integrity / optional compression / optional deduplication / etc – there really aren’t any drawbacks here… other than the complexity of the setup itself. In the real world, I use simple .qcow2 on ZFS, no zvols required. The difference in performance between the two was measurable on graphs, but it’s not significant enough to make me want to actually maintain the added complexity.

If you can’t or won’t use ZFS for whatever reason (like licensing concerns), xfs is probably your next best bet – but if that scares you, just use ext4 – the difference won’t be enough to matter much in the long run.

There’s really no reason to mess around with raw disks or raw files – they add significant extra hassle, remove significant features, and don’t offer tangible performance benefits.

If you’re going to use writeback caching, you should be extra certain of power integrity – UPS on the server with apcupsd. If you’re at all uncertain of your power integrity… take the read performance hit and go with nocache. (On the other hand, if you’re using ZFS and taking rolling hourly snapshots… maybe it’s worth taking more risks for the extra performance. Ultimately, you’re the admin, you get to call the shots.)

Caveats

It’s very important to note that these benchmarks do not tell the whole story about disk performance under KVM. In particular, both the random and sequential reads used here bypass the cache considerably more heavily than most general purpose workloads would… minimizing the impact that the cache has. And yes, it is significant. See those > 1GB/sec peaks in the random read performance? That kind of thing happens a lot more in a normal workload than it does in a random walk – particularly with ZFS storage, which uses the ARC (Adaptive Replacement Cache) rather than the simple FIFO cache used by other systems. The ARC makes decisions about cache eviction based not only on the time since an object was last seen, but the frequency at which it’s seen, among other things – with the result that it kicks serious butt, especially after a good amount of time to warm up and learn the behavior patterns of the system.

So, please take these numbers with a grain of salt. They’re quite accurate for what they are, but they’re synthetic benchmarks, not a real-world experience. Windows on KVM is actually a much (much) better experience than the raw numbers here would have you believe – the importance of better-managed cache, and persistent cache, really can’t be over-emphasized.

The actual guest OS installation for these tests was on a single spinning disk, but it used ZFS/qcow2/writeback for the underlying storage. I needed to reboot the guest after every single row of data – and in many cases, several times more, because I screwed something or another up. In the end, I rebooted that Windows Server 2008 R2 guest upwards of 50 times. And on a single spinning disk, shutdowns took about 4 seconds and boot times (from BIOS to desktop) took about 3 seconds. You don’t get that kind of performance out of the bare metal, and you can’t see it in these graphs, either. For most general-purpose workloads, these graphs are closer to being a “worst-place scenario” than are a direct model.

That sword cuts both ways, though – the “jokers in the deck”, my RAIDZ1 of spinning rust, isn’t really quite as impressive as it looks above. I’m spitballing here, but I think a lot of the difference is that ZFS is willing to more aggressively cache writes with a RAIDZ array than it is with a single member, probably because it expects that more spindles == faster writes, and it only wants to keep so many writes in cache before it flushes them. That’s a guess, and only a guess, but the reality is that after those jaw-dropping 1.9GB/sec 1MB random write runs, I could hear the spindles chattering for a significant chunk of time, getting all those writes committed to the rust. That also means that if I’d had the patience for bigger write runs than 5GB, I’d have seen performance dropping significantly. So you really shouldn’t think “hey, forget SSDs completely, spinning rust is fine for everybody!” It’s not. It is, however, a surprisingly good competitor on the cheap if you buy enough of it, and if your write runs come in bursts – even “bursts” ranging in the gigabytes.

latest signed VirtIO drivers

I never can remember how to get to the latest SIGNED VirtIO drivers for Windows guests under KVM, and it always takes me a surprising amount of time to find them. So, here’s a bookmark for me:

http://alt.fedoraproject.org/pub/alt/virtio-win/latest/images/

Note: very much fresher drivers than the latest copy I had! Newest drivers there at this time are 2013-04-17. 🙂

Configuring retry duration in Postfix

By default, Postfix (like most mailservers) will keep trying to deliver mail for unconscionably long periods of time – 5 days. Most users do NOT expect mail that isn’t getting delivered to just sorta hang out for 5 days before they find out it didn’t go through – the typical scenario is that a mistyped email address (or a receiving mailserver that decides to SMTP 4xx you instead of SMTP 5xx you when it really doesn’t want that email, ever) means that three or four days later, the message still hasn’t gotten there, the user has no idea anything went wrong, and then there’s a giant kerfuffle “but I sent that to you days ago!”

If you want to reconfigure Postfix to something a little more in tune with modern sensibilities, add the following to /etc/postfix/main.cf near the top:

# if you can't deliver it in an hour - it can't be delivered!

maximal_queue_lifetime = 1h

maximal_backoff_time = 15m

minimal_backoff_time = 5m

queue_run_delay = 5m

Note that this will cause you problems if you need to deliver to someone who has a particularly aggressive graylisting setup in place that would require retries at or near a full hour later from the original bounce that it sends you. But that’s okay – greylisting is bad, and the remote admin should FEEL bad for doing it. (Alternately – adjust the numbers above as desired to 2h, 4h, whatever will allow you to pass the greylisting on the remote end. Ultimately, you’re the one who has to answer to your users about a balance between “knowing right away that the mail didn’t go through” vs “managing to survive dumb things the other mail admin may have done on his end”.)

force directory mode on Samba 3.x

I just spent bloody forever trying to force any folder created by a Windows user to automatically be 770 instead of 750 on a Samba share (3.6.2 on Ubuntu).

The issue at hand is, with force directory mode = 0770 if a user creates a folder DIRECTLY on the Samba share, all is well… but if they drag an already existing folder FROM their windows machine INTO the samba share, it ends up 0750 and nobody else can see it. (I don’t want to tell you how long it took me to figure out WHEN, exactly, the mode wasn’t getting set correctly, either.)

I chased my tail with masks, directory security mode, unix extensions, and more, before I finally, FINALLY found the fix:

edit /etc/login.defs and set the default umask to 002 instead of 022. *poof*, problem solved. Thank you, thank you, THANK you, random dude from ubuntuforums.org…

Getting SPICE working in Ubuntu 12.04.1-LTS

Well, I finally got Spice working with KVM in Ubuntu 12.04.1 LTS today. It… was not pleasant. But it can be done (and without any PPAs, with some ugly hackery).

THESE STEPS ARE NECESSARY AS OF OCTOBER 2012. THEY MAY NOT BE BY THE TIME YOU READ THIS. IF YOU’RE FROM THE FUTURE, DO NOT HACK YOUR APPARMOR PROFILE OR MOVE SYSTEM BINARIES AROUND UNTIL AFTER YOU HAVE CONFIRMED THAT SPICE DOESN’T WORK WITHOUT YOU DOING THESE THINGS.

First, there’s a problem in the AppArmor profile for KVM that prevents it getting read access to /proc/*/auxv, which is necessary for QXL/Spice stuff to work. So let’s fix that: nano /etc/apparmor.d/abstractions/libvirt-qemu, find the line @{PROC}/*/status r, and add the following line:

  @{PROC}/*/auxv r,

NOTE THE TRAILING COMMA – it’s not a bug, it’s a (necessary) feature! Now sudo /etc/init.d/apparmor restart. This fixes the apparmor problem.

Now, let’s add all the packages you’ll need (assuming you’re already successfully running KVM, and just want to add Spice support):

sudo apt-get update ; sudo apt-get install qemu-kvm-spice python-spice-client-gtk

Another bit of REALLY dirty hackery is necessary here – you’ll need to replace your /usr/bin/kvm with the spice-enabled binary you just installed, which nothing in your system knows how to find. Sigh.

sudo mv /usr/bin/kvm /usr/bin/kvm.dist ; sudo ln -s /usr/bin/qemu-system-x86_64 /usr/bin/kvm

Alright, now – assuming you’re running a Windows guest – you’ll need to boot into that guest, download the Windows spice-guest-tools from http://spice-space.org/download.html, and install them. THINGS GET REALLY, REALLY UGLY HERE: YOU NEED TO MAKE VERY SURE YOU HAVE RDP (or some other client side remote control tool) INSTALLED AND WORKING BEFORE GOING ANY FURTHER. Seriously, TEST. IT. NOW.

Okay, tested? You can RDP or otherwise get into your Windows guest WITHOUT use of virt-manager, or the VNC channel virt-manager provides? Good. Shut down your Windows guest, and change the Display to Spice (say yes to “install Spice channels”) and the Video to QXL. Reboot your guest. YOU WILL GET NO VIDEO, EITHER FOR BIOS OR FOR THE GUEST (reference bug https://bugs.launchpad.net/ubuntu/+source/seabios/+bug/958549 ). Remote into your guest using RDP or other client-side remote tool as mandated in last step. Open Device Manager, find “Standard VGA Adapter”, right-click and select “Update Driver Software…” Choose “Search automatically for updated driver software” and it will find the “RedHat QXL Driver” and install it, after which it will need a reboot.

After rebooting, virt-manager will now successfully connect to the guest using Spice, and you will have a nice wicked-cool uber-accelerated control window, with working sound, to your Windows guest. HOWEVER cut and paste, unfortunately, still won’t work. You also won’t have any display during the SeaBIOS boot sequence or the guest’s boot sequence – video won’t start happening until you get all the way to the login screen for Win7.

—–

From the “don’t blame me” dept: if you don’t follow these instructions exactly, it can get REALLY annoying – in particular, if you have Display VNC set while you have Video QXL *and the spice-guest-tools installed on your guest*, kvm will generate crash after crash, REPEATEDLY, and manage to expose a bug in apport-gtk as well so that you’re constantly asked to submit problem reports but then informed that the problem reports are corrupt (incorrect padding) and therefore you can neither see them nor send them. Once you get tired of that, force your guest shut in virt-manager and change its video to QXL and you should be fine. I learned this one the hard way while figuring this crap out today.