Will ZFS and non-ECC RAM kill your data?

This comes up far too often, so rather than continuing to explain it over and over again, I’m going to try to do a really good job of it once and link to it here.

What’s ECC RAM? Is it a good idea?

The ECC stands for Error Correcting Checksum. In a nutshell, ECC RAM is a special kind of server-grade memory that can detect and repair some of the most common kinds of in-memory corruption. For more detail on how ECC RAM does this, and which types of errors it can and cannot correct, the rabbit hole’s over here.

Now that we know what ECC RAM is, is it a good idea? Absolutely. In-memory errors, whether due to faults in the hardware or to the impact of cosmic radiation (yes, really) are a thing. They do happen. And if it happens in a particularly strategic place, you will lose data to it. Period. There’s no arguing this.

What’s ZFS? Is it a good idea?

ZFS is, among other things, a checksumming filesystem. This means that for every block committed to storage, a strong hash (somewhat misleadingly AKA checksum) for the contents of that block is also written. (The validation hash is written in the pointer to the block itself, which is also checksummed in the pointer leading to itself, and so on and so forth. It’s turtles all the way down. Rabbit hole begins over here for this one.)

Is this a good idea? Absolutely. Combine ZFS checksumming with redundancy or parity, and now you have a self-healing array. If a block is corrupt on disk, the next time it’s read, ZFS will see that it doesn’t match its checksum and will load a redundant copy (in the case of mirror vdevs or multiple copy storage) or rebuild a parity copy (in the case of RAIDZ vdevs), and assuming that copy of the block matches its checksum, will silently feed you the correct copy instead, and log a checksum error against the first block that didn’t pass.

ZFS also supports scrubs, which will become important in the next section. When you tell ZFS to scrub storage, it reads every block that it knows about – including redundant copies – and checks them versus their checksums. Any failing blocks are automatically overwritten with good blocks, assuming that a good (passing) copy exists, either redundant or as reconstructed from parity. Regular scrubs are a significant part of maintaining a ZFS storage pool against long term corruption.

Is ZFS and non-ECC worse than not-ZFS and non-ECC? What about the Scrub of Death?

OK, it’s pretty easy to demonstrate that a flipped bit in RAM means data corruption: if you write that flipped bit back out to disk, congrats, you just wrote bad data. There’s no arguing that. The real issue here isn’t whether ECC is good to have, it’s whether non-ECC is particularly problematic with ZFS. The scenario usually thrown out is the the much-dreaded Scrub Of Death.

TL;DR version of the scenario: ZFS is on a system with non-ECC RAM that has a stuck bit, its user initiates a scrub, and as a result of in-memory corruption good blocks fail checksum tests and are overwritten with corrupt data, thus instantly murdering an entire pool. As far as I can tell, this idea originates with a very prolific user on the FreeNAS forums named Cyberjock, and he lays it out in this thread here. It’s a scary idea – what if the very thing that’s supposed to keep your system safe kills it? A scrub gone mad! Nooooooo!

The problem is, the scenario as written doesn’t actually make sense. For one thing, even if you have a particular address in RAM with a stuck bit, you aren’t going to have your entire filesystem run through that address. That’s not how memory management works, and if it were how memory management works, you wouldn’t even have managed to boot the system: it would have crashed and burned horribly when it failed to load the operating system in the first place. So no, you might corrupt a block here and there, but you’re not going to wring the entire filesystem through a shredder block by precious block.

But we’re being cheap here. Say you only corrupt one block in 5,000 this way. That would still be hellacious. So let’s examine the more reasonable idea of corrupting some data due to bad RAM during a scrub. And let’s assume that we have RAM that not only isn’t working 100% properly, but is actively goddamn evil and trying its naive but enthusiastic best to specifically kill your data during a scrub:

First, you read a block. This block is good. It is perfectly good data written to a perfectly good disk with a perfectly matching checksum. But that block is read into evil RAM, and the evil RAM flips some bits. Perhaps those bits are in the data itself, or perhaps those bits are in the checksum. Either way, your perfectly good block now does not appear to match its checksum, and since we’re scrubbing, ZFS will attempt to actually repair the “bad” block on disk. Uh-oh! What now?

Next, you read a copy of the same block – this copy might be a redundant copy, or it might be reconstructed from parity, depending on your topology. The redundant copy is easy to visualize – you literally stored another copy of the block on another disk. Now, if your evil RAM leaves this block alone, ZFS will see that the second copy matches its checksum, and so it will overwrite the first block with the same data it had originally – no data was lost here, just a few wasted disk cycles. OK. But what if your evil RAM flips a bit in the second copy? Since it doesn’t match the checksum either, ZFS doesn’t overwrite anything. It logs an unrecoverable data error for that block, and leaves both copies untouched on disk. No data has been corrupted. A later scrub will attempt to read all copies of that block and validate them just as though the error had never happened, and if this time either copy passes, the error will be cleared and the block will be marked valid again (with any copies that don’t pass validation being overwritten from the one that did).

Well, huh. That doesn’t sound so bad. So what does your evil RAM need to do in order to actually overwrite your good data with corrupt data during a scrub? Well, first it needs to flip some bits during the initial read of every block that it wants to corrupt. Then, on the second read of a copy of the block from parity or redundancy, it needs to not only flip bits, it needs to flip them in such a way that you get a hash collision. In other words, random bit-flipping won’t do – you need some bit flipping in the data (with or without some more bit-flipping in the checksum) that adds up to the corrupt data correctly hashing to the value in the checksum. By default, ZFS uses 256-bit SHA validation hashes, which means that a single bit-flip has a 1 in 2^256 chance of giving you a corrupt block which now matches its checksum. To be fair, we’re using evil RAM here, so it’s probably going to do lots of experimenting, and it will try flipping bits in both the data and the checksum itself, and it will do so multiple times for any single block. However, that’s multiple 1 in 2^256 (aka roughly 1 in 10^77) chances, which still makes it vanishingly unlikely to actually happen… and if your RAM is that damn evil, it’s going to kill your data whether you’re using ZFS or not.

But what if I’m not scrubbing?

Well, if you aren’t scrubbing, then your evil RAM will have to wait for you to actually write to the blocks in question before it can corrupt them. Fortunately for it, though, you write to storage pretty much all day long… including to the metadata that organizes the whole kit and kaboodle. First time you update the directory that your files are contained in, BAM! It’s gotcha! If you stop and think about it, in this evil RAM scenario ZFS is incredibly helpful, because your RAM now needs to not only be evil but be bright enough to consistently pull off collision attacks. So if you’re running non-ECC RAM that turns out to be appallingly, Lovecraftianishly evil, ZFS will mitigate the damage, not amplify it.

If you are using ZFS and you aren’t scrubbing, by the way, you’re setting yourself up for long term failure. If you have on-disk corruption, a scrub can fix it only as long as you really do have a redundant or parity copy of the corrupted block which is good. Once you corrupt all copies of a given block, it’s too late to repair it – it’s gone. Don’t be afraid of scrubbing. (Well, maybe be a little wary of the performance impact of scrubbing during high demand times. But don’t be worried about scrubbing killing your data.)

I’ve constructed a doomsday scenario featuring RAM evil enough to kill my data after all! Mwahahaha!

OK. But would using any other filesystem that isn’t ZFS have protected that data? ‘Cause remember, nobody’s arguing that you can lose data to evil RAM – the argument is about whether evil RAM is more dangerous with ZFS than it would be without it.

I really, really want to use the Scrub Of Death in a movie or TV show. How can I make it happen?

What you need here isn’t evil RAM, but an evil disk controller. Have it flip one bit per block read or written from disk B, but leave the data from disk A alone. Now scrub – every block on disk B will be overwritten with a copy from disk A, but the evil controller will flip bits on write, so now, all of disk B is written with garbage blocks. Now start flipping bits on write to disk A, and it will be an unrecoverable wreck pretty quickly, since there’s no parity or redundancy left for any block. Your choice here is whether to ignore the metadata for as long as possible, giving you the chance to overwrite as many actual data blocks as you can before the jig is up as they are written to by the system, or whether to pounce straight on the metadata and render the entire vdev unusable in seconds – but leave the actual data blocks intact for possible forensic recovery.

Alternately you could just skip straight to step B and start flipping bits as data is written on any or all individual devices, and you’ll produce real data loss quickly enough. But you specifically wanted a scrub of death, not just bad hardware, right?

I don’t care about your logic! I wish to appeal to authority!

OK. “Authority” in this case doesn’t get much better than Matthew Ahrens, one of the cofounders of ZFS at Sun Microsystems and current ZFS developer at Delphix. In the comments to one of my filesystem articles on Ars Technica, Matthew said “There’s nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem.”

Hope that helps. =)

Three Step Guide to X11 Forwarding

Got a graphical application you want to run on a Linux box, but display on a Windows box? It’s stupidly easy. I can’t believe how long it took me to learn how to do this, even though I knew it was possible to. Hopefully, this will save some other sysadmin from not having this trick in the toolbox. (It’s particularly useful for running virt-manager when you don’t have a Linux machine to sit in front of.)

Install Xming
Install Xming
Step 1: download and install Xming (probably from Softpedia, since Sourceforge is full of malware and BS misleading downloads now)

Enable X11 Forwarding
Enable X11 Forwarding
Step 2: in PuTTY’s configs on your Windows box, Connection –> SSH –> X11 –> check the “Enable X11 Forwarding” box.

Run from SSH
Run from SSH
Step 3: SSH into a Linux box, and run a GUI application from the command line. Poof, the app shows up on your Windows desktop!

Heartbleed SSL vulnerability

Last night (2014 Apr 7) a massive security vulnerability was publicly disclosed in OpenSSL, the library that encrypts most of the world’s sensitive traffic. The bug in question is approximately two years old – systems older than 2012 are not vulnerable – and affects the TLS “heartbeat” function, which is why the vulnerability has been nicknamed HeartBleed.

The bug allows a malicious remote user to scan arbitrary 64K chunks of the affected server’s memory. This can disclose any and ALL information in that affected server’s memory, including SSL private keys, usernames and passwords of ANY running service accepting logins, and more. Nobody knows if the vulnerability was known or exploited in the wild prior to its public disclosure last night.

If you are an end user:

You will need to change any passwords you use online unless you are absolutely sure that the servers you used them on were not vulnerable. If you are not a HIGHLY experienced admin or developer, you absolutely should NOT assume that sites and servers you use were not vulnerable. They almost certainly were. If you are a highly experienced ops or dev person… you still absolutely should not assume that, but hey, it’s your rope, do what you want with it.

Note that most sites and servers are not yet patched, meaning that changing your password right now will only expose that password as well. If you have not received any notification directly from the site or server in question, you may try a scanner like the one at http://filippo.io/Heartbleed/ to see if your site/server has been patched. Note that this script is not bulletproof, and in fact it’s less than 24 hours old as of the time of this writing, up on a free site, and under massive load.

The most important thing for end users to understand: You must not, must not, MUST NOT reuse passwords between sites. If you have been using one or two passwords for every site and service you access – your email, forums you post on, Facebook, Twitter, chat, YouTube, whatever – you are now compromised everywhere and will continue to be compromised everywhere until ALL sites are patched. Further, this will by no means be the last time a site is compromised. Criminals can and absolutely DO test compromised credentials from one site on other sites and reuse them elsewhere when they work! You absolutely MUST use different passwords – and I don’t just mean tacking a “2” on the end instead of a “1”, or similar cheats – on different sites if you care at all about your online presence, the money and accounts attached to your online presence, etc.

If you are a sysadmin, ops person, dev, etc:

Any systems, sites, services, or code that you are responsible for needs to be checked for links against OpenSSL versions 1.0.1 through 1.0.1f. Note, that’s the OpenSSL vendor versioning system – your individual distribution, if you are using repo versions like a sane person, may have different numbering schemes. (For example, Ubuntu is vulnerable from 1.0.1-0 through 1.0.1-4ubuntu5.11.)

Examples of affected services: HTTPS, IMAPS, POP3S, SMTPS, OpenVPN. Fabulously enough, for once OpenSSH is not affected, even in versions linking to the affected OpenSSL library, since OpenSSH did not use the Heartbeat function. If you are a developer and are concerned about code that you wrote, the key here is whether your code exposed access to the Heartbeat function of OpenSSL. If it was possible for an attacker to access the TLS heartbeat functionality, your code was vulnerable. If it was absolutely not possible to check an SSL heartbeat through your application, then your application was not vulnerable even if it linked to the vulnerable OpenSSL library.

In contrast, please realize that just because your service passed an automated scanner like the one linked above doesn’t mean it was safe. Most of those scanners do not test services that use STARTTLS instead of being TLS-encrypted from the get-go, but services using STARTTLS are absolutely still affected. Similarly, none of the scanners I’ve seen will test UDP services – but UDP services are affected. In short, if you as a developer don’t absolutely know that you weren’t exposing access to the TLS heartbeat function, then you should assume that your OpenSSL-using application or service was/is exploitable until your libraries are brought up to date.

You need to update all copies of the OpenSSL library to 1.0.1g or later (or your distribution’s equivalent), both dynamically AND statically linked (PS: stop using static links, for exactly things like this!), and restart any affected services. You should also, unfortunately, consider any and all credentials, passwords, certificates, keys, etc. that were used on any vulnerable servers, whether directly related to SSL or not, as compromised and regenerate them. The Heartbleed bug allowed scanning ALL memory on any affected server and thus could be used by a sufficiently skilled user to extract ANY sensitive data held in server RAM. As a trivial example, as of today (2014-Apr-08) users at the Ars Technica forums are logging on as other users using password credentials held in server RAM, as exposed by standard exploit test scripts publicly disclosed.

Completely eradicating all potential vulnerability is a STAGGERING amount of work and will involve a lot of user disruption. When estimating your paranoia level, please do remember that the bug itself has been in the wild since 2012 – the public disclosure was not until 2014-Apr-07, but we have no way of knowing how long private, possibly criminal entities have been aware of and/or exploiting the bug in the wild.

Understanding ctime

ctimes have got to be one of the most widely misunderstood features of POSIX filesystems. It’s very intuitive – and very wrong – to believe that just as mtime is the time the file was last modified and atime is the time the file was last accessed, ctime must be the time the file was created. Nope!

ctime is not the file creation time, it’s the inode creation time. Any time the inode for a file changes, the ctime changes. Which isn’t very intuitive, and happens much more frequent than you might think. In fact, I can’t think of a time that the mtime changes that the ctime won’t change with it… and in a lot of cases, the ctime will update when the mtime doesn’t! There’s a LOT of bad information floating around about this… so let’s examine it directly:

me@box:/tmp$ touch test
SHOWTIMES="%P \t ctime:  %Cb %Cd %CT \t mtime:  %Tb %Td %TT \t atime:  %Ab %Ad %AT \n" ;   export SHOWTIMES
me@box:/tmp$ find . -maxdepth 1 -name test -printf "$SHOWTIMES"
test ctime:  Jul 29 15:53:51.4814403150 mtime:  Jul 29 15:53:51.4814403150 atime:  Jul 29 15:53:51.4814403150
me@box:/tmp$ cat test > /dev/null
me@box:/tmp$ find . -maxdepth 1 -name test -printf "$SHOWTIMES"
test ctime:  Jul 29 15:53:51.4814403150 mtime:  Jul 29 15:53:51.4814403150 atime:  Jul 29 15:54:18.1014495830
me@box:/tmp$ touch test
me@box:/tmp$ find . -maxdepth 1 -name test -printf "$SHOWTIMES"
test ctime:  Jul 29 15:54:25.6522832920 mtime:  Jul 29 15:54:25.6522832920 atime:  Jul 29 15:54:25.6522832920
me@box:/tmp$ chmod 777 test
me@box:/tmp$ find . -maxdepth 1 -name test -printf "$SHOWTIMES"
test ctime:  Jul 29 15:54:32.7214485080 mtime:  Jul 29 15:54:25.6522832920 atime:  Jul 29 15:54:25.6522832920
me@box:/tmp$ mv test test2 ; mv test2 test
me@box:/tmp$ find . -maxdepth 1 -name test -printf "$SHOWTIMES"
test ctime:  Jul 29 15:54:54.6322825980 mtime:  Jul 29 15:54:25.6522832920 atime:  Jul 29 15:54:25.6522832920

OK, from top to bottom:

1. we create a file, and we check its ctime, mtime, and atime. No surprises.
2. we access the file, by cat’ing it to /dev/null. atime updates, ctime and mtime remain the same. No surprises.
3. we modify the file, by touching it. atime, mtime, and ctime update… not what you expected, amirite?
4. we change permissions on the file. ctime updates, but mtime and atime do not. Again… not what you expected, right right?
5. we mv the file a couple times. ctime updates again – mtime and atime still don’t.

This is usually the first answer that should be given to “how do I modify the ctime on a file?” (The second answer is: don’t. [i]By design[/b], there is no feature within a POSIX filesystem to set a ctime to anything other than the current system time, so the only way to do it is either to reset the system time, then mv the file, or to unmount the filesystem entirely and hexedit the metadata with a debugging tool. Neither is a good idea, and wanting to do so in the first place usually stems from a misunderstanding of what ctime is and does.)

SouthEast Linux Fest (SELF) 2012

I had a great time at SELF in Charlotte, NC this weekend. One of the highlights, for me, was for the first time meeting some fellow BSD types in the flesh – Kris Moore (founder of the PC-BSD distribution) and Dru Lavigne (Director of the FreeBSD Foundation), no less. While discussing the pain of doing FreeBSD installs onto RAIDZ, Kris told me about the new graphical installer in PC-BSD that lets you create new RAIDZ arrays of all types and install directly to them, all without ever having to leave the installer, which I found pretty exciting. I found it even more exciting when he told me that the procedures taken by the installer were based partly on my own work at freebsdwiki.net!

I set up a new VM with three virtual disks pretty much the minute I got home, and started a new PC-BSD 9.0 install. Sure enough, although the option is a little hard to discover, I managed to figure it out without having to go in search of any documentation – and without ever leaving the installer, and with a bare minimum of blood and chicken feathers, I got a brand new RAIDZ1 across my three virtual disks set up, and PC-BSD cheerfully installed onto it. (This is testing only, of course – in production, you should only do RAIDZ onto bare metal, not onto an abstraction like linux logical volumes or raw files accessed through a hypervisor.) Pretty heady stuff!

To the right – Tux dropped by the table while Dru and Kris and I were chatting, and posed for me with BSD’s horns. ¬†How great is that?

Storing PHP sessions in memcached instead of in files

in php.ini:

session.save_handler = memcache
session.save_path = "tcp://serv01:11211,tcp://serv02:11211,tcp://serv03:11211"

Obviously, you need php5-memcache installed, replace “serv1” “serv2” and “serv3” with valid server address(es), and you’ll need to restart your Apache server after making the change.

Why would you need to do this? Well, this week I had to deal with a web application server pool that kept slowly increasing its number of children all the way up to MaxCli, no matter what. It was unpredictable, other than being a slow creep. Eventually, stracing the pids showed that they were getting stuck in FLOCK on files in /var/lib/php5/php_sess*. This turns out to be an endemic problem with PHP that the devs don’t seem inclined to fix: php’s garbage collector will delete session files dirtily if a php process (which in the case of mod_php, means an Apache process) violates any of the php limits, such as max_execution_time (among many, many others). So you end up with your php script trying to lock a session file (file descriptor 3) that php’s garbage collector already deleted, and therefore an infinitely hung process that will never go away on its own.

Changing over to using memcache to store php sessions eliminated this file lock issue and resulted in a much more stable situation – a server pool that had been creeping up to 800 children per server over the course of a couple hours has been running stable and sweet on less than 150 children per server for days now.