This comes up far too often, so rather than continuing to explain it over and over again, I’m going to try to do a really good job of it once and link to it here.
What’s ECC RAM? Is it a good idea?
The ECC stands for Error Correcting Checksum. In a nutshell, ECC RAM is a special kind of server-grade memory that can detect and repair some of the most common kinds of in-memory corruption. For more detail on how ECC RAM does this, and which types of errors it can and cannot correct, the rabbit hole’s over here.
Now that we know what ECC RAM is, is it a good idea? Absolutely. In-memory errors, whether due to faults in the hardware or to the impact of cosmic radiation (yes, really) are a thing. They do happen. And if it happens in a particularly strategic place, you will lose data to it. Period. There’s no arguing this.
What’s ZFS? Is it a good idea?
ZFS is, among other things, a checksumming filesystem. This means that for every block committed to storage, a strong hash (somewhat misleadingly AKA checksum) for the contents of that block is also written. (The validation hash is written in the pointer to the block itself, which is also checksummed in the pointer leading to itself, and so on and so forth. It’s turtles all the way down. Rabbit hole begins over here for this one.)
Is this a good idea? Absolutely. Combine ZFS checksumming with redundancy or parity, and now you have a self-healing array. If a block is corrupt on disk, the next time it’s read, ZFS will see that it doesn’t match its checksum and will load a redundant copy (in the case of mirror vdevs or multiple copy storage) or rebuild a parity copy (in the case of RAIDZ vdevs), and assuming that copy of the block matches its checksum, will silently feed you the correct copy instead, and log a checksum error against the first block that didn’t pass.
ZFS also supports scrubs, which will become important in the next section. When you tell ZFS to scrub storage, it reads every block that it knows about – including redundant copies – and checks them versus their checksums. Any failing blocks are automatically overwritten with good blocks, assuming that a good (passing) copy exists, either redundant or as reconstructed from parity. Regular scrubs are a significant part of maintaining a ZFS storage pool against long term corruption.
Is ZFS and non-ECC worse than not-ZFS and non-ECC? What about the Scrub of Death?
OK, it’s pretty easy to demonstrate that a flipped bit in RAM means data corruption: if you write that flipped bit back out to disk, congrats, you just wrote bad data. There’s no arguing that. The real issue here isn’t whether ECC is good to have, it’s whether non-ECC is particularly problematic with ZFS. The scenario usually thrown out is the the much-dreaded Scrub Of Death.
TL;DR version of the scenario: ZFS is on a system with non-ECC RAM that has a stuck bit, its user initiates a scrub, and as a result of in-memory corruption good blocks fail checksum tests and are overwritten with corrupt data, thus instantly murdering an entire pool. As far as I can tell, this idea originates with a very prolific user on the FreeNAS forums named Cyberjock, and he lays it out in this thread here. It’s a scary idea – what if the very thing that’s supposed to keep your system safe kills it? A scrub gone mad! Nooooooo!
The problem is, the scenario as written doesn’t actually make sense. For one thing, even if you have a particular address in RAM with a stuck bit, you aren’t going to have your entire filesystem run through that address. That’s not how memory management works, and if it were how memory management works, you wouldn’t even have managed to boot the system: it would have crashed and burned horribly when it failed to load the operating system in the first place. So no, you might corrupt a block here and there, but you’re not going to wring the entire filesystem through a shredder block by precious block.
But we’re being cheap here. Say you only corrupt one block in 5,000 this way. That would still be hellacious. So let’s examine the more reasonable idea of corrupting some data due to bad RAM during a scrub. And let’s assume that we have RAM that not only isn’t working 100% properly, but is actively goddamn evil and trying its naive but enthusiastic best to specifically kill your data during a scrub:
First, you read a block. This block is good. It is perfectly good data written to a perfectly good disk with a perfectly matching checksum. But that block is read into evil RAM, and the evil RAM flips some bits. Perhaps those bits are in the data itself, or perhaps those bits are in the checksum. Either way, your perfectly good block now does not appear to match its checksum, and since we’re scrubbing, ZFS will attempt to actually repair the “bad” block on disk. Uh-oh! What now?
Next, you read a copy of the same block – this copy might be a redundant copy, or it might be reconstructed from parity, depending on your topology. The redundant copy is easy to visualize – you literally stored another copy of the block on another disk. Now, if your evil RAM leaves this block alone, ZFS will see that the second copy matches its checksum, and so it will overwrite the first block with the same data it had originally – no data was lost here, just a few wasted disk cycles. OK. But what if your evil RAM flips a bit in the second copy? Since it doesn’t match the checksum either, ZFS doesn’t overwrite anything. It logs an unrecoverable data error for that block, and leaves both copies untouched on disk. No data has been corrupted. A later scrub will attempt to read all copies of that block and validate them just as though the error had never happened, and if this time either copy passes, the error will be cleared and the block will be marked valid again (with any copies that don’t pass validation being overwritten from the one that did).
Well, huh. That doesn’t sound so bad. So what does your evil RAM need to do in order to actually overwrite your good data with corrupt data during a scrub? Well, first it needs to flip some bits during the initial read of every block that it wants to corrupt. Then, on the second read of a copy of the block from parity or redundancy, it needs to not only flip bits, it needs to flip them in such a way that you get a hash collision. In other words, random bit-flipping won’t do – you need some bit flipping in the data (with or without some more bit-flipping in the checksum) that adds up to the corrupt data correctly hashing to the value in the checksum. By default, ZFS uses 256-bit SHA validation hashes, which means that a single bit-flip has a 1 in 2^256 chance of giving you a corrupt block which now matches its checksum. To be fair, we’re using evil RAM here, so it’s probably going to do lots of experimenting, and it will try flipping bits in both the data and the checksum itself, and it will do so multiple times for any single block. However, that’s multiple 1 in 2^256 (aka roughly 1 in 10^77) chances, which still makes it vanishingly unlikely to actually happen… and if your RAM is that damn evil, it’s going to kill your data whether you’re using ZFS or not.
But what if I’m not scrubbing?
Well, if you aren’t scrubbing, then your evil RAM will have to wait for you to actually write to the blocks in question before it can corrupt them. Fortunately for it, though, you write to storage pretty much all day long… including to the metadata that organizes the whole kit and kaboodle. First time you update the directory that your files are contained in, BAM! It’s gotcha! If you stop and think about it, in this evil RAM scenario ZFS is incredibly helpful, because your RAM now needs to not only be evil but be bright enough to consistently pull off collision attacks. So if you’re running non-ECC RAM that turns out to be appallingly, Lovecraftianishly evil, ZFS will mitigate the damage, not amplify it.
If you are using ZFS and you aren’t scrubbing, by the way, you’re setting yourself up for long term failure. If you have on-disk corruption, a scrub can fix it only as long as you really do have a redundant or parity copy of the corrupted block which is good. Once you corrupt all copies of a given block, it’s too late to repair it – it’s gone. Don’t be afraid of scrubbing. (Well, maybe be a little wary of the performance impact of scrubbing during high demand times. But don’t be worried about scrubbing killing your data.)
I’ve constructed a doomsday scenario featuring RAM evil enough to kill my data after all! Mwahahaha!
OK. But would using any other filesystem that isn’t ZFS have protected that data? ‘Cause remember, nobody’s arguing that you can lose data to evil RAM – the argument is about whether evil RAM is more dangerous with ZFS than it would be without it.
I really, really want to use the Scrub Of Death in a movie or TV show. How can I make it happen?
What you need here isn’t evil RAM, but an evil disk controller. Have it flip one bit per block read or written from disk B, but leave the data from disk A alone. Now scrub – every block on disk B will be overwritten with a copy from disk A, but the evil controller will flip bits on write, so now, all of disk B is written with garbage blocks. Now start flipping bits on write to disk A, and it will be an unrecoverable wreck pretty quickly, since there’s no parity or redundancy left for any block. Your choice here is whether to ignore the metadata for as long as possible, giving you the chance to overwrite as many actual data blocks as you can before the jig is up as they are written to by the system, or whether to pounce straight on the metadata and render the entire vdev unusable in seconds – but leave the actual data blocks intact for possible forensic recovery.
Alternately you could just skip straight to step B and start flipping bits as data is written on any or all individual devices, and you’ll produce real data loss quickly enough. But you specifically wanted a scrub of death, not just bad hardware, right?
I don’t care about your logic! I wish to appeal to authority!
OK. “Authority” in this case doesn’t get much better than Matthew Ahrens, one of the cofounders of ZFS at Sun Microsystems and current ZFS developer at Delphix. In the comments to one of my filesystem articles on Ars Technica, Matthew said “There’s nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem.”
Hope that helps. =)
Are there any spec sheets that you referenced in regards to how corruption mitigation in reference to Non-ECC RAM works? Preferably from Oracle? or is this document just hypothesis?
Follow the second link in the article. It will lead you to the relevant section of the ZFS article on Wikipedia, which references Jeff Bonwick’s blog, which is where I first learned about both ZFS itself and how its data integrity features work, back in 2008 IIRC.
If you aren’t familiar, Jeff is one of the original creators of ZFS. Matt Ahrens has also publicly, if quietly, weighed in on this to say that ZFS poses no special risk to non ECC systems.
The ultimate reference, of course, is the source code itself.
Also, for the metadata (like directory entries), ZFS automatically uses copies=2, so even if you have no redundancy, there are 2 copies of your metadata. Now, the memory error might have corrupted both of those. If it is a case of ‘memory error, causes crash, reboot, directory bad’, because ZFS is copy-on-write, it can rollback and get back to good data.
If it is the case of a long running system, if that directory entry is updated again, before it is read, it might silently be repaired.
Scientific Study showing ZFS can recovery from any disk bit flipping, but not from memory bit flipping: http://research.cs.wisc.edu/adsl/Publications/zfs-corruption-fast10.pdf
The important this is, ZFS almost always detects the problem, whereas another file system would not even tell you something was going wrong
I wonder what happens if the system thinks it’s handling bad data that is in fact bad memory and thus rewrites the presumably bad data back to the appropriate disk. Wil this result in any meta data being written to disk?
Meta data is stored as copies=2 on-disk, but at some point, it needs to be generated on data in-memory. If that data is garbage and it affects the wrong metadata, your pool could be gone.
If a scrub can never result in metadata being regenerated and rewritten, I agree that conceptually, a scrub cannot kill your pool, even with bad RAM.
Otherwise, it can, because the self-healing feature actually may become the thing that kills the pool.
> I wonder what happens if the system thinks it’s handling bad data that is in fact bad memory and thus rewrites the presumably bad data back to the appropriate disk.
The only way that a scrub writes data is if ALL of the following things occur:
1. A block is read and falls its validation hash (data or metadata)
2. A redundant copy of that block – literal, or reconstructed from parity – is read, and it *does* pass its validation hash.
If both of those things happen, the block which did not pass its validation hash is overwritten with the copy of that block that did pass its check. The copy which validated successfully is not itself overwritten – why would it be? It *validated*, it doesn’t need fixing!
If a block did not initially fail its validation hash test, it won’t be overwritten. Period.
I was never talking about the good copy. So a block is corrupted by memory and thus considered bad. ZFS then reads the redundant copy and it passes the checksum because both checksum and data is stored in a good section of RAM. ZFS then tries to correct the initial bad block by overwriting it. Now my question is: will ZFS just and only just write the data & checksum to disk, or will there be a whole copy-on-write sort of thing going on where metadata is updated? Because in that case, if metadata is updated ‘up the chain’ to the uber block. A scrub could kill the file system. If that doesn’t happen, then ZFS is fine and the scrub-of-death does not exist.
There are new DRAM chips on the market which have an integrated ECC logic (and the extra memory space for storing the ECC parity bits).
As these ECC DRAMs are 100% compatible to conventional DRAMs, they fit on any board and can also be used to produce memory modules. The good thing is: the CPU does not need to be ECC capable, because the complete error correction process is performed by and inside these ECC DRAM memory chips, independantly from the processor.
Take a look at http://www.intelligentmemory.com
They have these ECC DRAMs in DDR3, DDR2, DDR1 technologies with up to 1Gbit per chip. Currently they are not listing any memory modules built with these new components, but I know they can make DIMMs and SO-DIMMs with max 2GB with them. Those modules would then be standard Non ECC 64 bit wide, while each ECC DRAM chip on them performs it’s own ECC correction. As multiple DRAMs run in parallel on a module, the strength of this method of ECC will be by far better than on a server with ECC.
I just think they will be quite expensive…
^^ interesting… question is, is that an actual product, or is it just vapor?
After reading your article, I think it makes more sense than following freenas forum blindly where everyone yells “ECC or death!” This is becoming like a religion cult whereby if you don’t believe in my god you will be damn to hell. Its just pure fear mongering over there. To make it worse, most people who head to freenas because they want a cheap system and are probably noob about zfs like me. And the admin there bashes anyone who doesn’t run on ECC making everyone believe ECC or bust. For the record, I have a ZFS setup with cheap non-ecc ram for almost 3 years and never had a problem. “i’m so lucky! yay” My Qnap on the other hand gave me headache with folder corruptions.
Some of us use open source freeware for a reason, we are trying to be cheap and save cost. If I had the money to do it, I would no doubt go the enterprise system level. its not about ECC vs non ECC, its about how far you want to go to protect your data. If I head down the ECC path, then what next? dual PSU? UPS? an indestructible casing in case my apartment collapse?
Of course I am willing to sacrifice some data loss (but not all). Just like a windows system, if I pull the power while its writing I will have corruption and a check disk would fix it, though not perfectly but I can live with it. But it seems like the proponents are saying that if I don’t use ECC ram, my zpool will die and everything is lost. If ZFS is so flimsy, why put it in production. ZFS creators should warn people not to use their software on home built PC.
Then again, we are human, if we had the ability to rationalize logically and not follow blindly, cults would not exist, neither will world wars. Ok enough ranting, lets get back to the real world of working and dealing with morons
Actually, ZFS (with or without ECC RAM) is extremely resilient to crashes (power failure, kernel crash, or other) due in part to the use of the ZIL as a transactional journal. And no, you don’t need a checkdisk, in fact you *can’t* do a checkdisk; ZFS has no such function and doesn’t need one. It has scrubbing, but you don’t need a scrub any more after a power crash than you did before the power crash.
Funny that you should say that an evil disk controller might be needed to do a scrub of death.
I have personally and unwittingly been running 15 drives (of a 19-drive RAIDZ3 pool) off of three evil SATA controllers plus port multipliers, that were cheap-as-fudge. Cheap enough to really don’t like the kernel in my then-version of Ubuntu (I forget which). Not that they wouldn’t work, they’d just corrupt data on-the-fly for the fun of it when running in SATA2 mode.
Actually, I just Googled a bit and found a couple of posts by me from around the time:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/987353
https://askubuntu.com/questions/145965/how-do-i-target-a-specific-driver-for-libata-kernel-parameter-modding
They ran like this for a long time, with tons of traffic in my pool, reads and writes galore, with me none-the-wiser. At some point something caused me to notice that reads directly from the drive block devices never produced the same data — I think I was mirroring a drive and the checksums didn’t match… Either way, I offlined the pool and started dd’ing ranges on drives on the controllers and md5’ing the data, noticing that it just kept changing, always.
I eventually fixed the problem with the boot option forcing the SATA ports into legacy mode as per the second link I posted above, and I checked all data on the pool. The scrub revealed no issues, and on top of that, I’ve made a habit of always checksumming storage data with SFVs and protecting it with 1% PAR2 parity data, and verifying these checksums revealed zero problems as well.
This was with three controllers actively being evil against data going to and from 15 out of 19 drives on my zpool. ZFS effectively just flipped those controllers the finger, fixed the data transparently, and allowed me to continue on my merry way. Oh, and this was on a non-ECC box, built around my old desktop computer’s entry-level ASRock motherboard.
If non-ECC RAM flips bits before ZFS receives those bits, then ZFS cannot know if they are corrupt.
Thank for sharing this article post. There are repair some of the most common kinds of in-memory corruption. In-memory errors, whether due to faults in the hardware or to the impact of cosmic radiation really are a thing.
For a read-only filesystem is ZFS recommended if we are interested in the data reliability capabilities to, at least, detect data corruptions from a SSD or from an evil non-ECC RAM?
The long and skinny is, ZFS is just as affected as any other filesystem by bad RAM.
It can write data with a bad bit to the drive. Likely the scrub of death won’t occur, but if the metadata is touched in such a way, a scrub won’t fix it.
And unlike any other filesystem, there are no chkdsk/fsck or data recovery tools for a ZFS block device that won’t mount.
I believe cyberjock more than you. I’m also deeply concerned about Ubuntu promoting Desktop installs with ZFS. Probably nobody uses ECC there. We will see a lot data loss because no ECC installed.
Thank for sharing this article post. There are repair some of the most common kinds of in-memory corruption. In-memory errors, whether due to faults in the hardware or to the impact of cosmic radiation really are a thing.
Hi Jim,
Would you update the link to Matthew A’s comment #129 “There’s nothing special about ZFS that requires …” to …
https://arstechnica.com/civis/threads/ars-walkthrough-using-the-zfs-next-gen-filesystem-on-linux.1235679/page-4#post-26303271
… as the link currently goes to the start of the thread?
Cyberjock is one of the reasons so many people never actually got started with FreeNAS (now TrueNAS) and is probably a cause for many of unRAID’s sales. He’s such a mean spirited know-it-all who went out of his way to berate anyone in the forums who didn’t know as much as him