ZFS clones look great on paper: they’re instantaneously generated, they’re read/write, they’re initially “free” because they reference the same blocks their parent snapshots do. They’re also (initially) frequently extra-snappy performance-wise, because a lot of those parent blocks are very likely already in the ARC. If you create ten clones of the same VM image (for instance), all ten clones will share the same blocks in the ARC instead of them needing to be in the ARC ten different times. Huge win!
But, as great as a clone sounds at first blush, you probably don’t want to use them for anything that isn’t ephemeral (intended to be destroyed in fairly short order). This is because a clone’s parent snapshot is forever immutable; you can’t destroy the parent snapshot without destroying the clone along with it… even if and when the clone becomes 100% divergent, and no longer shares any block references with its parent. Let’s examine this on a small scale.
Practical testing
On my workstation banshee, I create a new dataset, make sure compression is turned off so as not to confuse us, and populate it with a 256MB chunk of random binary stuff:
root@banshee:~# zfs create banshee/demo ; zfs set compression=off banshee/demo root@banshee:~# dd if=/dev/zero bs=16M count=16 | openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt | pv > /banshee/demo/random.bin 16+0 records in 16+0 records out 268435456 bytes (268 MB, 256 MiB) copied, 0.483868 s, 555 MB/s 256MiB 0:00:00 [ 533MiB/s] [<=> ]
I know this looks a little weird, but AES-256 is roughly an order of magnitude faster than /dev/urandom: so what I did here was use /dev/urandom to seed AES-256, then encrypt a 256MB chunk of /dev/zero with it. At the end of this procedure, we have a dataset with 256MB of data in it:
root@banshee:~# ls -lh /banshee/demo total 262M -rw-r--r-- 1 root root 256M Mar 15 14:39 random.bin root@banshee:~# zfs list banshee/demo NAME USED AVAIL REFER MOUNTPOINT banshee/demo 262M 83.3G 262M /banshee/demo
OK. Next step, we take a snapshot of banshee/demo, then create a clone using that snapshot as its parent.
Creating a clone
You don’t actually create a ZFS clone of a dataset at all; you create a clone from a snapshot of a dataset. So before we can “clone banshee/demo”, we first have to take a snapshot of it, and then we clone that.
root@banshee:~# zfs snapshot banshee/demo@parent-snapshot root@banshee:~# zfs clone banshee/demo@parent-snapshot banshee/demo-clone root@banshee:~# zfs list -rt all banshee/demo NAME USED AVAIL REFER MOUNTPOINT banshee/demo 262M 83.3G 262M /banshee/demo banshee/demo@parent-snapshot 0 - 262M - root@banshee:~# zfs list -rt all banshee/demo-clone NAME USED AVAIL REFER MOUNTPOINT banshee/demo-clone 1K 83.3G 262M /banshee/demo-clone
So right now, we have the dataset banshee/demo, which shares all its blocks with banshee/demo@parent-snapshot, which in turn shares all its blocks with banshee/demo-clone. We see 262M in USED for banshee/demo, with nothing or next-to-nothing in USED for either banshee/demo@parent-snapshot or banshee/demo-clone.
Beginning divergence: removing data
Now, we remove all the data from banshee/demo:
root@banshee:~# rm /banshee/demo/random.bin root@banshee:~# zfs list -rt all banshee/demo ; zfs list banshee/demo-clone NAME USED AVAIL REFER MOUNTPOINT banshee/demo 262M 83.3G 19K /banshee/demo banshee/demo@parent-snapshot 262M - 262M - NAME USED AVAIL REFER MOUNTPOINT banshee/demo-clone 1K 83.3G 262M /banshee/demo-clone
We still only have 262M of USED – but it’s all actually in banshee/demo@parent-snapshot now. You can tell because the REFER column has changed – banshee/demo@parent-snapshot and banshee/demo-clone still both REFER 262M, but banshee/demo only REFERs 19K now. (You still see 262M in USED for banshee/demo because banshee/demo@parent-snapshot is a child of banshee/demo, so its contents count towards banshee/demo‘s USED figure.)
Next up: we re-fill the parent dataset, banshee/demo, with 256MB of different random garbage.
Continuing divergence: replacing data in the parent
root@banshee:~# dd if=/dev/zero bs=16M count=16 | openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt | pv > /banshee/demo/random.bin 16+0 records in 16+0 records out 268435456 bytes (268 MB, 256 MiB) copied, 0.498349 s, 539 MB/s 256MiB 0:00:00 [ 516MiB/s] [<=> ] root@banshee:~# zfs list -rt all banshee/demo ; zfs list banshee/demo-clone NAME USED AVAIL REFER MOUNTPOINT banshee/demo 523M 83.2G 262M /banshee/demo banshee/demo@parent-snapshot 262M - 262M - NAME USED AVAIL REFER MOUNTPOINT banshee/demo-clone 1K 83.2G 262M /banshee/demo-clone
OK, at this point you see that the USED for banshee/demo shoots up to 523M: that’s the total of the 262M of original random garbage which is still preserved in banshee/demo@parent-snapshot, plus the new 262M of different random garbage in banshee/demo itself. The snapshot now diverges completely from the parent dataset, having no blocks in common at all.
So far, banshee/demo-clone is still 100% convergent with banshee/demo@parent-snapshot, so we’re still getting some conservation of space on disk and in ARC from that. But remember, the whole point of making the clone was so that we could write to it as well as read from it. So let’s do exactly that, and make the clone 100% divergent from its parent, too.
Diverging completely: replacing data in the clone
root@banshee:~# dd if=/dev/zero bs=16M count=16 | openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt | pv > /banshee/demo-clone/random.bin 16+0 records in 16+0 records out 268435456 bytes (268 MB, 256 MiB) copied, 0.50151 s, 535 MB/s 256MiB 0:00:00 [ 534MiB/s] [<=> ] root@banshee:~# zfs list -rt all banshee/demo ; zfs list banshee/demo-clone NAME USED AVAIL REFER MOUNTPOINT banshee/demo 523M 82.8G 262M /banshee/demo banshee/demo@parent-snapshot 262M - 262M - NAME USED AVAIL REFER MOUNTPOINT banshee/demo-clone 262M 82.8G 262M /banshee/demo-clone
There, done. We now have a parent dataset, banshee/demo, which diverges completely from its snapshot banshee/demo@parent-snapshot, and a clone, banshee/demo-clone, which also diverges completely from banshee/demo@parent-snapshot.
Examining the suck
Since neither the parent, its snapshot, nor the clone share any blocks with one another anymore, we’re using the full 786MB of on-disk space that the three of them add up to. And since they also don’t share any blocks in the ARC, we’re left with absolutely no benefit in either storage consumption or performance to our having used a clone.
Worse, despite having no blocks in common and no perceptible benefit to the clone structure, all three are still inextricably linked, and neither banshee/demo nor banshee/demo@parent-snapshot can be destroyed without also destroying banshee/demo-clone:
root@banshee:~# zfs destroy banshee/demo -r cannot destroy 'banshee/demo': filesystem has dependent clones use '-R' to destroy the following datasets: banshee/demo-clone root@banshee:~# zfs destroy banshee/demo@parent-snapshot cannot destroy 'banshee/demo@parent-snapshot': snapshot has dependent clones use '-R' to destroy the following datasets: banshee/demo-clone
So now you’re left with a great unwieldy mass of tangled dependencies, wasted space, and no perceptible benefits at all.
Conclusion and practical example
Imagine that you’re storing VM images in ZFS, and you began with a “gold” image of a freshly installed operating system, and created ten different clones to run ten different VMs from. Initially, this seemed great: you could create the clones instantaneously, and they shared tons of blocks, so they consumed a fraction of the ARC they would as complete, separate copies.
A year later, however, your gold image – of, let’s say, Ubuntu 16.04.1 – has diverged to a staggering degree with the set of rolling updates necessary to bring it all the way to Ubuntu 16.04.2. Your VMs have also diverged tremendously, from their parent snapshot and from one another. And now you’re stuck with the year-old snapshot of the “gold” image, completely useless to you but forever engraved on your drive unless and until you’re willing to replicate or otherwise block-for-block copy your VMs painstakingly into self-sufficient datasets with no references. You also have no remaining performance benefits, and you have an extra SPOF (single point of failure) where some admin – maybe even you – might see that parent snapshot nobody cared about anymore taking up all that disk space, and…
root@banshee:~# zfs destroy -R banshee/demo@parent-snapshot root@banshee:~# zfs list banshee/demo-clone cannot open 'banshee/demo-clone': dataset does not exist
One “oops” later, that “useless” parent snapshot and every single one of those clones you were using in production are gone forever. (Or, hopefully, just gone until you can restore them from your off-pool backup. You are maintaining replicated backups on at least one other pool, preferably on another machine, aren’t you? Aren’t you?!)
zfs promote should mitigate some of these issues.
I don’t think zfs promote helps this specific case at all. If you need the data in both the parent and the child, how does simply flipping the relationship do anything at all for you? You need to sever the relationship so that you can delete the backing store of the original snapshot without having to also delete one of the two original volumes.