ZFS: You should use mirror vdevs, not RAIDZ.

Continuing this week’s “making an article so I don’t have to keep typing it” ZFS series… here’s why you should stop using RAIDZ, and start using mirror vdevs instead.

The basics of pool topology

A pool is a collection of vdevs. Vdevs can be any of the following (and more, but we’re keeping this relatively simple):

  • single disks (think RAID0)
  • redundant vdevs (aka mirrors – think RAID1)
  • parity vdevs (aka stripes – think RAID5/RAID6/RAID7, aka single, dual, and triple parity stripes)

The pool itself will distribute writes among the vdevs inside it on a relatively even basis. However, this is not a “stripe” like you see in RAID10 – it’s just distribution of writes. If you make a RAID10 out of 2 2TB drives and 2 1TB drives, the second TB on the bigger drives is wasted, and your after-redundancy storage is still only 2 TB. If you put the same drives in a zpool as two mirror vdevs, they will be a 2x2TB mirror and a 2x1TB mirror, and your after-redundancy storage will be 3TB. If you keep writing to the pool until you fill it, you may completely fill the two 1TB disks long before the two 2TB disks are full. Exactly how the writes are distributed isn’t guaranteed by the specification, only that they will be distributed.

What if you have twelve disks, and you configure them as two RAIDZ2 (dual parity stripe) vdevs of six disks each? Well, your pool will consist of two RAIDZ2 arrays, and it will distribute writes across them just like it did with the pool of mirrors. What if you made a ten disk RAIDZ2, and a two disk mirror? Again, they go in the pool, the pool distributes writes across them. In general, you should probably expect a pool’s performance to exhibit the worst characteristics of each vdev inside it. In practice, there’s no guarantee where reads will come from inside the pool – they’ll come from “whatever vdev they were written to”, and the pool gets to write to whichever vdevs it wants to for any given block(s).

Storage Efficiency

If it isn’t clear from the name, storage efficiency is the ratio of usable storage capacity (after redundancy or parity) to raw storage capacity (before redundancy or parity).

This is where a lot of people get themselves into trouble. “Well obviously I want the most usable TB possible out of the disks I have, right?” Probably not. That big number might look sexy, but it’s liable to get you into a lot of trouble later. We’ll cover that further in the next section; for now, let’s just look at the storage efficiency of each vdev type.

  • single disk vdev(s) – 100% storage efficiency. Until you lose any single disk, and it becomes 0% storage efficency…
    single-disk vdevs
    eight single-disk vdevs



  • RAIDZ1 vdev(s) – (n-1)/n, where n is the number of disks in each vdev. For example, a RAIDZ1 of eight disks has an SE of 7/8 = 87.5%.
    raidz1
    an eight disk raidz1 vdev



  • RAIDZ2 vdev(s) – (n-2)/n. For example, a RAIDZ2 of eight disks has an SE of 6/8 = 75%.
    raidz2
    an eight disk raidz2 vdev



  • RAIDZ3 vdev(s) – (n-3)/n. For example, a RAIDZ3 of eight disks has an SE of 5/8 = 62.5%.
    raidz3
    an eight disk raidz3 vdev



  • mirror vdev(s) – 1/n, where n is the number of disks in each vdev. Eight disks set up as 4 2-disk mirror vdevs have an SE of 1/2 = 50%.
    mirror vdevs
    a pool of four 2-disk mirror vdevs



One final note: striped (RAIDZ) vdevs aren’t supposed to be “as big as you can possibly make them.” Experts are cagey about actually giving concrete recommendations about stripe width (the number of devices in a striped vdev), but they invariably recommend making them “not too wide.” If you consider yourself an expert, make your own expert decision about this. If you don’t consider yourself an expert, and you want more concrete general rule-of-thumb advice: no more than eight disks per vdev.

Fault tolerance / degraded performance

Be careful here. Keep in mind that if any single vdev fails, the entire pool fails with it. There is no fault tolerance at the pool level, only at the individual vdev level! So if you create a pool with single disk vdevs, any failure will bring the whole pool down.

It may be tempting to go for that big storage number and use RAIDZ1… but it’s just not enough. If a disk fails, the performance of your pool will be drastically degraded while you’re replacing it. And you have no fault tolerance at all until the disk has been replaced and completely resilvered… which could take days or even weeks, depending on the performance of your disks, the load your actual use places on the disks, etc. And if one of your disks failed, and age was a factor… you’re going to be sweating bullets wondering if another will fail before your resilver completes. And then you’ll have to go through the whole thing again every time you replace a disk. This sucks. Don’t do it. Conventional RAID5 is strongly deprecated for exactly the same reasons. According to Dell, “Raid 5 for all business critical data on any drive type [is] no longer best practice.”

RAIDZ2 and RAIDZ3 try to address this nightmare scenario by expanding to dual and triple parity, respectively. This means that a RAIDZ2 vdev can survive two drive failures, and a RAIDZ3 vdev can survive three. Problem solved, right? Well, problem mitigated – but the degraded performance and resilver time is even worse than a RAIDZ1, because the parity calculations are considerably gnarlier. And it gets worse the wider your stripe (number of disks in the vdev).

Saving the best for last: mirror vdevs. When a disk fails in a mirror vdev, your pool is minimally impacted – nothing needs to be rebuilt from parity, you just have one less device to distribute reads from. When you replace and resilver a disk in a mirror vdev, your pool is again minimally impacted – you’re doing simple reads from the remaining member of the vdev, and simple writes to the new member of the vdev. In no case are you re-writing entire stripes, all other vdevs in the pool are completely unaffected, etc. Mirror vdev resilvering goes really quickly, with very little impact on the performance of the pool. Resilience to multiple failure is very strong, though requires some calculation – your chance of surviving a disk failure is 1-(f/(n-f)), where f is the number of disks already failed, and n is the number of disks in the full pool. In an eight disk pool, this means 100% survival of the first disk failure, 85.7% survival of a second disk failure, 66.7% survival of a third disk failure. This assumes two disk vdevs, of course – three disk mirrors are even more resilient.

But wait, why would I want to trade guaranteed two disk failure in RAIDZ2 with only 85.7% survival of two disk failure in a pool of mirrors? Because of the drastically shorter time to resilver, and drastically lower load placed on the pool while doing so. The only disk more heavily loaded than usual during a mirror vdev resilvering is the other disk in the vdev – which might sound bad, but remember that it’s no more heavily loaded than it would’ve been as a RAIDZ member.  Each block resilvered on a RAIDZ vdev requires a block to be read from each surviving RAIDZ member; each block written to a resilvering mirror only requires one block to be read from a surviving vdev member.  For a six-disk RAIDZ1 vs a six disk pool of mirrors, that’s five times the extra I/O demands required of the surviving disks.

Resilvering a mirror is much less stressful than resilvering a RAIDZ.

One last note on fault tolerance

No matter what your ZFS pool topology looks like, you still need regular backup.

Say it again with me: I must back up my pool!

ZFS is awesome. Combining checksumming and parity/redundancy is awesome. But there are still lots of potential ways for your data to die, and you still need to back up your pool. Period. PERIOD!

Normal performance

It’s easy to think that a gigantic RAIDZ vdev would outperform a pool of mirror vdevs, for the same reason it’s got a greater storage efficiency. “Well when I read or write the data, it comes off of / goes onto more drives at once, so it’s got to be faster!” Sorry, doesn’t work that way. You might see results that look kinda like that if you’re doing a single read or write of a lot of data at once while absolutely no other activity is going on, if the RAIDZ is completely unfragmented… but the moment you start throwing in other simultaneous reads or writes, fragmentation on the vdev, etc then you start looking for random access IOPS. But don’t listen to me, listen to one of the core ZFS developers, Matthew Ahrens: “For best performance on random IOPS, use a small number of disks in each RAID-Z group. E.g, 3-wide RAIDZ1, 6-wide RAIDZ2, or 9-wide RAIDZ3 (all of which use ⅓ of total storage for parity, in the ideal case of using large blocks). This is because RAID-Z spreads each logical block across all the devices (similar to RAID-3, in contrast with RAID-4/5/6). For even better performance, consider using mirroring.

Please read that last bit extra hard: For even better performance, consider using mirroring. He’s not kidding. Just like RAID10 has long been acknowledged the best performing conventional RAID topology, a pool of mirror vdevs is by far the best performing ZFS topology.

Future expansion

This is one that should strike near and dear to your heart if you’re a SOHO admin or a hobbyist. One of the things about ZFS that everybody knows to complain about is that you can’t expand RAIDZ. Once you create it, it’s created, and you’re stuck with it.

Well, sorta.

Let’s say you had a server with 12 slots to put drives in, and you put six drives in it as a RAIDZ2. When you bought it, 1TB drives were a great bang for the buck, so that’s what you used. You’ve got 6TB raw / 4TB usable. Two years later, 2TB drives are cheap, and you’re feeling cramped. So you fill the rest of the six available bays in your server, and now you’ve added an 12TB raw / 8TB usable vdev, for a total pool size of 18TB/12TB. Two years after that, 4TB drives are out, and you’re feeling cramped again… but you’ve got no place left to put drives. Now what?

Well, you actually can upgrade that original RAIDZ2 of 1TB drives – what you have to do is fail one disk out of the vdev and remove it, then replace it with one of your 4TB drives. Wait for the resilvering to complete, then fail a second one, and replace it. Lather, rinse, repeat until you’ve replaced all six drives, and resilvered the vdev six separate times – and after the sixth and last resilvering finishes, you have a 24TB raw / 16TB usable vdev in place of the original 6TB/4TB one. Question is, how long did it take to do all that resilvering? Well, if that 6TB raw vdev was nearly full, it’s not unreasonable to expect each resilvering to take twelve to sixteen hours… even if you’re doing absolutely nothing else with the system. The more you’re trying to actually do in the meantime, the slower the resilvering goes. You might manage to get six resilvers done in six full days, replacing one disk per day. But it might take twice that long or worse, depending on how willing to hover over the system you are, and how heavily loaded it is in the meantime.

What if you’d used mirror vdevs? Well, to start with, your original six drives would have given you 6TB raw / 3TB usable. So you did give up a terabyte there. But maybe you didn’t do such a big upgrade the first time you expanded. Maybe since you only needed to put in two more disks to get more storage, you only bought two 2TB drives, and by the time you were feeling cramped again the 4TB disks were available – and you still had four bays free. Eventually, though, you crammed the box full, and now you’re in that same position of wanting to upgrade those old tiny 1TB disks. You do it the same way – you replace, resilver, replace, resilver – but this time, you see the new space after only two resilvers. And each resilvering happens tremendously faster – it’s not unreasonable to expect nearly-full 1TB mirror vdevs to resilver in three or four hours. So you can probably upgrade an entire vdev in a single day, even without having to hover over the machine too crazily. The performance on the machine is hardly impacted during the resilver. And you see the new capacity after every two disks replaced, not every six.

TL;DR

Too many words, mister sysadmin. What’s all this boil down to?

  • don’t be greedy. 50% storage efficiency is plenty.
  • for a given number of disks, a pool of mirrors will significantly outperform a RAIDZ stripe.
  • a degraded pool of mirrors will severely outperform a degraded RAIDZ stripe.
  • a degraded pool of mirrors will rebuild tremendously faster than a degraded RAIDZ stripe.
  • a pool of mirrors is easier to manage, maintain, live with, and upgrade than a RAIDZ stripe.
  • BACK. UP. YOUR POOL. REGULARLY. TAKE THIS SERIOUSLY.

TL;DR to the TL;DR – unless you are really freaking sure you know what you’re doing… use mirrors. (And if you are really, really sure what you’re doing, you’ll probably change your mind after a few years and wish you’d done it this way to begin with.)

OpenVPN on BeagleBone Black

beaglebone_black
This is my new Beaglebone Black. Enormous, isn’t it?

I needed an inexpensive embedded device for OpenVPN use, and my first thought (actually, my tech David’s first thought) was the obvious in this day and age: “Raspberry Pi.”

Unfortunately, the Pi didn’t really fit the bill.  Aside from the unfortunate fact that my particular Pi arrived with a broken ethernet port, doing some quick network-less testing of OpenSSL gave me very disappointing 5mbps-ish numbers – 5mbps or so, running flat out, encryption alone, let alone any actual routing.  This bore up with some reviews I found online, so I had to give up on the Pi as an embedded solution for OpenVPN use.

Luckily, that didn’t mean I was sunk yet – enter the Beaglebone Black.  Beaglebone doesn’t get as much press as the Pi does, but it’s an interesting device with an interesting history – it’s been around longer than the Pi (more than ten years!), it’s fully open source where the Pi is not (hardware plans are published online, and other vendors are not only allowed but encouraged to build bit-for-bit identical devices!), and although it doesn’t have the video chops of the Pi (no 1080p resolution supported), it has a much better CPU – a 1GHZ Cortex A8, vs the Pi’s 700MHz A7.  If all that isn’t enough, the Beaglebone also has built-in 2GB eMMC flash with a preloaded installation of Angstrom Linux, and – again unlike the Pi – directly supports being powered from plain old USB connected to a computer.  Pretty nifty.

The only real hitch I had with my Beaglebone was not realizing that if I had an SD card in, it would attempt to boot from the SD card, not from the onboard eMMC.  Once I disconnected my brand new Samsung MicroSD card and power cycled the Beaglebone, though, I was off to the races.  It boots into Angstrom pretty quickly, and thanks to the inclusion of the Avahi daemon in the default installation, you can discover the device (from linux at least – haven’t tested Windows) by just pinging beaglebone.local.  Once that resolves, ssh root@beaglebone.local with a default password, and you’re embedded-Linux-ing!

Angstrom doesn’t have any prebuilt packages for OpenVPN, so I downloaded the source from openvpn.net and did the usual ./configure ; make ; make install fandango.  I did have one minor hitch – the system clock wasn’t set, so ./configure bombed out complaining about files in the future.  Easily fixed – ntpdate us.pool.ntp.org updated my clock, and this time the package built without incident, needing somewhere south of 5 minutes to finish.  After that, it was time to test OpenVPN’s throughput – which, spoiler alert, was a total win!

root@beaglebone:~# openvpn --genkey --secret beagle.key ; scp beagle.key me@locutus:/tmp/
root@beaglebone:~# openvpn --secret beagle.key --port 666 --ifconfig 10.98.0.1 10.98.0.2 --dev tun
me@locutus:/tmp$ sudo openvpn --secret beagle.key --remote beaglebone.local --port 666 --ifconfig 10.98.0.2 10.98.0.1 --dev tun

Now I have a working tunnel between locutus and my beaglebone.  Opening a new terminal on each, I ran iperf to test throughput.  To run iperf (which was already available on Angstrom), you just run iperf -s on the server machine, and run iperf -c [ip address] on the client machine to connect to the server.  I tested connectivity both ways across my OpenVPN tunnel:

me@locutus:~$ iperf -c 10.98.0.1
------------------------------------------------------------
Client connecting to 10.98.0.1, TCP port 5001
TCP window size: 21.9 KByte (default)
------------------------------------------------------------
[ 3] local 10.98.0.2 port 55873 connected with 10.98.0.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.1 sec 46.2 MBytes 38.5 Mbits/sec
me@locutus:~$ iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local 10.98.0.2 port 5001 connected with 10.98.0.1 port 32902
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 47.0 MBytes 39.2 Mbits/sec

38+ mbps from an inexpensive embedded device?  I’ll take it!

PSA: AMD Socket C-32 CPU Coolers

I’ve started to build what I’ve dubbed “medium iron” virtualization servers: 12-24 cores, up to 128GB RAM, etc etc. For this particular task, AMD is the bang-for-the-buck king; they can’t touch Intel right now on per-core performance, but if what you want are tons and tons of cores and lots of memory address capability for cheap, they can’t be beat.

big iron is dead.  long live "medium iron"
big iron is dead. long live "medium iron"

One problem: the new Socket C32 CPUs, such as the Opteron 4180/4184/418x, don’t come with a cooler… and there are only two coolers AT ALL on the market that specifically state they support Socket C32 – the Noctua NH-12DO, which is a fine, quiet, lotsa-air-flow cooler but very pricey at typically $80 or so (and rather hard to find), and some Dynatron that everybody and their brother warns sounds like a drill press even at idle (and isn’t cheap either). Making matters worse, the Noctua is an unusually large, super-tall design that just plain won’t work in a lot of cases.

If you dig further, you’ll find some hardware enthusiasts mentioning that the Socket C32 “should accept most Socket F coolers”, going further to mention that they’ll need to be a certain “pitch” – which usually isn’t specified on the cooler’s details page.

Well, as a public service announcement, here’s your specific inexpensive alternative for Socket C32 cooling: the Thermaltake A4022 (sometimes aka the “TR2-R1”). At under $20 retail, it’s somewhere south of 1/4 the price of the Noctua cooler; it’s also much easier to find, and it’s a standard low-profile design that fits in a lot more places.

With two of the A4022 coolers installed, my dual Opteron 4180 system idles at about 40C and runs about 65C when running 12 instances of K7Burn to maximize thermal output. Better yet, this was a deliberately low precision installation: I put a small dab of generic thermal grease on top of each CPU, and did not remove the thermal tape stripe from the bottom of each cooler (which generally will give you lower temps, if you properly grease the CPU/cooler interface). So, no “overclocker tricks” required – these coolers, in my testing, “just work” perfectly well.

Review: ASUS “Transformer” Tablet

ASUS Transformer (tablet only)
Tablet only - 1.6 lbs

ASUS has entered the Android tablet market with a compelling new contender – the Eee TF-101 “Transformer.”  Featuring an Nvidia Tegra dual-core CPU at 1.0GHz, the device feels “snappier” than most relatively high-end desktop PCs – nothing lags; when you open an app, it pops onto the screen smartly.  If you’re accustomed to browsing on smartphones, you’ll feel the performance difference immediately – even notoriously heavy pages like CNN or ESPN render as quickly as they would if you were using a high-end desktop computer.

I first got my hands on a TF101 that one of my clients had purchased, sans docking station.  After I’d played with it for a few minutes, I knew I wanted one, but that left the $150 question – what will it be like when it’s docked?  The answer is “there’s a lot of potential here” – but there are problems to be worked through before you can give it an unqualified “hey, awesome!

CNET complains that the TF101 feels cheap, with poorly-rounded corners and a flimsy backplate.  After a week or so of ownership and something like 20 hours of active use, I do not agree on either issue.  I find the tablet nicely balanced, easy to grip, and solid feeling.  It weighs in at 1.6 pounds (tablet only) and 2.9 lbs (tablet and docking station), which puts it pretty much dead center in standard weight for both tablets and netbooks.  However, while the weight of the docked TF101 is an ounce heavier than the weight of my Dell Mini 10v, the TF101 feels much less cumbersome – probably because even though it’s slightly heavier, it’s much, much slimmer.

Docking the tablet feels easy and intuitive; line up the edges of the tablet with the edges of the docking station, and you’re in the right position for the sockets to mate. Pressing down first gently, then firmly produces good tactile feedback for whether it’s lined up properly, and whether it’s “clicked” all the way in. The hinge itself is very solid and doesn’t feel “loose” or sloppy at all – in fact, it’s stiff enough that most people would have trouble moving it at all without the tablet already inserted. Undocking the tablet is easy; there’s a release toggle that slides to the left (marked with an arrow POINTING to the left, which is a nice touch); the release toggle also has a solid, not-too-sloppy but not-too-stiff action.

The docking station offers more than just the keyboard.  There are also two USB ports (a convenience which is missing on the tablet itself), a full-size SD card slot, and an internal battery pack, roughly comparable to the battery in the tablet itself.  The extra battery life is a great feature; the tablet itself gets 9 hours or so of fully active use, and the docking station roughly doubles that.  In practical terms, most people will be able to go away for a long weekend with a fully-charged TF101 sans charger, use it for 4 hours a day without ever bothering to turn it off, and come home with a significant fraction of the battery left – especially if they’ve taken the time to set the “disconnect from wireless when screen is off” option in the Power settings. I used the docked tablet 2 to 4 hours a day for a full seven-day week, playing games, emailing, and browsing; at the end of the week I was at 15% charge remaining.

ASUS Transformer - with docking station
With docking station - 2.9 lbs

Polaris Office, the office suite shipped with the TF101, was a pleasant surprise – a client asked if I could display PowerPoint presentations on the tablet, and the answer turned out to be “yes, I certainly can.” I’ve only tried a few of them, none of which had any particularly fancy animations; but 40MB slideshows load and display just fine. Paired with an HDMI projector, the TF101 should make a pretty solid little presentation device, particularly since it feels just as “fast” running slideshows as it does browsing and playing games from the Market.

Moving on to the docking station itself, I quite liked the way the touchpad is integrated into the Android OS – instead of an arrow cursor, you get a translucent “bubble” roughly the size of a fingertip press, which felt much more intuitive to me. With the arrow, I tend to try to be just as precise as I would with a mouse – which can be frustrating. The “fingertip bubble” made it easier for me to relax and just “get what you want inside the circle” without trying to be overly finicky. Sensitivity for both tracking and tapping was also very good; the touchpad feels slick and responsive to use.

The keyboard, unfortunately, is a mixed bag – it’s better-suited to large hands than many netbook keyboards, but you won’t ever mistake it for the full-size keyboard on your desk.  The dimensions are almost exactly the same as the keyboard on my Dell Mini 10v; but I find that it feels significantly more cramped and awkward – probably because ASUS elected to go the trendy new route of “raised keys with space between them”, where the Mini’s keys are literally edge-to-edge with one another.  This should make the TF101 less likely to collect crumbs, skin flakes, and other kinds of “yuck” than the Mini 10v, but I personally would rather deal with more cleaning than less roominess.

Several applications don’t really play well with the keyboard – ConnectBot, which I use as an SSH client to operate remote servers, becomes completely unusable due to handling the shift key wrong – you can’t type anything from ! through + without resorting to re-enabling the onscreen keyboard.  In the Android Browser, typing URLs in the address bar works fine, but if you do any significant amount of typing in a form – for example, writing this post in the TinyMCE control WordPress uses – the up and down arrow keys frequently map to the wrong thing.  Sometimes up/down arrow would scroll through the text I was typing, sometimes they would tab me to different controls on the page, and some OTHER times they would simply scroll the entire page up and down.

In Polaris Office (the office suite shipped with the TF101), the keyboard itself worked perfectly – but the touchpad was too sensitive and placed too closely. It was difficult to type more than one sentence at a time without the heel of my hand brushing the touchpad and registering as a “tap”, causing the last half of a sentence to appear in the midst of the sentence before it.

Can these problems be mitigated?  Probably.  The ConnectBot issue was solved pretty simply by Googling “connectbot transformer”, which immediately leads you to a Transformer-specific fork of ConnectBot – after uninstalling the original ConnectBot, temporarily enabling off-Market app installation, and downloading and installing the fork directly from GitHub, my shift-key problems there were solved.  Presumably either Google or ASUS will eventually deal with the arrow key behavior in the Android Browser.  I tried using the Dolphin HD Browser in the meantime, but had no better luck with it – it is at least consistent in how it handles arrow key usage, but unfortunately it’s consistently wrong – it always scrolls the entire page up and down when you press up or down arrow keys, no matter where the focus on the page is.   Finally, you can toggle the touchpad completely on or off by using a function button at the top of the keyboard – but it would be nice to simply change the sensitivity instead, or automatically disable it for half a second or so after keypresses, the way you can on a traditional (non-Android) netbook.

In the end, though, you can’t really fix all the problems by yourself with “tweaking”; some of the frustrations with the poor integration of the physical keyboard into the Android environment are going to keep ambushing you until Google itself addresses them. ASUS and individual app developers can and likely will continue working to mitigate these issues, but it will be a never-ending game of whack-a-mole until Android itself takes adapting to the “netbook” environment more seriously.

Final verdict: The tablet looks, feels, and performs incredibly well; in most cases it “feels faster” than even high-end desktop computers. Even though my Atom-powered Dell Mini 10v has a Crucial C300 SSD (Solid State Drive), the TF101 spanks it thoroughly in pretty much every performance category possible and sends it home crying. Battery life is also phenomenal, at 9-ish active hours undocked or 18-ish hours docked. It looks and feels, on first blush, like it would make a truly incredible netbook when docked – but Android 3.2 and its apps clearly haven’t come to the party well-prepared for a physical keyboard – and it shows, which knocks the initial blush well off the device as a netbook competitor. If you really need physical keyboard and conventional data entry, this is probably not going to be the device for you – at least, not until the rest of the OS and its apps evolve to support it better.

If you want a tablet, I can recommend the TF-101 without reservation.  If you want a netbook, though, you should probably give the TF-101 a pass unless and until Google starts taking the idea of “Android Netbooks” seriously.

More virtualization: multiple Win7 guests on a single Debian host

As a proof-of-concept for USC computer science labs, I set up eight Windows 7 VMs on the same physical host in the Windows Server demonstration below, and recorded firing them up simultaneously and doing some light web browsing, etc. on several of them. Performance is pretty solid; you could probably cram double this many guests on that host and still have as good or better performance than the typical physical lab workstation.


update: replaced video with somewhat more watchable version, with all eight guests tiled on one screen.

Aside from good performance and a single box to maintain, this setup offers some fairly compelling advantages over the traditional computer lab: the host also has a 2TB conventional drive in it, which is where a “gold” image of the Win7 guests is maintained. It only takes about 10 minutes total to reset all of the guests to the “gold” standard; and it would be just as easy to keep multiple gold images on the conventional drive for different classes – Linux images for one class, Windows images with Office for a basic class, Windows images with Visual Studio for another, Solaris for yet another… you get the idea.

Also, the time to “reset” the guests could be substantially faster than that, even, with a little tweaking – using .qcow files instead of whole LVM volumes would allow you to use rsync with the –inplace argument and only have to write over the (relatively few) changed blocks, for example; or in a more advanced layout a separate FreeBSD machine with a large RAIDZ array and iSCSI exports could be used to store the images. There’s still plenty of room for improvement and innovation, but even the simple proof-of-concept (which I put together in roughly half an hour) looks pretty compelling to me.

Virtualizing Windows Server with KVM

I’ve been surprised and pleased at just how well Windows Server 2008 runs virtualized under Debian Squeeze. I first started running virtual Windows Servers purely for the disaster recovery and portability aspects, expecting to pay with a drop in performance… but what I found was that in a lot of cases, Windows 2008’s performance is actually somewhat better when running virtually. In particular, the ever-annoying reboot cycle gets cut to a tiny, tiny fraction of what it would be if running on “the bare metal.”

It’s also pretty nice never, ever having to play “hunt-the-driver” – the virtual “hardware” is all natively supported by Windows, so a virtual install “just works” the moment it’s done, no fuss no muss. But what about that performance?

Smokin’! Which exposes yet another reason to think about virtualization: being able to take advantage of Linux’s highly superior kernel RAID capabilities. The box shown above is running four Crucial C300 128GB solid state drives connected to SATA-3 6Gbps ports on an ASUS board; the Debian Squeeze host has them set up in a kernel RAID10. The resulting 250GB or so of storage is on a performance level that just has to be seen to be believed.

Note that while this IS a really “hot” machine, it’s still just one machine, running on commodity hardware – there’s no $50,000 SAN lurking in the background somewhere; that performance is ALL coming from a single machine with a price tag of WELL under $10K.

Ready to upgrade yet? =)

Solid State Drives

If you’ve never seen a machine equipped with a good Solid State Drive (SSD)… they’re pretty impressive.  In this clip, I’m putting an Ubuntu 9.10 workstation with an Intel SSD through its paces.

Some of the reason that machine is so fast is Ubuntu – the newest release has some pretty significant disk speed related enhancements – but the vast majority of it is the solid state drive.  (For those of you not familiar with Linux, it might help you to think of GiMP as “Photoshop” – both because it does pretty much the same job, and because both are notorious for being EXTREMELY slow to start up.)

You do have to be careful when you’re buying an SSD, though – they’re not all created equal.  In fact, some of them are absolutely atrocious, with significantly worse performance than conventional hard drives… so you need to know what you’re doing (or trust who you’re buying from) when you go that route.  In particular, anything with a jMicron controller in it is better taken out back and shot than put in a production machine.  You also need to be aware that you’re going to pay a lot more per megabyte for solid state – an 80GB SSD costs about as much as two 1.5 terabyte conventional hard drives.  So you probably don’t want SSDs (yet) for tasks involving large amounts of bulk storage.

But, as the video demonstrates… if what you need is performance, there’s nothing else in the same league; a few hundred bucks spent on a good SSD will give you more real-world performance benefit for most tasks than several thousand dollars spent otherwise.

It’s also worth noting that the current generation of SSDs are generally 2.5″ form factor, meaning they fit interchangeably in notebooks, netbooks, or desktop computers.  You typically won’t see as much of the top-end performance on a notebook or netbook – their SATA controllers usually bottleneck at a third of the top-end speed of the best SSDs – but they’re just as much (if not more) worth the upgrade, because conventional laptop HDDs perform much more poorly than full-size HDDs, so the speed boost is even more of a blessing.