When NOTHING else will remove a half-installed .deb package

WARNING: ALL WARRANTIES NULL AND VOID

With that important disclaimer out of the way… when you’re stuck in the world’s worst apt -f install loop and can’t figure out any other way to get the damn thing unwedged when there’s a half-installed package (eg if you’ve removed an /etc directory for a package you installed before, and this breaks an installer script—or the installer script “knows you already have it” and refuses to replace a removed config directory), this is the nuclear option:

sudo nano /var/lib/dpkg/status

Remove the offending package (and any packages that depend on it) entirely from this file, then apt install the offending package again.

If you’re still broken after that… did I mention all warranties null and void? This is an extremely nuclear option, and I really wouldn’t recommend it outside a throwaway test environment; you’re probably better off just nuking the whole server and reinstalling.

With that said, the next thing I had to do to clean out the remnants of the mysql/mariadb coinstall debacle that inspired this post was:

find / -iname “*mysql*” | grep -v php | grep -v snap | xargs rm -r

This got me out of a half-broken state on a machine that somebody had installed both mariadb-server and mysql-server on, leaving neither working properly.

Rebalancing data on ZFS mirrors

One of the questions that comes up time and time again about ZFS is “how can I migrate my data to a pool on a few of my disks, then add the rest of the disks afterward?”

If you just want to get the data moved and don’t care about balance, you can just copy the data over, then add the new disks and be done with it. But, it won’t be distributed evenly over the vdevs in your pool.

Don’t fret, though, it’s actually pretty easy to rebalance mirrors. In the following example, we’ll assume you’ve got four disks in a RAID array on an old machine, and two disks available to copy the data to in the short term.

Step one: create the new pool, copy data to it

First up, we create a simple temporary zpool with the two available disks.

zpool create temp -oashift=12 mirror /dev/disk/by-id/wwn-disk0 /dev/disk/by-id/disk1

Simple. Now you’ve got a ZFS mirror named temp, and you can start copying your data to it.

Step two: scrub the pool

Do not skip this step!

zpool scrub temp

Once this is done, do a zpool status temp to make sure you don’t have any errors. Assuming you don’t, you’re ready to proceed.

Step three: break the mirror, create a new pool

zpool detach temp /dev/disk/by-id/disk1

Now, your temp pool is down to one single disk vdev, and you’ve freed up one of its original disks. You’ve also got a known good copy of all your data on disk0, and you’ve verified it’s all good by using a zpool scrub command in step two. So, destroy the old machine’s storage, freeing up its four disks for use. 

zpool create tank /dev/disk/by-id/disk1 mirror /dev/disk/by-id/disk2 /dev/disk/by-id/disk3 mirror /dev/disk/by-id/disk4 /dev/disk/by-id/disk5

Now you’ve got your original temporary pool named temp, and a new permanent pool named tank. Pool “temp” is down to one single-disk vdev, and pool “tank” has one single-disk vdev, and two mirror vdevs.

Step four: copy your data from temp to tank

Copy all your data one more time, from the single-disk pool “temp” to the new pool “tank.” You can use zfs replication for this, or just plain old cp or rsync. Your choice.

Step five: scrub tank, destroy temp

Do not skip this step.

zpool scrub tank

Once this is done, do a zpool status tank to make sure you don’t have any errors. Assuming you don’t, now it’s time to destroy your temporary pool to free up its disk.

zpool destroy temp

Almost done!

Step six: attach the final disk from temp to the single-disk vdev in tank

zpool attach tank /dev/disk/by-id/disk0 /dev/disk/by-id/disk1

That’s it—you now have all of your data imported to a six-disk pool of mirrors, and all of the data is evenly distributed (according to disk size, at least) across all vdevs, not all clumped up on the first one to be added.

You can obviously adjust this formula for larger (or smaller!) pools, and it doesn’t necessarily require importing from an older machine—you can use this basic technique to redistribute data across an existing pool of mirrors, too, if you add a new mirror vdev. 

The important concept here is the idea of breaking mirror vdevs using zpool detach, and creating mirror vdevs from single-disk vdevs using zpool attach.

 

Continuously updated iostat

Finally, after I don’t know HOW many years, I figured out how to get continuously updated stats from iostat that don’t just scroll up the screen and piss you off.

For those of you who aren’t familiar, iostat gives you some really awesome per-disk reports that you can use to look for problems. Eg, on a system I’m moving a bunch of data around on at the moment:

root@dr0:~# iostat --human -xs
Linux 4.15.0-45-generic (dr0) 06/04/2019 x86_64 (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.1% 0.0% 2.6% 5.8% 0.0% 91.5%
Device tps kB/s rqm/s await aqu-sz areq-sz %util
loop0 0.00 0.0k 0.00 0.00 0.00 1.6k 0.0%
sda 214.70 17.6M 0.14 2.25 0.48 83.9k 29.8%
sdb 364.41 38.1M 0.63 9.61 3.50 107.0k 76.5%
sdc 236.49 22.2M 4.13 2.13 0.50 96.3k 20.4%
sdd 237.14 22.2M 4.09 2.14 0.51 95.9k 20.4%
md0 12.09 221.5k 0.00 0.00 0.00 18.3k 0.0%

In particular, note that %util column. That lets me see that /dev/sdb is the bottleneck on my current copy operation. (I expect this, since it’s a single disk reading small blocks and writing large blocks to a two-vdev pool, but if this were one big pool, it would be an indication of problems with sdb.)

But what if I want to see a continuously updated feed? Well, I can do iostat –human -xs 1 and get a new listing every second… but it just scrolls up the screen, too fast to read. Yuck.

OK, how about using the watch command instead? Well, normally, when you call iostat, the first output is a reading that averages the stats for all devices since the first boot. This one won’t change visibly very often unless the system was JUST booted, and almost certainly isn’t what you want. It also frustrates the heck out of any attempt to simply use watch.

The key here is the -y argument, which skips that first report which always gives you the summary of history since last boot, and gets straight to the continuous interval reports – and knowing that you need to specify an interval, and a count for iostat output. If you get all that right, you can finally use watch -n 1 to get a running output of iostat that doesn’t scroll up off the screen and drive you insane trying to follow it:

root@dr0:~# watch -n 1 iostat -xy --human 1 1

Have fun!

About ZFS recordsize

ZFS stores data in records, which are themselves composed of blocks. The block size is set by the ashift value at time of vdev creation, and is immutable. The recordsize, on the other hand, is individual to each dataset(although it can be inherited from parent datasets), and can be changed at any time you like. In 2019, recordsize defaults to 128K if not explicitly set.

big files? big recordsize.

The general rule of recordsize is that it should closely match the typical workload experienced within that dataset. For example, a dataset used to store high-quality JPGs, averaging 5MB or more, should have recordsize=1M. This matches the typical I/O seen in that dataset – either reading or writing a full 5+ MB JPG, with no random access within each file – quite well; setting that larger recordsize prevents the files from becoming unduly fragmented, ensuring the fewest IOPS are consumed during either read or write of the data within that dataset.

DB binaries? Smaller recordsize.

By contrast, a dataset which directly contains a MySQL InnoDB database should have recordsize=16K. That’s because InnoDB defaults to a 16KB page size, so most operations on an InnoDB database will be done in individual 16K chunks of data. Matching recordsize to MySQL’s page size here means we maximize the available IOPS, while minimizing latency on the highly sync()hronous reads and writes made by the database (since we don’t need to read or write extraneous data while handling our MySQL pages).

VMs? Match the recordsize to the VM storage format.

(That’s cluster_size, for QEMU/KVM .qcow2.)

On the other hand, if you’ve got a MySQL InnoDB database stored within a VM, your optimal recordsize won’t necessarily be either of the above – for example, KVM .qcow2 files default to a cluster_size of 64KB. If you’ve set up a VM on .qcow2 with default cluster_size, you don’t want to set recordsize any lower (or higher!) than the cluster_size of the .qcow2 file. So in this case, you’ll want recordsize=64K to match the .qcow2’s cluster_size=64K, even though the InnoDB database inside the VM is probably using smaller pages.

An advanced administrator might look at all of this, determine that a VM’s primary function in life is to run MySQL, that MySQL’s default page size is good, and therefore set both the .qcow2 cluster_size and the dataset’s recordsize to match, at 16K each.

A different administrator might look at all this, determine that the performance of MySQL in the VM with all the relevant settings left to their defaults was perfectly fine, and elect not to hand-tune all this crap at all. And that’s okay.

What if I set recordsize too high?

If recordsize is much higher than the size of the typical storage operation within the dataset, latency will be greatly increased and this is likely to be incredibly frustrating. IOPS will be very limited, databases will perform poorly, desktop UI will be glacial, etc.

What if I set recordsize too low?

If recordsize is a lot smaller than the size of the typical storage operation within the dataset, fragmentation will be greatly (and unnecessarily) increased, leading to unnecessary performance problems down the road. IOPS as measured by artificial tools will be super high, but performance profiles will be limited to those presented by random I/O at the record size you’ve set, which in turn can be significantly worse than the performance profile of larger block operations.

You’ll also screw up compression with an unnecessarily low recordsize; zfs inline compression dictionaries are per-record, and work by fitting more than one entire block into a single record’s space. If you set compression=lz4ashift=12, and recordsize=4K you’ll effectively have NO compression, because your blocksize is equal to your recordsize – pretty much nothing but all-zero blocks can be compressed. Meanwhile, the same dataset with the default 128K recordsize might easily have a 1.7:1 compression ratio.

Are the defaults good? Do I aim high, or do I aim low?

128K is a pretty reasonable “ah, what the heck, it works well enough” setting in general. It penalizes you significantly on IOPS and latency for small random I/O operations, and it presents more fragmentation than necessary for large contiguous files, but it’s not horrible at either task. There is a lot to be gained from tuning recordsize more appropriately for task, though.

What about bittorrent?

The “big records for big files” rule of thumb still applies for datasets used as bittorrent targets.

This is one of those cases where things work just the opposite of how you might think – torrents write data in relatively small chunks, and access them randomly for both read and write, so you might reasonably think this calls for a small recordsize. However, the actual data in the torrents is typically huge files, which are accessed in their entirety for everything but the initial bittorrent session.

Since the typical access pattern is “large-file”, most people will be better off using recordsize=1M in the torrent target storage. This keeps the downloaded data unfragmented despite the bittorrent client’s insanely random writing patterns. The data acquired during the bittorrent session in chunks is accumulated in the ZIL until a full record is available to write, since the torrent client itself is not synchronous – it writes all the time, but rarely if ever calls sync().

As a proof-of-concept, I used the Transmission client on an Ubuntu 16.04 LTS workstation to download the Ubuntu 18.04.2 Server LTS ISO, with a dataset using recordsize=1M as the target. This workstation has a pool consisting of two mirror vdevs on rust, so high levels of fragmentation would be very easy to spot.

root@locutus:/# zpool export data ; modprobe -r zfs ; modprobe zfs ; zpool import data

root@locutus:/# pv < /data/torrent/ubu*18*iso > /dev/null
 883MB 0:00:03 [ 233MB/s] [==================================>] 100%

Exporting the pool and unloading the ZFS kernel module entirely is a weapons-grade-certain method of emptying the ARC entirely; getting better than 200 MB/sec average read throughput directly from the rust vdevs afterward (the transfer actually peaked at nearly 400 MB/sec!) confirms that our torrented ISO is not fragmented.

Note that preallocation settings in your bittorrent client are meaningless when the client is saving to ZFS – you can’t actually preallocate in any meaningful way on ZFS, because it’s a copy-on-write filesystem.

A comprehensive guide to fixing slow SSH logins

The debug text that brought you here

Most of you are probably getting here just from frustratedly googling “slow ssh login”.  Those of you who got a little froggier and tried doing an ssh -vv to get lots of debug output saw things hanging at debug1: SSH2_MSG_SERVICE_ACCEPT received, likely for long enough that you assumed the entire process was hung and ctrl-C’d out.  If you’re patient enough, the process will generally eventually continue after the debug1: SSH2_MSG_SERVICE_ACCEPT received line, but it may take 30 seconds.  Or even five minutes.

You might also have enabled debug logging on the server, and discovered that your hang occurs immediately after debug1: KEX done [preauth]
and before debug1: userauth-request for user in /var/log/auth.log.

I feel your frustration, dear reader. I have solved this problem after hours of screeching head-desking probably ten times over the years.  There are a few fixes for this, with the most common – DNS – tending to drown out the rest.  Which is why I keep screeching in frustration every few years; I remember the dreaded debug1: SSH2_MSG_SERVICE_ACCEPT received hang is something I’ve solved before, but I can only remember some of the fixes I’ve needed.

Anyway, here are all the fixes I’ve needed to deploy over the years, collected in one convenient place where I can find them again.

It’s usually DNS.

The most common cause of slow SSH login authentications is DNS. To fix this one, go to the SSH server, edit /etc/ssh/sshd_config, and set UseDNS no.  You’ll need to restart the service after changing sshd_config: /etc/init.d/ssh restart, systemctl restart ssh, etc as appropriate.

If it’s not DNS, it’s Avahi.

The next most common cause – which is devilishly difficult to find reference to online, and I hope this helps – is the never-to-be-sufficiently damned avahi daemon.  To fix this one, go to the SSH client, edit /etc/nsswitch.conf, and change this line:

hosts:          files mdns4_minimal [NOTFOUND=return] dns

to:

hosts:          files dns

In theory maybe something might stop working without that mdns4_minimal option?  But I haven’t got the foggiest notion what that might be, because nothing ever seems broken for me after disabling it.  No services need restarting after making this change, which again, must be made on the client.

You might think this isn’t your problem. Maybe your slow logins only happen when SSHing to one particular server, even one particular server on your local network, even one particular server on your local network which has UseDNS no and which you don’t need any DNS resolution to connect to in the first place.  But yeah, it can still be this avahi crap. Yay.

When it’s not Avahi… it’s PAM.

This is another one that’s really, really difficult to find references to online.  Optional PAM modules can really screw you here.  In my experience, you can’t get away with simply disabling PAM login in /etc/ssh/sshd_config – if you do, you won’t be able to log in at all.

What you need to do is go to the SSH server, edit /etc/pam.d/common-session and comment out the optional module that’s causing you grief.  In the past, that was pam_ck_connector.so.  More recently, in Ubuntu 16.04, the culprit that bit me hard was pam_systemd.so. Get in there and comment that bugger out.  No restarts required.

#session optional pam_systemd.so

GSSAPI, and ChallengeResponse.

I’ve seen a few seconds added to a pokey login from GSSAPIAuthentication, whatever that is. I feel slightly embarrassed about not knowing, but seriously, I have no clue.  Ditto for ChallengeResponseAuthentication.  All I can tell you is that neither cover standard interactive passwords, or standard public/private keypair authentication (the keys you keep in ~/ssh/authorized_keys).

If you aren’t using them either, then disable them.  If you’re not using Active Directory authentication, might as well go ahead and nuke Kerberos while you’re at it.  Make these changes on the server in /etc/ssh/sshd_config, and restart the service.

ChallengeResponseAuthentication no
KerberosAuthentication no
GSSAPIAuthentication no

Host-based Authentication.

If you’re actually using this, don’t disable it. But let’s get real with each other: you’re not using it.  I mean, I’m sure somebody out there is.  But it’s almost certainly not you.  Get rid of it.  This is also on the server in /etc/ssh/sshd_config, and also will require a service restart.

# Don't read the user's ~/.rhosts and ~/.shosts files
IgnoreRhosts yes
# For this to work you will also need host keys in /etc/ssh_known_hosts
RhostsRSAAuthentication no
# similar for protocol version 2
HostbasedAuthentication no

Need frequent connections? Consider control sockets.

If you need to do repetitive ssh’ing into another host, you can speed up the repeated ssh commands enormously with the use of control sockets.  The first time you SSH into the host, you establish a socket.  After that, the socket obviates the need for re-authentication.

ssh -M -S /path/to/socket -o ControlPersist=5m remotehost exit

This creates a new SSH socket at /path/to/socket which will survive for the next 5 minutes, after which it’ll automatically expire.  The exit at the end just causes the ssh connection you establish to immediately terminate (otherwise you’d have a perfectly normal ssh session going to remotehost).

After creating the control socket, you utilize it with -S /path/to/socket in any normal ssh command to remotehost.

The improvement in execution time for commands using that socket is delicious.

me@banshee:/~$ ssh -M -S /tmp/demo -o ControlPersist=5m eohippus exit

me@banshee:/~$ time ssh eohippus exit
real 0m0.660s
user 0m0.052s
sys 0m0.016s

me@banshee:/~$ time ssh -S /tmp/demo eohippus exit
real 0m0.050s
user 0m0.005s
sys 0m0.010s

Yep… from 660ms to 50ms.  SSH control sockets are pretty awesome.

Heartbleed SSL vulnerability

Last night (2014 Apr 7) a massive security vulnerability was publicly disclosed in OpenSSL, the library that encrypts most of the world’s sensitive traffic. The bug in question is approximately two years old – systems older than 2012 are not vulnerable – and affects the TLS “heartbeat” function, which is why the vulnerability has been nicknamed HeartBleed.

The bug allows a malicious remote user to scan arbitrary 64K chunks of the affected server’s memory. This can disclose any and ALL information in that affected server’s memory, including SSL private keys, usernames and passwords of ANY running service accepting logins, and more. Nobody knows if the vulnerability was known or exploited in the wild prior to its public disclosure last night.

If you are an end user:

You will need to change any passwords you use online unless you are absolutely sure that the servers you used them on were not vulnerable. If you are not a HIGHLY experienced admin or developer, you absolutely should NOT assume that sites and servers you use were not vulnerable. They almost certainly were. If you are a highly experienced ops or dev person… you still absolutely should not assume that, but hey, it’s your rope, do what you want with it.

Note that most sites and servers are not yet patched, meaning that changing your password right now will only expose that password as well. If you have not received any notification directly from the site or server in question, you may try a scanner like the one at http://filippo.io/Heartbleed/ to see if your site/server has been patched. Note that this script is not bulletproof, and in fact it’s less than 24 hours old as of the time of this writing, up on a free site, and under massive load.

The most important thing for end users to understand: You must not, must not, MUST NOT reuse passwords between sites. If you have been using one or two passwords for every site and service you access – your email, forums you post on, Facebook, Twitter, chat, YouTube, whatever – you are now compromised everywhere and will continue to be compromised everywhere until ALL sites are patched. Further, this will by no means be the last time a site is compromised. Criminals can and absolutely DO test compromised credentials from one site on other sites and reuse them elsewhere when they work! You absolutely MUST use different passwords – and I don’t just mean tacking a “2” on the end instead of a “1”, or similar cheats – on different sites if you care at all about your online presence, the money and accounts attached to your online presence, etc.

If you are a sysadmin, ops person, dev, etc:

Any systems, sites, services, or code that you are responsible for needs to be checked for links against OpenSSL versions 1.0.1 through 1.0.1f. Note, that’s the OpenSSL vendor versioning system – your individual distribution, if you are using repo versions like a sane person, may have different numbering schemes. (For example, Ubuntu is vulnerable from 1.0.1-0 through 1.0.1-4ubuntu5.11.)

Examples of affected services: HTTPS, IMAPS, POP3S, SMTPS, OpenVPN. Fabulously enough, for once OpenSSH is not affected, even in versions linking to the affected OpenSSL library, since OpenSSH did not use the Heartbeat function. If you are a developer and are concerned about code that you wrote, the key here is whether your code exposed access to the Heartbeat function of OpenSSL. If it was possible for an attacker to access the TLS heartbeat functionality, your code was vulnerable. If it was absolutely not possible to check an SSL heartbeat through your application, then your application was not vulnerable even if it linked to the vulnerable OpenSSL library.

In contrast, please realize that just because your service passed an automated scanner like the one linked above doesn’t mean it was safe. Most of those scanners do not test services that use STARTTLS instead of being TLS-encrypted from the get-go, but services using STARTTLS are absolutely still affected. Similarly, none of the scanners I’ve seen will test UDP services – but UDP services are affected. In short, if you as a developer don’t absolutely know that you weren’t exposing access to the TLS heartbeat function, then you should assume that your OpenSSL-using application or service was/is exploitable until your libraries are brought up to date.

You need to update all copies of the OpenSSL library to 1.0.1g or later (or your distribution’s equivalent), both dynamically AND statically linked (PS: stop using static links, for exactly things like this!), and restart any affected services. You should also, unfortunately, consider any and all credentials, passwords, certificates, keys, etc. that were used on any vulnerable servers, whether directly related to SSL or not, as compromised and regenerate them. The Heartbleed bug allowed scanning ALL memory on any affected server and thus could be used by a sufficiently skilled user to extract ANY sensitive data held in server RAM. As a trivial example, as of today (2014-Apr-08) users at the Ars Technica forums are logging on as other users using password credentials held in server RAM, as exposed by standard exploit test scripts publicly disclosed.

Completely eradicating all potential vulnerability is a STAGGERING amount of work and will involve a lot of user disruption. When estimating your paranoia level, please do remember that the bug itself has been in the wild since 2012 – the public disclosure was not until 2014-Apr-07, but we have no way of knowing how long private, possibly criminal entities have been aware of and/or exploiting the bug in the wild.

Enabling core dumps on Apache2.2 on Debian

It was quite an adventure today, figuring out how to get a segfaulting Apache to give me core dumps to analyze on Debian Squeeze. What SHOULD have been easy… wasn’t. Here’s what all you must do:

First of all, you’ll need to set ulimit -c unlimited in your /etc/init.d/apache2 script’s start section.

case $1 in
start)
log_daemon_msg "Starting web server" "apache2"
# set ulimit for debugging
# ulimit -c unlimited

Now make a directory for core dumps – mkdir /tmp/apache2-dumps ; chmod 777 /tmp/apache2-dumps – then you’ll need to apt-get install apache2-dbg libapr1-dbg libaprutil1-dbg

And, the current (Debian Squeeze, in May 2012) version of Debian does not have PIE support in the default gdb, so you’ll need to install gdb from backports. So, add deb http://backports.debian.org/debian-backports squeeze-backports main to /etc/apt/sources.list, then apt-get update && apt-get install -t squeeze-backports gdb

Now add “CoreDumpDirectory /tmp/apache2-dumps” to /etc/apache2/apache2.conf (or its own file in conf.d, whatever), then /etc/init.d/apache2 stop ; /etc/init.d/apache2 start

And once you start getting segfaults, you’ll get a core in /tmp/apache2-dumps/core.

Finally, now that you have your core, you can gdb apache2 /tmp/apache2-dumps/core, bt full, and debug to your heart’s content. WHEW.

Linux Sysadmin 101

I just finished the LibreOffice Impress presentation I’ll be using when I give my first Linux Sysadmin 101 talk at IT-ology this weekend.  The hardest part is always making graphics!

A diagram of the UNIX filesystem structure of a simple webserver
A diagram of the UNIX filesystem structure of a simple webserver

It’s licensed Creative Commons non-commercial share-alike; if you’d like a copy, you can grab one here:  https://jrs-s.net/linux_sysadmin_101/linux_sysadmin_101.odp (ODP, 2.4MB)

More virtualization: multiple Win7 guests on a single Debian host

WordPress › Error

There has been a critical error on this website.

Learn more about troubleshooting WordPress.