Death of a Server

Murphy’s law is typically (mis)phrased as “if anything can go wrong, it will”. My new extension to this law (Wes’ law?) will now read: “If anything can go wrong, it will, at the most inconvenient time” because he didn’t take into account the 4th dimension: time.

Roughly two weeks ago I was wandering around the streets of Prague, CZ when I noticed that I could no longer log into my server back in the U.S. After checking everything leading up to the system, my wife reported “the power button is still doing nothing and no lights are coming on”. I suspected, at this point, it was the power supply. But unfortunately I still had another week of work travel to complete before I could get back to fix it. (And of course, during part of my away trip, I was planning on using it remotely for a work-related demonstration involving DNSSEC).

Hence my new extension: “… at the most inconvenient time”.

Returning to the U.S.

Upon returning to the physical system I did confirm my guess that it was the power supply that died. (Note: in front of the system is multiple surge protectors and a decent UPS, so it was definitely the supply itself breaking, not a surge coming through the power-lines.) I quickly removed the old supply and replaced it with a nice, shiny, dust-free new one. Click Switch, and still no go. Power went to the mother board but it refused to do anything.

Back to the store for a new mother board. And a CPU. And memory. My original estimate of a $75 replacement power supply was beginning to look very very off. After replacing the motherboard, taking out all the original cards and leaving only the original hard drives in place (ok, the physical case was still the same) I tried booting up again. At least the BIOS bootstrapping began, but the system still failed to boot and the screen showing “no hard drives detected” had to be a bad sign.

That left the 3 hard drives as being still in some state of “bad”. So, booting from a Fedora rescue disk, I attempted to examine how each drive was functioning. One at a time. None of them would even spin up. All 3 exhibited complete failure conditions. Two of three were identically configured drives (from different manufactures) in a RAID1 array to ensure that if either drive died, the data would still remain intact. Redundancy is great until everything fails at once. Murphy doesn’t believe redundancy will help. The third drive contained (daily) backups of the system from the other two, but it had catastrophically failed too. That meant that there was no chance of a complete recovery unless I could get at least one of the drives working.

Salvage Operations

That can’t be good

In a last-ditch effort, I ordered brand-new, exact copies of the dead drives (which themselves were only 5 months old so finding duplicates was easy). If I was lucky, only the controllers on the drives would be dead and the physical drives themselves would still function. When the new duplicate drives arrived, I swapped the good controller on a new drive onto a bad drive and hoped. Unfortunately, the first old drive with the new controller still failed (though at least it sounded like it was trying to spin up this time). I crossed my fingers and moved on to the second bad drive. Unfortunately, even that was a no-go. I even tried various other tricks, being at the true “last resort” stage. It’s amazing the things that people suggest that might fix a dead hard drive, from knocking it on a table (I didn’t try that) to pretending to throw it like a frisbee to putting it in the freezer for 30 minutes.

Eventually, I had to admit defeat and start from my oldest, external backups. Sigh. They were from 4 months ago. Double sigh.

It’s better than nothing at least, but… I lost mail. I lost some pictures. And I lost some reputation points from having run a very solid, rarely down, server for various mailing lists and other services for the last 15+ years.

Looking Forward

So what did I learn from this? The first thing: one set of backups is never enough. And most importantly, at least one set should be electrically isolated from the machine. This means that the very common technique of storing backups on an external USB drive probably isn’t wise either since it’s just as likely that the USB system would spike a few volts to the external drive too.

So what are my future plans? I’ve replaced the system and got it back up and running on the old data, restarted the backup system using the exact same nightly routine. But now I’m going to add an external USB drive to a completely different machine and (r)sync the backups to it on a daily-ish basis. That combined with a backup MX server that keeps mail copies of critical domains for 30 days and off-site backups of truly critical data should suffice right?

Shush Murphy. Yes, I can hear you whispering behind me, but I’m not on speaking terms with you right now.

Comments (2)