Death of a Server

Murphy’s law is typically (mis)phrased as “if anything can go wrong, it will”. My new extension to this law (Wes’ law?) will now read: “If anything can go wrong, it will, at the most inconvenient time” because he didn’t take into account the 4th dimension: time.

Roughly two weeks ago I was wandering around the streets of Prague, CZ when I noticed that I could no longer log into my server back in the U.S. After checking everything leading up to the system, my wife reported “the power button is still doing nothing and no lights are coming on”. I suspected, at this point, it was the power supply. But unfortunately I still had another week of work travel to complete before I could get back to fix it. (And of course, during part of my away trip, I was planning on using it remotely for a work-related demonstration involving DNSSEC).

Hence my new extension: “… at the most inconvenient time”.

Returning to the U.S.

Upon returning to the physical system I did confirm my guess that it was the power supply that died. (Note: in front of the system is multiple surge protectors and a decent UPS, so it was definitely the supply itself breaking, not a surge coming through the power-lines.) I quickly removed the old supply and replaced it with a nice, shiny, dust-free new one. Click Switch, and still no go. Power went to the mother board but it refused to do anything.

Back to the store for a new mother board. And a CPU. And memory. My original estimate of a $75 replacement power supply was beginning to look very very off. After replacing the motherboard, taking out all the original cards and leaving only the original hard drives in place (ok, the physical case was still the same) I tried booting up again. At least the BIOS bootstrapping began, but the system still failed to boot and the screen showing “no hard drives detected” had to be a bad sign.

That left the 3 hard drives as being still in some state of “bad”. So, booting from a Fedora rescue disk, I attempted to examine how each drive was functioning. One at a time. None of them would even spin up. All 3 exhibited complete failure conditions. Two of three were identically configured drives (from different manufactures) in a RAID1 array to ensure that if either drive died, the data would still remain intact. Redundancy is great until everything fails at once. Murphy doesn’t believe redundancy will help. The third drive contained (daily) backups of the system from the other two, but it had catastrophically failed too. That meant that there was no chance of a complete recovery unless I could get at least one of the drives working.

Salvage Operations


That can’t be good

In a last-ditch effort, I ordered brand-new, exact copies of the dead drives (which themselves were only 5 months old so finding duplicates was easy). If I was lucky, only the controllers on the drives would be dead and the physical drives themselves would still function. When the new duplicate drives arrived, I swapped the good controller on a new drive onto a bad drive and hoped. Unfortunately, the first old drive with the new controller still failed (though at least it sounded like it was trying to spin up this time). I crossed my fingers and moved on to the second bad drive. Unfortunately, even that was a no-go. I even tried various other tricks, being at the true “last resort” stage. It’s amazing the things that people suggest that might fix a dead hard drive, from knocking it on a table (I didn’t try that) to pretending to throw it like a frisbee to putting it in the freezer for 30 minutes.

Eventually, I had to admit defeat and start from my oldest, external backups. Sigh. They were from 4 months ago. Double sigh.

It’s better than nothing at least, but… I lost mail. I lost some pictures. And I lost some reputation points from having run a very solid, rarely down, server for various mailing lists and other services for the last 15+ years.

Looking Forward

So what did I learn from this? The first thing: one set of backups is never enough. And most importantly, at least one set should be electrically isolated from the machine. This means that the very common technique of storing backups on an external USB drive probably isn’t wise either since it’s just as likely that the USB system would spike a few volts to the external drive too.

So what are my future plans? I’ve replaced the system and got it back up and running on the old data, restarted the backup system using the exact same nightly routine. But now I’m going to add an external USB drive to a completely different machine and (r)sync the backups to it on a daily-ish basis. That combined with a backup MX server that keeps mail copies of critical domains for 30 days and off-site backups of truly critical data should suffice right?

Shush Murphy. Yes, I can hear you whispering behind me, but I’m not on speaking terms with you right now.

2 Comments

  1. Bill Broadley Said,

    April 14, 2011 @ 2:44 am

    Youch, yeah happens. I’d replace \electrically isolated\ to offsite, especially for photos. Ship a dsl/cable enabled friend/family a driver, or backup to S3 or something. Use rsync, one of the rsync based, or duplicity (if to s3).

    Sadly UPSs provide minimal if any protection against spikes unless they are the more expensive, heavier, noisier, and hotter fully online variety that continuously puts the power from AC -> DC -> AC again. Of course powerstrips are even more limited in their protection. To test just unplug your UPS, if you hear a loud click then it’s not online and you can hear the latency between power flicker and UPS switch over.

    Hmmm, there’s a big loss going from ac -> dc -> ac -> dc (wall -> ups -> power cable -> power supply. I wonder if laptops run online… if so a usb drive connect to (and powered by) a laptop might be significantly more reliable than a usb connected to a desktop.

    BTW, typically only the 2.5\ drives can be powered by USB.

    Not sure a surge that destroys a power supply, motherboard, and down the molex connect to 3 drives is going to stop just because it’s going over USB.

    Where any other computers online, in the house, and fine?

  2. Wes Said,

    April 27, 2011 @ 9:09 pm

    Bill,

    I actually have off-sites for most critical things. Like photos. But there is a delay while “I process, copy, etc”, photos I’ve just taken. I think I may change this process to include “backup right away” now…

    And you’re right, UPSes aren’t fantastic. But they’re still “better than nothing”. I’m still fairly confident that it was the PS itself, but you’re right that there was always the chance it wasn’t.

    Personally, I think we should do 12V around the house and have a converter at the edge. The DC lines in a house should be short enough to limit the voltage drop to a reasonable amount.