East of the Sun, West of the Moon



Filed under: Software,Technology — Erwin @ 5:15 am
A month ago:
Prepare server C, ship it off to work (which is two timezones away from where I live).
Two weeks ago:
Moved game M from server K to server C. This gave M a new home, and allowed me to rearrange the RAID5 setups and ext3 file-systems on top of them.
One week ago:
Moved several games from server T to server K, after the rearranging had been finished. T is one of our oldest machines, so that was a nice improvement. At the end of the week it becomes clear to me that something isn’t quite right in the current setup of K, but I can’t immediately put my finger on it.
This week:
I start to suspect that I’ve forgotten to use the -R stride=<number> parameter when creating the ext3 filesystems. To verify this I download the e2fsprogs 1.39 sources, edit the resize2fs program slightly, which has a heuristic built in to determine what stride was originally used (as it isn’t stored anywhere afterward for easy checking, unfortunately), and use that to confirm my suspicion. Damn. That would explain the less than stellar harddisk/filesystem performance, then. It turns out I also overlooked for that server C.
After a few days of juggling data around in preparation (RAID5 setups with 3 harddisks in the active set and 1 spare, so I can take two out, create the new RAID5 from those, copy data over, and later dismantle the original RAID5 and add their components to the new one), I begin an announced maintenance downtime, first on server C, which has only one game to deal with, but has 3 filesystems that need to be juggled around. Everything goes fine for the first two (non-root) filesystems. I reboot to make sure that the RAID autodetect code picks them up correctly, and that works fine. Then I proceed to make adjustments for the 3rd (root) filesystem. This shouldn’t be a problem because LILO, the boot-loader should be using the /boot filesystem, which is on a RAID1 setup. I make adjustments to /etc/lilo.conf and /etc/fstab and rerun the lilo program to make sure the master boot records are updated on all the hardisks involved.

I reboot… and the machine doesn’t come back up. Damnit!

The short version of what follows is that I wake up my colleague in Berkeley, California, who goes over to the colo facility to analyze the problem and helps me by swapping two of the harddisks into into another machine so I can extract the necessary data with which I can at least restart the M game on another machine (yes, we have backups, but those would be 8 hours older than the most current snapshot, so if we can restore that, that’s always better). After 6 hours the maintenance on machine K has been finished as intended (it didn’t involve a root filesystem, thankfully) and I can bring back all games, including M on it. We’ll deal with C later.

In the excitement I never ate anything for breakfast or lunch during those six hours, just had my usual morning coffee, so at the end of it I wasn’t feeling too cheerful and had a bit of a headache. Ouch.

I hope your day went a little better than that. 🙂

Powered by WordPress