Server upgrade (by alaric)
I host a heap of web sites (including this blog), email domains, source control repositories, mailing lists, and various other things (such as one of the official Chicken Scheme egg mirrors, a Jabber server, and an IRC server with bots). I do this with a combination of dedicated server hardware which I hire space, power, and connectivity for in London for the primary stuff, and a virtual private server in California for backup services and rapid DNS lookups from the USA.
This is a costly hobby, but it gives us a platform upon which to do interesting things, and lets me help other people out with free hosting; as I need to put in the time and money to run the infrastructure anyway, the spare capacity on it is essentially free.
The most demanding part is server upgrades. Periodically, I buy a new physical server, install it with all the software it will need, put it alongside the current hardware in the data centre, and transfer the data and settings across and configure everything that needs configuring on the new server until it works just like the old, then switch them over. I do this when the current hardware is getting full or overloaded or unreliable or just plain out of date, as I don't trust in-place updates of the core system software - it's too easy to end up with NOTHING working.
However, this has been overdue for several years. I bought the new hardware (this time, with a contribution from my biggest user of disk space!) nearly two years ago, and installed it in the rack nearly a year ago, but only yesterday did I get the chance to spend a day sitting next to it in London coaxing it into readiness then doing the final switch over...
It didn't go entirely to plan, of course. I'd previously written a script that used rsync to copy all the user data over; the first time I ran it it copied everything, then subsequent runs only had to copy the differences. The idea was that I would have less down time while I copied the data from the old server to the new (which has to happen with both servers offline, so that nothing can change during the copying process) if there was only the final changes to copy. However, I realised that the accounts of my biggest user of disk space weren't covered by my script as they had been slightly hacked to accomodate their growth.
And the whole process of moving the software configuration was made more complex by the fact that I had previously been running two servers in a kind of symbiotic cluster, in order to meet the load with the hardware of the time. Nowadays 64-bit multi-core behemoths with gigabytes of RAM are cheaply available and well supported by NetBSD, so everything can be done on one box. This is a much simpler setup, but it means that I had to undo the complexity of the previous setup when transferring everything across!
I ran into a few other unexpected problems, too; I noticed that the clock on the new server was terribly wrong, despite it running NTP. I did a manual ntpdate
, and then just in case, another to check that it was now only a few millisecond out - but it was already half a second out again! It quickly became apparent that the clock was ticking about one second in every two seconds of real time...
Looking in the output of sysctl -a
, it became apparent that I had a choice of time counter sources: it was using the TSC, but I also had an HPET, a clock interrupt, an APIC clock, and the good old 8254; my machine was brimming with alternate clocks. I tried switching to the HPET with sysctl -w kern.timecounter.hardware=hpet0
and suddenly time was running as expected. I popped that in /etc/sysctl.conf
so it would come back on reboots, resynched the clocks, and everything's been fine since. I can only presume that the kernel was reading the CPU clock speed wrong, or some kind of dynamic clock scaling is happening, so that the (CPU-based) TSC wasn't having its ticks converted to seconds properly.
I had a big setback with the email setup, as NetBSD comes with Postfix as part of the base system but I wanted a more recent version from packages, but I ended up getting tangled with what version was being run in various situations and what configuration file was being used, which took a while to sort out. And then of course there's Mailman, the mailing list server software, which is complicated by needing write access to its filesystem-based state when run from the mail system (for incoming mail) or the Web server (for the web interface), so uses lots of setgid binaries and group-writable files and the like, and so always takes a lot of fiddling to get working properly.
But... I did it. And so, having completed my tax returns earlier this year (which is what freed up the time to prepare for and do this mission), I have now gotten rid of all the major obligations that have been hanging over me for the past few years.
I still need to visit London again - I've left the old servers running alongside the new in case I missed any files that need to be transferred; I'll give people a chance to check I've not missed any of their stuff before remotely powering them down (to save electricity, which I pay for) and coming in to take them (to free up the space). But that's relatively easy!