Whew. On Monday I upgraded some of the software on my primary web server, since it was running some old stuff with security holes in.
Annoyingly, the www/apache2
package in NetBSD seemed to now conflict with devel/subversion-base
since apache 2 required devel/apr0
while devel/subversion-base
required devel/apr
and they were conflicting packages. So, I had to upgrade to www/apache22
. Fair enough.
One recompile later, and I start apache, and start checking out different web applications I host to see if they all still work...
...and my browser times out. Hmm, OK. I go to an open ssh window to look at the log files, and it's frozen.
I quickly check the network hasn't failed, then resign myself to the fact that my server has just dropped off of the net. It won't even ping, and I can't reach any of the services it forwards in to the backend server either, so the network stack is totally down.
So that evening I head down to the datacentre and take a look... to find that it's died handling the exit()
syscall from Apache. Apparently an assertion failure inside knote_destroy
or something.
Reboot. Start Apache. Start taking a look at sites.
Kerboom! It dies again in the same way.
Hmmm... Clearly, my three year old NetBSD 2.0 kernel is none too happy with Apache 2.2. It looks like Apache's doing something that triggers a bug in the kernel; knotes are event notification things, so I bet Apache's doing some kind of asynch I/O, and triggering a bug in the kernel code that implements it, causing it to leave the knote state of the process in an invalid state, so that the kernel panics when trying to close down the process state after process termination.
So I reboot it again, stop Apache starting, and leave it at that for the time being. No web service, but everything else works.
Then this evening (the day after), I returned, now with a shiny NetBSD 4.0 install CD in hand. Nervously I backed up some critical directories, then bit the bullet and did an upgrade.
And, to my delight, it was nearly seamless. The NetBSD installer upgraded and rebooted into a nearly perfectly working system. All my existing software, compiled under 2.0, ran fine under 4.0's 2.0 emulation, with the mysterious exception of net/bind9
, which wouldn't start. A quick cd /usr/pkgsrc/net/bind9; make install
later, and it was starting fine. Even Apache worked without hosing the system!
I had to compile a custom kernel with routing enabled, to allow the NAT that the server provides between the single public IP of the love.warhead.org.uk
cluster and the backend server infatuation
; then a quick reboot and that was working too.
All in all a successful mission, and it only took an hour or two. I still need to recompile all of my packages, but only to avoid the risk of there being a problem in the 2.0 emulation. While I was there I recompiled bash
and sudo
, just because it's nice to be able to rely on them.