DNS issues today (by alaric)
Gah! This morning, my alerting system texted me to say that love (the primary server) can't talk to ihibehu (the backup server in the USA). A quick looked confirmed that we seemed to have some kind of routing loop in level3's network, which was therefore returning "TTL exceeded" to pings. I could connect to ihibehu OK from another network, confirming that it was just a local routing spat of some kind. I shrugged and moved on with life.
However, people started complaining they couldn't resolve DNS for stuff I host, so I had another look. love and ihibehu are both DNS servers (they go by the name of ns0.warhead.org.uk and ns1.warhead.org.uk in that role), and if one is unreachable, then the other should be contacted, so all should have been fine. However, it turned out that the IP address for ns0.warhead.org.uk was still pointing to its old location (and love don't live there anymore), so ns0.warhead.org.uk wasn't "working"; and so for the people whose route to ihibehu went via the routing problem, ns1.warhead.org.uk wasn't working as well.
Oops! One tricky aspect of distributed fault-tolerant systems is that sometimes part of them fails and you don't realise because all the user-visible stuff silently fails over. Therefore, you need to test things below the failover layer to make sure they work individually. Although I check both DNS servers are up, I wasn't checking that the "glue records" mapping the nameserver names to IPs pointed to the right place...
But I clearly remembered sending in the request to the registrar to change the glue record for ns0.warhead.org.uk when I moved it, didn't I? I checked my emails and, yes, I'd send that request, but with all the other stuff I was dealing with in the migration, I never chased it up. And lo, nestled among my spam emails was a response from the registrar, reminding me that I still had access to the interface to do it myself (The registrar used to be me, but I passed that mantle on to somebody else), and suggesting I do so. So it had never gotten done.
"No time like the present, then," I thought, and set out to send in the request, only to find that I don't still have access to the interface, because it also needs a password which I removed from my password databases when I passed control of the registry interface over. Doh!
So I've re-requested that the registrar does it for me. Thankfully, the routing loop has healed up and all is working again while I wait for that to happen. And I'm going to write a test for my glue records being correct into my monitoring system, because that was just sloppy!