Debugger is not a naughty word (by alaric)
Other practices
Unit tests
The principle of unit tests is that, before or in parallel with the writing of each and every subroutine, subsystem, or sub-anything in your program, you should write other short programs that test that subroutine. These tests are filed into a library of tests, so that at any point, it's relatively easy for any programmer on the team to request that all the tests be run, and any that fail reported.
This has two benefits.
- If you run all the unit tests after every change you make to the application, then any remote parts of the system you have unexpectedly broken will quickly make themselves known by failing their unit test. Which is brilliant news!
- Structuring your software to make it easy to write unit tests for in the first place tends to make it clearer and less prone to bugs. For a start, unit testing means running internal bits of your program in unusual orders, often several times over with different initial states to check different aspects of its behaviour. This is easier to do if your program is in a more function style, since it just involves calling the functions with different inputs and checking the outputs. More imperative programs tend to be harder to test, as you often have to jump through hoops to engineer the desired initial state of the system in order to test an operation. But this only applies if the system is written with unit tests from the ground up, rather than unit tests being retrofitted.
However, I'd have been unlikely to have predicted that the reminder would be sensitive to a table creation occurring during its operation, and thus unlikely to have explicitly written a unit test to check that case. Still, a unit test of the reminder would have increased the likelihood of the problem being noticed, since a table creation could have ended up during the remind operation by accident and thus caused the problem.
Also, some things are hard to unit test. Software that deals with external state like SQL database or the filesystem cannot be written in a more functional style, so you still have to jump through hoops to set up the external states for testing. Software that deals with multiple independent agents - be they different threads, different processes on the same machine, or different processes communicating over a network - are inherently nondeterministic, and so difficult to create repeatable tests for, too.
And unit tests are extra up-front work. The programmers are writing unit tests rather than writing new features! This can be hard to justify to the management...
Formal methods
Formal methods are, perhaps, the ultimate in bug-prevention techniques.
The idea is simple: you document absolutely every consequence of every operation in your program, in a formal mathematical notation that leaves no ambiguity. Nothing can be left out "because it's obvious".
If I were using formal methods, then when I wanted to modify the Get operation to sometimes close and reopen the DB4 environment, I'd have initially been unable to. The consequence of "closing the DB4 environment" would not have been possible within the "get" operation since they would not be part of the declared consequences of "get".
So before I could do it, I'd have had to change the declared consequences of "get", which sure enough, would then have caused an error to be raised in my proof checker where I'd declared in the reminder algorithm that the cursor had to remain valid over the call to "get". Since the interface to the DB4 library would have formally declared that closing the DB4 environment invalidated all cursors on tables within that environment.
And if I'd forgotten to declare that the cursor had to remain valid, then the proof checker would raise an error when I tried to prove the correctness of the reminder algorithm, as using an invalid cursor would be declared illegal by the DB4 library's formal specification.
Needless to say, there's still scope for human error, as humans have to sit and declare everything. It's possible for some things to be undeclared, or declared wrongly. Most of the time this will be noticed by the proof checker since it invalidates the proof of correctness of the program, but occasionally - rarely - a mistake can be made that does not generate a logical inconsistency.
And, needless to say, it slows down development a lot. The formal declarations can be more complex than the program itself. And whenever you want to change anything, there can be a lot of reshuffling as the implications of this change are made clear throughout the system. If operation A now may have an extra side-effect, then all the operations that use A may now have that side effect - and all that use those operations - and so on.
So formal methods are pretty unpopular in the mainstream programming world, while logicians and computer scientists work on better proof checkers and theorem solvers to try and automate more of the process.
But all is not lost - I've used them myself, for small but critical parts of a program. They're a lot easier to apply to programs written in functional style, and for certain types of problem. And they can be worth applying to parts of a program where the costs of a bug are high.
Bugs in third party components
Sometimes, a bug is reported, and upon investigation, it turns out to be the fault of a third-party component you're using rather than your own code.
A case in point. One function of some software I wrote seemed unable to run any faster than 25 runs per second, when it should be able to do much more. Investigation revealed that the problem was that part of this operation involved writing to memcached, and that for some reason, this write was taking 0.04 seconds.
Writing to memcached involves sending the following over a TCP connection:
- A header line
- The data itself
- A trailer to mark the end of the data
My software was sending each of these things as a separate request to write to the TCP connection, since the block of data was already at some location in memory, and the header and trailer were not adjacent to it. However, it transpired, when I recorded what was happening at the IP level, that there was a 0.04s delay between sending the header line and getting the acknowledgement from the server that it had arrived.
This was quite odd. There was no such delay getting any of the other two bits of data acknowledged.
So rather than finding out why this delay arose, I adjusted the code that writes to memcached. I used something called scatter/gather I/O, which is a way of telling the operating system to write several independent blocks of data in one operation, and told it to write the header, the data, and the trailer. This made my code a bit more complex, but when I tested it, the problem went away.
So I think I was hitting some obscure behaviour of the Linux TCP/IP stack. I'm not sure. Either way, it didn't seem to be anything my code was doing wrong, and I managed to work around it.
But it's not always easy to work around these problems. The reason my application is doing its own writing to memcached rather than using an off-the-shelf memcached client library is that the client library I was using turned out to be causing segmentation violations. I was aware that this could be being caused by my own code, like the crash in the reminder was happening within a DB4 library function but was due to my (ab)use of DB4. But I couldn't see how, so I wrote my own memcache client library and used that; if the problem was being caused elsewhere, then at least writing my own client library would probably reveal that fact, since I would be able to better understand any interference with its operation caused by other bits of code.
And as it turned out, things worked perfectly with the new library, so I suspect it was the old library at fault. But still, I'm not sure.
It's also quite painful to hit a bug in a third-party component, since what people see is your software failing to operate correctly. They're not usually very amused if you try and tell them it's not your fault - particularly if you chose the third-party component in the first place...
One of the advantages sometimes cited for open source software is that when a bug appears in something you are using, you can go in and fix it yourself. This is true; the fact I had the source code for DB4 made it a lot easier for me to trace what it was doing that was crashing, but it's not always easy to do that. The memcache library I was using, for example, was not easy to understand; I found it easier to rewrite it than debug and fix it.
Installing binary packages makes it a little harder, too. When I was working with the crash in DB4, I was using a binary DB4 library installed by apt
. I downloaded the DB4 source code and could have compiled it with debugging symbols, thus letting the debugger automate my work in tracing the assembly, but I knew it would be hard to recreate the situation in which the installed version of DB4 had been compiled - it would be more trouble than it's worth. Package systems that build from source, like NetBSD's pkgsrc, are in my experience easier for developers - if I want to rebuild a package slightly differently, I can just ask for the build process to be paused before compilation, go in and hack the makefile or run configure
by hand, then install it as usual.
There are some bugs you'll never catch
No matter what you do, there are always some bugs you'll never catch. For example, even if you can perfectly map from a design to a bit of software implementing that design, there can still be flaws in the design. Things like formal methods can catch a lot of design flaws - those where the design is internally inconsistent - but by no means all.
One example of a difficult bug I had recently was in a piece of software that has (for reasons irrelevant to this discussion) to listen to a stream of numbers, that arrive in approximate order, and track the largest number that we have seen all numbers less than or equal to.
Confused yet?
Imagine our stream of numbers so far was 1,5,2,4,3,6,8,9.
The highest number we've seen all numbers less than or equal to is 6, because we've seen 1,2,3,4,5 and 6 but not 7.
The way we do this is to remember the highest number we've seen all the numbers up to and including - let's call it C. We also keep a list of numbers we've seen since. Since computer memory isn't infinite, we have a fixed maximum size for this list - in this case, I chose 16.
As we receive each number, we follow this process:
- If the number is one more than C, then we set C to it. However, this may now mean that some numbers in the buffer can be moved out - in the example above, imagine that 7 arrived; 8 and 9 can then come out of the buffer. So if the number is one more then C, after setting C to it, we see if C+1 is in the buffer. If so, we remove it, set C to the new value of C+1, and repeat the process - seeing if the new C+1 is in the buffer. This is repeated until we don't find C+1 in the buffer.
- If it's not C+1, then we try to put it in the buffer.
- But if the buffer is full, we have a problem. The list of numbers is supposed to be nearly sorted, so numbers shouldn't appear too far out of order. If the buffer is full, it means that we've had more than 16 numbers turn up while we're still waiting for a particular one. This is a bigger distance than should occur, so the number has probably been lost, in which case, we have a recovery process.
- Find the lowest number in the buffer. Call it L.
- Issue a warning message that the numbers from C+1 to L appear to have gone missing.
- Set C to L.
- Scan the buffer for C+1 - if it's found, remove it, and set C to C+1, and repeat this step.
This worked fine, until in a test under extreme load, some numbers were delayed by more than 16 places. My algorithm had assumed that if a number was more than 16 places out, it would never arrive; its behaviour when a number turned up after we'd given up on it turned out to be rather suboptimal.
Imagine in our example above, 7 didn't appear for quite some time. So the buffer eventually filled up, a warning message was issued, and 7 was skipped.
So say we get to C being about 30, when 7 turns up. Step through the process above. 7 is not C+1 (31), so we don't advance C; instead, we put 7 into the buffer.
Uhoh!
There are two ways a number comes out of the buffer: by being C+1 after C has just been updated, or by the buffer overflowing and the lowest number in it being removed.
Once C has passed 7, 7 will never equal C+1. So that 7 will sit in the buffer, until the buffer next overflows. Since the 7 is sitting there, it's taking up space, meaning the buffer will then overflow that much sooner, too. When the buffer does overflow, disaster strikes - 7 is the lowest number there, so it gets picked as the new C. But the numbers currently arriving are in the 30+ range, so once C has become 7, all the numbers will be going into the buffer until it overflows again, and the new lowest number - 31, say - then becomes C, and we carry on from there as usual, having recovered after a period of incorrect operation.
Whoops.
The fix was simple - add a step after step 1. "If the number is less than C, issue a warning message, and then don't try and put it in the buffer".
By @ndy Macolleague, Tue 15th Jan 2008 @ 2:02 pm
Hi,
Here is a quick "highest number we've seen all numbers less than or equal" algorithm that I just cooked up whilst reading this. It's pretty much at the pseudo code level and I've not even tried to compile it but I think it illustrates a point.
I've don't store the numbers that come in: I just store a set of flags so that we know if we have seen a given number or not.
int main (void) { int i; /* Most recent value read / int blk, idx; / The block that i lives in and the index into block / int mem[1024]; / Some space. Highest value is 1024 * sizeof(int) * 8. / int h = 0; / The block that we are up to / int val; / The answer so far */
}
Seeing as we are tracking h we could probably throw away / reuse the early parts of mem as they fill up. That would mean that mem would only have to be big enough to store the range of values that might be un decided at any given time. To do that would make the illustration above less clear so I didn't bother as it'd be mostly memory management rather than part of the core algorithm.
By @ndy Macolleague, Tue 15th Jan 2008 @ 2:04 pm
Hmph! The formatting of the variable declarations has been messed up. Here it is again:
int main (void) {
}
By Gavan Fantom, Tue 15th Jan 2008 @ 2:12 pm
I would have expected that for nearly-sequential input, storing ranges (ie extents) would have been a reasonable compression. The answer is then the size (well, strictly speaking, the higher end) of the first extent.
The size of the extent map would then be a reasonable metric for the extent of disorder in the input.
Depending on the quality of the input, an extent map of all numbers not seen so far may also be a suitable choice.
By @ndy Macolleague, Tue 15th Jan 2008 @ 2:19 pm
Hi,
This thing probably has loads of bugs... The for loop will prevent the while loop terminating. So change it to "while (val < (1024 * sizeof(int) * 8))" or just change the memory management model and make it "while (1)".
By @ndy Macolleague, Tue 15th Jan 2008 @ 2:27 pm
Well... the for loop should have prevented the while loop from terminating as it should be h<1023 not h<1024 otherwise we end up with a bounds overflow on the next line. Perhaps I should have compiled this... Maybe I still should...
Oh dear.
By Gavan Fantom, Tue 15th Jan 2008 @ 2:36 pm
In my experience, the biggest factor in improving quality in software is visibility. You touched on a few of the "rules" which have been proposed over time, but the overriding rule as far as I'm concerned is "Write clear and readable code". If you can read it and understand it, you have more chance of spotting when something is not right. This usually extends to having a clear write-up or diagram describing the design clearly so that someone trying to understand the code can understand the design at the same time.
But that's not the be all and end all of visibility. Unit tests can help here, but only if you actually run them.
Amazingly, large codebases with multiple developers often suffer from the problem of the head of the repository not even compiling. When everybody is updating all the time it gets noticed and fixed regularly, but when you have people doing the bulk of their development on a snapshot or a branch and only updating to the head or the trunk occasionally, this can become a really big problem. Again the way to solve this is by greater visibility - in this case autobuilding.
Another frequent failure is to spot a bug (or a potential bug), not fix it straight away, and then forget about it. Two years later it rears its ugly head, and by that time you've completely forgotten about it and have to debug from scratch. It is only after spending weeks debugging that you realise that you've already seen this, and then you wish you'd fixed it at the time, or at least written a TODO item, or tracked it in a bug database.
And if it's in a bug database, it's visible. It's measurable. You can generate a report of all known bugs.
So once you have readable code, documented design, (some) unit tests which you run daily right after your nightly builds, and a bug database full of all the bugs which you spotted and didn't have time to fix, you still have bugs that you don't know about. What about them?
Well, software is not fault-tolerant except in extremely specific cases. So if a fault does occur, make sure your code spots it early and fails noisily. This will increase visibility of the problem. It will also prevent knock-on failures, as the code will simply stop executing once bad data has been detected. And also, hopefully, it will make it easier to isolate where the problem occurred. This is all extremely good news when it comes to debugging. A failed assertion is much more useful than a Segmentation violation.
The more you can see, and the clearer it is, the higher the quality of the resulting software, given a competent and conscientious programmer.
By alaric, Thu 17th Jan 2008 @ 10:29 am
@ndy: That's similar to the principle I used, I guess. Since the algorithm runs continuously with an ever-increasing (and, eventually, wrapping back to 0) sequence of numbers (yes, before people ask, this is for packet sequence numbering in a network client!), I indeed have a buffer, but set up so that it slides along the space of sequence numbers, so it only stores the "frothy" region; the "solid" region where we have seen all the sequence numbers gets pushed off of the bottom of the buffer ASAP to free up space for froth.
Gavan: Yep, good points, thanks