Debugging Fu, Part 1
A How-To Guide for Debugging, Part 1, Reproducibility
Get It Reproducible.
If you can't demonstrate a bug 100% of the time, you are in hell. You are in absolute and total hell and you will be burning in a lake of fire until such a time as the bug is either fixed or is reproducible. Have you ever watched someone debugging multithreaded code or an OS race condition? Yeah, that's hell.
So when a bug is reported or assigned to you, your first assumption should be that there might be a bug until you have it conclusively demonstrated on your own box or your build server under controlled conditions. So after QA or support has shown you an issue or after you have pulled a task out of the product back log, your first job is to make that bug happen.
You, like me, might suffer from either hubris or excessive curiosity from time to time and leap right to the source code and start reading, spot the bug by sight and fix it. Stop. Just stop and walk away from your source. You might be right, but you really want to start with a test first. Putting in a pre-emptive fix can have a few horrible side-effects. It can mask the problem if the actual cause was deeper. It can also introduce new bugs. Finally, you have changed your core assumption from "there might be a bug" to "there isn't a bug." All of these represent shifts in the playing field and none of them are in your favor.
In the best of all possible worlds, you should be able to write a unit test that induces the bug with machinery to automatically detect it. Hooray - this is terrific. This is your goal. This unit test will hopefully live as long as the code it tests and will be a shining beacon of reassurance that all is well.
Avoid doing anything at all to upset this beacon.
You might have gotten a test case from a customer or from QA which you've put into your unit test. You, being an experienced developer, want to read between the lines and rewrite or refactor the code to get to the crux of the matter. Don't. Again, just don't. Walk away from your instinct. You're at war here and you're now fighting yourself as well as the bug. If you refactor the test case and in the process lose the reproducibility of the bug, you are inviting the assertion, "clearly the test must be wrong" instead of the question, "did I just screw up the test?" If you want to refactor or simplify, copy your failing unit test and simplify that. Leave both tests in until they both go green. Then and only then, consider removing the wordier test.
What if I can't reproduce it?
I sit down at my machine, follow the steps and in the bug report and don't see it, what now? Hopefully you put the test into a unit test and checked it in so your build server picks it up. Make sure the build server goes green. Go back to whoever reported the bug and have them run your test. You might have to repackage it into an app or an assembly, but so be it. If your test passes and theirs does not, play "one of these things is not like the other" until you can figure out why. You will have to drop assumptions and work with givens. Is the OS the same? Is the version of .NET the same? Same compiler? Same runtime libraries? And so on. We had unit tests that passed on one build server but failed on another because the version of TrueType fonts installed was different. Sad, but true.
Do whatever you can to eliminate target environment as the cause.
What if I can reproduce it...uh...sometimes?
Welcome to my world. If you routinely get handed bugs like this, it means one of two things. Either you or your peers are writing some really terrible code that needs a solid dose of code review, or you are already a ninja debugger and the hard cases get escalated to you because you're the only one who finds them. I like to think that I'm in the latter camp.
So here's what you do. If your test only fails sometimes, there are usually five main causes of this:
- Persistent side-effects. Has previous code (or a previous test) modified global state thereby affecting later code? I see this most often in tests that fail on the server but pass on local machines. NUnit and TestDriven don't run tests in the same order, for example, and if a previous test hoses a global and doesn't restore it, everything that uses it later on will behave differently.
- Pointer Chicanery. If you use uninitialized pointers or write outside of a block, you are creating bugs that may be benign initially, but whose behavior changes as the runtime environment changes. Uninitialized pointers that come from stack variables contain what was last on the stack, which might be valid for a while, but if the path of execution changes, may suddenly become crashworthy at best, or will shotgun your heap at worst. If you write outside a block, you might have running code until your memory layout changes. For these last two, using gflags might be your only hope.
- Threading or multiprocessing. Ah, we love multiprocessing. We love it dearly. The first pass of code is usually so beautiful and then it deadlocks so you inject debugging code which changes the timing and you no longer get the deadlock. Or you put in a fix that eliminates the deadlock in your debug rig and then it (or a different deadlock) comes right back when the debugging rig goes away. You have my sympathies. Look at Joe Duffy's book (here's his blog) or Stephen Taub's blog.
- Tests that rely on randomness. No kidding. I'm of two minds on this. I hate tests based on random input because you only have an indication when something is wrong, not when it's right. I love random tests because they do turn up unexpected errors. So here is my current internal guideline for writing tests that exploit random inputs - the test itself, upon failure, should generate output which is source for a new unit test that will fail because it uses the same input as the failing iteration of the random test. This means that when a random test fails, you now have source code for a reproducible test. Yes, writing a test that self generates tests is a royal pain, but that pain should serve as a deterrent to writing random tests, and will only change your reaction to a random test failure. My old reaction was, "oh crap, not the random test again." Now my reaction is much calmer, since the next step is trivial.
- Resource limitations or other issues that affect timing. This is usually only an issue in network programming, so I won't touch on it here.
You want to get to either being able to reproduce it 100% or to find out which of the above four things are the general cause of your bug. If you know what is causing your irreproducibility, you might be able to start laying traps to make the bug reproducible.