This weekend I got quite a lot done on liboca, mainly working on the network stack, and the basic implementation of OCP.1, which is the TCP/IP-based protocol for OCA (There is room in the OCA standard for other transports). Anyhow, at one point, my unit tests broke in a suite of tests that tested code that should not have been affected by the code I was working on. In the end the fact that I had set up Valgrind, ended up showing me how to solve my problem, so I learned that hard way that it is an essential tool to run along with your standard test suite.
What is Valgrind?
"Valgrind is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile your programs in detail. You can also use Valgrind to build new tools. " - valgrind.org
In its simplest form, it takes your code, instruments it and runs it on a processor emulator, and tells you if you’ve leaked memory, accessed memory you shouldn’t have, and a whole bunch of other stuff – I encourage you to check out the website for more detail.
So what happened?
So we already got to the fact that I had a theoretically non-impacted test stubbornly crashing with:
Segmentation fault (Core Dumped)
Frustratingly, I couldn’t reproduce the crash in a debugger, which left me thinking it was a race condition for a while, but all the tricks I tried to hunt down the race didn’t give me any inclination as to what was going on. Also, the core was not being dumped, by default Ubuntu doesn’t do that. You need to enable core dumps with:
$ ulimit -c unlimited
Anyhow, I then debugged the core dump:
$ gdb build/tests/all core
Finding that the Seg-fault was inside Google Test! Googling for this issue showed no known issues around this part of the code, so I was a little stumped.
So what next?
Looking further up the stack trace showed that it happened inside _malloc_consolidate which previous investigations have lead me to attribute to memory corruption, most often heap trashing. Anyhow, I have a bunch of verification scripts set up to automatically run the unit tests, static analysis and dynamic analysis (valgrind!), and was indeed running them. Lo and behold, valgrind (memcheck) was turning up 47 issues, all of which were relating to a few lines of code (that I had been working on) where I was deleting an object which was still needed by an asynchronous process. I fixed the errors rather easily, and hey presto! No more Segmentation Fault.
You see the problem was not even the unit test in question, but about 10 tests earlier, the heap was getting trashed, which was causing the test runner to seg-fault later on.
So what did I learn?
In the past I’ve had policies on repositories that no code makes it in without a successful build and test-run. Usually I compile with -Werror (or similar) so that my code never has any warnings (because I have to fix them, since they’re treated as errors). This time I added cppcheck and valgrind because I figured they would be useful. This experience taught me that if I ever encounter weird issues while I’m working, I should run the full verification suite because I will probably find something related to the weird issues that way.