Programmer Smart, C Smarter
For the second or third time in the course of a very large programming project I’m working on, I have discovered that a big runtime problem I was having was due not to program or computational complexity but because I did something very basically stupid. I’ve been programming a population genetic simulation that is, by its nature and rationale, very complex. One of the biggest problems in the history of population genetics is that programming multilocus (and in some case, multi-allele) problems results in hugely costly use of programming memory. Programs that seek to model systems with non-free recombination (recombination rates other than 0.5) must hold a huge number of variables.
Part of this problem I’ve solved by deploying haploid. However, the whole point of programming that algorithm was so that I could code even more complex things, like populations with age-structure, the project I’m talking about now. However, in trying to get this program to run I’ve had a string of problems that looked like the program was running out of resources. I have been using OpenMP to speed up the computations, but it seemed that I was running into either a race condition or some bizarre side-effect of parallelism where threads were waiting for threads that were otherwise blocked. In other words, I’d run my program, on my quad-core workstation, or on the university’s big cluster, and it would just stop after a while, despite the processors continuing to be totally occupied. A few times I had to actually turn my machine off; setting
OMP_NUM_THREADS didn’t seem to be helping either. I was really puzzled.
Then I met with my advisor and she suggested that instead of running a huge number of initial values, I should just be using one initial value (0.1) for one of my variables. I did this and got the same hangup behavior from the program. This was totally irritating me by this point, especially because my much-wiser adviser (she’s my sensei) looked at me with furrowed brow and said “It shouldn’t be taking that long.”
So I ran the program in
gdb (GNU Debugger). This had screwed me before, and it turned out gdb had a bug. I had spent all day chasing what was not a bug: when I ran the program it looked like the pointer I passed to a function was not the pointer evaluated when gdb entered the function. Then I realized that not only had
gcc failed to produce good debugging code, but I had just given the program incorrect input values.
In this case I came upon something else weird: a check that should have stopped the iteration instead evaluated to false. I was really confused until I looked at the types of the variables:
_Bool keep_going_p; keep_going_p = ... ; /* set the value */ if (keep_going_p == ERANGE) raise hell; else return library_book;
See the problem?
ERANGE is an integer. Booleans are stored as integers (i.e. you can assign an
int to a
_Bool), but they only evaluate to either true or false. 0 = false and everything else is true. So even though keep_going_p does in fact equal
ERANGE, the program can’t tell. I changed keep_going_p to an int and the program now completes (with up to 6 loci, that’s 64 genotypes to iterate over up to a billion iterations on four processors) in less than five minutes.
C Respects the Programmer’s Intellect
So one explanation for all this freedom to make stupid mistakes is that you just have to be a really smart programmer to use a language like C. C is supposed to be good for creating fast programs that run on system-level, and possibly in very resource-sparse environments (like a PDP-11). In other words C favors the creation of good compilers. The typical example is that arrays can run over their boundaries without throwing a compiler error, only resulting in (sometimes) hard to pin down runtime errors. In other words, you won’t know that you programmed the wrong number of elements in the array until you run the program and it returns garbage. That can be a long time, considering that the part of the program containing the bug might not be used for a few years after the program is released to the public.
However, what that means is that you have to know how the compiler deals with things inside the computer; you have to think of the memory of the computer in the way that the compiler deals with it. This is not a restriction, but a liberation, because it forces you to think in terms of how the computer actually works. Not as tightly as assembly language, but still pretty darn close. I see this as forcing me to think in terms of how the computer really works, and therefore coming up with better algorithms. In other words: I like it.
On top of this, this attitude of C compiler-writers fits in with the rest of the Unix philosophy: computer programs should be written for users who see themselves fundamentally as programmers. They are essentially people who know as much about the machine as the programmer, and it is completely immoral for the programmer to program in a way that insults the user by presuming the user to be a stupid and unsophisticated person.
Inevitably when someone who’s an adherent of a different philosophy hears me talking about one of these stupid mistakes, they say I should be using Java, C# or some other garbage-collected or dynamically-typed language. However, that disregards the nature of the stupid mistake, and disregards a basic problem with programming in general: I could have made the same mistake in any language with a type system. If I had done that in Lisp, or Java or whatever, the compiler still would not have caught it. There’s a difference between dynamically-typed languages and strongly-typed lagnuages. Almost every programming language is strongly-typed, and it has nothing to do with when those types are decided (compile-time versus run time). And none of the stupid mistakes I have made have involved dynamic allocation, so garbage-collection wouldn’t matter either. People often say we should all be using interpreted languages like Matlab, which I find objectionable on both philosophical (programming) grounds and moral grounds.
There’s just no beating the right tool for the job, which in the case of complex simulations that need to be portable and freely distributable, is C.