Friday, April 28, 2006 8:56 AM
by
loufranco
On Optimizing: Tips for the inner loop
Everyone knows the first two rules of optimizing: (1) Don't pre-optimize, and (2) Use a profiling tool to make sure you spend time optimizing the right code. I mostly agree with (1) -- I'll get back to that in a bit, and nothing beats a good profiling tool. We use
AQTime here, and it's .NET integration and seamless support of both managed and unmanaged code is a big help when you're writing fast algorithms in unmanged code with a managed wrapper.
Another pearl of wisdom is "blah blah blah is fast enough (unless you're in a large nested loop)." Well, here at
Atalasoft, we do image processing, and practically everything you do is in a nested loop over all of the pixels of the image, so all of the sudden, nothing is "fast enough". All of the sudden, you find yourself counting operations and conditionals, removing all floating-point, any division and as much of anything else that you can.
Here are some tips if you find yourself needing to optimize an inner loop. These tips are for C/C++, because if you are even thinking of doing this, you want to be in unmanaged land (one of the reasons why not pre-optimizing can get you into trouble -- choosing the right technology for when you need speed, is an important first step).
- Use integers. Modern instruction sets mean that naive compilation of C to machine code results in similar speed for ints and double (my quick tests show doubles as faster), but turn optimization on, and it's a different story. My double code improved slightly, but my int code was three times faster when optimized. A quick look at the assembly showed that the code for doubles was similar, but for ints, it was transformational. Biggest difference was in using shifts for multiply and divide by a constant (I did not use powers of 2, but it figured out tricks anyway). Optimizers are better at improving calculations with ints than with doubles.
- Don't divide. A corollary to using integers is that you can't divide without losing precision -- that's ok, it was probably slow anyway. A common reason to want to divide in image manipulation is for calculating ratios. In line drawing and scaling algorithms, watching a ratio is a way to know when the error term is large enough to move onto the next row or column. It often looks something like this:
if ((xSrc+1)/wSrc > xDst/wDst) ++xSrc;
You can't do this with ints, so cross multiply.
if ((xSrc+1)*wDst > xDst*wSrc) ++xSrc;
We aren't going to keep this either, but this step helps with the next one.
- Don't mulitply. Often, in a loop, we're multiplying by the index. This can be replaced by an accumulation. The code above gets turned into this:
xDst_Times_WSrc += wSrc;
if (xSrc_Times_wDst_Plus_wDst <= xDst_Times_WSrc) {
++xSrc;
xSrc_Times_wDst_Plus_wDst += wDst;
}
- Don't recalculate on each iteration. The optimization above also gets a big boost from the fact that xSrc doesn't change every iteration (and therefore, (xSrc+1)*wDst doesn't either).
- Use Table lookups to replace costly calculations. This is a common technique. Steve described it here.
- Replace array indexing with pointer incrementing. Table lookups are a multiplication and an addition, so it's not necessarily faster than a calculation at this level. But, often we are indexing by the inner loop index (because a calculation is the same for every row). In that case, increment a pointer into the table, rather than indexing.
- Look for repeating patterns. Often a calculation results in a repeating pattern as you loop. If so, you can precalculate the table of results (just the smaller repeating section) outside the loop and then use a table lookup in the inner loop. This will be slower if there is no pattern, because you have to detect the pattern with a conditional.
In my particular case, the biggest wins were with (4) and (7). The repeat size was often orders of magnitude smaller than the image width (and not a function of it), so the larger the image, the more the potential speedup, which meant my algorithm time grew much slower than the original. However, without rewriting the algorithm with (1), (2) and (3), it might not have been obvious how to apply them, and (7) without (6) would not have been as fast.