## Never Forget - Floating Point Sucks

There are a number things in software engineering that look like great ideas and look like the obvious and perfect solution to a particular problem.  Ultimately, they fail in practice.

Floating point numbers are just such an example.  It looks like they are a great way to represent elements of the set of Real numbers, so much so that diligent engineers have built silicon devices to perform operations on them in ways that are efficient.

There is a problem, though.  Floating point numbers have holes in them.  Even though I can write out a constant in C or Java or C# as 1.371, there is no guarantee that the actual number represented in the final executable is actual 1.371.  I might get 1.37099999997 or 1.37100000000001 or something else that is very close to 1.371, but really isn't.

So let's say that you're weighting a vector like this:

`double final = cx * x + cy * y + cz * z;`

let's also say that x, y, and z have a known upper limit k, such that x + y + z <= 3x and each of the three coefficients is greater than or equal to 0 and less than or equal to 1 and that cx + cy + cz = 1

Under these conditions you should be able to say that final <= k

Great.  Now lets consider one of the standard formulae for calculating perceptual gray:

gray = 0.299 * r + 0.587 * g + 0.114 * b

the coefficients add up nicely to 1, so this looks like it is a good formula to put into code:

`static int Luminance(BYTE r, BYTE g, BYTE b){    return (int)(0.299 * r + 0.587 * g + 0.114 * b); }`

This works great - we're assuming 8 bits per component in the colors and Luminance gives me 0 for black and 255 for white.  The problem happens when you bump up to 16 bit per component color:

```static int Luminance(UNS16 r, UNS16 g, UNS16 b) {     return (int)(0.299 * r + 0.587 * g + 0.114 * b); }```

When you apply this to white, you get 65534 not 65535.  The problem is that the intermediate result is 65534.999999997 (or something like that).  The rules in C for casting to int call for truncation, so that fractional part goes away and you have the wrong answer.  So you say, "ah-ha! clearly you should be rounding!".  I say this is the wrong solution, primarily for performance reasons, but also because you only really want to round if you need to, but the issue is hardware/type related, so since this is a platform issue, the code should be set up to be conditionally compiled for the platform.

Fixed point, which offers the hope of greater performance, fails.  16 bits of fractional part, the maximum you can use for 16 bit per component color, isn't nearly enough, again due to holes.

I've found that by treating this like a first semester calculus problem, I can get to an more appropriate solution.  So I defined an epsilon that when put into the calculation creates a delta that will give us the effect that we want:

Not only that, I can use the compiler to tell me if I even need it:

`#define needRgbEpsilon  ((int)(rCoef * 65535 + gCoef * 65535 + bCoef * 65535) < 65535)`

Then I change my luminance code to:

`return (int)(rCoef * r + gCoef * g + bCoef *b + (needRgbEpsilon ? rgbEpsilon : 0));`

and I get the expected results.  Now, how about performance?  Well - that needRgbEpsilon #define is constant at compile time, therefore that conditional in the return folds to a constant 0 or rgbEpsilon.  If it's not needed, the addition doesn't happen.  If it is needed, I get the addition and pay the penalty of one additional floating point add - no more.  That will be far less expensive than calling round or ceil or your other favorite library routine.

What is my rgbEpsilon?  That would be telling.

Published Friday, January 25, 2008 10:28 AM by Steve Hawley