The joys of floating point variables

The other week I had to examine a rather simple data import program, written in VB6. The program reads some numbers from a CSV file, and creates a bunch of SQL INSERTs to transfer them to a database: it worked flawlessly, except for some values with .15 or .45 as the decimal part.

In the end it all boiled down to a code snippet like this (it's VB6, but should compile in VB.NET too, just create a form with a button and a click event in both cases)

Dim s As Double
Dim s3 As Double

s = 8.45 - 5.45
s3 = 3

MsgBox "8.45 - 5.45 = :" & s
MsgBox "8.45 - 5.45 = :" & Format(s, "0.000000000000000000000")
MsgBox "8.45 - 5.45 = 3 :" & (s = s3)
MsgBox "Int(8.45-5.45) :" & Int(s)
MsgBox "Fix(8.45-5.45) :" & Fix(s)

Now, the funny part is that if you print, even with as many decimal digits as you like, 8.45 - 5.45, you get, apparently, the number 3.000. Then, why s = 3 is equal to False ? Why Int(s) and Fix(s) = 2?

Never mind that using Double is not recommended in cases like this, in any way - one should use Currency : I wanted to know exactly why there is such an apparently disconcerting behaviour.

I tried to replicate the snippet in C# (using Math.Floor instead of Int()) : same result. After that I extended the snippet with a routine found searching stackoverflow.com (look in the comments for the URL) to decode the internal IEEE 754 format used to store doubles (it is the same for each and every language using double precision floating point numbers since the last iterations of microsoft compilers for DOS I believe. Here's the code:

using System;

using System.Text;

namespace TestDouble
{
    class Program
    {
        static void Main(string[] args)
        {
            double s, s3;

            s = 8.45 - 5.45;
            s3 = 3;

            Console.WriteLine("s = 8.45 - 5.45 :" + s.ToString());
            Console.WriteLine("s (formatted)   :" + String.Format("{0:0.00000000000000000000000000000000}", s));

            Console.WriteLine("s3 = 3          :" + s.ToString());
            Console.WriteLine("s == s3         :" + (s == s3).ToString());
            Console.WriteLine("Math.Floor(s)   :" + Math.Floor(s).ToString());
            Console.WriteLine("Math.Floor(s3) :" + Math.Floor(s3).ToString());

            Console.WriteLine("s = 8.45-5.45 IEEE 754 : " + d2s(s));
            Console.WriteLine("s3 = 3         IEEE 754 : " + d2s(s3));


            Console.WriteLine();
        }

        static string d2s(double d)
        {
            // yanked from: http://stackoverflow.com/questions/389993/extracting-mantissa-and-exponent-from-double-in-c-sharp
            // Translate the double into sign, exponent and mantissa.
            long bits = BitConverter.DoubleToInt64Bits(d);
            // Note that the shift is sign-extended, hence the test against -1 not 1
            bool negative = (bits < 0);
            int exponent = (int)((bits >> 52) & 0x7ffL);
            long mantissa = bits & 0xfffffffffffffL;

            // Subnormal numbers; exponent is effectively one higher,
            // but there's no extra normalisation bit in the mantissa
            if (exponent == 0)
            {
                exponent++;
            }
            // Normal numbers; leave exponent as it is but add extra
            // bit to the front of the mantissa
            else
            {
                mantissa = mantissa | (1L << 52);
            }

            // Bias the exponent. It's actually biased by 1023, but we're
            // treating the mantissa as m.0 rather than 0.m, so we need
            // to subtract another 52 from it.
            exponent -= 1075;

            if (mantissa == 0)
            {
                return " 0";
            }

            /* Normalize */
            while ((mantissa & 1) == 0)
            {    /* i.e., Mantissa is even */
                mantissa >>= 1;
                exponent++;
            }

            return (negative ? "-" : "+") + " " + mantissa.ToString() + " * 2 ^ " + exponent.ToString();
        }
    }
}

Now we're starting to shed some light: 8.45-5.45 and 3 are stored in a completely different way in IEEE 754.

8.45 - 5.45 is stored with a mantissa of 3377699720527871 and an exponent of -50.

I fired up Mathematica to check the numbers:

In[5]:= +3377699720527871 * 2^-50
Out[5]= 3377699720527871/1125899906842624

The fraction above is the exact value of 8.45-5.45, as stored in the 64 bit double variable.

With a 15 decimal digit precision this is..

In[21]:= N[3377699720527871/1125899906842624, 15]
Out[21]= 3.00000000000000

But with a precision of more than 15 digits..

In[22]:= N[3377699720527871/1125899906842624, 16]
Out[22]= 2.999999999999999

The problem with VB6/VB.NET and C# is that a double precision number is apparently converted to a string representing a decimal number with a maximum of 15 decimal digits. This is in compliance with the standard, as far as I've seen, but it's all that the language can churn out - if I specifiy more decimal digits in the formatting string I end up with the same precision anyway, not surprisingly of course. By the way, '3' is stored as what the IEEE 754 standard defines a 'subnormal' number, all in the mantissa with a zero exponent.

It seems that I stumbled on a really pathological case of double precision rounding. In the original program the numbers represented hours (the integer part) and minutes (the decimal part): it looked like the calculations were wrong only with quarter of an hour! I've event translated the C# program in Java. It seems that Java uses one decimal digit (but only one) more formatting double data - in my case it would have helped though...

public class TestDouble {

    public static void main(String[] args) {

        double s, s3;

        s = 8.45 - 5.45;
        s3 = 3;

        System.out.printf("s = 8.45 - 5.45 :%f\n",s);
        System.out.printf("s (formatted .14f)   :%.14f\n",s);
        System.out.printf("s (formatted .15f)   :%.15f\n",s);
        System.out.printf("s (formatted .30f)   :%.30f\n",s);
        System.out.printf("s3 = 3          :%f\n",s3);
        System.out.printf("s == s3         :%b\n",(s == s3));
        System.out.printf("Math.Floor(s)   :%f\n",Math.floor(s));
        System.out.printf("Math.Floor(s3)   :%f\n",Math.floor(s3));


        System.out.println("s = 8.45-5.45 IEEE 754 : " + d2s(s));
        System.out.println("s3 = 3         IEEE 754 : " + d2s(s3));


    }

    static String d2s(double d)
      {
        // adapted from http://stackoverflow.com/questions/389993/extracting-mantissa-and-exponent-from-double-in-c-sharp
          // Translate the double into sign, exponent and mantissa.
          long bits = Double.doubleToLongBits(d);
          // Note that the shift is sign-extended, hence the test against -1 not 1
          Boolean negative = (bits < 0);
          int exponent = (int)((bits >> 52) & 0x7ffL);
          long mantissa = bits & 0xfffffffffffffL;


          // Subnormal numbers; exponent is effectively one higher,
          // but there's no extra normalisation bit in the mantissa
          if (exponent == 0)
          {
              exponent++;
          }
          // Normal numbers; leave exponent as it is but add extra
          // bit to the front of the mantissa
          else
          {
              mantissa = mantissa | (1L << 52);
          }

          // Bias the exponent. It's actually biased by 1023, but we're
          // treating the mantissa as m.0 rather than 0.m, so we need
          // to subtract another 52 from it.
          exponent -= 1075;

          if (mantissa == 0)
          {
              return " 0";
          }

          /* Normalize */
          while ((mantissa & 1) == 0)
          {    /* i.e., Mantissa is even */
              mantissa >>= 1;
              exponent++;
          }

          return (negative ? "-" : "+") + " " + String.format("%4d * 2 ^ %4d", mantissa, exponent);

      }

}

delenda.net - ephemeris

Wednesday, 2 January 2013

The joys of floating point variables

No comments:

Post a Comment

Blog Archive