My Photo
Location: Downham Market, Norfolk, United Kingdom

Saturday, December 17, 2016

On the numeric precision provided by programming languages

Probably the most common floating-point format supported by modern programming languages is what C calls a double, that is a 64-bit data type consisting of a 53-bit mantissa, an 11-bit exponent and a sign bit (yes, I know that's 65 bits in all but as is common with floating-point formats the most-significant bit of the mantissa is assumed to be a 1 and isn't stored). More formally this format is standardised in IEEE 754 as binary64.

Some languages provide this as one of a number of different options, for example in C both float (32-bits) and double (64-bits) floating-point formats are commonly supported. In others double is the only floating-point format available; Liberty BASIC is an example of this. The precision provided by this format is sufficient for a wide range of real-world applications; to put it in perspective it corresponds to an uncertainty of about one millimetre in the distance between the Earth and Neptune!

However there is of course a big difference between precision and accuracy. One particularly important issue is that a typical calculation consists of many cascaded steps, and (depending on the kinds of operations performed) at each of these steps there is a possibility that an error of half-an-LSB or more is introduced. After a large number of steps the potential accumulated error may be very much greater than the precision of a double might lead you to expect. For this reason it is common for the 'floating point unit' (FPU) of a modern CPU to work internally to an even higher precision, for example a format with a 64-bit mantissa is often used. So long as intermediate values are held to this higher precision there is a greater likelihood of the final result, represented as a double, having an accuracy commensurate with its precision.

When the highest precision floating-point data type provided by the programming language itself is a double, this desirable outcome can be achieved only when the programmer specifies the calculation in his source code in such a way that it can take place entirely within the FPU, and when the language interpreter/compiler is designed to achieve that. That may be the case in a language like C, when the computation is described in the form of a single expression and the compiler generates code that keeps intermediate values on the FPU stack, but that will not always be the case.

If the programmer's source code splits the calculation into a number of discrete steps, the compiler may not be able to recognise that it could combine them into one computation (and anyway the programmer may want to see the intermediate values to help him debug the code). Even more significantly, an interpreted programming language may simply not be capable of keeping intermediate values in the FPU and may always transfer them temporarily to whatever representation it uses internally for floating-point values. If this is a double a truncation may occur after each step of the calculation.

Of course this may not be important. Many calculations won't need anything like the precision of a 53-bit mantissa and even the loss of several bits of accuracy as a result of multiple truncations may not matter. But sometimes it may, and indeed the programmer may not even know what accuracy is required (if you're not a mathematician - and perhaps even if you are! - it's only too easy to perform a computation in such a way that it is unduly sensitive to the accuracy of an intermediate term, without even realising it).

So ideally there needs to be a way of overcoming the limitation that intermediate values need to be held in the FPU, so that the full accuracy of which the CPU is capable can be taken advantage of even when those values are transferred elsewhere (e.g. to a 'variable'). Some C compilers (notably GCC, but not Microsoft's compilers) support the 80-bit 'temporary float' format as a native data type, when the underlying CPU provides access to it, which solves the problem. By declaring a variable as a long double it can hold an intermediate value with the same precision as that of the FPU's internal registers.

Because the problem is potentially even more acute with an interpreted language, I was very keen that version 6 of BBC BASIC for Windows should have that capability too. Indeed, the default numeric data type is a variant which can contain either an 80-bit floating-point value or a 64-bit integer (it is noteworthy that an 80-bit float can represent a 64-bit integer value precisely). So you can perform cascaded floating-point calculations in BB4W safe in the knowledge that the final result is as accurate as it could be given the capabilities of the CPU/FPU. I don't think BBC BASIC is unique in this respect, but it is certainly very unusual for an interpreted BASIC to provide native support for 80-bit floats and it is something that I like to point to as a valuable feature.

Given what I have said above, it is irritating that Microsoft decided not to support 80-bit extended-precision floats in their compiler family. You can declare a variable to be of long double type, but in fact that is treated by the compiler as a synonym of double and the precision is no different. That can make it difficult or impossible to produce results which are as accurate as they theoretically could be. Even more surprisingly the ARM CPU family provides no access at all to 80-bit float values, even if they are used temporarily by the FPU (and I'm not sure whether that is the case or not). Unfortunately that means even GCC treats long double as a synonym of double when targeting ARM.


Post a Comment

<< Home