Floating point numbers

JavaScript does not have special data types for numbers like int or float in languages like Java and C++. Everything is simply number.

This can create issues when dealing with floating-point arithmetic, such as the classic example of asking JavaScript to calculate the value of 0.1 + 0.2.

Loading TypeScript...

This is actually a common result that you will get from languages like Python, along with any system that implements the IEEE 754 standard for floating point arithmetic.

Rounding errors

When computers work with any numbers, they can't store them exactly, since memory space is inevitably finite. At some point, real numbers have to be approximated into a limited number of bits (0's and 1's). This means that most floating-point arithmetic results are not exact, and rounding becomes necessary. The techniques for rounding are a big part of what makes floating-point arithmetic unique.

One key idea employed in modern floating point arithmetic is guard digits, which help reduce rounding errors - especially when subtracting numbers that are very close together. IBM thought guard digits were so important that in 1968, they updated their entire System/360 architecture to add a guard digit in order to double precision. They even went and upgraded machines already in the field!

The IEEE floating-point standard goes a step further. It defines specific algorithms for basic operations (like addition, subtraction, multiplication, division, and square root), and says that all implementations must produce results identical to those algorithms. That means a program will give exactly the same results - bit for bit - on any system that follows the standard. This consistency makes the behavior of software become predictable when moving between different machines.

Dissecting a number

There are two different IEEE standards for floating-point computation. IEEE 754 is a binary standard that requires , for single precision and for double precision. It also specifies the precise layout of bits in a single and double precision. IEEE 854 allows either or and unlike 754, does not specify how floating-point numbers are encoded into bits. It does not require a particular value for , but instead it specifies constraints on the allowable values of for single and double precision.

  • first bit is sign bit
  • exponent is next 11 bits
  • significand is remaining 52 bits (also called the mantissa)

When computers represent real numbers, the most common method is called floating-point representation. It works a bit like scientific notation and involves a base (usually 2 or 10) and a precision (how many digits to keep). For example, with base 10 and precision 3, the number would be written as .

In a binary number system, certain decimal numbers like cannot be represented exactly in this form, since their binary representation requires an infinite repeating pattern (similar to in decimal). Instead, floating-point formats store the closest possible approximation.

In floating point arithmetic and associated proofs, a floating point number is typically written as , where represents digits of the significand, is the base, and is the exponent.

Reasons for precision loss

There are two main reasons for losing precision.

  • Rounding error - numbers like 0.1 can't be precisely expressed in binary, even though they're simple to precisely express in decimal.
  • Out of range - the number is too large or too small to fit in the available exponent range.

Normalization and Uniqueness

Floating-point numbers can sometimes have more than one representation - for example, and both equal . To avoid this ambiguity, formats typically use normalized numbers where the first digit is non-zero. This makes the representation unique, but introduces a new problem: it can't represent zero! Special handling is required for that case.

Note that when you see a floating-point number like , the is just part of the notation, not an actual multiplication operation.

Relative error and Ulps

Rounding error is measured by units in the last place (ulps).

Suppose you are working with a floating-point number, and the real value is . If the closest floating-point representation is , the error is units in the last place (ulp). In general, if a floating-point number is used to represent a real number, the error will be the difference between the real number and the floating-point approximation. If the floating-point number is the closest possible one, the error can still be up to 0.5 ulp.

Another way to measure error is relative error, which compares the difference between the floating-point number and the real number, relative to the real number. For example, if you approximate by , the relative error is , or 0.1% of the true value.

Ulps vs Relative Error

When rounding to the closest floating-point number, the error in ulps is always less than or equal to 0.5 ulp. However, when you express the error in terms of relative error, the size of the error can change more significantly, especially in larger numbers.

Wobble is the factor that measures how much the relative error can vary due to the way floating-point numbers are stored. Essentially, relative error can be influenced by the base and precision of the representation.

When you care about the exact error introduced by rounding, ulps is a more natural measure. But when you're analyzing more complex formulas, relative error gives you a clearer picture of how rounding affects the final result, especially in terms of overall computation.

If you're only interested in the general size of rounding errors, ulps and relative error are often interchangeable, though ulps are usually easier to work with for smaller errors.

Guard Digits

When subtracting two floating-point numbers—especially numbers that are very close to each other—rounding errors can get much worse than usual. One way to reduce this kind of error is to use guard digits.

Suppose we're using a floating-point format that keeps only 3 digits (p = 3). If we try to compute something like (a subtraction between a huge number and a tiny number) with full precision as , then rounded the result, we would get .

However, hardware is limited to a fixed number of digits, so when the smaller number is shifted to line up with the big one, its digits are truncated. In this case, it becomes .

However, this fails when the numbers are close together. For example, 10.1 - 9.93 in float point is , which gives a result of or . However, the correct answer is - a 30 ulp error!

Guard digits provide extra digits for intermediate calculations. Even if the final result is rounded to 3 digits, those extra digits in the middle help catch and reduce big subtraction errors like the one above. In 1968, IBM added a guard digit to all of their floating-point units, and even retrofitted all of their older machines - it was that important.

Floating point arithmetic

What is NaN

In IEEE 754, NaNs are often represented as floating-point numbers with the exponent emax + 1 and nonzero significands.

Limits of precision

The Number class provides constants

Loading TypeScript...

This bound exists lower as well.

Loading TypeScript...
Loading TypeScript...

Was this page helpful?