Floating point numbers

JavaScript does not have special data types for numbers like int or float in languages like Java and C++. Everything is simply number.

JavaScript represents all numeric values with floating point numbers. The precision limitations of floating point numbers can lead to surprising issues, such as the canonical example of asking JavaScript to calculate the value of 0.1 + 0.2.

Loading TypeScript...

This is a common result that you will also get from other languages like Python. Generally, any system that implements the IEEE 754 standard for floating point arithmetic will produce this result.

Python 3.13.2 (...) on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 0.1 + 0.2
0.30000000000000004
>>> 0.1 + 0.2 == 0.3
False

While most computer engineers are familiar with this behavior, very few understand why this happens. To quote the opening sentence of David Goldberg's comprehensive guide:

Floating-point arithmetic is considered an esoteric subject by many people.

Goldberg's guide to floating point arithmetic is written for an engineering audience. This guide is designed to be comprehensible by anyone - young people, the mathematically uninitiated, and the intellectually curious.

Binary numbers

Computers represent all numbers with binary digits - 0's and 1's - where each digit is called a bit. You can experiment with the 8 bit binary number below, by clicking on the bits to change them from 0 to 1.

= 0

Each digit represents a power of 2, just like our human numbering system uses each digit to represent a power of 10. A group of 8 bits is called a byte, and a single byte can represent 2⁸ (256) different values.

In the example above, we are using 8 bits to represent an unsigned integer. If we want to store negative numbers, we can dedicate 1 bit as a sign bit to store 0 for positive numbers and 1 for negative numbers.

sign= 0

Try clicking the sign bit to see how the resulting value changes. If we allowed the sign bit to simply control whether there is a negative sign or not, we would have two binary numbers that both represent zero - 10000000 (negative zero) and 00000000 (positive zero). To avoid ambiguity, along with a variety of other reasons, computers represent negative numbers in two's complement. If the sign bit is 1, then invert all bits (i.e. 1 is considered a 0, and 0 is 1) and add 1 to the sum to get the negative number.

Floating point numbers

Floating point numbers represent numeric values in a binary form of scientific notation (). Each number consists of three parts:

The sign - a single bit which indicates whether the number is positive or negative, where 0 is positive and 1 is negative.
The exponent (written as ) - determines the scale of the number (how big or small it is).
The significand (written as ) determines the precision of the number (the actual digits), and is often referred to as the mantissa.

JavaScript uses 64-bit IEEE 754 double precision format for all numbers. This means that each 64 bit number uses 1 bit for the sign, 11 bits for the exponent, and 52 bits for the significand.

signexponentsignificand

The equation for computing a decimal value from the binary representation above is

Floating point intuition

Let's try to first understand this intuitively. The exponent is really just an interval between two successive powers of 2, like , , and so on. The significand is simply the offset within this interval. To see what this looks like, try entering values below to see their floating point decomposition, or click along the number line to see the sign, exponent, and significand change.

value
=sign (0 or 1)
exponent(0 ... 2¹¹)
significand(0 ... 2⁵²)

As a floating point number gets larger, it "floats" to the next interval, and as it gets smaller, it "floats" to the previous interval. Intervals closer to zero are "more dense", in the sense that the significand provides a more precise number along that interval.

The significand is being displayed as a number, but it actually represents the sequence of binary digits of a normalized value which takes the form . When we know the first digit is always a 1, we can actually get an extra bit of precision for free - this is called the hidden bit trick.

You will often see floating point numbers written in scientific notation (), but as you can see, this does not indicate a standard multiplication operation.

Floating point proofs

In floating point arithmetic and associated proofs, a floating point number is typically written as , where represents digits of the significand, is the base, and is the exponent.

Rounding errors

Just as we cannot precisely represent in our decimal (base 10) number system, a computer cannot precisely represent in binary (base 2) representation. Instead, floating-point formats store the closest possible approximation. Consequently, most floating-point arithmetic results are not exact, so rounding must be carefully executed in a way that minimizes precision loss.

One key idea employed in modern floating point arithmetic is guard digits, which help reduce rounding errors - especially when subtracting numbers that are very close together. IBM thought guard digits were so important that in 1968, they updated their entire System/360 architecture to add a guard digit in order to double precision, and even upgraded machines already in the field.

If rounding is unavoidable, why does rounding error even matter?

IEEE standards

The IEEE floating-point standards define specific algorithms for basic operations like addition, subtraction, multiplication, division, and square root. All implementations must produce results identical to those algorithms, especially with respect to rounding. This is to safeguard the assumption that a program will give exactly the same results - bit for bit - on any system that follows the standard. The IEEE standard has been widely adopted by hardware manufacturers for implementing floating point arithmetic.

There are two different IEEE standards for floating-point computation. IEEE 754 is a binary standard that requires , for single precision and for double precision. It also specifies the precise layout of bits in a single and double precision. IEEE 854 allows either or and unlike 754, does not specify how floating-point numbers are encoded into bits. It does not require a particular value for , and instead specifies constraints on the allowable values of for single and double precision.

Reasons for precision loss

There are two main reasons for losing precision.

Rounding error - numbers like 0.1 can't be precisely expressed in binary, even though they're simple to precisely express in decimal.
Out of range - the number is too large or too small to fit in the available exponent range.

Normalization and Uniqueness

Floating-point numbers can sometimes have more than one representation - for example, and both equal . To avoid this ambiguity, formats typically use normalized numbers where the first digit is non-zero. This makes the representation unique, but introduces a new problem: it can't represent zero! Special handling is required for that case.

Note that when you see a floating-point number like , the is just part of the notation, not an actual multiplication operation.

Relative error and Ulps

Rounding error is measured by units in the last place (ulps).

Suppose you are working with a floating-point number, and the real value is . If the closest floating-point representation is , the error is units in the last place (ulp). In general, if a floating-point number is used to represent a real number, the error will be the difference between the real number and the floating-point approximation. If the floating-point number is the closest possible one, the error can still be up to 0.5 ulp.

Another way to measure error is relative error, which compares the difference between the floating-point number and the real number, relative to the real number. For example, if you approximate by , the relative error is , or 0.1% of the true value.

Ulps vs Relative Error

When rounding to the closest floating-point number, the error in ulps is always less than or equal to 0.5 ulp. However, when you express the error in terms of relative error, the size of the error can change more significantly, especially in larger numbers.

Wobble is the factor that measures how much the relative error can vary due to the way floating-point numbers are stored. Essentially, relative error can be influenced by the base and precision of the representation.

When you care about the exact error introduced by rounding, ulps is a more natural measure. But when you're analyzing more complex formulas, relative error gives you a clearer picture of how rounding affects the final result, especially in terms of overall computation.

If you're only interested in the general size of rounding errors, ulps and relative error are often interchangeable, though ulps are usually easier to work with for smaller errors.

Guard Digits

When subtracting two floating-point numbers—especially numbers that are very close to each other—rounding errors can get much worse than usual. One way to reduce this kind of error is to use guard digits.

Suppose we're using a floating-point format that keeps only 3 digits (p = 3). If we try to compute something like (a subtraction between a huge number and a tiny number) with full precision as , then rounded the result, we would get .

However, hardware is limited to a fixed number of digits, so when the smaller number is shifted to line up with the big one, its digits are truncated. In this case, it becomes .

However, this fails when the numbers are close together. For example, 10.1 - 9.93 in float point is , which gives a result of or . However, the correct answer is - a 30 ulp error!

Guard digits provide extra digits for intermediate calculations. Even if the final result is rounded to 3 digits, those extra digits in the middle help catch and reduce big subtraction errors like the one above. In 1968, IBM added a guard digit to all of their floating-point units, and even retrofitted all of their older machines - it was that important.

What is NaN

In IEEE 754, NaNs are often represented as floating-point numbers with the exponent emax + 1 and nonzero significands.

Limits of precision

The Number class provides constants

Loading TypeScript...

This bound exists lower as well.

Loading TypeScript...