Round-off error


A roundoff error, also called rounding error, is the difference between the result produced by a given algorithm using exact arithmetic and the result produced by the same algorithm using finite-precision, rounded arithmetic. Rounding errors are due to inexactness in the representation of real numbers and the arithmetic operations done with them. This is a form of quantization error. When using approximation equations or algorithms, especially when using finitely many digits to represent real numbers, one of the goals of numerical analysis is to estimate computation errors. Computation errors, also called numerical errors, include both truncation errors and roundoff errors.
When a sequence of calculations with an input involving roundoff error are made, errors may accumulate, sometimes dominating the calculation. In ill-conditioned problems, significant error may accumulate.
In short, there are two major facets of roundoff errors involved in numerical calculations:
  1. Digital computers have magnitude and precision limits on their ability to represent numbers.
  2. Certain numerical manipulations are highly sensitive to roundoff errors. This can result from both mathematical considerations as well as from the way in which computers perform arithmetic operations.

    Representation error

The error introduced by attempting to represent a number using a finite string of digits is a form of roundoff error called representation error. Here are some examples of representation error in decimal representations:
NotationRepresentationApproximationError
1/70.0.142 8570.000 000
ln 20.693 147 180 559 945 309 41...0.693 1470.000 000 180 559 945 309 41...
log10 20.301 029 995 663 981 195 21...0.30100.000 029 995 663 981 195 21...
cube root|1.259 921 049 894 873 164 76...1.259920.000 001 049 894 873 164 76...
square root|1.414 213 562 373 095 048 80...1.414210.000 003 562 373 095 048 80...
e2.718 281 828 459 045 235 36...2.718 281 828 459 0450.000 000 000 000 000 235 36...
π3.141 592 653 589 793 238 46...3.141 592 653 589 7930.000 000 000 000 000 238 46...

Increasing the number of digits allowed in a representation reduces the magnitude of possible roundoff errors, but any representation limited to finitely many digits will still cause some degree of roundoff error for uncountably many real numbers. Additional digits used for intermediary steps of a calculation are known as guard digits.
Rounding multiple times can cause error to accumulate. For example, if 9.945309 is rounded to two decimal places, then rounded again to one decimal place, the total error is 0.054691. Rounding 9.945309 to one decimal place in a single step introduces less error. This commonly occurs when performing arithmetic operations.

Floating-point number system

Compared with the fixed-point number system, the floating-point number system is more efficient in representing real numbers so it is widely used in modern computers. While the real numbers are infinite and continuous, a floating-point number system is finite and discrete. Thus, representation error, which leads to roundoff error, occurs under the floating-point number system.

Notation of floating-point number system

A floating-point number system is characterized by integers:
In the IEEE standard the base is binary, i.e., and normalization is used. The IEEE standard stores the sign, exponent, and mantissa in separate fields of a floating point word, each of which has a fixed width. The two most commonly used levels of precision for floating-point numbers are single precision and double precision.
PrecisionSign Exponent Mantissa
Single1823
Double11152

Machine epsilon

can be used to measure the level of roundoff error in the floating-point number system. Here are two different definitions.
There are two common rounding rules, round-by-chop and round-to-nearest. The IEEE standard uses round-to-nearest.
xRound-by-chopRoundoff ErrorRound-to-nearestRoundoff Error
1.6491.60.0491.60.049
1.6501.60.0501.60.050
1.6511.60.0511.7-0.049
1.6991.60.0991.7-0.001
1.7491.70.0491.70.049
1.7501.70.0501.8-0.050

Calculating roundoff error in IEEE standard

Suppose the usage of round-to-nearest and IEEE double precision.
Since the bit to the right of the binary point is a and is followed by other nonzero bits, the round-to-nearest rule requires rounding up, that is, add bit to the bit. Thus, the normalized floating-point representation in IEEE standard of is
This representation is derived by discarding the infinite tail
from the right tail and then added in the rounding step.

Measuring roundoff error by using machine epsilon

The machine epsilon can be used to measure the level of roundoff error when using the two rounding rules above. Below are the formulas and corresponding proof. The first definition of machine epsilon is used here.

Theorem

  1. Round-by-chop:
  2. Round-to-nearest:

    Proof

Let where, and let be the floating-point representation of.
Since round-by-chop is being used, it is
The proof for round-to-nearest is similar.
Even if some numbers can be represented exactly by floating-point numbers and such numbers are called machine numbers, performing floating-point arithmetic may lead to roundoff error in the final result.

Addition

Machine addition consists of lining up the decimal points of the two numbers to be added, adding them, and then storing the result again as a floating-point number. The addition itself can be done in higher precision but the result must be rounded back to the specified precision, which may lead to roundoff error.
For example, adding to in IEEE double precision as follows,
From this example, it can be seen that roundoff error can be introduced when doing the addition of a large number and a small number because the shifting of decimal points in the mantissas to make the exponents match may cause the loss of some digits.

Multiplication

In general, the product of -digit mantissas contains up to digits, so the result might not fit in the mantissa. Thus roundoff error will be involved in the result.
In general, the quotient of -digit mantissas may contain more than -digits. Thus roundoff error will be involved in the result.
The subtracting of two nearly equal numbers is called subtractive cancellation.
Errors can be magnified or accumulated when a sequence of calculations is applied on an initial input with roundoff error due to inexact representation.

Unstable algorithms

An algorithm or numerical process is called stable if small changes in the input only produce small changes in the output and it is called unstable if large changes in the output
are produced.
A sequence of calculations normally occur when running some algorithm. The amount of error in the result depends on the stability of the algorithm. Roundoff error will be magnified by unstable algorithms.
For example, for with given. It is easy to show that. Suppose is our initial value and has a small representation error, which means the initial input to this algorithm is instead of. Then the algorithm does the following sequence of calculations.
The roundoff error is amplified in succeeding calculations so this algorithm is unstable.

Ill-conditioned problems

Even if a stable algorithm is used, the solution to a problem is still inaccurate due to the accumulation of roundoff error when the problem itself is ill-conditioned.
The condition number of a problem is the ratio of the relative change in the solution to the relative change in the input. A problem is well-conditioned if small relative changes in input result in small relative changes in the solution. Otherwise. the problem is called ill-conditioned. In other words, a problem is called ill-conditioned if its condition number is "much larger" than.
The condition number is introduced as a measure of the roundoff errors that can result when solving ill-conditioned problems.
For example, higher-order polynomials tend to be very ill-conditioned, that is, they tend to be highly sensitive to roundoff error.
In 1901, Carl Runge published a study on the dangers of higher-order polynomial interpolation. He looked at the following simple-looking function:
which is now called Runge's function. He took equidistantly spaced data points from this function over the interval. He then used interpolating polynomials of increasing order and found that as he took more points, the polynomials and the original curve differed considerably as illustrated in Figure “Comparison1” and Figure “Comparison 2”. Further, the situation deteriorated greatly as the order was increased. As shown in Figure “Comparison 2”, the fit has gotten even worse, particularly at the ends of the interval.
Click on the figures in order to see the full descriptions.

Real world example: Patriot missile failure due to magnification of roundoff error

On 25 February 1991, during the Gulf War, an American Patriot missile battery in Dharan, Saudi Arabia, failed to intercept an incoming Iraqi Scud missile. The Scud struck an American Army barracks and killed 28 soldiers. A report of the Government Accountability Office entitled "Patriot Missile Defense: Software Problem Led to System Failure at Dhahran, Saudi Arabia" reported on the cause of the failure: an inaccurate calculation of the time since boot due to computer arithmetic errors. Specifically, the time in tenths of a second, as measured by the system's internal clock, was multiplied by 10 to produce the time in seconds. This calculation was performed using a 24-bit fixed point register. In particular, the value 1/10, which has a non-terminating binary expansion, was chopped at 24 bits after the radix point. The small chopping error, when multiplied by the large number giving the time in tenths of a second, led to a significant error. Indeed, the Patriot battery had been up around 100 hours, and an easy calculation shows that the resulting time error due to the magnified chopping error was about 0.34 seconds.. A Scud travels at about meters per second, and so travels more than half a kilometer in this time. This was far enough that the incoming Scud was outside the "range gate" that the Patriot tracked. Ironically, the fact that the bad time calculation had been improved in some parts of the code, but not all, contributed to the problem, since it meant that the inaccuracies did not cancel.