Chapter 1. Floating-Point Arithmetic and Error Analysis

Web Laboratories and Student Tools

This web version adds computational laboratories. The student can enter data, compute errors, compare stable and unstable formulas, and test understanding with randomized quizzes.

Error labAbsolute and relative error Floating-point labMachine epsilon and unit roundoff Roundoff-growth labThe $\gamma_n$ bound Cancellation labAmplification factor Stability labStable quadratic roots Summation labRecursive, pairwise, and Kahan sums

Game Cards: Fast Memory Checks

Click a card to reveal the answer. These cards are deliberately short; they are meant to prevent the reader from forgetting the meaning of the main symbols while reading the chapter.

Randomized Quiz Arena

The questions are selected randomly from a large Chapter 1 bank. They include true/false questions, single-answer multiple choice, multiple-answer questions, and numerical questions.

Number of questions

Learning goals

After completing this chapter, the reader should be able to:

distinguish exact real arithmetic from floating-point arithmetic;
derive the standard model $\fl(x)=x(1+\delta)$, $|\delta|\le u$;
use the $\gamma_n$-notation to control accumulated rounding errors;
analyze cancellation and loss of significant digits;
distinguish conditioning from stability;
interpret residuals as backward-error objects;
recognize stable and unstable algebraic reformulations;
solve basic and advanced error-analysis exercises.

The Numerical Representation Problem

Numerical analysis begins with a structural obstruction: the field of real numbers $\mathbb R$ is infinite and uncountable, whereas every digital computer manipulates only finitely many representable numbers. Thus a real number $x\in\mathbb R$ is usually not stored as $x$ itself, but as a nearby machine number

\[ \fl(x). \]

The discrepancy between $x$ and $\fl(x)$ is not a programming accident. It is an unavoidable mathematical consequence of finite storage.

Central idea

A numerical result is meaningful only when we understand both the mathematical sensitivity of the problem and the stability of the algorithm used to solve it.

\[ \text{forward error} \approx \text{conditioning} \times \text{backward error}. \]

Definition 1.1: Absolute and relative error

Let $x$ be an exact quantity and let $\widetilde{x}$ be an approximation. The absolute error is

\[ E_{\mathrm{abs}}(x,\widetilde{x}) = |x-\widetilde{x}|. \]

If $x\ne0$, the relative error is

\[ E_{\mathrm{rel}}(x,\widetilde{x}) = \frac{|x-\widetilde{x}|}{|x|}. \]

Absolute error measures distance in the original physical scale. Relative error measures error compared with the magnitude of the exact quantity. The second notion is usually more informative when quantities vary over several orders of magnitude.

Example 1.2: Scale matters

Let $x_1=10^6$, $\widetilde{x}_1=10^6+1$, and let $x_2=10^{-6}$, $\widetilde{x}_2=2\cdot10^{-6}$. Then

\[ \frac{|x_1-\widetilde{x}_1|}{|x_1|}=10^{-6}, \qquad \frac{|x_2-\widetilde{x}_2|}{|x_2|}=1. \]

The first approximation is relatively accurate; the second has $100\%$ relative error.

Floating-Point Systems

Definition 1.3: Normalized floating-point system

Let $\beta\ge2$ be an integer base, let $t\ge1$ be the number of significant digits, and let \(L

\[ \mathcal F(\beta,t,L,U) = \left\{ \pm \left( d_0+d_1\beta^{-1}+\cdots+d_{t-1}\beta^{-(t-1)} \right)\beta^e \right\}, \]

where

\[ d_0\in\{1,\ldots,\beta-1\}, \qquad d_j\in\{0,\ldots,\beta-1\}, \qquad L\le e\le U. \]

The number

\[ m=d_0+d_1\beta^{-1}+\cdots+d_{t-1}\beta^{-(t-1)} \]

is the mantissa, and $e$ is the exponent.

A normalized floating-point number therefore has the form

\[ x=\pm m\beta^e, \qquad 1\le m<\beta. \]

The exponent controls the scale, whereas the mantissa controls the significant digits.

Definition 1.4: Machine epsilon and unit roundoff

In a normalized base-$\beta$ system with $t$ significant digits, the spacing between $1$ and the next larger normalized floating-point number is

\[ \varepsilon_{\mathrm{mach}} = \beta^{1-t}. \]

For rounding to nearest, the unit roundoff is

\[ u=\frac12\beta^{1-t}. \]

For chopping, a standard relative-error bound is

\[ u_{\mathrm{chop}}=\beta^{1-t}. \]

Figure 1.1. A local view of a normalized floating-point grid. Rounding to nearest maps a real number to the closest machine number.

Rounding as a Projection onto a Finite Set

Let $\mathcal F$ be a floating-point system. The operation

\[ \fl:\mathbb R\to\mathcal F \]

maps a real number to a representable number. For rounding to nearest, $\fl(x)$ is a nearest floating-point number to $x$. Ties are resolved by a prescribed rule, for example round-to-even.

Theorem 1.5: Local spacing bound

Let $x$ be a real number lying between two consecutive normalized floating-point numbers \(a

\[ |x-\fl(x)|\le \frac{b-a}{2}. \]

Proof.

Since $a$ and $b$ are the two nearest representable endpoints surrounding $x$, rounding to nearest chooses either $a$ or $b$. The largest possible distance to the nearest endpoint occurs at the midpoint $(a+b)/2$, and that distance is $(b-a)/2$. Hence

\[ |x-\fl(x)|\le \frac{b-a}{2}. \]

Theorem 1.6: Relative representation error

Let $x$ lie in the normalized range of a base-$\beta$, $t$-digit floating-point system. If $\fl(x)$ is obtained by rounding to nearest and no overflow or underflow occurs, then there exists $\delta\in\mathbb R$ such that

\[ \fl(x)=x(1+\delta), \qquad |\delta|\le u, \]

where

\[ u=\frac12\beta^{1-t}. \]

Proof.

It is enough to consider $x>0$, since the system is symmetric with respect to sign. Suppose

\[ \beta^e\le x<\beta^{e+1}. \]

In this binade, adjacent normalized floating-point numbers have spacing

\[ \Delta=\beta^{e-(t-1)}. \]

By the local spacing bound,

\[ |x-\fl(x)| \le \frac{\Delta}{2} = \frac12\beta^{e-(t-1)}. \]

Since $x\ge \beta^e$, we obtain

\[ \frac{|x-\fl(x)|}{|x|} \le \frac{\frac12\beta^{e-(t-1)}}{\beta^e} = \frac12\beta^{1-t} = u. \]

Define

\[ \delta=\frac{\fl(x)-x}{x}. \]

Then $\fl(x)=x(1+\delta)$ and $|\delta|\le u$.

Important limitation

The statement excludes overflow and underflow. Near underflow, the normalized-spacing model changes. Near overflow, the rounded value may not be representable at all.

The Standard Model for Floating-Point Arithmetic

The previous theorem describes the representation of a single real number. Numerical algorithms also perform arithmetic operations. Let

\[ \circ\in\{+,-,\times,/\}. \]

The standard model assumes that a floating-point operation is the exact real operation followed by rounding:

\[ \fl(x\circ y) = (x\circ y)(1+\delta), \qquad |\delta|\le u, \]

provided the exact result is in the normalized range and no exceptional event occurs.

Definition 1.7: Floating-point operation model

For an arithmetic operation $\circ$, the floating-point operation is modeled by

\[ x\mathbin{\widehat{\circ}}y = \fl(x\circ y) = (x\circ y)(1+\delta), \qquad |\delta|\le u. \]

This model is the foundation for most backward error analysis.

Accumulation of Rounding Errors

It is convenient to introduce the notation

\[ \gamma_n=\frac{nu}{1-nu}, \qquad nu<1. \]

This quantity appears naturally when bounding products of many terms $1+\delta_j$.

Lemma 1.8: Product perturbation bound

Let $|\delta_j|\le u$ for $j=1,\ldots,n$, and assume $nu<1$. Then

\[ \prod_{j=1}^n(1+\delta_j) = 1+\theta_n, \qquad |\theta_n|\le \gamma_n. \]

Proof.

For the upper bound,

\[ \prod_{j=1}^n(1+\delta_j)\le (1+u)^n. \]

Using the elementary estimate $(1+u)^n\le(1-nu)^{-1}$ for $nu<1$, one obtains

\[ (1+u)^n \le \frac{1}{1-nu} = 1+\frac{nu}{1-nu} = 1+\gamma_n. \]

Similarly,

\[ \prod_{j=1}^n(1+\delta_j) \ge (1-u)^n \ge 1-\frac{nu}{1-nu} = 1-\gamma_n. \]

Therefore the product has the form $1+\theta_n$, with

\[ |\theta_n|\le \gamma_n. \]

Theorem 1.9: Error bound for recursive summation

Let

\[ s=x_1+x_2+\cdots+x_n \]

and let $\widehat{s}$ be computed by the recursive floating-point summation

\[ \widehat{s}_1=x_1, \qquad \widehat{s}_k=\fl(\widehat{s}_{k-1}+x_k), \qquad k=2,\ldots,n. \]

If no overflow or underflow occurs and $(n-1)u<1$, then

\[ \widehat{s} = \sum_{j=1}^n x_j(1+\theta_j), \qquad |\theta_j|\le\gamma_{n-1}. \]

Consequently,

\[ |\widehat{s}-s| \le \gamma_{n-1}\sum_{j=1}^n |x_j|. \]

Proof.

At each addition there is a local perturbation:

\[ \widehat{s}_k = (\widehat{s}_{k-1}+x_k)(1+\delta_k), \qquad |\delta_k|\le u. \]

Expanding recursively gives each original datum multiplied by a product of at most $n-1$ rounding factors. By Lemma 1.8, every such product is $1+\theta_j$, with $|\theta_j|\le\gamma_{n-1}$. Thus

\[ \widehat{s} = \sum_{j=1}^n x_j(1+\theta_j). \]

Subtracting $s$ and taking absolute values gives

\[ |\widehat{s}-s| = \left| \sum_{j=1}^n x_j\theta_j \right| \le \gamma_{n-1}\sum_{j=1}^n |x_j|. \]

Figure 1.2. For double precision, _n is tiny for moderate n, but the structure of the bound is essential: long computations accumulate rounding factors.

Cancellation and Loss of Significance

Cancellation occurs when two nearly equal floating-point numbers are subtracted. Although subtraction itself may be exact under certain circumstances, the result can have few significant digits if the inputs already contain errors.

Theorem 1.10: Cancellation amplifies relative input errors

Let

\[ \widetilde{x}=x(1+\delta_x), \qquad \widetilde{y}=y(1+\delta_y), \]

with

\[ |\delta_x|\le\varepsilon, \qquad |\delta_y|\le\varepsilon. \]

Assume $x\ne y$. Then the relative error in the difference satisfies

\[ \frac{|(\widetilde{x}-\widetilde{y})-(x-y)|}{|x-y|} \le \varepsilon \frac{|x|+|y|}{|x-y|}. \]

Proof.

We compute

\[ (\widetilde{x}-\widetilde{y})-(x-y) = x\delta_x-y\delta_y. \]

Therefore

\[ |(\widetilde{x}-\widetilde{y})-(x-y)| \le |x|\,|\delta_x|+|y|\,|\delta_y| \le \varepsilon(|x|+|y|). \]

Dividing by $|x-y|$ gives the result.

The amplification factor

\[ \frac{|x|+|y|}{|x-y|} \]

is large when $x$ and $y$ are close.

Figure 1.3. Cancellation amplification. As the relative separation =|x-y|/(|x|+|y|) tends to zero, the possible relative error in the difference grows like 1/.

Example 1.11: A classical cancellation example

For large $x$, the expression

\[ \sqrt{x+1}-\sqrt{x} \]

subtracts two nearly equal numbers. Multiplying by the conjugate gives

\[ \sqrt{x+1}-\sqrt{x} = \frac{1}{\sqrt{x+1}+\sqrt{x}}. \]

The second expression avoids subtracting nearly equal quantities.

Example 1.12: A trigonometric stable reformulation

For small $x$, the expression

\[ 1-\cos x \]

is susceptible to cancellation. A stable equivalent form is

\[ 1-\cos x = 2\sin^2\left(\frac{x}{2}\right). \]

Sterbenz Lemma

Theorem 1.13: Sterbenz lemma

Let $x$ and $y$ be positive normalized floating-point numbers in a radix-$\beta$ floating-point system with exact subtraction after alignment. If

\[ \frac12\le \frac{x}{y}\le2, \]

then $x-y$ is exactly representable as a floating-point number.

Proof.

Assume without loss of generality that $x\ge y>0$. The hypothesis gives

\[ y\le x\le2y. \]

Thus $0\le x-y\le y$. Since $x$ and $y$ lie within at most one binade of each other, their significands can be aligned without losing significant digits beyond the required precision for the exact difference. The subtraction of the aligned integer significands is exact. After normalization, the result is representable. Hence

\[ \fl(x-y)=x-y. \]

Sterbenz does not remove cancellation

Sterbenz lemma says that subtraction of two floating-point inputs may be exact. It does not say that the original real data were accurate. If $x$ and $y$ already contain errors, Theorem 1.10 shows that those errors may be amplified.

Conditioning of Mathematical Problems

Let a problem be represented by a mapping

\[ F:X\to Y. \]

The condition number measures the sensitivity of $F(x)$ to perturbations of $x$.

Definition 1.14: Absolute condition number

Suppose $F$ is Fréchet differentiable at $x$. The absolute condition number is

\[ \kappa_{\mathrm{abs}}(F,x) = \|DF(x)\|, \]

where $DF(x)$ is the Fréchet derivative.

Definition 1.15: Relative condition number

If $x\ne0$ and $F(x)\ne0$, the relative condition number is

\[ \kappa_{\mathrm{rel}}(F,x) = \frac{\|DF(x)\|\,\|x\|}{\|F(x)\|}. \]

For a scalar $f:\mathbb R\to\mathbb R$, this becomes

\[ \kappa_f(x) = \left| \frac{x f'(x)}{f(x)} \right|. \]

Theorem 1.16: First-order perturbation estimate

Let $F:X\to Y$ be Fréchet differentiable at $x$. Then, for small $h$,

\[ F(x+h)-F(x)=DF(x)h+o(\|h\|). \]

Consequently,

\[ \frac{\|F(x+h)-F(x)\|}{\|F(x)\|} \le \kappa_{\mathrm{rel}}(F,x) \frac{\|h\|}{\|x\|} + o\left(\frac{\|h\|}{\|x\|}\right). \]

Proof.

By Fréchet differentiability,

\[ F(x+h)-F(x)=DF(x)h+r(h), \qquad \frac{\|r(h)\|}{\|h\|}\to0. \]

Taking norms,

\[ \|F(x+h)-F(x)\| \le \|DF(x)\|\,\|h\|+\|r(h)\|. \]

Dividing by $\|F(x)\|$ and rewriting the leading term as

\[ \frac{\|DF(x)\|\,\|h\|}{\|F(x)\|} = \frac{\|DF(x)\|\,\|x\|}{\|F(x)\|} \frac{\|h\|}{\|x\|} \]

gives the result.

Conditioning of Linear Systems

For a nonsingular matrix $A$, the solution of

\[ Ax=b \]

\[ x=A^{-1}b. \]

Perturbing $b$ to $b+\Delta b$ changes the solution to $x+\Delta x$, where

\[ A\Delta x=\Delta b, \qquad \Delta x=A^{-1}\Delta b. \]

Theorem 1.17: Right-hand-side perturbation bound

Let $A$ be nonsingular. If $Ax=b$ and $A(x+\Delta x)=b+\Delta b$, then

\[ \frac{\|\Delta x\|}{\|x\|} \le \kappa(A) \frac{\|\Delta b\|}{\|b\|}, \]

where

\[ \kappa(A)=\|A\|\,\|A^{-1}\|. \]

Proof.

Since $\Delta x=A^{-1}\Delta b$,

\[ \|\Delta x\|\le \|A^{-1}\|\,\|\Delta b\|. \]

Also $b=Ax$, so

\[ \|b\|\le \|A\|\,\|x\|. \]

Thus

\[ \frac{1}{\|x\|} \le \frac{\|A\|}{\|b\|}. \]

Combining the previous two estimates gives

\[ \frac{\|\Delta x\|}{\|x\|} \le \|A^{-1}\|\,\|\Delta b\|\frac{\|A\|}{\|b\|} = \kappa(A)\frac{\|\Delta b\|}{\|b\|}. \]

Forward Error, Backward Error, and Stability

Definition 1.18: Forward error

Let $x$ be the exact solution of a problem and let $\widetilde{x}$ be a computed solution. The forward error is

\[ \|\widetilde{x}-x\|. \]

Definition 1.19: Backward error

A computed solution $\widetilde{x}$ has small backward error if it is the exact solution of a nearby problem.

For a linear system, if $\widetilde{x}$ is computed, its residual is

\[ r=b-A\widetilde{x}. \]

Then

\[ A\widetilde{x}=b-r. \]

Thus $\widetilde{x}$ is the exact solution of the perturbed problem

\[ A\widetilde{x}=\widetilde{b}, \qquad \widetilde{b}=b-r. \]

Definition 1.20: Backward stability

An algorithm is backward stable if the computed solution is the exact solution of a nearby problem, with the perturbation size comparable to unavoidable rounding error.

Conditioning versus stability

Conditioning belongs to the problem. Stability belongs to the algorithm. A stable algorithm applied to an ill-conditioned problem may still produce a large forward error.

\[ \text{forward error} \lesssim \text{condition number}\times\text{backward error}. \]

A Model Result: Backward Error for a Dot Product

Let

\[ s=x^Ty=\sum_{i=1}^n x_i y_i. \]

A standard floating-point dot product computes products and sums sequentially.

Theorem 1.21: Backward error interpretation of a computed dot product

Assume no overflow or underflow and $nu<1$. The computed dot product satisfies

\[ \fl(x^Ty) = \sum_{i=1}^n x_i y_i(1+\theta_i), \qquad |\theta_i|\le \gamma_n. \]

Equivalently,

\[ \fl(x^Ty)=(x+\Delta x)^Ty \]

with componentwise perturbations satisfying

\[ |\Delta x_i|\le \gamma_n |x_i|. \]

Proof.

Each product and addition introduces a factor $1+\delta_j$, with $|\delta_j|\le u$. Expanding the recursive computation gives each term $x_i y_i$ multiplied by a product of at most $n$ such factors. By Lemma 1.8, that product is $1+\theta_i$, where $|\theta_i|\le\gamma_n$. Define

\[ \Delta x_i=\theta_i x_i. \]

Then

\[ \sum_{i=1}^n x_i y_i(1+\theta_i) = \sum_{i=1}^n (x_i+\Delta x_i)y_i = (x+\Delta x)^Ty, \]

and $|\Delta x_i|\le\gamma_n|x_i|$.

Stable and Unstable Algebraic Forms

Two algebraically identical formulas may have radically different numerical behavior.

Example 1.22: Quadratic formula

The roots of

\[ ax^2+bx+c=0 \]

are formally

\[ x_{\pm} = \frac{-b\pm\sqrt{b^2-4ac}}{2a}. \]

If $b^2\gg4ac$, one of the signs may involve severe cancellation. A stable strategy computes one root using the noncancelling sign and obtains the other from

\[ x_+x_-=\frac{c}{a}. \]

Example 1.23: Logarithmic cancellation

For small $x$, evaluating

\[ \log(1+x) \]

directly may lose accuracy if $1+x$ rounds to a number too close to $1$. Specialized functions such as $\operatorname{log1p}(x)$ are designed to compute this quantity accurately.

Algorithms

Algorithm 1: Kahan compensated summation

Require: $x_1,\ldots,x_n$
$s\gets 0$
$c\gets 0$
For {$k=1,\ldots,n$}
$y\gets x_k-c$
$t\gets s+y$
$c\gets (t-s)-y$
$s\gets t$
End
\Return $s$

Algorithm 2: Safe hypotenuse computation

Require: $x,y\in\mathbb R$
$v\gets \max(|x|,|y|)$
$u\gets \min(|x|,|y|)$
If {$v=0$}
\Return $0$
Else
\Return $v\sqrt{1+(u/v)^2}$
End

Chapter Summary

Main takeaways

Floating-point arithmetic replaces $\mathbb R$ by a finite set
$\mathcal F$.
Correct rounding gives
\[ \fl(x)=x(1+\delta),\qquad |\delta|\le u. \]
Products of rounding factors are controlled by
\[ \gamma_n=\frac{nu}{1-nu}. \]
Cancellation amplifies previous relative errors.
Conditioning measures the sensitivity of the mathematical problem.
Stability measures the behavior of the numerical algorithm.
Backward stability means the computed solution solves a nearby problem
exactly.

Exercises

How to use the accordions

First try each problem without opening the solution. Then open the accordion and compare your argument with the worked derivation. All formulas are rendered by KaTeX; no raw LaTeX code is intentionally displayed.

The following problems are divided into basic and advanced problems. Problems marked by $\star$, $\star\star$, and $\star\star\star$ are progressively harder.

43 worked solutions included.

Basic problems

Exercise 1.1: Basic: Absolute and relative error

Let $x=\pi$ and $\widetilde{x}=3.14$. Compute the absolute and relative errors.

Worked solution Show / hide solution to Exercise 1.1

Problem formulation

We compare the exact value $x=\pi$ with the decimal approximation $\widetilde{x}=3.14$.

Detailed solution

The absolute error is

\[ E_{\mathrm{abs}}=|\pi-3.14|\approx 0.001592653589793. \]

The relative error is

\[ E_{\mathrm{rel}}=\frac{|\pi-3.14|}{|\pi|} \approx 5.069\times 10^{-4}. \]

Conclusion

The approximation has a small absolute error, but the relative error is the more scale-independent diagnostic.

Exercise 1.2: Basic: Decimal unit roundoff

In a decimal floating-point system with $\beta=10$ and $t=5$, compute $\varepsilon_{\mathrm{mach}}$ and the unit roundoff for rounding to nearest.

Worked solution Show / hide solution to Exercise 1.2

Problem formulation

The system has base $\beta=10$ and precision $t=5$.

Detailed solution

The distance from $1$ to the next larger normalized floating-point number is

\[ \varepsilon_{\mathrm{mach}}=\beta^{1-t}=10^{1-5}=10^{-4}. \]

For rounding to nearest, the unit roundoff is one half of this spacing:

\[ u=\frac12\varepsilon_{\mathrm{mach}}=\frac12\cdot 10^{-4}=5\times10^{-5}. \]

Conclusion

Thus $\varepsilon_{\mathrm{mach}}=10^{-4}$ and $u=5\times10^{-5}$.

Exercise 1.3: Basic: Relative model implies absolute model

Prove that if $\fl(x)=x(1+\delta)$ and $|\delta|\le u$, then

\[ |\fl(x)-x|\le u|x|. \]

Worked solution Show / hide solution to Exercise 1.3

Detailed solution

Starting from the relative model,

\[ \fl(x)=x(1+\delta), \qquad |\delta|\le u. \]

Subtract $x$ from both sides:

\[ \fl(x)-x=x\delta. \]

Taking absolute values gives

\[ |\fl(x)-x|=|x|\,|\delta|\le u|x|. \]

Conclusion

The relative rounding model immediately implies the absolute-error bound.

Exercise 1.4: Basic: Stable trigonometric identity

Show that

\[ 1-\cos x=2\sin^2\left(\frac{x}{2}\right). \]

Explain why the right-hand side is numerically preferable for small $x$.

Worked solution Show / hide solution to Exercise 1.4

Detailed solution

Use the double-angle identity

\[ \cos x=1-2\sin^2\left(\frac{x}{2}\right). \]

Rearranging gives

\[ 1-\cos x=2\sin^2\left(\frac{x}{2}\right). \]

When $x$ is small, $\cos x\approx1$. The direct formula $1-\cos x$ subtracts two nearly equal numbers and loses significant digits. The expression $2\sin^2(x/2)$ avoids that subtraction and is therefore numerically preferable.

Conclusion

The two formulas are algebraically identical, but the second is stable for small $x$.

Exercise 1.5: Basic: Power condition number

Let $f(x)=x^p$, with $p\in\mathbb R$. Compute the relative condition number

\[ \kappa_f(x)=\left|\frac{x f'(x)}{f(x)}\right|. \]

Worked solution Show / hide solution to Exercise 1.5

Detailed solution

For $f(x)=x^p$,

\[ f'(x)=p x^{p-1}. \]

Therefore

\[ \kappa_f(x)=\left|\frac{x f'(x)}{f(x)}\right| =\left|\frac{x p x^{p-1}}{x^p}\right|=|p|. \]

Conclusion

The relative condition number is constant: $\kappa_f(x)=|p|$, for $x\ne0$ and $x^p\ne0$.

Exercise 1.6: Basic: Logarithm condition number

Let $f(x)=\log x$. Compute $\kappa_f(x)$. Explain why the problem is ill-conditioned near $x=1$.

Worked solution Show / hide solution to Exercise 1.6

Detailed solution

For $f(x)=\log x$, with $x>0$,

\[ f'(x)=\frac1x. \]

Hence

\[ \kappa_f(x)=\left|\frac{x f'(x)}{f(x)}\right| =\left|\frac{1}{\log x}\right|. \]

As $x\to1$, $\log x\to0$, so $\kappa_f(x)\to\infty$.

Conclusion

The logarithm is relatively ill-conditioned near $x=1$ when the output $\log x$ is used as the reference scale.

Exercise 1.7: Basic: Residual and error in a linear system

Let $Ax=b$ be nonsingular and let $\widetilde{x}$ be a computed approximation. Prove that

\[ x-\widetilde{x}=A^{-1}(b-A\widetilde{x}). \]

Worked solution Show / hide solution to Exercise 1.7

Detailed solution

Since $Ax=b$, subtract $A\widetilde{x}$ from both sides:

\[ b-A\widetilde{x}=Ax-A\widetilde{x}=A(x-\widetilde{x}). \]

Because $A$ is nonsingular, multiply by $A^{-1}$:

\[ x-\widetilde{x}=A^{-1}(b-A\widetilde{x}). \]

Conclusion

The forward error equals the inverse matrix applied to the residual. Therefore a small residual implies a small error only when $A^{-1}$ is not too large.

Intermediate and advanced problems

Exercise 1.8: ★: Inductive proof of the gamma_n-bound

Prove Lemma 1.8 using induction instead of the product estimate.

Worked solution Show / hide solution to Exercise 1.8

Detailed solution

Assume $|\delta_j|\le u$ and $nu<1$. Define

\[ \prod_{j=1}^{n}(1+\delta_j)=1+\theta_n. \]

For $n=1$, $|\theta_1|\le u\le \gamma_1$. Suppose $|\theta_k|\le\gamma_k$. Then

\[ (1+\theta_{k+1})=(1+\theta_k)(1+\delta_{k+1}). \]

Thus

\[ |\theta_{k+1}|\le \gamma_k+u+\gamma_k u =\frac{(k+1)u}{1-ku} \le \frac{(k+1)u}{1-(k+1)u}=\gamma_{k+1}. \]

Conclusion

By induction, $|\theta_n|\le\gamma_n=nu/(1-nu)$.

Exercise 1.9: ★: Recursive summation relative error

Let $s=\sum_{j=1}^n x_j$. Use Theorem 1.9 to derive a relative error bound for $\widehat{s}$, assuming $s\ne0$. Explain why the bound becomes large when

\[ |s|\ll\sum_{j=1}^n |x_j|. \]

Worked solution Show / hide solution to Exercise 1.9

Detailed solution

The recursive summation theorem gives

\[ \widehat{s}=\sum_{j=1}^{n}x_j(1+\theta_j), \qquad |\theta_j|\le\gamma_{n-1}. \]

Therefore

\[ |\widehat{s}-s| \le \gamma_{n-1}\sum_{j=1}^n |x_j|. \]

If $s\ne0$, division by $|s|$ yields

\[ \frac{|\widehat{s}-s|}{|s|} \le \gamma_{n-1}\frac{\sum_{j=1}^n |x_j|}{|s|}. \]

Conclusion

The bound becomes large when $|s|\ll\sum |x_j|$, because cancellation makes the problem intrinsically sensitive.

Exercise 1.10: ★: Stable quadratic roots

For the quadratic equation $ax^2+bx+c=0$, derive a cancellation-free procedure for computing both roots when $b>0$ and $b^2\gg4ac$.

Worked solution Show / hide solution to Exercise 1.10

Detailed solution

Let

\[ D=b^2-4ac. \]

When $b>0$ and $D\approx b^2$, the formula $(-b+\sqrt D)/(2a)$ subtracts two nearly equal numbers. Define instead

\[ q=-\frac12\left(b+\sqrt{D}\right). \]

Then compute

\[ x_1=\frac{q}{a},\qquad x_2=\frac{c}{q}. \]

The second formula follows from $x_1x_2=c/a$.

Conclusion

The cancellation-free procedure is to compute the large root with $q/a$ and the small root with $c/q$.

Exercise 1.11: ★: Nonassociativity of floating-point addition

In exact real arithmetic, addition is associative. Construct a concrete example in a floating-point system showing that

\[ \fl(\fl(a+b)+c) \ne \fl(a+\fl(b+c)), \]

even when no underflow or overflow occurs.

Worked solution Show / hide solution to Exercise 1.11

Detailed solution

Use a decimal floating-point system with $t=3$ significant digits. Choose

\[ a=10^4, \qquad b=-10^4, \qquad c=1. \]

Then

\[ \fl(\fl(a+b)+c)=\fl(0+1)=1. \]

But $b+c=-9999$, which rounds to $-1.00\times10^4$ in this system. Hence

\[ \fl(a+\fl(b+c))=\fl(10^4-10^4)=0. \]

Conclusion

Floating-point addition is not associative, even though no overflow or underflow is needed in the example.

Exercise 1.12: ★: Pairwise summation and logarithmic error growth

Let $n=2^m$, and suppose that $s=\sum_{j=1}^n x_j$ is computed by a balanced binary summation tree rather than by recursive left-to-right summation. Assume the standard floating-point model and no overflow or underflow.

Prove that the computed sum satisfies
\[ \widehat{s} = \sum_{j=1}^n x_j(1+\theta_j), \qquad |\theta_j|\le\gamma_m, \qquad m=\log_2 n. \]
Deduce that
\[ |\widehat{s}-s| \le \gamma_{\log_2 n}\sum_{j=1}^n |x_j|. \]
Compare this estimate with the recursive summation bound.

Worked solution Show / hide solution to Exercise 1.12

Detailed solution

In a balanced binary tree with $n=2^m$, every input travels through exactly $m$ additions. Each path therefore contributes a product of $m$ rounding factors:

\[ \prod_{\ell=1}^{m}(1+\delta_\ell)=1+\theta_j, \qquad |\theta_j|\le\gamma_m. \]

Thus

\[ \widehat{s}=\sum_{j=1}^n x_j(1+\theta_j), \qquad |\theta_j|\le\gamma_m, \qquad m=\log_2 n. \]

Consequently

\[ |\widehat{s}-s|\le\gamma_{\log_2 n}\sum_{j=1}^n |x_j|. \]

Conclusion

Pairwise summation replaces the linear growth $\gamma_{n-1}$ by the logarithmic growth $\gamma_{\log_2 n}$.

Exercise 1.13: ★★: Error-free transformation by TwoSum

Let $a,b\in\mathcal F$, and assume round-to-nearest arithmetic with no overflow or underflow. The TwoSum algorithm is

\[ \begin{aligned} s &= \fl(a+b),\\ z &= \fl(s-a),\\ e &= \fl\big((a-(s-z))+(b-z)\big). \end{aligned} \]

Prove that

\[ a+b=s+e \]

exactly in real arithmetic.

Worked solution Show / hide solution to Exercise 1.13

Detailed solution

The TwoSum algorithm computes

\[ s=\fl(a+b),\qquad z=\fl(s-a), \qquad e=\fl((a-(s-z))+(b-z)). \]

Under round-to-nearest and no exceptional events, the correction term $e$ recovers exactly the part of $a+b$ lost when $s$ was rounded. The key cancellation identities are

\[ s-z \approx a, \qquad z\approx b, \]

and the two residual pieces are recombined in a way that is exactly representable by Sterbenz-type exact-subtraction arguments.

\[ s+e=a+b. \]

Conclusion

TwoSum is an error-free transformation: it represents the exact real sum as a rounded leading part plus a rounded error part.

Exercise 1.14: ★★: Conditioning of polynomial roots

Consider a monic polynomial

\[ p(z)=\prod_{j=1}^n(z-\lambda_j) \]

with distinct roots. Let $p$ be perturbed to

\[ p(z)+\epsilon q(z), \qquad \deg(q)\le n-1. \]

Prove that the first-order perturbation of $\lambda_k$ satisfies
\[ \Delta\lambda_k = -\epsilon\frac{q(\lambda_k)}{p'(\lambda_k)} + O(\epsilon^2). \]
Deduce that the absolute sensitivity of $\lambda_k$ is governed by
\[ \frac{1}{|p'(\lambda_k)|}. \]
Explain why Wilkinson's polynomial
\[ p(z)=\prod_{j=1}^{20}(z-j) \]

exhibits extreme ill-conditioning.

Worked solution Show / hide solution to Exercise 1.14

Detailed solution

Let $p(\lambda_j)=0$ and perturb the polynomial by $\delta p$. The perturbed root has the form $\lambda_j+\Delta\lambda_j$. Linearizing,

\[ 0=p(\lambda_j+\Delta\lambda_j)+\delta p(\lambda_j+\Delta\lambda_j) \approx p'(\lambda_j)\Delta\lambda_j+\delta p(\lambda_j). \]

Therefore

\[ \Delta\lambda_j\approx -\frac{\delta p(\lambda_j)}{p'(\lambda_j)}. \]

For a monic polynomial with distinct roots,

\[ p'(\lambda_j)=\prod_{k\ne j}(\lambda_j-\lambda_k). \]

Conclusion

Roots become ill-conditioned when two roots are close, because then $|p'(\lambda_j)|$ is small.

Exercise 1.15: ★★: Backward error of matrix-vector multiplication

Let $A\in\mathbb R^{m\times n}$ and $x\in\mathbb R^n$. Let $\widehat{y}=\fl(Ax)$ be computed by the standard row dot-product algorithm. Prove that there exists $\Delta A$ such that

\[ \widehat{y}=(A+\Delta A)x, \qquad |\Delta A|\le \gamma_n |A| \]

componentwise.

Worked solution Show / hide solution to Exercise 1.15

Detailed solution

For the standard dot product in row $i$, the computed value satisfies

\[ \widehat y_i=\sum_{j=1}^n a_{ij}x_j(1+\theta_{ij}), \qquad |\theta_{ij}|\le\gamma_n, \]

assuming no overflow or underflow. Define $\Delta a_{ij}=a_{ij}\theta_{ij}$. Then

\[ \widehat y_i=\sum_{j=1}^n (a_{ij}+\Delta a_{ij})x_j. \]

Hence

\[ \widehat y=(A+\Delta A)x, \qquad |\Delta A|\le\gamma_n |A|. \]

Conclusion

Standard matrix-vector multiplication is componentwise backward stable.

Exercise 1.16: ★★: Horner's method

Let $p(x)=\sum_{k=0}^n a_kx^k$ be evaluated by Horner's rule:

\[ y_n=a_n, \qquad y_k=\fl(y_{k+1}x+a_k), \qquad k=n-1,\ldots,0. \]

Prove that the computed value $\widehat{p}(x)=y_0$ satisfies

\[ |\widehat{p}(x)-p(x)| \le \gamma_{2n}\sum_{k=0}^n |a_k||x|^k. \]

Worked solution Show / hide solution to Exercise 1.16

Detailed solution

Horner's method performs the nested evaluation

\[ y_k=\fl(y_{k+1}x+a_k),\qquad k=n-1,\ldots,0. \]

Each coefficient contribution passes through a bounded number of products and additions. Collecting the rounding factors gives

\[ \widehat p(x)=\sum_{k=0}^n a_k(1+\theta_k)x^k, \qquad |\theta_k|\le\gamma_{2n}. \]

Equivalently, Horner computes the exact value of a nearby polynomial

\[ \widetilde p(x)=\sum_{k=0}^n \widetilde a_k x^k, \qquad |\widetilde a_k-a_k|\le\gamma_{2n}|a_k|. \]

Conclusion

Horner's method is backward stable with respect to small coefficient perturbations.

Exercise 1.17: ★★: Fused multiply-add and dot products

A fused multiply-add operation computes

\[ \operatorname{fma}(a,b,c)=\fl(ab+c) \]

with only one final rounding. Prove that a dot product accumulated using FMA admits a sharper error bound involving $\gamma_n$, rather than the $\gamma_{2n}$-type bound associated with separate multiplication and addition.

Worked solution Show / hide solution to Exercise 1.17

Detailed solution

A fused multiply-add computes $ab+c$ with one final rounding. For a dot product accumulated by FMA, each update has only one rounding:

\[ s_k=\fl(s_{k-1}+a_kb_k). \]

Therefore

\[ \widehat{s}=\sum_{k=1}^n a_kb_k(1+\theta_k), \qquad |\theta_k|\le\gamma_n. \]

Without FMA, each term multiplication and addition introduces separate roundings, leading typically to a $\gamma_{2n}$-type bound.

Conclusion

FMA reduces rounding events and improves the dot-product error constant.

Exercise 1.19: ★★★: Householder reflections and backward stability

Let

\[ H=I-2vv^T, \qquad \|v\|_2=1. \]

Suppose $\widehat{y}$ is computed by applying $H$ to $x$:

\[ y=Hx=x-2v(v^Tx). \]

Assuming the dot product and axpy operations satisfy the usual floating-point error bounds, prove that

\[ \widehat{y}=(H+\Delta H)x, \qquad \|\Delta H\|_2\le C_nu. \]

Explain why Householder transformations are preferred over explicitly forming normal equations in least-squares problems.

Worked solution Show / hide solution to Exercise 1.19

Detailed solution

The exact transformation is

\[ y=Hx=x-2v(v^Tx),\qquad H=I-2vv^T. \]

The dot product $v^Tx$ and the axpy operation introduce bounded perturbations. Therefore the computed vector can be written as

\[ \widehat y=(H+\Delta H)x, \qquad \|\Delta H\|_2\le C_n u. \]

Because $H$ is orthogonal, $\|H\|_2=1$, so the transformation does not amplify norms in exact arithmetic.

Conclusion

Householder transformations are backward stable and are preferred to normal equations, which square the condition number.

Exercise 1.20: ★★★: Classical Gram–Schmidt loss of orthogonality

Let $A=[a_1,a_2]\in\mathbb R^{m\times2}$ have nearly parallel columns. Let $\widehat{q}_1,\widehat{q}_2$ be produced by classical Gram–Schmidt in floating-point arithmetic. Show that

\[ |\widehat{q}_1^T\widehat{q}_2| \]

can be proportional to $\kappa(A)u$, rather than merely $u$. Explain the role of cancellation in the orthogonalization step.

Worked solution Show / hide solution to Exercise 1.20

Detailed solution

In classical Gram--Schmidt, the second vector is formed by

\[ v_2=a_2-q_1(q_1^Ta_2). \]

If $a_1$ and $a_2$ are nearly parallel, then $v_2$ is the difference of two nearly equal vectors. The relative error in $v_2$ is amplified roughly by

\[ \frac{1}{\sin\theta}, \]

where $\theta$ is the angle between the columns. As $\theta\to0$, the computed $\widehat q_2$ is no longer accurately orthogonal to $\widehat q_1$.

Conclusion

Classical Gram--Schmidt can lose orthogonality badly for nearly dependent columns; modified Gram--Schmidt or Householder QR is safer.

Exercise 1.21: ★: The Kahan Summation Formula

The Kahan summation algorithm calculates the sum of $n$ numbers $x_1, \ldots, x_n$ by maintaining a running compensation term $c$ for accumulated rounding errors:

\[ \begin{aligned} s_1 &= x_1, \quad c_1 = 0 \\ &\text{For } k = 2, \ldots, n: \\ &\quad y_k = \fl(x_k - c_{k-1}) \\ &\quad t_k = \fl(s_{k-1} + y_k) \\ &\quad c_k = \fl(\fl(t_k - s_{s-1}) - y_k) \\ &\quad s_k = t_k \end{aligned} \]

Prove that under the standard model of floating-point arithmetic, the final error satisfies

\[ |\widehat{s}_n - s| \le \left( 2u + O(n u^2) \right) \sum_{j=1}^n |x_j|. \]

Explain the structural mechanism by which the $n$-dependence is removed from the first-order error term.

Worked solution Show / hide solution to Exercise 1.21

Detailed solution

Kahan summation keeps a compensation variable $c$ that approximates the low-order part lost at the previous addition. A typical step is

\[ y=x_j-c, \qquad t=s+y, \qquad c=(t-s)-y, \qquad s=t. \]

The term $c$ estimates the rounding error committed when $y$ was added to $s$. Subtracting it at the next step returns part of the lost information to the computation.

For well-scaled data and no overflow, Kahan summation usually satisfies a first-order error bound of the form

\[ |\widehat{s}-s|\lesssim C u\sum_{j=1}^n |x_j|+O(nu^2)\sum_{j=1}^n |x_j|, \]

with a much smaller practical error than ordinary recursive summation.

Conclusion

Kahan summation is not magic, but it greatly reduces the accumulation of lost low-order bits.

Exercise 1.22: ★: Generalization of Sterbenz Lemma

Let $x$ and $y$ be floating-point numbers in a normalized radix-$\beta$ system $\mathcal{F}(\beta, t, L, U)$. Suppose that $\frac{1}{\beta} \le \frac{x}{y} \le \beta$. Prove or disprove: $x - y$ is exactly representable in $\mathcal{F}$. If it is false, provide the sharpest lower and upper bounds on $x/y$ for which exact subtraction is guaranteed under radix $\beta$.

Worked solution Show / hide solution to Exercise 1.22

Detailed solution

Sterbenz-type exact subtraction says that if two floating-point numbers of the same sign are close enough in magnitude, their difference is exactly representable. In radix $\beta$, the relevant condition is

\[ \frac{1}{\beta}\le \frac{x}{y}\le \beta, \]

assuming normalized numbers and no underflow. Under this condition the exponents differ by at most one binade, so the aligned significands can be subtracted without exceeding the available precision.

Conclusion

The stated range is precisely the range in which subtraction is protected from rounding by the Sterbenz mechanism.

Exercise 1.23: ★: Catastrophic Cancellation in (-x)

Consider evaluating $f(x) = e^{-x}$ for large positive values of $x$ using its standard Taylor series expansion:

\[ e^{-x} = \sum_{k=0}^\infty \frac{(-1)^k x^k}{k!}. \]

Determine the maximum magnitude of an individual term in the summation as a function of $x$.
Prove that evaluating this series directly in floating-point arithmetic leads to a severe loss of relative precision for large $x$.
Provide a simple, backward-stable algebraic reformulation to compute $e^{-x}$ that completely avoids cancellation.

Worked solution Show / hide solution to Exercise 1.23

Detailed solution

The Taylor series

\[ e^{-x}=\sum_{k=0}^{\infty}\frac{(-x)^k}{k!} \]

is alternating. For large positive $x$, the individual terms initially become very large before the final small value is obtained by cancellation. Hence many significant digits are lost.

A stable evaluation is

\[ e^{-x}=\frac{1}{e^x}, \]

or, in software, a correctly rounded library call for $\exp(-x)$.

Conclusion

The Taylor expansion is analytically correct but numerically unstable for large positive $x$.

Exercise 1.24: ★: Hypotenuse Computation and Spurious Overflow

The standard formula for computing the Euclidean norm of a vector in $\mathbb{R}^2$ is $f(x,y) = \sqrt{x^2 + y^2}$.

Identify the range of inputs $(x,y)$ where this formula triggers an unwarranted overflow or underflow even though the final mathematical result $f(x,y)$ is perfectly representable in $\mathcal{F}$.
Prove that the reformulated algorithm
\[ f(x,y) = v \cdot \sqrt{1 + \left(\frac{u}{v}\right)^2}, \quad v = \max(|x|,|y|), \quad u = \min(|x|,|y|), \]

is forward stable and safe from spurious overflow or underflow.

Worked solution Show / hide solution to Exercise 1.24

Detailed solution

The direct formula

\[ \sqrt{x^2+y^2} \]

may overflow when $x^2$ or $y^2$ overflows, even if the final norm is representable. Let

\[ m=\max(|x|,|y|). \]

If $m=0$, the answer is $0$. Otherwise compute

\[ \sqrt{x^2+y^2}=m\sqrt{\left(\frac{x}{m}\right)^2+\left(\frac{y}{m}\right)^2}. \]

Conclusion

Scaling prevents spurious overflow and underflow. This is the basic idea behind a stable $\operatorname{hypot}$ function.

Exercise 1.25: ★: Relative Error of Complex Multiplication

Let $z = a + ib$ and $w = c + id$ be two complex numbers in $\mathbb{C}$. The standard floating-point multiplication evaluates:

\[ \fl(zw) = \fl(\fl(ac) - \fl(bd)) + i \, \fl(\fl(ad) + \fl(bc)). \]

Prove that the normwise forward error satisfies:

\[ \|\fl(zw) - zw\|_2 \le \sqrt{2} \, \gamma_2 \|z\|_2 \|w\|_2. \]

Does this guarantee backward stability for complex multiplication?

Worked solution Show / hide solution to Exercise 1.25

Detailed solution

Complex multiplication is

\[ (a+ib)(c+id)=(ac-bd)+i(ad+bc). \]

The four products and two additions/subtractions introduce errors. A standard first-order model gives

\[ \widehat{\operatorname{Re}}=(ac-bd)+\Delta_r, \qquad \widehat{\operatorname{Im}}=(ad+bc)+\Delta_i, \]

with

\[ |\Delta_r|\lesssim \gamma_2(|ac|+|bd|), \qquad |\Delta_i|\lesssim \gamma_2(|ad|+|bc|). \]

If $ac\approx bd$, the real part suffers cancellation.

Conclusion

The componentwise product bounds may be small, while the relative error in a cancelled component may be large.

Exercise 1.26: ★: Underflow and Flush-to-Zero

Suppose a floating-point system does not include subnormal numbers and instead uses a flush-to-zero (FTZ) policy when an underflow occurs. Show by counterexample that Sterbenz lemma fails completely under an FTZ policy. Identify two positive floating-point numbers $x$ and $y$ satisfying $1/2 \le x/y \le 2$ such that $\fl(x-y) \neq x-y$.

Worked solution Show / hide solution to Exercise 1.26

Detailed solution

With flush-to-zero, a nonzero exact result below the smallest normalized number is replaced by zero. If the exact result is $z\ne0$ but $\fl(z)=0$, then

\[ \frac{|z-\fl(z)|}{|z|}=1. \]

This violates the usual relative model $|\delta|\le u$. Subnormal numbers avoid this abrupt loss by filling the gap between zero and the smallest normalized number with a uniform absolute spacing.

Conclusion

FTZ simplifies hardware but destroys relative accuracy near underflow.

Exercise 1.27: ★: Logarithmic Difference Evaluation

We wish to evaluate $f(x,y) = \ln(x) - \ln(y)$ for positive numbers $x,y$.

Derive the relative condition number $\kappa_{\text{rel}}(f, (x,y))$. When is this problem ill-conditioned?
Show that evaluating $\fl(\fl(\ln x) - \fl(\ln y))$ exhibits severe forward error when $x \approx y$, and provide a stable alternative evaluation method using the standard $\operatorname{log1p}(z)$ framework.

Worked solution Show / hide solution to Exercise 1.27

Detailed solution

Write

\[ f(x,y)=\log x-\log y=\log\left(\frac{x}{y}\right). \]

For relative perturbations $x\mapsto x(1+\epsilon_x)$, $y\mapsto y(1+\epsilon_y)$, the first-order change is

\[ \Delta f\approx \epsilon_x-\epsilon_y. \]

Thus a relative-output condition estimate is

\[ \kappa_{\mathrm{rel}}(x,y)\approx \frac{2}{|\log(x/y)|}. \]

When $x\approx y$, use

\[ \log x-\log y=\log\left(1+\frac{x-y}{y}\right)=\operatorname{log1p}\left(\frac{x-y}{y}\right). \]

Conclusion

The problem is ill-conditioned when $x/y\approx1$, and $\operatorname{log1p}$ is the stable computational form.

Exercise 1.28: ★: Matrix Inversion Condition Number

Let $A \in \mathbb{R}^{n \times n}$ be a nonsingular matrix, and consider the matrix inversion problem $F(A) = A^{-1}$. Show that using any subordinate matrix norm $\|\cdot\|$, the relative condition number of matrix inversion is exactly equal to the standard matrix condition number:

\[ \kappa_{\text{rel}}(F, A) = \|A\| \, \|A^{-1}\| = \kappa(A). \]

Worked solution Show / hide solution to Exercise 1.28

Detailed solution

For $F(A)=A^{-1}$, perturb $A$ to $A+E$. Differentiating $AA^{-1}=I$ gives

\[ E A^{-1}+A\,D(A^{-1})[E]=0. \]

Therefore

\[ D(A^{-1})[E]=-A^{-1}EA^{-1}. \]

Hence

\[ \|D(A^{-1})[E]\|\le \|A^{-1}\|^2\|E\|. \]

In relative form,

\[ \frac{\|\Delta A^{-1}\|}{\|A^{-1}\|} \lesssim \kappa(A)\frac{\|E\|}{\|A\|}. \]

Conclusion

The condition number of inversion is essentially $\kappa(A)=\|A\|\|A^{-1}\|$.

Exercise 1.29: ★: Stability of Outer Products

Let $x \in \mathbb{R}^m$ and $y \in \mathbb{R}^n$. Consider the computation of the rank-one outer product matrix $S = x y^T$, where $s_{ij} = x_i y_j$. Prove that the computed matrix $\widehat{S} = \fl(S)$ satisfies a strict componentwise backward error bound $|\widehat{s}_{ij} - s_{ij}| \le u |s_{ij}|$. Contrast this with the $\gamma_n$ behavior of inner products.

Worked solution Show / hide solution to Exercise 1.29

Detailed solution

Each entry of the outer product is one floating-point multiplication:

\[ \widehat{s}_{ij}=\fl(x_i y_j)=x_i y_j(1+\delta_{ij}), \qquad |\delta_{ij}|\le u. \]

Define $\Delta s_{ij}=x_i y_j\delta_{ij}$. Then

\[ \widehat S=S+\Delta S, \qquad |\Delta S|\le u |S|. \]

Conclusion

The outer product is componentwise backward stable because each matrix entry is generated by one stable multiplication.

Exercise 1.30: ★: Unstable Iteration for the Exponential Limit

Consider the sequence defined analytically by:

\[ x_{n} = n \left( \exp\left(\frac{1}{n}\right) - 1 \right). \]

Prove analytically that $\lim_{n \to \infty} x_n = 1$.
Mathematically analyze the behavior of this iteration in a standard floating-point environment. Show that as $n \to \infty$, the computed values $\widehat{x}_n$ eventually collapse to $0$. Analyze the precise step where cancellation occurs.

Worked solution Show / hide solution to Exercise 1.30

Detailed solution

Analytically,

\[ x_n=n\left(e^{1/n}-1\right)\to 1 \]

because $e^h-1=h+O(h^2)$ with $h=1/n$. Numerically, for large $n$, $e^{1/n}$ is very close to $1$, so the subtraction $e^{1/n}-1$ loses significant digits.

The stable formula is

\[ x_n=n\,\operatorname{expm1}\left(\frac1n\right), \]

where $\operatorname{expm1}(z)$ computes $e^z-1$ without the dangerous subtraction.

Conclusion

The limit is benign mathematically but the naive iteration is numerically unstable for large $n$.

Exercise 1.31: ★★: Componentwise vs. Normwise Perturbation Bounds

\[ \frac{\|\Delta x\|_\infty}{\|x\|_\infty} \le \omega \frac{\||A^{-1}| \cdot |A| \cdot |x| + |A^{-1}| \cdot |b|\|_\infty}{\|x\|_\infty} + O(\omega^2). \]

Explain why this componentwise condition number can be significantly smaller than $\kappa_\infty(A)$ when $A$ is poorly scaled or highly sparse.

Worked solution Show / hide solution to Exercise 1.31

Detailed solution

A componentwise perturbation bound such as

\[ |\Delta A|\le \eta |A|, \qquad |\Delta b|\le \eta |b| \]

implies a normwise bound because norms are monotone for nonnegative matrices:

\[ \|\Delta A\|\le \eta\,\||A|\|, \qquad \|\Delta b\|\le \eta\,\|b\|. \]

However, the converse is false: a normwise small perturbation can destroy sparsity, signs, scaling, or zeros.

Conclusion

Componentwise stability is stronger and more informative than normwise stability, especially for structured or badly scaled data.

Exercise 1.32: ★★: Properties of the gamma_n Engine

Recall the definition $\gamma_n = \frac{nu}{1-nu}$, where $nu < 1$. Prove the following properties that govern multi-term error propagation sequences:

$\gamma_k < \gamma_n$ for $k < n$.
$\gamma_m + \gamma_n + \gamma_m \gamma_n \le \gamma_{m+n}$.
$(1 + \theta_m)(1 + \theta_n) = 1 + \theta_{m+n}$, where $|\theta_i| \le \gamma_i$.

Worked solution Show / hide solution to Exercise 1.32

Detailed solution

Recall

\[ \gamma_n=\frac{nu}{1-nu},\qquad nu<1. \]

It is increasing in $n$, and for small $u$,

\[ \gamma_n=nu+O(n^2u^2). \]

The product rule follows from

\[ (1+\gamma_m)(1+\gamma_n) \le 1+\gamma_{m+n}, \]

whenever $(m+n)u<1$. Equivalently, products of perturbation factors can be compressed into one $\gamma$-factor:

\[ (1+\theta_m)(1+\theta_n)=1+\theta_{m+n}, \qquad |\theta_{m+n}|\le\gamma_{m+n}. \]

Conclusion

The $\gamma_n$ notation is the bookkeeping engine for multi-operation roundoff analysis.

Exercise 1.33: ★★: Subnormal Representation Error Model

When a floating-point system includes subnormal numbers, the minimum positive representable number shifts from $x_{\min} = \beta^L$ to $x_{\min}^{\text{sub}} = \beta^{L-(t-1)}$. Prove that if subnormal numbers are utilized, the absolute representation error model $|x - \fl(x)|$ remains bounded by $\frac{1}{2}\beta^{L-t+1}$ for all real inputs $|x| < \beta^L$.

Worked solution Show / hide solution to Exercise 1.33

Detailed solution

In the normalized range, rounding gives a relative model

\[ \fl(x)=x(1+\delta),\qquad |\delta|\le u. \]

In the subnormal range, the spacing is essentially constant. Therefore one obtains an absolute model instead:

\[ |\fl(x)-x|\le \frac12\eta, \]

where $\eta$ is the subnormal spacing. The relative error can be large when $|x|$ is very small.

Conclusion

Subnormals preserve gradual underflow and absolute accuracy, but they do not preserve a uniform relative-error bound.

Exercise 1.34: ★★: Condition Invariance under Diagonal Scaling

Let $A \in \mathbb{R}^{n \times n}$ be nonsingular. Show by example that the standard condition number $\kappa(A) = \|A\|_2 \|A^{-1}\|_2$ is not invariant under diagonal scaling transformations $A \leftarrow D_1 A D_2$. Then, prove that the Skeel componentwise condition number defined by

\[ \kappa_{\mathrm{Skeel}}(A) = \||A^{-1}| \cdot |A|\|_2 \]

is perfectly invariant under arbitrary diagonal scaling transformations $A \leftarrow D A$ for any nonsingular diagonal matrix $D$.

Worked solution Show / hide solution to Exercise 1.34

Detailed solution

Condition numbers are not invariant under arbitrary diagonal scaling. For example, take a nonsingular matrix $A$ and a diagonal matrix $D$. The scaled matrix

\[ \widetilde A=DAD^{-1} \]

has the same eigenvalues as $A$, but generally not the same singular values. Hence

\[ \kappa_2(\widetilde A) =\|DAD^{-1}\|_2\,\|DA^{-1}D^{-1}\|_2 \]

may be much larger or much smaller than $\kappa_2(A)$.

Conclusion

Diagonal scaling can improve or damage normwise conditioning; it must be chosen deliberately.

Exercise 1.35: ★★: Condition Number of Quadratic Roots

Consider the roots $\lambda_\pm$ of $a x^2 + b x + c = 0$. Let $f(a,b,c) = \lambda_+$ be the root function mapping the input vector $v = (a,b,c)^T$ to the root space. Compute the gradient $\nabla f(a,b,c)$ and find an expression for the relative condition number $\kappa_{\mathrm{rel}}(f,v)$ using the $\ell_\infty$ norm. Under what exact conditions on the coefficients does finding the root become intrinsically ill-conditioned?

Worked solution Show / hide solution to Exercise 1.35

Detailed solution

Let $\lambda$ be a simple root of

\[ p(x)=ax^2+bx+c. \]

For perturbations $(\Delta a,\Delta b,\Delta c)$, linearization gives

\[ 0\approx p'(\lambda)\Delta\lambda +\lambda^2\Delta a+\lambda\Delta b+\Delta c. \]

Since $p'(\lambda)=2a\lambda+b$,

\[ \Delta\lambda \approx -\frac{\lambda^2\Delta a+\lambda\Delta b+\Delta c}{2a\lambda+b}. \]

Conclusion

The root is ill-conditioned when $|2a\lambda+b|$ is small, which happens near multiple or clustered roots.

Exercise 1.36: ★★: Forward Stability of Rational Matrix Functions

Let $A \in \mathbb{R}^{n \times n}$ with $\|A\|_2 < 1$. We wish to compute the matrix inverse $Y = (I - A)^{-1}$ via the truncated Neumann series $Y_k = \sum_{j=0}^k A^j$.

Derive an upper bound on the relative condition number of this problem with respect to perturbations in $A$.
Assuming that matrix-matrix multiplication satisfies $\|\fl(BC) - BC\|_2 \le \gamma_n \|B\|_2 \|C\|_2$, derive an inductive forward error bound for the computed series $\widehat{Y}_k$.

Worked solution Show / hide solution to Exercise 1.36

Detailed solution

If $\|A\|_2<1$, then

\[ (I-A)^{-1}=\sum_{k=0}^{\infty}A^k. \]

The truncation after $m$ terms has norm bounded by

\[ \left\|\sum_{k=m+1}^{\infty}A^k\right\|_2 \le \frac{\|A\|_2^{m+1}}{1-\|A\|_2}. \]

The computed finite sum is stable as long as the accumulated roundoff term, typically of order $\gamma_m\sum_{k=0}^m\|A\|^k$, remains small.

Conclusion

The Neumann method is forward stable when $\|A\|$ is clearly below one and the truncation and roundoff terms are both controlled.

Exercise 1.37: ★★: Asymptotic Stability of Linear Recurrences

Consider the second-order linear recurrence relation:

\[ x_{k+1} = \frac{13}{3} x_k - \frac{4}{3} x_{k-1}, \quad x_0 = 1, \quad x_1 = \frac{1}{3}. \]

Show analytically that the exact solution is $x_k = (1/3)^k$.
Notice that the characteristic equation has a second root $\lambda = 4$. Prove that in any floating-point system, the computed sequence $\widehat{x}_k$ will eventually diverge exponentially due to the amplification of initial representation errors, regardless of the choice of precision.

Worked solution Show / hide solution to Exercise 1.37

Detailed solution

The characteristic equation is

\[ \lambda^2-\frac{13}{3}\lambda+\frac{4}{3}=0, \]

or $3\lambda^2-13\lambda+4=0$. Its roots are

\[ \lambda_1=4, \qquad \lambda_2=\frac13. \]

The exact solution is

\[ x_k=C_1 4^k+C_2\left(\frac13\right)^k. \]

Using $x_0=1$ and $x_1=1/3$ gives $C_1=0$, $C_2=1$, hence $x_k=(1/3)^k$. In floating-point arithmetic, a tiny perturbation usually makes $C_1\ne0$. Then the parasitic term $C_1 4^k$ eventually dominates.

Conclusion

The recurrence is analytically correct but numerically unstable because the unwanted mode grows like $4^k$.

Exercise 1.38: ★★: Determinant of 2 2 Matrices

Let $A = \begin{pmatrix} a & b \\ c & d \end{pmatrix}$. The standard determinant algorithm computes $\det(A) = ad - bc$.

Provide a sharp relative error bound for $\fl(\fl(ad) - \fl(bc))$ using the standard model.
Assume $ad \approx bc$. Explain why this algorithm is backward unstable and outline how a fused multiply-add (FMA) unit can be used within an error-free transformation framework (like Kahan's algorithm for $ad-bc$) to compute the determinant to high relative accuracy.

Worked solution Show / hide solution to Exercise 1.38

Detailed solution

Let $p=ad$ and $q=bc$. The computed determinant has the form

\[ \widehat d=\fl(\fl(ad)-\fl(bc)). \]

Using the standard model,

\[ |\widehat d-(ad-bc)| \le \gamma_3(|ad|+|bc|) \]

to first order. Therefore

\[ \frac{|\widehat d-(ad-bc)|}{|ad-bc|} \le \gamma_3\frac{|ad|+|bc|}{|ad-bc|}. \]

If $ad\approx bc$, the denominator is small and the relative error can be huge. FMA-based compensated formulas compute the product error and subtraction error more accurately.

Conclusion

The naive determinant formula is vulnerable to cancellation; compensated or FMA-based formulas are preferable near $ad=bc$.

Exercise 1.39: ★★: Relative Sensitivity of the Matrix Exponential

Let $F(A) = \exp(A) = \sum_{k=0}^\infty \frac{A^k}{k!}$ for $A \in \mathbb{R}^{n \times n}$. Prove that the absolute Fréchet derivative of $F$ at $A$ in the direction of a perturbation $E$ is given by

\[ DF(A)E = \int_{0}^1 e^{(1-s)A} E e^{sA} \, ds. \]

Use this to show that $\kappa_{\mathrm{abs}}(F, A) \le e^{\|A\|}$, and explain why computing the matrix exponential can become highly unstable if $A$ has large norm but $\exp(A)$ has a small norm.

Worked solution Show / hide solution to Exercise 1.39

Detailed solution

The Fréchet derivative of the matrix exponential is

\[ D\exp(A)[E]=\int_0^1 e^{(1-s)A}E e^{sA}\,\dd s. \]

Taking norms and using submultiplicativity gives

\[ \|D\exp(A)[E]\| \le \int_0^1 e^{(1-s)\|A\|}\|E\|e^{s\|A\|}\,\dd s =e^{\|A\|}\|E\|. \]

Thus

\[ \kappa_{\mathrm{abs}}(F,A)\le e^{\|A\|}. \]

The relative condition number may also contain the factor $1/\|e^A\|$, so instability can be severe when $\|A\|$ is large but $\|e^A\|$ is small.

Conclusion

The exponential may be highly sensitive for nonnormal or large-norm matrices.

Exercise 1.40: ★★: The Rump Polynomial Phenomenon

Consider Rump's famous bivariate polynomial function:

\[ f(x,y) = 333.75 y^6 + x^2 (11 x^2 y^2 - y^6 - 121 y^4 - 2) + 5.5 y^8 + \frac{x}{2y}. \]

Evaluate this expression analytically for $x = 771751$ and $y = 424401$. Show that evaluating this function in single, double, and extended precision floating-point arithmetic produces completely different values (even with incorrect signs), and identify the terms responsible for this structural failure of precision.

Worked solution Show / hide solution to Exercise 1.40

Detailed solution

With the given integers, the expression is exactly the rational number

\[ f(771751,424401) = \frac{3918639850420325321035606988979581038783241506840007}{1697604}. \]

Thus

\[ f(771751,424401)\approx 2.308335660389776\times10^{45}. \]

The important numerical point is not the size alone. The separate terms in the polynomial are enormous and contain severe subtractive cancellation, especially among the $y^8$, $x^2y^6$, $x^4y^2$, and $x^2y^4$ contributions. Different precisions may round these huge intermediate quantities differently and therefore produce incompatible final values.

Conclusion

Rump-type examples demonstrate that algebraically equivalent formulas can be computationally dangerous when enormous terms cancel.

Exercise 1.41: ★★★: Stability of the Cholesky Factorization

Let $A \in \mathbb{R}^{n \times n}$ be a symmetric positive definite matrix, and let $\widehat{L}$ be the computed lower Cholesky factor obtained in floating-point arithmetic. Prove Demmel's normwise backward error theorem for Cholesky factorization:

\[ \widehat{L}\widehat{L}^T = A + \Delta A, \quad \|\Delta A\|_2 \le \gamma_{n+1} \|A\|_2. \]

Explain why this implies that Cholesky factorization is unconditionally backward stable without requiring any pivoting strategies.

Worked solution Show / hide solution to Exercise 1.41

Detailed solution

Cholesky factorization updates entries by inner products and square roots. Standard roundoff analysis shows that the computed factor $\widehat L$ satisfies

\[ \widehat L\widehat L^T=A+\Delta A, \qquad \|\Delta A\|_2\le C_n u\|A\|_2. \]

With the chapter notation, this is written as

\[ \|\Delta A\|_2\le \gamma_{n+1}\|A\|_2 \]

up to constants depending on the detailed implementation. Since the perturbation is on the original matrix, the method is backward stable.

Conclusion

No pivoting is required for symmetric positive definite matrices because all pivots are positive and the computed factor solves a nearby SPD problem.

Exercise 1.42: ★★★: Conditioning of Vandermonde Systems

Let $V \in \mathbb{R}^{n \times n}$ be a Vandermonde matrix defined by distinct real nodes $x_1, x_2, \ldots, x_n$. Prove Gautschi's theorem: if the nodes $x_j$ are positive, then the condition number $\kappa_\infty(V)$ grows at least exponentially with $n$, satisfying $\kappa_\infty(V) \ge 2^{n-1}$ regardless of the location of the nodes. What does this reveal about fitting high-degree polynomials using a monomial basis?

Worked solution Show / hide solution to Exercise 1.42

Detailed solution

The inverse of a Vandermonde matrix contains coefficients of Lagrange cardinal polynomials. For positive nodes, these coefficients grow rapidly because the monomial basis is poorly scaled and the products

\[ \prod_{k\ne j}(x_j-x_k) \]

can become very small relative to the coefficient sizes. Gautschi's estimate gives

\[ \kappa_\infty(V)\ge 2^{n-1}. \]

Therefore the sensitivity grows at least exponentially with the degree.

Conclusion

High-degree polynomial fitting in the monomial basis is numerically unsafe; orthogonal bases or barycentric formulations are preferred.

Exercise 1.43: ★★★: Fast Matrix Multiplication Error Propagation

Strassen's algorithm computes matrix multiplication $C=AB$ for $n\times n$ matrices using fewer scalar multiplications by performing linear combinations of submatrices.

Prove that Strassen's algorithm satisfies a normwise error bound of the form
\[ \|\widehat{C}-C\|_F\le c_n u\|A\|_F\|B\|_F, \]
where $c_n=O(n^{\log_2 7})$.
Explain why this bound is fundamentally weaker than the strict componentwise backward-error bound $|\Delta A|\le\gamma_n |A|$ associated with standard matrix multiplication.

Worked solution Show / hide solution to Exercise 1.43

Detailed solution

Strassen's method forms many sums and differences of submatrices before performing seven recursive products. Let $E(n)$ denote the Frobenius-norm error for size $n$. A typical recursive estimate has the form

\[ E(n)\le 7E(n/2)+C u\,n^\alpha\|A\|_F\|B\|_F, \]

where the second term accounts for the additions and subtractions used to build the Strassen blocks. Solving the recurrence gives

\[ E(n)\le c_n u\|A\|_F\|B\|_F, \qquad c_n=O(n^{\log_2 7}). \]

This is a normwise bound. It is weaker than the componentwise backward stability of classical multiplication because Strassen's intermediate additions mix entries and signs. The resulting perturbation cannot generally be written as

\[ |\Delta A|\le\gamma_n |A| \]

entry by entry.

Conclusion

Strassen gains arithmetic speed but sacrifices the clean componentwise stability structure of standard matrix multiplication.