Chapter 1. Floating-Point Arithmetic and Error Analysis
Web Laboratories and Student Tools
This web version adds computational laboratories. The student can enter data, compute errors, compare stable and unstable formulas, and test understanding with randomized quizzes.
Game Cards: Fast Memory Checks
Click a card to reveal the answer. These cards are deliberately short; they are meant to prevent the reader from forgetting the meaning of the main symbols while reading the chapter.
Randomized Quiz Arena
The questions are selected randomly from a large Chapter 1 bank. They include true/false questions, single-answer multiple choice, multiple-answer questions, and numerical questions.
After completing this chapter, the reader should be able to:
- distinguish exact real arithmetic from floating-point arithmetic;
- derive the standard model \(\fl(x)=x(1+\delta)\), \(|\delta|\le u\);
- use the \(\gamma_n\)-notation to control accumulated rounding errors;
- analyze cancellation and loss of significant digits;
- distinguish conditioning from stability;
- interpret residuals as backward-error objects;
- recognize stable and unstable algebraic reformulations;
- solve basic and advanced error-analysis exercises.
The Numerical Representation Problem
Numerical analysis begins with a structural obstruction: the field of real numbers \(\mathbb R\) is infinite and uncountable, whereas every digital computer manipulates only finitely many representable numbers. Thus a real number \(x\in\mathbb R\) is usually not stored as \(x\) itself, but as a nearby machine number
The discrepancy between \(x\) and \(\fl(x)\) is not a programming accident. It is an unavoidable mathematical consequence of finite storage.
A numerical result is meaningful only when we understand both the mathematical sensitivity of the problem and the stability of the algorithm used to solve it.
Let \(x\) be an exact quantity and let \(\widetilde{x}\) be an approximation. The absolute error is
If \(x\ne0\), the relative error is
Absolute error measures distance in the original physical scale. Relative error measures error compared with the magnitude of the exact quantity. The second notion is usually more informative when quantities vary over several orders of magnitude.
Let \(x_1=10^6\), \(\widetilde{x}_1=10^6+1\), and let \(x_2=10^{-6}\), \(\widetilde{x}_2=2\cdot10^{-6}\). Then
The first approximation is relatively accurate; the second has \(100\%\) relative error.
Floating-Point Systems
Let \(\beta\ge2\) be an integer base, let \(t\ge1\) be the number of significant digits, and let \(L
where
The number
is the mantissa, and \(e\) is the exponent.
A normalized floating-point number therefore has the form
The exponent controls the scale, whereas the mantissa controls the significant digits.
In a normalized base-\(\beta\) system with \(t\) significant digits, the spacing between \(1\) and the next larger normalized floating-point number is
For rounding to nearest, the unit roundoff is
For chopping, a standard relative-error bound is
Rounding as a Projection onto a Finite Set
Let \(\mathcal F\) be a floating-point system. The operation
maps a real number to a representable number. For rounding to nearest, \(\fl(x)\) is a nearest floating-point number to \(x\). Ties are resolved by a prescribed rule, for example round-to-even.
Let \(x\) be a real number lying between two consecutive normalized floating-point numbers \(a
Since \(a\) and \(b\) are the two nearest representable endpoints surrounding \(x\), rounding to nearest chooses either \(a\) or \(b\). The largest possible distance to the nearest endpoint occurs at the midpoint \((a+b)/2\), and that distance is \((b-a)/2\). Hence
Let \(x\) lie in the normalized range of a base-\(\beta\), \(t\)-digit floating-point system. If \(\fl(x)\) is obtained by rounding to nearest and no overflow or underflow occurs, then there exists \(\delta\in\mathbb R\) such that
where
It is enough to consider \(x>0\), since the system is symmetric with respect to sign. Suppose
In this binade, adjacent normalized floating-point numbers have spacing
By the local spacing bound,
Since \(x\ge \beta^e\), we obtain
Define
Then \(\fl(x)=x(1+\delta)\) and \(|\delta|\le u\).
The statement excludes overflow and underflow. Near underflow, the normalized-spacing model changes. Near overflow, the rounded value may not be representable at all.
The Standard Model for Floating-Point Arithmetic
The previous theorem describes the representation of a single real number. Numerical algorithms also perform arithmetic operations. Let
The standard model assumes that a floating-point operation is the exact real operation followed by rounding:
provided the exact result is in the normalized range and no exceptional event occurs.
For an arithmetic operation \(\circ\), the floating-point operation is modeled by
This model is the foundation for most backward error analysis.
Accumulation of Rounding Errors
It is convenient to introduce the notation
This quantity appears naturally when bounding products of many terms \(1+\delta_j\).
Let \(|\delta_j|\le u\) for \(j=1,\ldots,n\), and assume \(nu<1\). Then
For the upper bound,
Using the elementary estimate \((1+u)^n\le(1-nu)^{-1}\) for \(nu<1\), one obtains
Similarly,
Therefore the product has the form \(1+\theta_n\), with
Let
and let \(\widehat{s}\) be computed by the recursive floating-point summation
If no overflow or underflow occurs and \((n-1)u<1\), then
Consequently,
At each addition there is a local perturbation:
Expanding recursively gives each original datum multiplied by a product of at most \(n-1\) rounding factors. By Lemma 1.8, every such product is \(1+\theta_j\), with \(|\theta_j|\le\gamma_{n-1}\). Thus
Subtracting \(s\) and taking absolute values gives
Cancellation and Loss of Significance
Cancellation occurs when two nearly equal floating-point numbers are subtracted. Although subtraction itself may be exact under certain circumstances, the result can have few significant digits if the inputs already contain errors.
Let
with
Assume \(x\ne y\). Then the relative error in the difference satisfies
We compute
Therefore
Dividing by \(|x-y|\) gives the result.
The amplification factor
is large when \(x\) and \(y\) are close.
For large \(x\), the expression
subtracts two nearly equal numbers. Multiplying by the conjugate gives
The second expression avoids subtracting nearly equal quantities.
For small \(x\), the expression
is susceptible to cancellation. A stable equivalent form is
Sterbenz Lemma
Let \(x\) and \(y\) be positive normalized floating-point numbers in a radix-\(\beta\) floating-point system with exact subtraction after alignment. If
then \(x-y\) is exactly representable as a floating-point number.
Assume without loss of generality that \(x\ge y>0\). The hypothesis gives
Thus \(0\le x-y\le y\). Since \(x\) and \(y\) lie within at most one binade of each other, their significands can be aligned without losing significant digits beyond the required precision for the exact difference. The subtraction of the aligned integer significands is exact. After normalization, the result is representable. Hence
Sterbenz lemma says that subtraction of two floating-point inputs may be exact. It does not say that the original real data were accurate. If \(x\) and \(y\) already contain errors, Theorem 1.10 shows that those errors may be amplified.
Conditioning of Mathematical Problems
Let a problem be represented by a mapping
The condition number measures the sensitivity of \(F(x)\) to perturbations of \(x\).
Suppose \(F\) is Fréchet differentiable at \(x\). The absolute condition number is
where \(DF(x)\) is the Fréchet derivative.
If \(x\ne0\) and \(F(x)\ne0\), the relative condition number is
For a scalar \(f:\mathbb R\to\mathbb R\), this becomes
Let \(F:X\to Y\) be Fréchet differentiable at \(x\). Then, for small \(h\),
Consequently,
By Fréchet differentiability,
Taking norms,
Dividing by \(\|F(x)\|\) and rewriting the leading term as
gives the result.
Conditioning of Linear Systems
For a nonsingular matrix \(A\), the solution of
is
Perturbing \(b\) to \(b+\Delta b\) changes the solution to \(x+\Delta x\), where
Let \(A\) be nonsingular. If \(Ax=b\) and \(A(x+\Delta x)=b+\Delta b\), then
where
Since \(\Delta x=A^{-1}\Delta b\),
Also \(b=Ax\), so
Thus
Combining the previous two estimates gives
Forward Error, Backward Error, and Stability
Let \(x\) be the exact solution of a problem and let \(\widetilde{x}\) be a computed solution. The forward error is
A computed solution \(\widetilde{x}\) has small backward error if it is the exact solution of a nearby problem.
For a linear system, if \(\widetilde{x}\) is computed, its residual is
Then
Thus \(\widetilde{x}\) is the exact solution of the perturbed problem
An algorithm is backward stable if the computed solution is the exact solution of a nearby problem, with the perturbation size comparable to unavoidable rounding error.
Conditioning belongs to the problem. Stability belongs to the algorithm. A stable algorithm applied to an ill-conditioned problem may still produce a large forward error.
A Model Result: Backward Error for a Dot Product
Let
A standard floating-point dot product computes products and sums sequentially.
Assume no overflow or underflow and \(nu<1\). The computed dot product satisfies
Equivalently,
with componentwise perturbations satisfying
Each product and addition introduces a factor \(1+\delta_j\), with \(|\delta_j|\le u\). Expanding the recursive computation gives each term \(x_i y_i\) multiplied by a product of at most \(n\) such factors. By Lemma 1.8, that product is \(1+\theta_i\), where \(|\theta_i|\le\gamma_n\). Define
Then
and \(|\Delta x_i|\le\gamma_n|x_i|\).
Stable and Unstable Algebraic Forms
Two algebraically identical formulas may have radically different numerical behavior.
The roots of
are formally
If \(b^2\gg4ac\), one of the signs may involve severe cancellation. A stable strategy computes one root using the noncancelling sign and obtains the other from
For small \(x\), evaluating
directly may lose accuracy if \(1+x\) rounds to a number too close to \(1\). Specialized functions such as \(\operatorname{log1p}(x)\) are designed to compute this quantity accurately.
Algorithms
- Require: \(x_1,\ldots,x_n\)
- \(s\gets 0\)
- \(c\gets 0\)
- For {\(k=1,\ldots,n\)}
- \(y\gets x_k-c\)
- \(t\gets s+y\)
- \(c\gets (t-s)-y\)
- \(s\gets t\)
- End
- \Return \(s\)
- Require: \(x,y\in\mathbb R\)
- \(v\gets \max(|x|,|y|)\)
- \(u\gets \min(|x|,|y|)\)
- If {\(v=0\)}
- \Return \(0\)
- Else
- \Return \(v\sqrt{1+(u/v)^2}\)
- End
Chapter Summary
- Floating-point arithmetic replaces \(\mathbb R\) by a finite set
\(\mathcal F\).
- Correct rounding gives
\[ \fl(x)=x(1+\delta),\qquad |\delta|\le u. \]
- Products of rounding factors are controlled by
\[ \gamma_n=\frac{nu}{1-nu}. \]
- Cancellation amplifies previous relative errors.
- Conditioning measures the sensitivity of the mathematical problem.
- Stability measures the behavior of the numerical algorithm.
- Backward stability means the computed solution solves a nearby problem
exactly.
Exercises
First try each problem without opening the solution. Then open the accordion and compare your argument with the worked derivation. All formulas are rendered by KaTeX; no raw LaTeX code is intentionally displayed.
The following problems are divided into basic and advanced problems. Problems marked by \(\star\), \(\star\star\), and \(\star\star\star\) are progressively harder.
Basic problems
Let \(x=\pi\) and \(\widetilde{x}=3.14\). Compute the absolute and relative errors.
Worked solution Show / hide solution to Exercise 1.1
Problem formulation
We compare the exact value \(x=\pi\) with the decimal approximation \(\widetilde{x}=3.14\).
Detailed solution
The absolute error is
\[ E_{\mathrm{abs}}=|\pi-3.14|\approx 0.001592653589793. \]The relative error is
\[ E_{\mathrm{rel}}=\frac{|\pi-3.14|}{|\pi|} \approx 5.069\times 10^{-4}. \]Conclusion
The approximation has a small absolute error, but the relative error is the more scale-independent diagnostic.
In a decimal floating-point system with \(\beta=10\) and \(t=5\), compute \(\varepsilon_{\mathrm{mach}}\) and the unit roundoff for rounding to nearest.
Worked solution Show / hide solution to Exercise 1.2
Problem formulation
The system has base \(\beta=10\) and precision \(t=5\).
Detailed solution
The distance from \(1\) to the next larger normalized floating-point number is
\[ \varepsilon_{\mathrm{mach}}=\beta^{1-t}=10^{1-5}=10^{-4}. \]For rounding to nearest, the unit roundoff is one half of this spacing:
\[ u=\frac12\varepsilon_{\mathrm{mach}}=\frac12\cdot 10^{-4}=5\times10^{-5}. \]Conclusion
Thus \(\varepsilon_{\mathrm{mach}}=10^{-4}\) and \(u=5\times10^{-5}\).
Prove that if \(\fl(x)=x(1+\delta)\) and \(|\delta|\le u\), then
Worked solution Show / hide solution to Exercise 1.3
Detailed solution
Starting from the relative model,
\[ \fl(x)=x(1+\delta), \qquad |\delta|\le u. \]Subtract \(x\) from both sides:
\[ \fl(x)-x=x\delta. \]Taking absolute values gives
\[ |\fl(x)-x|=|x|\,|\delta|\le u|x|. \]Conclusion
The relative rounding model immediately implies the absolute-error bound.
Show that
Explain why the right-hand side is numerically preferable for small \(x\).
Worked solution Show / hide solution to Exercise 1.4
Detailed solution
Use the double-angle identity
\[ \cos x=1-2\sin^2\left(\frac{x}{2}\right). \]Rearranging gives
\[ 1-\cos x=2\sin^2\left(\frac{x}{2}\right). \]When \(x\) is small, \(\cos x\approx1\). The direct formula \(1-\cos x\) subtracts two nearly equal numbers and loses significant digits. The expression \(2\sin^2(x/2)\) avoids that subtraction and is therefore numerically preferable.
Conclusion
The two formulas are algebraically identical, but the second is stable for small \(x\).
Let \(f(x)=x^p\), with \(p\in\mathbb R\). Compute the relative condition number
Worked solution Show / hide solution to Exercise 1.5
Detailed solution
For \(f(x)=x^p\),
\[ f'(x)=p x^{p-1}. \]Therefore
\[ \kappa_f(x)=\left|\frac{x f'(x)}{f(x)}\right| =\left|\frac{x p x^{p-1}}{x^p}\right|=|p|. \]Conclusion
The relative condition number is constant: \(\kappa_f(x)=|p|\), for \(x\ne0\) and \(x^p\ne0\).
Let \(f(x)=\log x\). Compute \(\kappa_f(x)\). Explain why the problem is ill-conditioned near \(x=1\).
Worked solution Show / hide solution to Exercise 1.6
Detailed solution
For \(f(x)=\log x\), with \(x>0\),
\[ f'(x)=\frac1x. \]Hence
\[ \kappa_f(x)=\left|\frac{x f'(x)}{f(x)}\right| =\left|\frac{1}{\log x}\right|. \]As \(x\to1\), \(\log x\to0\), so \(\kappa_f(x)\to\infty\).
Conclusion
The logarithm is relatively ill-conditioned near \(x=1\) when the output \(\log x\) is used as the reference scale.
Let \(Ax=b\) be nonsingular and let \(\widetilde{x}\) be a computed approximation. Prove that
Worked solution Show / hide solution to Exercise 1.7
Detailed solution
Since \(Ax=b\), subtract \(A\widetilde{x}\) from both sides:
\[ b-A\widetilde{x}=Ax-A\widetilde{x}=A(x-\widetilde{x}). \]Because \(A\) is nonsingular, multiply by \(A^{-1}\):
\[ x-\widetilde{x}=A^{-1}(b-A\widetilde{x}). \]Conclusion
The forward error equals the inverse matrix applied to the residual. Therefore a small residual implies a small error only when \(A^{-1}\) is not too large.
Intermediate and advanced problems
Prove Lemma 1.8 using induction instead of the product estimate.
Worked solution Show / hide solution to Exercise 1.8
Detailed solution
Assume \(|\delta_j|\le u\) and \(nu<1\). Define
\[ \prod_{j=1}^{n}(1+\delta_j)=1+\theta_n. \]For \(n=1\), \(|\theta_1|\le u\le \gamma_1\). Suppose \(|\theta_k|\le\gamma_k\). Then
\[ (1+\theta_{k+1})=(1+\theta_k)(1+\delta_{k+1}). \]Thus
\[ |\theta_{k+1}|\le \gamma_k+u+\gamma_k u =\frac{(k+1)u}{1-ku} \le \frac{(k+1)u}{1-(k+1)u}=\gamma_{k+1}. \]Conclusion
By induction, \(|\theta_n|\le\gamma_n=nu/(1-nu)\).
Let \(s=\sum_{j=1}^n x_j\). Use Theorem 1.9 to derive a relative error bound for \(\widehat{s}\), assuming \(s\ne0\). Explain why the bound becomes large when
Worked solution Show / hide solution to Exercise 1.9
Detailed solution
The recursive summation theorem gives
\[ \widehat{s}=\sum_{j=1}^{n}x_j(1+\theta_j), \qquad |\theta_j|\le\gamma_{n-1}. \]Therefore
\[ |\widehat{s}-s| \le \gamma_{n-1}\sum_{j=1}^n |x_j|. \]If \(s\ne0\), division by \(|s|\) yields
\[ \frac{|\widehat{s}-s|}{|s|} \le \gamma_{n-1}\frac{\sum_{j=1}^n |x_j|}{|s|}. \]Conclusion
The bound becomes large when \(|s|\ll\sum |x_j|\), because cancellation makes the problem intrinsically sensitive.
For the quadratic equation \(ax^2+bx+c=0\), derive a cancellation-free procedure for computing both roots when \(b>0\) and \(b^2\gg4ac\).
Worked solution Show / hide solution to Exercise 1.10
Detailed solution
Let
\[ D=b^2-4ac. \]When \(b>0\) and \(D\approx b^2\), the formula \((-b+\sqrt D)/(2a)\) subtracts two nearly equal numbers. Define instead
\[ q=-\frac12\left(b+\sqrt{D}\right). \]Then compute
\[ x_1=\frac{q}{a},\qquad x_2=\frac{c}{q}. \]The second formula follows from \(x_1x_2=c/a\).
Conclusion
The cancellation-free procedure is to compute the large root with \(q/a\) and the small root with \(c/q\).
In exact real arithmetic, addition is associative. Construct a concrete example in a floating-point system showing that
even when no underflow or overflow occurs.
Worked solution Show / hide solution to Exercise 1.11
Detailed solution
Use a decimal floating-point system with \(t=3\) significant digits. Choose
\[ a=10^4, \qquad b=-10^4, \qquad c=1. \]Then
\[ \fl(\fl(a+b)+c)=\fl(0+1)=1. \]But \(b+c=-9999\), which rounds to \(-1.00\times10^4\) in this system. Hence
\[ \fl(a+\fl(b+c))=\fl(10^4-10^4)=0. \]Conclusion
Floating-point addition is not associative, even though no overflow or underflow is needed in the example.
Let \(n=2^m\), and suppose that \(s=\sum_{j=1}^n x_j\) is computed by a balanced binary summation tree rather than by recursive left-to-right summation. Assume the standard floating-point model and no overflow or underflow.
- Prove that the computed sum satisfies
\[ \widehat{s} = \sum_{j=1}^n x_j(1+\theta_j), \qquad |\theta_j|\le\gamma_m, \qquad m=\log_2 n. \]
- Deduce that
\[ |\widehat{s}-s| \le \gamma_{\log_2 n}\sum_{j=1}^n |x_j|. \]
- Compare this estimate with the recursive summation bound.
Worked solution Show / hide solution to Exercise 1.12
Detailed solution
In a balanced binary tree with \(n=2^m\), every input travels through exactly \(m\) additions. Each path therefore contributes a product of \(m\) rounding factors:
\[ \prod_{\ell=1}^{m}(1+\delta_\ell)=1+\theta_j, \qquad |\theta_j|\le\gamma_m. \]Thus
\[ \widehat{s}=\sum_{j=1}^n x_j(1+\theta_j), \qquad |\theta_j|\le\gamma_m, \qquad m=\log_2 n. \]Consequently
\[ |\widehat{s}-s|\le\gamma_{\log_2 n}\sum_{j=1}^n |x_j|. \]Conclusion
Pairwise summation replaces the linear growth \(\gamma_{n-1}\) by the logarithmic growth \(\gamma_{\log_2 n}\).
Let \(a,b\in\mathcal F\), and assume round-to-nearest arithmetic with no overflow or underflow. The TwoSum algorithm is
Prove that
exactly in real arithmetic.
Worked solution Show / hide solution to Exercise 1.13
Detailed solution
The TwoSum algorithm computes
\[ s=\fl(a+b),\qquad z=\fl(s-a), \qquad e=\fl((a-(s-z))+(b-z)). \]Under round-to-nearest and no exceptional events, the correction term \(e\) recovers exactly the part of \(a+b\) lost when \(s\) was rounded. The key cancellation identities are
\[ s-z \approx a, \qquad z\approx b, \]and the two residual pieces are recombined in a way that is exactly representable by Sterbenz-type exact-subtraction arguments.
\[ s+e=a+b. \]Conclusion
TwoSum is an error-free transformation: it represents the exact real sum as a rounded leading part plus a rounded error part.
Consider a monic polynomial
with distinct roots. Let \(p\) be perturbed to
- Prove that the first-order perturbation of \(\lambda_k\) satisfies
\[ \Delta\lambda_k = -\epsilon\frac{q(\lambda_k)}{p'(\lambda_k)} + O(\epsilon^2). \]
- Deduce that the absolute sensitivity of \(\lambda_k\) is governed by
\[ \frac{1}{|p'(\lambda_k)|}. \]
- Explain why Wilkinson's polynomial
\[ p(z)=\prod_{j=1}^{20}(z-j) \]
exhibits extreme ill-conditioning.
Worked solution Show / hide solution to Exercise 1.14
Detailed solution
Let \(p(\lambda_j)=0\) and perturb the polynomial by \(\delta p\). The perturbed root has the form \(\lambda_j+\Delta\lambda_j\). Linearizing,
\[ 0=p(\lambda_j+\Delta\lambda_j)+\delta p(\lambda_j+\Delta\lambda_j) \approx p'(\lambda_j)\Delta\lambda_j+\delta p(\lambda_j). \]Therefore
\[ \Delta\lambda_j\approx -\frac{\delta p(\lambda_j)}{p'(\lambda_j)}. \]For a monic polynomial with distinct roots,
\[ p'(\lambda_j)=\prod_{k\ne j}(\lambda_j-\lambda_k). \]Conclusion
Roots become ill-conditioned when two roots are close, because then \(|p'(\lambda_j)|\) is small.
Let \(A\in\mathbb R^{m\times n}\) and \(x\in\mathbb R^n\). Let \(\widehat{y}=\fl(Ax)\) be computed by the standard row dot-product algorithm. Prove that there exists \(\Delta A\) such that
componentwise.
Worked solution Show / hide solution to Exercise 1.15
Detailed solution
For the standard dot product in row \(i\), the computed value satisfies
\[ \widehat y_i=\sum_{j=1}^n a_{ij}x_j(1+\theta_{ij}), \qquad |\theta_{ij}|\le\gamma_n, \]assuming no overflow or underflow. Define \(\Delta a_{ij}=a_{ij}\theta_{ij}\). Then
\[ \widehat y_i=\sum_{j=1}^n (a_{ij}+\Delta a_{ij})x_j. \]Hence
\[ \widehat y=(A+\Delta A)x, \qquad |\Delta A|\le\gamma_n |A|. \]Conclusion
Standard matrix-vector multiplication is componentwise backward stable.
Let \(p(x)=\sum_{k=0}^n a_kx^k\) be evaluated by Horner's rule:
Prove that the computed value \(\widehat{p}(x)=y_0\) satisfies
Worked solution Show / hide solution to Exercise 1.16
Detailed solution
Horner's method performs the nested evaluation
\[ y_k=\fl(y_{k+1}x+a_k),\qquad k=n-1,\ldots,0. \]Each coefficient contribution passes through a bounded number of products and additions. Collecting the rounding factors gives
\[ \widehat p(x)=\sum_{k=0}^n a_k(1+\theta_k)x^k, \qquad |\theta_k|\le\gamma_{2n}. \]Equivalently, Horner computes the exact value of a nearby polynomial
\[ \widetilde p(x)=\sum_{k=0}^n \widetilde a_k x^k, \qquad |\widetilde a_k-a_k|\le\gamma_{2n}|a_k|. \]Conclusion
Horner's method is backward stable with respect to small coefficient perturbations.
A fused multiply-add operation computes
with only one final rounding. Prove that a dot product accumulated using FMA admits a sharper error bound involving \(\gamma_n\), rather than the \(\gamma_{2n}\)-type bound associated with separate multiplication and addition.
Worked solution Show / hide solution to Exercise 1.17
Detailed solution
A fused multiply-add computes \(ab+c\) with one final rounding. For a dot product accumulated by FMA, each update has only one rounding:
\[ s_k=\fl(s_{k-1}+a_kb_k). \]Therefore
\[ \widehat{s}=\sum_{k=1}^n a_kb_k(1+\theta_k), \qquad |\theta_k|\le\gamma_n. \]Without FMA, each term multiplication and addition introduces separate roundings, leading typically to a \(\gamma_{2n}\)-type bound.
Conclusion
FMA reduces rounding events and improves the dot-product error constant.
Let \(Ax=b\) be solved using an LU factorization computed in a low precision with unit roundoff \(u_\ell\). Suppose residuals are computed in a higher precision with unit roundoff \(u_h\ll u_\ell\). The refinement process is
Model the low-precision correction solve as applying \(A^{-1}+\Delta\), where
Derive the recurrence
Conclude that refinement converges if
Worked solution Show / hide solution to Exercise 1.18
Detailed solution
Let \(e_k=x-x_k\). The residual is computed in high precision:
\[ r_k=b-Ax_k=Ae_k+\hbox{small high-precision error}. \]The low-precision correction solve can be modeled as
\[ d_k=(A^{-1}+\Delta)r_k, \qquad \|\Delta\|\le C u_\ell \|A^{-1}\|. \]Then
\[ e_{k+1}=x-x_{k+1}=e_k-d_k \]and hence
\[ \|e_{k+1}\|\le C u_\ell\kappa(A)\|e_k\|+O(u_h)\|x\|. \]Conclusion
Iterative refinement converges when \(C u_\ell\kappa(A)<1\), explaining the success of mixed-precision solvers.
Let
Suppose \(\widehat{y}\) is computed by applying \(H\) to \(x\):
Assuming the dot product and axpy operations satisfy the usual floating-point error bounds, prove that
Explain why Householder transformations are preferred over explicitly forming normal equations in least-squares problems.
Worked solution Show / hide solution to Exercise 1.19
Detailed solution
The exact transformation is
\[ y=Hx=x-2v(v^Tx),\qquad H=I-2vv^T. \]The dot product \(v^Tx\) and the axpy operation introduce bounded perturbations. Therefore the computed vector can be written as
\[ \widehat y=(H+\Delta H)x, \qquad \|\Delta H\|_2\le C_n u. \]Because \(H\) is orthogonal, \(\|H\|_2=1\), so the transformation does not amplify norms in exact arithmetic.
Conclusion
Householder transformations are backward stable and are preferred to normal equations, which square the condition number.
Let \(A=[a_1,a_2]\in\mathbb R^{m\times2}\) have nearly parallel columns. Let \(\widehat{q}_1,\widehat{q}_2\) be produced by classical Gram–Schmidt in floating-point arithmetic. Show that
can be proportional to \(\kappa(A)u\), rather than merely \(u\). Explain the role of cancellation in the orthogonalization step.
Worked solution Show / hide solution to Exercise 1.20
Detailed solution
In classical Gram--Schmidt, the second vector is formed by
\[ v_2=a_2-q_1(q_1^Ta_2). \]If \(a_1\) and \(a_2\) are nearly parallel, then \(v_2\) is the difference of two nearly equal vectors. The relative error in \(v_2\) is amplified roughly by
\[ \frac{1}{\sin\theta}, \]where \(\theta\) is the angle between the columns. As \(\theta\to0\), the computed \(\widehat q_2\) is no longer accurately orthogonal to \(\widehat q_1\).
Conclusion
Classical Gram--Schmidt can lose orthogonality badly for nearly dependent columns; modified Gram--Schmidt or Householder QR is safer.
The Kahan summation algorithm calculates the sum of $n$ numbers $x_1, \ldots, x_n$ by maintaining a running compensation term $c$ for accumulated rounding errors:
Prove that under the standard model of floating-point arithmetic, the final error satisfies
Explain the structural mechanism by which the $n$-dependence is removed from the first-order error term.
Worked solution Show / hide solution to Exercise 1.21
Detailed solution
Kahan summation keeps a compensation variable \(c\) that approximates the low-order part lost at the previous addition. A typical step is
\[ y=x_j-c, \qquad t=s+y, \qquad c=(t-s)-y, \qquad s=t. \]The term \(c\) estimates the rounding error committed when \(y\) was added to \(s\). Subtracting it at the next step returns part of the lost information to the computation.
For well-scaled data and no overflow, Kahan summation usually satisfies a first-order error bound of the form
\[ |\widehat{s}-s|\lesssim C u\sum_{j=1}^n |x_j|+O(nu^2)\sum_{j=1}^n |x_j|, \]with a much smaller practical error than ordinary recursive summation.
Conclusion
Kahan summation is not magic, but it greatly reduces the accumulation of lost low-order bits.
Let $x$ and $y$ be floating-point numbers in a normalized radix-$\beta$ system $\mathcal{F}(\beta, t, L, U)$. Suppose that $\frac{1}{\beta} \le \frac{x}{y} \le \beta$. Prove or disprove: $x - y$ is exactly representable in $\mathcal{F}$. If it is false, provide the sharpest lower and upper bounds on $x/y$ for which exact subtraction is guaranteed under radix $\beta$.
Worked solution Show / hide solution to Exercise 1.22
Detailed solution
Sterbenz-type exact subtraction says that if two floating-point numbers of the same sign are close enough in magnitude, their difference is exactly representable. In radix \(\beta\), the relevant condition is
\[ \frac{1}{\beta}\le \frac{x}{y}\le \beta, \]assuming normalized numbers and no underflow. Under this condition the exponents differ by at most one binade, so the aligned significands can be subtracted without exceeding the available precision.
Conclusion
The stated range is precisely the range in which subtraction is protected from rounding by the Sterbenz mechanism.
Consider evaluating $f(x) = e^{-x}$ for large positive values of $x$ using its standard Taylor series expansion:
- Determine the maximum magnitude of an individual term in the summation as a function of $x$.
- Prove that evaluating this series directly in floating-point arithmetic leads to a severe loss of relative precision for large $x$.
- Provide a simple, backward-stable algebraic reformulation to compute $e^{-x}$ that completely avoids cancellation.
Worked solution Show / hide solution to Exercise 1.23
Detailed solution
The Taylor series
\[ e^{-x}=\sum_{k=0}^{\infty}\frac{(-x)^k}{k!} \]is alternating. For large positive \(x\), the individual terms initially become very large before the final small value is obtained by cancellation. Hence many significant digits are lost.
A stable evaluation is
\[ e^{-x}=\frac{1}{e^x}, \]or, in software, a correctly rounded library call for \(\exp(-x)\).
Conclusion
The Taylor expansion is analytically correct but numerically unstable for large positive \(x\).
The standard formula for computing the Euclidean norm of a vector in $\mathbb{R}^2$ is $f(x,y) = \sqrt{x^2 + y^2}$.
- Identify the range of inputs $(x,y)$ where this formula triggers an unwarranted overflow or underflow even though the final mathematical result $f(x,y)$ is perfectly representable in $\mathcal{F}$.
- Prove that the reformulated algorithm
\[ f(x,y) = v \cdot \sqrt{1 + \left(\frac{u}{v}\right)^2}, \quad v = \max(|x|,|y|), \quad u = \min(|x|,|y|), \]
is forward stable and safe from spurious overflow or underflow.
Worked solution Show / hide solution to Exercise 1.24
Detailed solution
The direct formula
\[ \sqrt{x^2+y^2} \]may overflow when \(x^2\) or \(y^2\) overflows, even if the final norm is representable. Let
\[ m=\max(|x|,|y|). \]If \(m=0\), the answer is \(0\). Otherwise compute
\[ \sqrt{x^2+y^2}=m\sqrt{\left(\frac{x}{m}\right)^2+\left(\frac{y}{m}\right)^2}. \]Conclusion
Scaling prevents spurious overflow and underflow. This is the basic idea behind a stable \(\operatorname{hypot}\) function.
Let $z = a + ib$ and $w = c + id$ be two complex numbers in $\mathbb{C}$. The standard floating-point multiplication evaluates:
Prove that the normwise forward error satisfies:
Does this guarantee backward stability for complex multiplication?
Worked solution Show / hide solution to Exercise 1.25
Detailed solution
Complex multiplication is
\[ (a+ib)(c+id)=(ac-bd)+i(ad+bc). \]The four products and two additions/subtractions introduce errors. A standard first-order model gives
\[ \widehat{\operatorname{Re}}=(ac-bd)+\Delta_r, \qquad \widehat{\operatorname{Im}}=(ad+bc)+\Delta_i, \]with
\[ |\Delta_r|\lesssim \gamma_2(|ac|+|bd|), \qquad |\Delta_i|\lesssim \gamma_2(|ad|+|bc|). \]If \(ac\approx bd\), the real part suffers cancellation.
Conclusion
The componentwise product bounds may be small, while the relative error in a cancelled component may be large.
Suppose a floating-point system does not include subnormal numbers and instead uses a flush-to-zero (FTZ) policy when an underflow occurs. Show by counterexample that Sterbenz lemma fails completely under an FTZ policy. Identify two positive floating-point numbers $x$ and $y$ satisfying $1/2 \le x/y \le 2$ such that $\fl(x-y) \neq x-y$.
Worked solution Show / hide solution to Exercise 1.26
Detailed solution
With flush-to-zero, a nonzero exact result below the smallest normalized number is replaced by zero. If the exact result is \(z\ne0\) but \(\fl(z)=0\), then
\[ \frac{|z-\fl(z)|}{|z|}=1. \]This violates the usual relative model \(|\delta|\le u\). Subnormal numbers avoid this abrupt loss by filling the gap between zero and the smallest normalized number with a uniform absolute spacing.
Conclusion
FTZ simplifies hardware but destroys relative accuracy near underflow.
We wish to evaluate $f(x,y) = \ln(x) - \ln(y)$ for positive numbers $x,y$.
- Derive the relative condition number $\kappa_{\text{rel}}(f, (x,y))$. When is this problem ill-conditioned?
- Show that evaluating $\fl(\fl(\ln x) - \fl(\ln y))$ exhibits severe forward error when $x \approx y$, and provide a stable alternative evaluation method using the standard $\operatorname{log1p}(z)$ framework.
Worked solution Show / hide solution to Exercise 1.27
Detailed solution
Write
\[ f(x,y)=\log x-\log y=\log\left(\frac{x}{y}\right). \]For relative perturbations \(x\mapsto x(1+\epsilon_x)\), \(y\mapsto y(1+\epsilon_y)\), the first-order change is
\[ \Delta f\approx \epsilon_x-\epsilon_y. \]Thus a relative-output condition estimate is
\[ \kappa_{\mathrm{rel}}(x,y)\approx \frac{2}{|\log(x/y)|}. \]When \(x\approx y\), use
\[ \log x-\log y=\log\left(1+\frac{x-y}{y}\right)=\operatorname{log1p}\left(\frac{x-y}{y}\right). \]Conclusion
The problem is ill-conditioned when \(x/y\approx1\), and \(\operatorname{log1p}\) is the stable computational form.
Let $A \in \mathbb{R}^{n \times n}$ be a nonsingular matrix, and consider the matrix inversion problem $F(A) = A^{-1}$. Show that using any subordinate matrix norm $\|\cdot\|$, the relative condition number of matrix inversion is exactly equal to the standard matrix condition number:
Worked solution Show / hide solution to Exercise 1.28
Detailed solution
For \(F(A)=A^{-1}\), perturb \(A\) to \(A+E\). Differentiating \(AA^{-1}=I\) gives
\[ E A^{-1}+A\,D(A^{-1})[E]=0. \]Therefore
\[ D(A^{-1})[E]=-A^{-1}EA^{-1}. \]Hence
\[ \|D(A^{-1})[E]\|\le \|A^{-1}\|^2\|E\|. \]In relative form,
\[ \frac{\|\Delta A^{-1}\|}{\|A^{-1}\|} \lesssim \kappa(A)\frac{\|E\|}{\|A\|}. \]Conclusion
The condition number of inversion is essentially \(\kappa(A)=\|A\|\|A^{-1}\|\).
Let $x \in \mathbb{R}^m$ and $y \in \mathbb{R}^n$. Consider the computation of the rank-one outer product matrix $S = x y^T$, where $s_{ij} = x_i y_j$. Prove that the computed matrix $\widehat{S} = \fl(S)$ satisfies a strict componentwise backward error bound $|\widehat{s}_{ij} - s_{ij}| \le u |s_{ij}|$. Contrast this with the $\gamma_n$ behavior of inner products.
Worked solution Show / hide solution to Exercise 1.29
Detailed solution
Each entry of the outer product is one floating-point multiplication:
\[ \widehat{s}_{ij}=\fl(x_i y_j)=x_i y_j(1+\delta_{ij}), \qquad |\delta_{ij}|\le u. \]Define \(\Delta s_{ij}=x_i y_j\delta_{ij}\). Then
\[ \widehat S=S+\Delta S, \qquad |\Delta S|\le u |S|. \]Conclusion
The outer product is componentwise backward stable because each matrix entry is generated by one stable multiplication.
Consider the sequence defined analytically by:
- Prove analytically that $\lim_{n \to \infty} x_n = 1$.
- Mathematically analyze the behavior of this iteration in a standard floating-point environment. Show that as $n \to \infty$, the computed values $\widehat{x}_n$ eventually collapse to $0$. Analyze the precise step where cancellation occurs.
Worked solution Show / hide solution to Exercise 1.30
Detailed solution
Analytically,
\[ x_n=n\left(e^{1/n}-1\right)\to 1 \]because \(e^h-1=h+O(h^2)\) with \(h=1/n\). Numerically, for large \(n\), \(e^{1/n}\) is very close to \(1\), so the subtraction \(e^{1/n}-1\) loses significant digits.
The stable formula is
\[ x_n=n\,\operatorname{expm1}\left(\frac1n\right), \]where \(\operatorname{expm1}(z)\) computes \(e^z-1\) without the dangerous subtraction.
Conclusion
The limit is benign mathematically but the naive iteration is numerically unstable for large \(n\).
Let $Ax = b$ and $(A + \Delta A)(x + \Delta x) = b + \Delta b$. Suppose we have a componentwise structural bound $|\Delta A| \le \omega |A|$ and $|\Delta b| \le \omega |b|$, where the inequalities and absolute values hold element-by-element. Prove Oberguggenberger's first-order bound:
Explain why this componentwise condition number can be significantly smaller than $\kappa_\infty(A)$ when $A$ is poorly scaled or highly sparse.
Worked solution Show / hide solution to Exercise 1.31
Detailed solution
A componentwise perturbation bound such as
\[ |\Delta A|\le \eta |A|, \qquad |\Delta b|\le \eta |b| \]implies a normwise bound because norms are monotone for nonnegative matrices:
\[ \|\Delta A\|\le \eta\,\||A|\|, \qquad \|\Delta b\|\le \eta\,\|b\|. \]However, the converse is false: a normwise small perturbation can destroy sparsity, signs, scaling, or zeros.
Conclusion
Componentwise stability is stronger and more informative than normwise stability, especially for structured or badly scaled data.
Recall the definition $\gamma_n = \frac{nu}{1-nu}$, where $nu < 1$. Prove the following properties that govern multi-term error propagation sequences:
- $\gamma_k < \gamma_n$ for $k < n$.
- $\gamma_m + \gamma_n + \gamma_m \gamma_n \le \gamma_{m+n}$.
- $(1 + \theta_m)(1 + \theta_n) = 1 + \theta_{m+n}$, where $|\theta_i| \le \gamma_i$.
Worked solution Show / hide solution to Exercise 1.32
Detailed solution
Recall
\[ \gamma_n=\frac{nu}{1-nu},\qquad nu<1. \]It is increasing in \(n\), and for small \(u\),
\[ \gamma_n=nu+O(n^2u^2). \]The product rule follows from
\[ (1+\gamma_m)(1+\gamma_n) \le 1+\gamma_{m+n}, \]whenever \((m+n)u<1\). Equivalently, products of perturbation factors can be compressed into one \(\gamma\)-factor:
\[ (1+\theta_m)(1+\theta_n)=1+\theta_{m+n}, \qquad |\theta_{m+n}|\le\gamma_{m+n}. \]Conclusion
The \(\gamma_n\) notation is the bookkeeping engine for multi-operation roundoff analysis.
When a floating-point system includes subnormal numbers, the minimum positive representable number shifts from $x_{\min} = \beta^L$ to $x_{\min}^{\text{sub}} = \beta^{L-(t-1)}$. Prove that if subnormal numbers are utilized, the absolute representation error model $|x - \fl(x)|$ remains bounded by $\frac{1}{2}\beta^{L-t+1}$ for all real inputs $|x| < \beta^L$.
Worked solution Show / hide solution to Exercise 1.33
Detailed solution
In the normalized range, rounding gives a relative model
\[ \fl(x)=x(1+\delta),\qquad |\delta|\le u. \]In the subnormal range, the spacing is essentially constant. Therefore one obtains an absolute model instead:
\[ |\fl(x)-x|\le \frac12\eta, \]where \(\eta\) is the subnormal spacing. The relative error can be large when \(|x|\) is very small.
Conclusion
Subnormals preserve gradual underflow and absolute accuracy, but they do not preserve a uniform relative-error bound.
Let $A \in \mathbb{R}^{n \times n}$ be nonsingular. Show by example that the standard condition number $\kappa(A) = \|A\|_2 \|A^{-1}\|_2$ is not invariant under diagonal scaling transformations $A \leftarrow D_1 A D_2$. Then, prove that the Skeel componentwise condition number defined by
is perfectly invariant under arbitrary diagonal scaling transformations $A \leftarrow D A$ for any nonsingular diagonal matrix $D$.
Worked solution Show / hide solution to Exercise 1.34
Detailed solution
Condition numbers are not invariant under arbitrary diagonal scaling. For example, take a nonsingular matrix \(A\) and a diagonal matrix \(D\). The scaled matrix
\[ \widetilde A=DAD^{-1} \]has the same eigenvalues as \(A\), but generally not the same singular values. Hence
\[ \kappa_2(\widetilde A) =\|DAD^{-1}\|_2\,\|DA^{-1}D^{-1}\|_2 \]may be much larger or much smaller than \(\kappa_2(A)\).
Conclusion
Diagonal scaling can improve or damage normwise conditioning; it must be chosen deliberately.
Consider the roots $\lambda_\pm$ of $a x^2 + b x + c = 0$. Let $f(a,b,c) = \lambda_+$ be the root function mapping the input vector $v = (a,b,c)^T$ to the root space. Compute the gradient $\nabla f(a,b,c)$ and find an expression for the relative condition number $\kappa_{\mathrm{rel}}(f,v)$ using the $\ell_\infty$ norm. Under what exact conditions on the coefficients does finding the root become intrinsically ill-conditioned?
Worked solution Show / hide solution to Exercise 1.35
Detailed solution
Let \(\lambda\) be a simple root of
\[ p(x)=ax^2+bx+c. \]For perturbations \((\Delta a,\Delta b,\Delta c)\), linearization gives
\[ 0\approx p'(\lambda)\Delta\lambda +\lambda^2\Delta a+\lambda\Delta b+\Delta c. \]Since \(p'(\lambda)=2a\lambda+b\),
\[ \Delta\lambda \approx -\frac{\lambda^2\Delta a+\lambda\Delta b+\Delta c}{2a\lambda+b}. \]Conclusion
The root is ill-conditioned when \(|2a\lambda+b|\) is small, which happens near multiple or clustered roots.
Let $A \in \mathbb{R}^{n \times n}$ with $\|A\|_2 < 1$. We wish to compute the matrix inverse $Y = (I - A)^{-1}$ via the truncated Neumann series $Y_k = \sum_{j=0}^k A^j$.
- Derive an upper bound on the relative condition number of this problem with respect to perturbations in $A$.
- Assuming that matrix-matrix multiplication satisfies $\|\fl(BC) - BC\|_2 \le \gamma_n \|B\|_2 \|C\|_2$, derive an inductive forward error bound for the computed series $\widehat{Y}_k$.
Worked solution Show / hide solution to Exercise 1.36
Detailed solution
If \(\|A\|_2<1\), then
\[ (I-A)^{-1}=\sum_{k=0}^{\infty}A^k. \]The truncation after \(m\) terms has norm bounded by
\[ \left\|\sum_{k=m+1}^{\infty}A^k\right\|_2 \le \frac{\|A\|_2^{m+1}}{1-\|A\|_2}. \]The computed finite sum is stable as long as the accumulated roundoff term, typically of order \(\gamma_m\sum_{k=0}^m\|A\|^k\), remains small.
Conclusion
The Neumann method is forward stable when \(\|A\|\) is clearly below one and the truncation and roundoff terms are both controlled.
Consider the second-order linear recurrence relation:
- Show analytically that the exact solution is $x_k = (1/3)^k$.
- Notice that the characteristic equation has a second root $\lambda = 4$. Prove that in any floating-point system, the computed sequence $\widehat{x}_k$ will eventually diverge exponentially due to the amplification of initial representation errors, regardless of the choice of precision.
Worked solution Show / hide solution to Exercise 1.37
Detailed solution
The characteristic equation is
\[ \lambda^2-\frac{13}{3}\lambda+\frac{4}{3}=0, \]or \(3\lambda^2-13\lambda+4=0\). Its roots are
\[ \lambda_1=4, \qquad \lambda_2=\frac13. \]The exact solution is
\[ x_k=C_1 4^k+C_2\left(\frac13\right)^k. \]Using \(x_0=1\) and \(x_1=1/3\) gives \(C_1=0\), \(C_2=1\), hence \(x_k=(1/3)^k\). In floating-point arithmetic, a tiny perturbation usually makes \(C_1\ne0\). Then the parasitic term \(C_1 4^k\) eventually dominates.
Conclusion
The recurrence is analytically correct but numerically unstable because the unwanted mode grows like \(4^k\).
Let $A = \begin{pmatrix} a & b \\ c & d \end{pmatrix}$. The standard determinant algorithm computes $\det(A) = ad - bc$.
- Provide a sharp relative error bound for $\fl(\fl(ad) - \fl(bc))$ using the standard model.
- Assume $ad \approx bc$. Explain why this algorithm is backward unstable and outline how a fused multiply-add (FMA) unit can be used within an error-free transformation framework (like Kahan's algorithm for $ad-bc$) to compute the determinant to high relative accuracy.
Worked solution Show / hide solution to Exercise 1.38
Detailed solution
Let \(p=ad\) and \(q=bc\). The computed determinant has the form
\[ \widehat d=\fl(\fl(ad)-\fl(bc)). \]Using the standard model,
\[ |\widehat d-(ad-bc)| \le \gamma_3(|ad|+|bc|) \]to first order. Therefore
\[ \frac{|\widehat d-(ad-bc)|}{|ad-bc|} \le \gamma_3\frac{|ad|+|bc|}{|ad-bc|}. \]If \(ad\approx bc\), the denominator is small and the relative error can be huge. FMA-based compensated formulas compute the product error and subtraction error more accurately.
Conclusion
The naive determinant formula is vulnerable to cancellation; compensated or FMA-based formulas are preferable near \(ad=bc\).
Let $F(A) = \exp(A) = \sum_{k=0}^\infty \frac{A^k}{k!}$ for $A \in \mathbb{R}^{n \times n}$. Prove that the absolute Fréchet derivative of $F$ at $A$ in the direction of a perturbation $E$ is given by
Use this to show that $\kappa_{\mathrm{abs}}(F, A) \le e^{\|A\|}$, and explain why computing the matrix exponential can become highly unstable if $A$ has large norm but $\exp(A)$ has a small norm.
Worked solution Show / hide solution to Exercise 1.39
Detailed solution
The Fréchet derivative of the matrix exponential is
\[ D\exp(A)[E]=\int_0^1 e^{(1-s)A}E e^{sA}\,\dd s. \]Taking norms and using submultiplicativity gives
\[ \|D\exp(A)[E]\| \le \int_0^1 e^{(1-s)\|A\|}\|E\|e^{s\|A\|}\,\dd s =e^{\|A\|}\|E\|. \]Thus
\[ \kappa_{\mathrm{abs}}(F,A)\le e^{\|A\|}. \]The relative condition number may also contain the factor \(1/\|e^A\|\), so instability can be severe when \(\|A\|\) is large but \(\|e^A\|\) is small.
Conclusion
The exponential may be highly sensitive for nonnormal or large-norm matrices.
Consider Rump's famous bivariate polynomial function:
Evaluate this expression analytically for $x = 771751$ and $y = 424401$. Show that evaluating this function in single, double, and extended precision floating-point arithmetic produces completely different values (even with incorrect signs), and identify the terms responsible for this structural failure of precision.
Worked solution Show / hide solution to Exercise 1.40
Detailed solution
With the given integers, the expression is exactly the rational number
\[ f(771751,424401) = \frac{3918639850420325321035606988979581038783241506840007}{1697604}. \]Thus
\[ f(771751,424401)\approx 2.308335660389776\times10^{45}. \]The important numerical point is not the size alone. The separate terms in the polynomial are enormous and contain severe subtractive cancellation, especially among the \(y^8\), \(x^2y^6\), \(x^4y^2\), and \(x^2y^4\) contributions. Different precisions may round these huge intermediate quantities differently and therefore produce incompatible final values.
Conclusion
Rump-type examples demonstrate that algebraically equivalent formulas can be computationally dangerous when enormous terms cancel.
Let $A \in \mathbb{R}^{n \times n}$ be a symmetric positive definite matrix, and let $\widehat{L}$ be the computed lower Cholesky factor obtained in floating-point arithmetic. Prove Demmel's normwise backward error theorem for Cholesky factorization:
Explain why this implies that Cholesky factorization is unconditionally backward stable without requiring any pivoting strategies.
Worked solution Show / hide solution to Exercise 1.41
Detailed solution
Cholesky factorization updates entries by inner products and square roots. Standard roundoff analysis shows that the computed factor \(\widehat L\) satisfies
\[ \widehat L\widehat L^T=A+\Delta A, \qquad \|\Delta A\|_2\le C_n u\|A\|_2. \]With the chapter notation, this is written as
\[ \|\Delta A\|_2\le \gamma_{n+1}\|A\|_2 \]up to constants depending on the detailed implementation. Since the perturbation is on the original matrix, the method is backward stable.
Conclusion
No pivoting is required for symmetric positive definite matrices because all pivots are positive and the computed factor solves a nearby SPD problem.
Let $V \in \mathbb{R}^{n \times n}$ be a Vandermonde matrix defined by distinct real nodes $x_1, x_2, \ldots, x_n$. Prove Gautschi's theorem: if the nodes $x_j$ are positive, then the condition number $\kappa_\infty(V)$ grows at least exponentially with $n$, satisfying $\kappa_\infty(V) \ge 2^{n-1}$ regardless of the location of the nodes. What does this reveal about fitting high-degree polynomials using a monomial basis?
Worked solution Show / hide solution to Exercise 1.42
Detailed solution
The inverse of a Vandermonde matrix contains coefficients of Lagrange cardinal polynomials. For positive nodes, these coefficients grow rapidly because the monomial basis is poorly scaled and the products
\[ \prod_{k\ne j}(x_j-x_k) \]can become very small relative to the coefficient sizes. Gautschi's estimate gives
\[ \kappa_\infty(V)\ge 2^{n-1}. \]Therefore the sensitivity grows at least exponentially with the degree.
Conclusion
High-degree polynomial fitting in the monomial basis is numerically unsafe; orthogonal bases or barycentric formulations are preferred.
Strassen's algorithm computes matrix multiplication \(C=AB\) for \(n\times n\) matrices using fewer scalar multiplications by performing linear combinations of submatrices.
- Prove that Strassen's algorithm satisfies a normwise error bound of the form
\[ \|\widehat{C}-C\|_F\le c_n u\|A\|_F\|B\|_F, \]where \(c_n=O(n^{\log_2 7})\).
- Explain why this bound is fundamentally weaker than the strict componentwise backward-error bound \(|\Delta A|\le\gamma_n |A|\) associated with standard matrix multiplication.
Worked solution Show / hide solution to Exercise 1.43
Detailed solution
Strassen's method forms many sums and differences of submatrices before performing seven recursive products. Let \(E(n)\) denote the Frobenius-norm error for size \(n\). A typical recursive estimate has the form
\[ E(n)\le 7E(n/2)+C u\,n^\alpha\|A\|_F\|B\|_F, \]where the second term accounts for the additions and subtractions used to build the Strassen blocks. Solving the recurrence gives
\[ E(n)\le c_n u\|A\|_F\|B\|_F, \qquad c_n=O(n^{\log_2 7}). \]This is a normwise bound. It is weaker than the componentwise backward stability of classical multiplication because Strassen's intermediate additions mix entries and signs. The resulting perturbation cannot generally be written as
\[ |\Delta A|\le\gamma_n |A| \]entry by entry.
Conclusion
Strassen gains arithmetic speed but sacrifices the clean componentwise stability structure of standard matrix multiplication.