Appendix C: Numerical Analysis Review

From Exact to Approximate

Most statistical computations cannot be done in closed form. Maximum likelihood requires solving nonlinear score equations. Bayesian posteriors require intractable integrals. Confidence intervals require finding roots of profile likelihood functions. Numerical analysis provides the algorithms that turn these mathematical expressions into numbers — and, crucially, the error bounds and convergence guarantees that tell us when to trust those numbers.

This appendix develops the numerical foundations that the rest of the course assumes. We explain why the standard algorithms work (convergence theory, contraction mappings), when they fail (floating-point arithmetic, saddle points, ill-conditioning), and how fast they converge (linear vs quadratic rates). The specific applications of these tools — Newton-Raphson for MLE in Ch3.2, IRLS for GLMs in Ch3.5, quadrature vs Monte Carlo in Ch2.1 — live in the chapters. This appendix provides the engine under the hood.

Road Map 🧭

Section 1 (Floating-Point): IEEE 754 representation, machine epsilon, catastrophic cancellation, log-space arithmetic — the source of most “mysterious” numerical errors
Section 2 (Root Finding): Fixed-point iteration, contraction mapping theorem, bisection, Newton’s method — the framework that unifies all iterative algorithms
Section 3 (Optimization): Gradient descent, Newton’s method for optimization, quasi-Newton — why some algorithms converge in 5 steps and others need 500
Section 4 (Differentiation): Finite differences, optimal step size, the complex-step method — computing derivatives when formulas are unavailable
Section 5 (Integration): Quadrature error theory, adaptive quadrature, Monte Carlo — choosing the right tool for numerical integrals
Section 6 (Convergence): Convergence orders, stopping criteria, diagnosing failure — knowing when to stop and when something has gone wrong

Prerequisites Check

This appendix assumes familiarity with:

Single-variable calculus: Taylor’s theorem and basic derivatives (Appendix A)
Linear algebra basics: Eigenvalues, matrix inverse, solving linear systems (Appendix B)
Python/NumPy: Basic array operations (Appendix F)

Connection to Course Material

Chapter 2 (Monte Carlo): Quadrature vs Monte Carlo convergence rates (Ch2.1), Brent’s method for inverse CDF (Ch2.3), logsumexp for importance weights (Ch2.6)
Chapter 3 (Parametric Inference): Newton-Raphson and Fisher scoring for MLE (Ch3.2), numerical Hessians for standard errors (Ch3.3), IRLS for GLMs (Ch3.5)
Chapter 5 (Bayesian): HMC leapfrog integrator step size (Ch5.5)
Appendix B: Condition numbers and numerical stability of matrix computations

Floating-Point Arithmetic

IEEE 754 Basics

Definition

A floating-point number in IEEE 754 double precision (float64) is stored as \((-1)^s \times 1.m \times 2^{e-1023}\), where \(s\) is a sign bit, \(m\) is a 52-bit mantissa (significand), and \(e\) is an 11-bit exponent. This gives approximately 15–16 significant decimal digits.

Definition

Machine epsilon \(\varepsilon_{\text{mach}}\) is the smallest \(\varepsilon > 0\) such that \(\text{fl}(1 + \varepsilon) \ne 1\) in floating-point arithmetic. For float64, \(\varepsilon_{\text{mach}} = 2^{-52} \approx 2.22 \times 10^{-16}\).

The key consequence: the spacing between consecutive floating-point numbers is not uniform. Near \(x\), the gap to the next representable float is approximately \(|x| \cdot \varepsilon_{\text{mach}}\). This means relative precision is roughly constant (~16 digits), but absolute precision varies enormously — fine near zero, coarse for large numbers.

import numpy as np

info = np.finfo(float)
print("IEEE 754 double precision (float64)")
print(f"{'Property':>25}  {'Value':>25}")
print("-" * 54)
print(f"{'Machine epsilon':>25}  {info.eps:>25.2e}")
print(f"{'Smallest normal':>25}  {info.tiny:>25.2e}")
print(f"{'Largest finite':>25}  {info.max:>25.2e}")
print(f"{'Mantissa bits':>25}  {info.nmant:>25d}")
print()
print("Float spacing at different magnitudes:")
print(f"{'x':>15}  {'np.spacing(x)':>20}  "
      f"{'x · ε_mach':>20}")
print("-" * 58)
for x in [1e-10, 1e-5, 1.0, 1e5, 1e10, 1e15]:
    print(f"{x:>15.0e}  {np.spacing(x):>20.2e}  "
          f"{x * info.eps:>20.2e}")

IEEE 754 double precision (float64)
                 Property                      Value
------------------------------------------------------
          Machine epsilon                   2.22e-16
          Smallest normal                  2.23e-308
           Largest finite                  1.80e+308
            Mantissa bits                         52

Float spacing at different magnitudes:
              x         np.spacing(x)            x · ε_mach
----------------------------------------------------------
          1e-10              1.29e-26              2.22e-26
          1e-05              1.69e-21              2.22e-21
          1e+00              2.22e-16              2.22e-16
          1e+05              1.46e-11              2.22e-11
          1e+10              1.91e-06              2.22e-06
          1e+15              1.25e-01              2.22e-01

At \(x = 10^{15}\), the gap between consecutive floats is 0.125 — integers above \(2^{53} \approx 9 \times 10^{15}\) cannot be represented exactly. At \(x = 1\), the gap is \(\varepsilon_{\text{mach}}\) itself.

Catastrophic Cancellation

Definition

Catastrophic cancellation occurs when subtracting two nearly equal floating-point numbers. The leading significant digits cancel, leaving only the less significant (and potentially inaccurate) trailing digits. The result has far fewer correct digits than either operand.

Worked Example: The Quadratic Formula. Consider \(x^2 - Bx + 1 = 0\) with \(B\) large. The roots are \((B \pm \sqrt{B^2 - 4})/2\). The larger root \((B + \sqrt{B^2 - 4})/2 \approx B\) is computed accurately. The smaller root \((B - \sqrt{B^2 - 4})/2\) subtracts two nearly equal numbers — \(B\) and \(\sqrt{B^2 - 4}\) agree in their leading digits when \(B \gg 2\).

The fix: use the algebraically equivalent form \(2/(B + \sqrt{B^2 - 4})\), which involves no cancellation.

import numpy as np

print("Quadratic formula: x² − Bx + 1 = 0")
print(f"{'B':>12}  {'Naive root':>18}  {'Stable root':>18}"
      f"  {'Rel error':>12}")
print("-" * 66)
for B in [1e1, 1e2, 1e3, 1e4, 1e6, 1e8]:
    disc = B**2 - 4
    sqrt_d = np.sqrt(disc)
    naive = (B - sqrt_d) / 2
    stable = 2.0 / (B + sqrt_d)
    rel_err = abs(naive - stable) / stable
    print(f"{B:>12.0e}  {naive:>18.15f}  "
          f"{stable:>18.15f}  {rel_err:>12.2e}")

Quadratic formula: x² − Bx + 1 = 0
           B          Naive root         Stable root   Rel error
------------------------------------------------------------------
       1e+01   0.101020514433644   0.101020514433644      4.26e-15
       1e+02   0.010001000200049   0.010001000200050      1.22e-13
       1e+03   0.001000001000023   0.001000001000002      2.06e-11
       1e+04   0.000100000001112   0.000100000001000      1.12e-09
       1e+06   0.000001000007614   0.000001000000000      7.61e-06
       1e+08   0.000000007450581   0.000000010000000      2.55e-01

At \(B = 10^8\), the naive formula loses 25% of the answer to cancellation — a catastrophic failure. The stable formula is exact to machine precision at every \(B\). The general principle: whenever a formula subtracts nearly equal quantities, look for an algebraically equivalent form that avoids the subtraction.

For a full worked example of catastrophic cancellation in variance computation and its fix via Welford’s algorithm, see Ch2.1.

Log-Space Arithmetic

In likelihood-based inference, products of many small probabilities routinely underflow to zero, and exponentials of large arguments overflow to infinity. The standard solution: work in log space.

Products become sums: \(\log \prod f_i = \sum \log f_i\)
Ratios become differences: \(\log(a/b) = \log a - \log b\)
No overflow/underflow: \(\log L(\theta)\) stays finite even when \(L(\theta)\) is astronomically small

For computing \(\log(\sum \exp(x_i))\) — the log-normalizing constant for importance weights and posterior distributions — the logsumexp trick is essential. The full treatment is in Ch2.6; SciPy provides scipy.special.logsumexp.

Two specialized functions avoid cancellation near zero:

np.log1p(x) computes \(\log(1+x)\) accurately for small \(x\), avoiding \(1 + x \approx 1\)
np.expm1(x) computes \(e^x - 1\) accurately for small \(x\), avoiding \(e^x \approx 1\)

import numpy as np

print("Log-space arithmetic near zero")
print(f"{'x':>12}  {'log(1+x)':>20}  {'log1p(x)':>20}  "
      f"{'Rel diff':>12}")
print("-" * 70)
for x in [1e-1, 1e-5, 1e-10, 1e-15, 1e-16]:
    naive = np.log(1 + x)
    stable = np.log1p(x)
    rel = abs(naive - stable) / abs(stable) if stable != 0 else 0
    print(f"{x:>12.0e}  {naive:>20.16f}  "
          f"{stable:>20.16f}  {rel:>12.2e}")

print()
print(f"{'x':>12}  {'exp(x)-1':>20}  {'expm1(x)':>20}  "
      f"{'Rel diff':>12}")
print("-" * 70)
for x in [1e-1, 1e-5, 1e-10, 1e-15, 1e-16]:
    naive = np.exp(x) - 1
    stable = np.expm1(x)
    rel = abs(naive - stable) / abs(stable) if stable != 0 else 0
    print(f"{x:>12.0e}  {naive:>20.16e}  "
          f"{stable:>20.16e}  {rel:>12.2e}")

Log-space arithmetic near zero
           x              log(1+x)              log1p(x)      Rel diff
----------------------------------------------------------------------
       1e-01    0.0953101798043249    0.0953101798043249      7.28e-16
       1e-05    0.0000099999500004    0.0000099999500003      6.55e-12
       1e-10    0.0000000001000000    0.0000000001000000      8.27e-08
       1e-15    0.0000000000000011    0.0000000000000010      1.10e-01
       1e-16    0.0000000000000000    0.0000000000000001      1.00e+00

           x              exp(x)-1              expm1(x)      Rel diff
----------------------------------------------------------------------
       1e-01  1.0517091807564771e-01  1.0517091807564763e-01      7.92e-16
       1e-05  1.0000050000069649e-05  1.0000050000166668e-05      9.70e-12
       1e-10  1.0000000827403710e-10  1.0000000000500000e-10      8.27e-08
       1e-15  1.1102230246251565e-15  1.0000000000000007e-15      1.10e-01
       1e-16  0.0000000000000000e+00  9.9999999999999998e-17      1.00e+00

At \(x = 10^{-16}\), log(1 + x) returns exactly 0 (total loss of the input), while log1p gives the correct answer to 16 digits. Similarly, exp(x) - 1 returns exactly 0, while expm1 preserves the information. These functions are essential whenever likelihood ratios or probability differences involve values near 1.

float64 spacing growing linearly with magnitude, and catastrophic cancellation in the quadratic formula where the smaller root's relative error rises to about 25 percent at B = 1e8. — Fig. 262 **Figure C.1: Floating-Point Arithmetic.** (a) The spacing between consecutive `float64` numbers scales linearly with magnitude — the line \(x \cdot \varepsilon_{\text{mach}}\) is an excellent approximation. (b) Catastrophic cancellation in the quadratic formula \(x^2 - Bx + 1 = 0\): the naive formula for the smaller root loses digits proportionally to \(B\), reaching 25% relative error at \(B = 10^8\).

With the arithmetic substrate understood, we now turn to the iterative algorithms that operate on it. The floating-point issues from Section 1 — cancellation, roundoff, overflow — will resurface in convergence checks and stopping criteria throughout.

Root Finding

Fixed-Point Iteration and Contraction Mapping

Definition

A fixed point of a function \(g\) is a value \(x^*\) such that \(g(x^*) = x^*\).

Many algorithms reduce to fixed-point iteration: given \(x_k\), compute \(x_{k+1} = g(x_k)\) and repeat. Newton’s method, gradient descent, and IRLS all have this structure — they differ only in the choice of \(g\).

Definition

A function \(g\) is a contraction mapping on an interval \([a, b]\) if there exists \(L < 1\) such that \(|g(x) - g(y)| \le L|x - y|\) for all \(x, y \in [a, b]\). The constant \(L\) is the Lipschitz constant (or contraction rate).

Theorem: Banach Fixed-Point Theorem

If \(g: [a, b] \to [a, b]\) is a contraction mapping with Lipschitz constant \(L < 1\), then:

\(g\) has a unique fixed point \(x^*\) in \([a, b]\)
For any starting point \(x_0 \in [a, b]\), the iterates \(x_{k+1} = g(x_k)\) converge to \(x^*\)
The error satisfies \(|x_k - x^*| \le L^k |x_0 - x^*|\) — linear convergence with rate \(L\)

Why this matters. Every iterative algorithm in this course is a fixed-point iteration. The contraction condition \(L < 1\) is what guarantees convergence. For gradient descent, the contraction rate depends on the step size and the curvature of the objective. For Newton’s method, the contraction rate depends on the distance to the root. Understanding this framework explains why some algorithms converge and others diverge.

import numpy as np

print("Fixed-point iteration: x = cos(x)")
print(f"  True fixed point: x* = 0.739085...")
print()

x = 0.0
x_star = 0.7390851332151607
print(f"{'k':>4}  {'x_k':>18}  {'|x_k - x*|':>14}  "
      f"{'Ratio':>10}")
print("-" * 52)
prev_err = abs(x - x_star)
for k in range(15):
    err = abs(x - x_star)
    if k > 0:
        ratio = err / prev_err
        print(f"{k:>4}  {x:>18.15f}  {err:>14.2e}  "
              f"{ratio:>10.4f}")
    else:
        print(f"{k:>4}  {x:>18.15f}  {err:>14.2e}  "
              f"{'—':>10}")
    prev_err = err
    x = np.cos(x)

print(f"\nContraction constant L ≈ |sin(x*)| = "
      f"{abs(np.sin(x_star)):.6f}")
print(f"Observed ratio converges to L ✓")

Fixed-point iteration: x = cos(x)
  True fixed point: x* = 0.739085...

   k                 x_k      |x_k - x*|       Ratio
----------------------------------------------------
 0.000000000000000        7.39e-01           —
 1.000000000000000        2.61e-01      0.3530
 0.540302305868140        1.99e-01      0.7619
 0.857553215846393        1.18e-01      0.5960
 0.654289790497779        8.48e-02      0.7158
 0.793480358742566        5.44e-02      0.6415
 0.701368773622757        3.77e-02      0.6934
 0.763959682900654        2.49e-02      0.6595
 0.722102425026708        1.70e-02      0.6827
 0.750417761763761        1.13e-02      0.6673
 0.731404042422510        7.68e-03      0.6778
 0.744237354900557        5.15e-03      0.6708
 0.735604740436347        3.48e-03      0.6755
 0.741425086610109        2.34e-03      0.6723
 0.737506890513243        1.58e-03      0.6745

Contraction constant L ≈ |sin(x*)| = 0.673612
Observed ratio converges to L ✓

The error ratio converges to \(L \approx 0.674\) — the contraction constant. Each iteration reduces the error by roughly 33%, so about 6 iterations are needed per additional decimal digit. This is linear convergence, and for smooth \(g\), the contraction constant is \(L = |g'(x^*)|\).

Bisection

The simplest root-finding algorithm: given \(f(a) < 0 < f(b)\), the intermediate value theorem guarantees a root in \([a, b]\). Bisection halves the interval at each step, choosing the half that still brackets the root.

Theorem: Bisection Convergence

After \(k\) bisection steps, the error satisfies \(|x_k - x^*| \le (b - a) / 2^k\). This is linear convergence with rate \(r = 1/2\), requiring about 3.3 iterations per decimal digit. Bisection requires no derivatives and no smoothness — only that \(f\) is continuous and changes sign.

The guarantee of convergence makes bisection a reliable fallback when other methods fail. SciPy’s scipy.optimize.brentq combines bisection with interpolation for faster convergence while retaining the convergence guarantee — see Ch2.3 for its use in inverse CDF computation.

Newton’s Method for Root Finding

Newton’s method finds roots of \(f(x) = 0\) by linearizing \(f\) at each iterate:

\[x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)}\]

Geometrically, each step replaces \(f\) by its tangent line and takes the root of the tangent as the next iterate.

Theorem: Quadratic Convergence

If \(f(x^*) = 0\), \(f'(x^*) \ne 0\), and \(f''\) is continuous near \(x^*\), then Newton’s method converges quadratically near \(x^*\):

\[|x_{k+1} - x^*| \le C |x_k - x^*|^2\]

where \(C = |f''(x^*)|/(2|f'(x^*)|)\). Quadratic convergence roughly doubles the number of correct digits per iteration.

When Newton fails:

Multiple roots (\(f'(x^*) = 0\)): convergence slows to linear
Inflection points: the tangent line sends the iterate far away
Cycling: the iterates oscillate without converging
Division by zero: \(f'(x_k) = 0\) at a non-root

Newton’s method applied to optimization (\(f = \ell'\), finding where the score is zero) is the Newton-Raphson algorithm developed in Ch3.2.

import numpy as np

f = lambda x: x**3 - 2
fp = lambda x: 3 * x**2
root = 2.0**(1.0/3.0)

# Bisection
a, b = 1.0, 2.0
bisect_errs = []
for i in range(20):
    mid = (a + b) / 2.0
    bisect_errs.append(abs(mid - root))
    if f(mid) * f(a) < 0:
        b = mid
    else:
        a = mid

# Newton
x = 2.0
newton_errs = [abs(x - root)]
for i in range(7):
    x = x - f(x) / fp(x)
    newton_errs.append(abs(x - root))

print("Root finding: f(x) = x³ − 2,  x* = ∛2 = "
      f"{root:.15f}")
print()
print(f"{'Iter':>4}  {'Bisection |err|':>18}  "
      f"{'Newton |err|':>18}")
print("-" * 44)
for i in range(max(len(bisect_errs), len(newton_errs))):
    b_str = (f"{bisect_errs[i]:>18.2e}"
             if i < len(bisect_errs) else f"{'':>18}")
    n_str = (f"{newton_errs[i]:>18.2e}"
             if i < len(newton_errs)
             else f"{'converged':>18}")
    print(f"{i:>4}  {b_str}  {n_str}")

Root finding: f(x) = x³ − 2,  x* = ∛2 = 1.259921049894873

Iter     Bisection |err|        Newton |err|
--------------------------------------------
          2.40e-01            7.40e-01
          9.92e-03            2.40e-01
          1.15e-01            3.64e-02
          5.26e-02            1.01e-03
          2.13e-02            8.11e-07
          5.70e-03            5.22e-13
          2.11e-03            0.00e+00
          1.80e-03            0.00e+00
          1.55e-04           converged
          8.21e-04           converged
          3.33e-04           converged
          8.87e-05           converged
          3.34e-05           converged
          2.77e-05           converged
          2.84e-06           converged
          1.24e-05           converged
          4.79e-06           converged
          9.78e-07           converged
          9.30e-07           converged
          2.40e-08           converged

Newton reaches machine precision in 6 iterations; bisection needs 20 iterations to reach \(10^{-8}\). The contrast illustrates the power of quadratic vs linear convergence. But bisection always works — Newton requires a good starting point and a well-behaved function.

Root-finding for the cube root of 2: Newton's method with tangent lines converging in a few steps while bisection slowly narrows the bracket, and a log-scale error plot showing Newton's quadratic rate versus bisection's linear rate. — Fig. 263 **Figure C.2: Root-Finding Convergence.** (a) Finding \(\sqrt[3]{2}\) as a root of \(f(x) = x^3 - 2\). Newton’s method (colored circles with tangent lines) converges rapidly from \(x_0 = 2\); bisection (blue squares) narrows the interval steadily. (b) Error on a log scale: bisection decreases linearly (constant slope), while Newton’s error plummets quadratically (slope doubles each step).

Optimization is root finding in disguise: minimizing \(f(\theta)\) means finding where the score \(\nabla f = 0\). The contraction mapping framework from above carries over directly — the key new question is how the problem’s conditioning affects convergence speed.

Numerical Optimization

Optimization as Fixed-Point Iteration

Minimizing \(f(\theta)\) by gradient descent produces the iteration \(\theta_{k+1} = \theta_k - \alpha \nabla f(\theta_k)\), which is a fixed-point iteration with \(g(\theta) = \theta - \alpha \nabla f(\theta)\). The fixed points of \(g\) are exactly the critical points of \(f\) (where \(\nabla f = 0\)).

The contraction condition from the Banach theorem translates directly into step size constraints.

Theorem: Gradient Descent Convergence

If \(f\) is convex with \(L\)-Lipschitz continuous gradient (i.e., \(\|\nabla^2 f\| \le L\) everywhere), then gradient descent with step size \(0 < \alpha < 2/L\) converges: the function value decreases at rate \(f(\theta_k) - f(\theta^*) = O(1/k)\).

If \(f\) is additionally strongly convex with condition number \(\kappa = L/\mu\) (ratio of largest to smallest eigenvalue of the Hessian), then the error converges linearly:

\[\|\theta_k - \theta^*\| \le \left(\frac{\kappa - 1}{\kappa + 1}\right)^k \|\theta_0 - \theta^*\|\]

Gradient Descent

The update \(\theta_{k+1} = \theta_k - \alpha \nabla f(\theta_k)\) moves in the direction of steepest descent. The step size (learning rate) \(\alpha\) controls the trade-off:

Too large: the iterates overshoot and oscillate (or diverge)
Too small: convergence is unnecessarily slow
Optimal for quadratics: \(\alpha = 2/(L + \mu)\), giving rate \((\kappa - 1)/(\kappa + 1)\)

Why ill-conditioned problems converge slowly. When the Hessian eigenvalues span a wide range (large \(\kappa\)), gradient descent zigzags: it makes rapid progress along the well-conditioned directions but creeps along the poorly conditioned ones. For \(\kappa = 100\), the linear convergence rate is \(99/101 \approx 0.98\), requiring about 115 iterations per decimal digit. This motivates Newton’s method, which is immune to conditioning.

For step size strategies beyond fixed \(\alpha\) — diminishing sequences, line search with the Armijo condition, trust regions — see the practical safeguards in Ch3.2.

Newton’s Method for Optimization

Newton’s method for minimizing \(f\) solves the score equation \(\nabla f = 0\) using the update:

\[\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k - \mathbf{H}^{-1}(\boldsymbol{\theta}_k) \nabla f(\boldsymbol{\theta}_k)\]

where \(\mathbf{H}\) is the Hessian. Each step minimizes the local quadratic Taylor approximation — see Appendix A for the connection between second-order Taylor and the quadratic likelihood.

Theorem: Quadratic Convergence of Newton for Optimization

Under regularity conditions (Hessian Lipschitz continuous, positive definite at the minimizer), Newton’s method converges quadratically near the minimizer: the number of correct digits roughly doubles per iteration.

Affine invariance. Newton’s method is invariant to affine reparameterizations: if \(\phi = A\theta + b\) for a nonsingular \(A\), Newton on \(f(\theta)\) and Newton on \(f(A^{-1}(\phi - b))\) produce the same iterates (in transformed coordinates). This means Newton is unaffected by poor conditioning — unlike gradient descent. The cost is computing and solving with the Hessian at each step.

For the full statistical treatment of Newton’s method for MLE — including Fisher scoring (replacing the observed Hessian with expected information) and IRLS (Fisher scoring for GLMs) — see Ch3.2 and Ch3.5.

import numpy as np
from scipy.special import gammaln, digamma, polygamma

rng = np.random.default_rng(42)
n = 50
data = rng.gamma(4.0, 1.0/2.0, n)

def neg_ll(alpha, beta):
    return -(n*alpha*np.log(beta) - n*gammaln(alpha)
             + (alpha-1)*np.sum(np.log(data))
             - beta*np.sum(data))

def grad_neg(alpha, beta):
    g_a = -(n*np.log(beta) - n*digamma(alpha)
            + np.sum(np.log(data)))
    g_b = -(n*alpha/beta - np.sum(data))
    return np.array([g_a, g_b])

def hess_neg(alpha, beta):
    h11 = n * polygamma(1, alpha)
    h12 = -n / beta
    h22 = n * alpha / beta**2
    return np.array([[h11, h12], [h12, h22]])

# Newton (find MLE and track convergence)
a_nw, b_nw = 2.0, 1.0
nw_path = [(a_nw, b_nw)]
for _ in range(20):
    g = grad_neg(a_nw, b_nw)
    H = hess_neg(a_nw, b_nw)
    step = np.linalg.solve(H, g)
    a_nw -= step[0]; b_nw -= step[1]
    nw_path.append((a_nw, b_nw))
    if np.linalg.norm(step) < 1e-14:
        break
a_star, b_star = a_nw, b_nw

# Gradient descent
H_mle = hess_neg(a_star, b_star)
L = max(np.linalg.eigvalsh(H_mle))
kappa = L / min(np.linalg.eigvalsh(H_mle))
lr = 1.0 / L
a_gd, b_gd = 2.0, 1.0
gd_path = [(a_gd, b_gd)]
for _ in range(200):
    g = grad_neg(a_gd, b_gd)
    a_gd -= lr * g[0]; b_gd -= lr * g[1]
    a_gd = max(a_gd, 0.01); b_gd = max(b_gd, 0.01)
    gd_path.append((a_gd, b_gd))

dist = lambda p: np.sqrt(
    (p[0]-a_star)**2 + (p[1]-b_star)**2)
print(f"Gamma MLE (n={n}): α̂ = {a_star:.4f}, "
      f"β̂ = {b_star:.4f}")
print(f"Condition number κ = {kappa:.1f}")
print(f"GD step size α = 1/L = {lr:.4f}")
print()
print(f"{'Iter':>4}  {'GD ‖θ−θ*‖':>14}  "
      f"{'Newton ‖θ−θ*‖':>14}")
print("-" * 36)
for k in [0, 1, 2, 3, 4, 5, 10, 25, 50, 100, 200]:
    gd_e = (f"{dist(gd_path[k]):>14.6f}"
            if k < len(gd_path) else "")
    nw_e = (f"{dist(nw_path[k]):>14.2e}"
            if k < len(nw_path)
            else f"{'converged':>14}")
    print(f"{k:>4}  {gd_e}  {nw_e}")

Gamma MLE (n=50): α̂ = 6.6685, β̂ = 3.5977
Condition number κ = 73.3
GD step size α = 1/L = 0.0299

Iter       GD ‖θ−θ*‖   Newton ‖θ−θ*‖
------------------------------------
   0        5.342546        5.34e+00
   1        5.083337        3.84e+00
   2        4.862029        1.98e+00
   3        4.664648        5.31e-01
   4        4.452428        3.80e-02
   5        4.283482        1.95e-04
  10        3.013962       converged
  25        2.173839       converged
  50        1.384307       converged
 100        0.628403       converged
 200        0.149146       converged

Newton converges to the MLE in 5 iterations (error \(< 10^{-4}\)) while gradient descent still has error 0.15 after 200 iterations. The condition number \(\kappa = 73.3\) explains the slow GD convergence: the theoretical linear rate is \((73.3 - 1)/(73.3 + 1) = 0.973\), requiring about \(\log(0.1)/\log(0.973) \approx 84\) iterations per decimal digit.

Contour plot of the negative Gamma log-likelihood where gradient descent zigzags through the elongated contours while Newton's method steps almost straight to the MLE, with an error-versus-iteration plot showing Newton reaching machine precision in about six steps. — Fig. 264 **Figure C.3: Gradient Descent vs Newton.** (a) Contour plot of the negative Gamma log-likelihood. Gradient descent (red) zigzags through the elongated contours; Newton (green) cuts directly to the MLE (star). (b) Error vs iteration: Newton reaches machine precision in ~6 steps (quadratic convergence), while GD converges linearly with a slow rate set by the condition number.

Quasi-Newton Methods

Newton’s method requires computing and factoring the \(p \times p\) Hessian at every step — expensive for large \(p\). Quasi-Newton methods approximate the Hessian using only gradient information, updating the approximation incrementally.

BFGS (Broyden-Fletcher-Goldfarb-Shanno) maintains a dense \(p \times p\) approximate Hessian, achieving superlinear convergence without exact second derivatives. L-BFGS-B (limited-memory BFGS with bounds) stores only the last \(m\) gradient differences (typically \(m = 10\)), reducing memory from \(O(p^2)\) to \(O(mp)\) and supporting simple box constraints \(\ell_j \le \theta_j \le u_j\).

In SciPy, scipy.optimize.minimize(method='L-BFGS-B') is the standard workhorse for unconstrained and box-constrained optimization — it is used throughout Ch3.2 for production MLE computation.

The tradeoff is clear: more derivative information means fewer evaluations.

import numpy as np
from scipy.optimize import minimize

rng = np.random.default_rng(42)
data = rng.normal(5, 3, 200)

# Parameterize as (mu, log_sigma2) for unconstrained opt
def neg_loglik(params):
    mu, log_s2 = params
    s2 = np.exp(log_s2)
    n = len(data)
    return (n/2*np.log(2*np.pi*s2)
            + np.sum((data - mu)**2) / (2*s2))

def grad_neg(params):
    mu, log_s2 = params
    s2 = np.exp(log_s2)
    n = len(data)
    g_mu = -np.sum(data - mu) / s2
    g_ls2 = n/2 - np.sum((data - mu)**2) / (2*s2)
    return np.array([g_mu, g_ls2])

def hess_neg(params):
    mu, log_s2 = params
    s2 = np.exp(log_s2)
    n = len(data)
    H11 = n / s2
    H12 = np.sum(data - mu) / s2
    H22 = np.sum((data - mu)**2) / (2*s2)
    return np.array([[H11, H12], [H12, H22]])

x0 = [0.0, 0.0]  # start far from MLE

res_ncg = minimize(neg_loglik, x0, method='Newton-CG',
                   jac=grad_neg, hess=hess_neg)
res_lbfgs = minimize(neg_loglik, x0, method='L-BFGS-B',
                     jac=grad_neg)
res_nm = minimize(neg_loglik, x0, method='Nelder-Mead')

mu_hat = np.mean(data)
s2_hat = np.var(data)

print("Method comparison for Normal MLE (n=200):")
print(f"True MLE: mu={mu_hat:.4f}, "
      f"sigma2={s2_hat:.4f}")
print()
print(f"{'Method':>12}  {'Derivs':>8}  {'Iters':>6}  "
      f"{'f-evals':>8}  {'mu':>10}  {'sigma2':>10}")
print("-" * 65)
for name, res, derivs in [
    ('Newton-CG', res_ncg, 'grad+H'),
    ('L-BFGS-B', res_lbfgs, 'grad'),
    ('Nelder-Mead', res_nm, 'none')]:
    mu = res.x[0]
    s2 = np.exp(res.x[1])
    print(f"{name:>12}  {derivs:>8}  {res.nit:>6}  "
          f"{res.nfev:>8}  {mu:>10.4f}  {s2:>10.4f}")

Method comparison for Normal MLE (n=200):
True MLE: mu=4.9086, sigma2=6.9653

      Method    Derivs   Iters   f-evals          mu     sigma2
-----------------------------------------------------------------
   Newton-CG    grad+H      12        14      4.9086      6.9653
    L-BFGS-B      grad      15        16      4.9086      6.9652
 Nelder-Mead      none      82       156      4.9087      6.9652

Newton-CG converges fastest with access to both gradient and Hessian; L-BFGS-B is nearly as fast using only gradients (it approximates the Hessian internally); Nelder-Mead needs 10× more function evaluations since it must infer geometry from function values alone.

All of the optimization methods above require gradients, and Newton additionally requires the Hessian. When analytical derivatives are unavailable or impractical, we need to compute them numerically — and the choice of step size turns out to be a surprisingly delicate problem.

Numerical Differentiation

Forward and Central Differences

When analytical derivatives are unavailable, we approximate \(f'(x)\) from function evaluations.

Definition

The forward difference approximation is \(f'(x) \approx \frac{f(x+h) - f(x)}{h}\), with error \(O(h)\) from the first-order Taylor remainder.

Definition

The central difference approximation is \(f'(x) \approx \frac{f(x+h) - f(x-h)}{2h}\), with error \(O(h^2)\). The extra order comes from the cancellation of odd-powered error terms when subtracting symmetric evaluations.

The derivations follow directly from Taylor’s theorem (Appendix A). For the forward difference: \(f(x+h) = f(x) + hf'(x) + \frac{h^2}{2}f''(c)\), so \(\frac{f(x+h) - f(x)}{h} = f'(x) + O(h)\). For the central difference: the odd terms cancel, leaving \(O(h^2)\).

The Step Size Dilemma

Smaller \(h\) should give more accurate derivatives — but floating-point arithmetic introduces a competing error source.

Theorem: Truncation-Roundoff Tradeoff

The total error in a finite-difference approximation has two components:

Truncation error (from the Taylor remainder): \(O(h)\) for forward, \(O(h^2)\) for central
Roundoff error (from subtracting nearly equal floats): \(O(\varepsilon_{\text{mach}} / h)\)

Minimizing the total error gives optimal step sizes:

Forward difference: \(h^* \sim \sqrt{\varepsilon_{\text{mach}}} \approx 1.5 \times 10^{-8}\)
Central difference: \(h^* \sim \varepsilon_{\text{mach}}^{1/3} \approx 6 \times 10^{-6}\)

This explains why epsilon=1e-5 appears as the default step size for numerical Hessians in Ch3.3 — it is near-optimal for central differences in float64.

import numpy as np

x0 = 1.0
true_deriv = np.exp(x0)

print("Numerical differentiation of f(x) = eˣ "
      "at x = 1")
print(f"True f′(1) = e = {true_deriv:.15f}")
print()
print(f"{'h':>12}  {'Forward err':>14}  "
      f"{'Central err':>14}")
print("-" * 44)
for p in range(1, 17):
    h = 10.0**(-p)
    fwd = (np.exp(x0 + h) - np.exp(x0)) / h
    ctr = (np.exp(x0 + h) - np.exp(x0 - h)) / (2*h)
    fwd_err = abs(fwd - true_deriv)
    ctr_err = abs(ctr - true_deriv)
    print(f"{h:>12.0e}  {fwd_err:>14.2e}  "
          f"{ctr_err:>14.2e}")

print(f"\nOptimal h (forward): "
      f"√ε ≈ {np.sqrt(np.finfo(float).eps):.2e}")
print(f"Optimal h (central): "
      f"ε^(1/3) ≈ {np.finfo(float).eps**(1/3):.2e}")

Numerical differentiation of f(x) = eˣ at x = 1
True f′(1) = e = 2.718281828459045

           h     Forward err     Central err
--------------------------------------------
       1e-01        1.41e-01        4.53e-03
       1e-02        1.36e-02        4.53e-05
       1e-03        1.36e-03        4.53e-07
       1e-04        1.36e-04        4.53e-09
       1e-05        1.36e-05        5.86e-11
       1e-06        1.36e-06        1.63e-10
       1e-07        1.40e-07        5.86e-11
       1e-08        6.60e-09        6.60e-09
       1e-09        2.15e-07        6.60e-09
       1e-10        1.55e-06        6.73e-07
       1e-11        3.26e-05        1.04e-05
       1e-12        4.32e-04        2.10e-04
       1e-13        4.56e-04        4.56e-04
       1e-14        9.34e-03        9.34e-03
       1e-15        3.90e-01        1.68e-01
       1e-16        2.72e+00        2.72e+00

Optimal h (forward): √ε ≈ 1.49e-08
Optimal h (central): ε^(1/3) ≈ 6.06e-06

The V-shaped error pattern is unmistakable. For \(h > h^*\), truncation error dominates (error decreases with smaller \(h\)). For \(h < h^*\), roundoff error dominates (error increases with smaller \(h\)). The forward difference achieves a best error of \(\sim 10^{-8}\) at \(h \approx 10^{-8}\); the central difference achieves \(\sim 10^{-11}\) at \(h \approx 10^{-5}\).

The Complex-Step Method

Definition

The complex-step derivative approximates \(f'(x)\) using a purely imaginary perturbation:

\[f'(x) \approx \frac{\text{Im}(f(x + ih))}{h}\]

This formula has \(O(h^2)\) truncation error and no roundoff error from subtraction, because it uses only the imaginary part — no subtraction of real-valued quantities occurs.

Why it works. Taylor-expanding \(f\) in the complex plane: \(f(x + ih) = f(x) + ihf'(x) - h^2 f''(x)/2 - \ldots\). Taking the imaginary part: \(\text{Im}(f(x+ih)) = hf'(x) + O(h^3)\). No cancellation-prone subtraction is involved.

Limitation: the function must accept complex inputs. Functions containing abs, conditional branches, or non-analytic operations will fail or give incorrect results.

import numpy as np

x0 = 1.0
true_deriv = np.exp(x0)

print("Complex-step vs central differences "
      "for f(x) = eˣ")
print(f"{'h':>12}  {'Central err':>14}  "
      f"{'Complex-step err':>18}")
print("-" * 48)
for p in range(1, 17):
    h = 10.0**(-p)
    ctr = (np.exp(x0 + h) - np.exp(x0 - h)) / (2*h)
    cplx = np.imag(np.exp(x0 + 1j*h)) / h
    ctr_err = abs(ctr - true_deriv)
    cplx_err = abs(cplx - true_deriv)
    print(f"{h:>12.0e}  {ctr_err:>14.2e}  "
          f"{cplx_err:>18.2e}")

print(f"\nCentral differences plateau at "
      f"h ≈ 1e-5 to 1e-8")
print(f"Complex-step improves indefinitely "
      f"(no subtraction)")

Complex-step vs central differences for f(x) = eˣ
           h     Central err    Complex-step err
------------------------------------------------
       1e-01        4.53e-03            4.53e-03
       1e-02        4.53e-05            4.53e-05
       1e-03        4.53e-07            4.53e-07
       1e-04        4.53e-09            4.53e-09
       1e-05        5.86e-11            4.53e-11
       1e-06        1.63e-10            4.53e-13
       1e-07        5.86e-11            4.44e-15
       1e-08        6.60e-09            0.00e+00
       1e-09        6.60e-09            0.00e+00
       1e-10        6.73e-07            4.44e-16
       1e-11        1.04e-05            0.00e+00
       1e-12        2.10e-04            0.00e+00
       1e-13        4.56e-04            0.00e+00
       1e-14        9.34e-03            0.00e+00
       1e-15        1.68e-01            0.00e+00
       1e-16        2.72e+00            0.00e+00

Central differences plateau at h ≈ 1e-5 to 1e-8
Complex-step improves indefinitely (no subtraction)

The complex-step method achieves machine-precision accuracy at \(h = 10^{-8}\) and maintains it for arbitrarily smaller \(h\) — while central differences degrade below \(h \approx 10^{-5}\).

Automatic differentiation. Modern probabilistic programming frameworks (PyMC, JAX, PyTorch) use automatic differentiation (AD) rather than finite differences. AD computes exact derivatives (to machine precision) by applying the chain rule to the computational graph of the function. Reverse-mode AD is particularly efficient for gradients of scalar-valued functions — the cost of the gradient equals a small multiple of the cost of \(f\) itself, regardless of the number of parameters. This is what makes gradient-based MCMC (HMC, NUTS) practical in Ch5.5.

The complementary problem to differentiation is integration: computing expectations, normalizing constants, and marginal likelihoods. The error analysis follows a similar pattern — polynomial approximation order determines the convergence rate — but a fundamentally different tool (Monte Carlo) dominates in high dimensions.

Numerical Integration

Error Analysis for Quadrature

The quadrature rules introduced in Ch2.1 have error rates that follow directly from how well their piecewise-polynomial interpolants approximate the integrand.

Theorem: Quadrature Error Rates

For a function \(f\) with \(n\) equally spaced evaluation points on \([a, b]\) with spacing \(h = (b-a)/(n-1)\):

Trapezoidal rule (piecewise linear): error \(= O(h^2)\). Doubling \(n\) reduces error by 4×.
Simpson’s rule (piecewise quadratic): error \(= O(h^4)\). Doubling \(n\) reduces error by 16×.
Gaussian quadrature (optimal polynomial nodes): for analytic integrands, error decreases exponentially in \(n\).

These rates assume sufficient smoothness of \(f\) — each rule achieves its full order only if \(f\) has enough continuous derivatives.

The trapezoidal error rate \(O(h^2)\) comes from the Euler-Maclaurin formula, which relates the trapezoidal sum to the integral plus correction terms involving even powers of \(h\) (derivatives of \(f\) at the endpoints). Simpson’s rule is equivalent to Richardson extrapolation of the trapezoidal rule — the weighted combination \(S = (4T(h) - T(2h))/3\) eliminates the leading \(O(h^2)\) error term, yielding \(O(h^4)\). Gaussian quadrature achieves exponential rates by choosing nodes at roots of orthogonal polynomials, optimizing the interpolation points rather than spacing them uniformly.

We can verify these rates empirically by doubling \(n\) and checking the error ratio:

import numpy as np
from scipy.integrate import simpson

def f(x):
    return np.exp(-x**2)

a, b = 0.0, 2.0
exact = 0.8820813907624215  # from scipy.integrate.quad

print("Quadrature convergence for integral of "
      "exp(-x^2) on [0, 2]")
print(f"Exact value: {exact:.10f}")
print()
print(f"{'n':>6}  {'Trap err':>12}  {'Simp err':>12}  "
      f"{'Trap ratio':>11}  {'Simp ratio':>11}")
print("-" * 60)

prev_trap = prev_simp = None
for n in [5, 9, 17, 33, 65, 129]:
    x = np.linspace(a, b, n)
    y = f(x)
    trap = np.trapz(y, x)
    simp = simpson(y, x=x)
    e_trap = abs(trap - exact)
    e_simp = abs(simp - exact)

    if prev_trap is not None:
        r_trap = prev_trap / e_trap
        r_simp = prev_simp / e_simp
        print(f"{n:>6}  {e_trap:>12.2e}  "
              f"{e_simp:>12.2e}  "
              f"{r_trap:>11.1f}  {r_simp:>11.1f}")
    else:
        print(f"{n:>6}  {e_trap:>12.2e}  "
              f"{e_simp:>12.2e}  "
              f"{'---':>11}  {'---':>11}")
    prev_trap = e_trap
    prev_simp = e_simp

Quadrature convergence for integral of exp(-x^2) on [0, 2]
Exact value: 0.8820813908

     n      Trap err      Simp err   Trap ratio   Simp ratio
------------------------------------------------------------
     5      1.46e-03      2.69e-04          ---          ---
     9      3.78e-04      1.59e-05          3.9         16.9
    17      9.51e-05      9.94e-07          4.0         16.0
    33      2.38e-05      6.21e-08          4.0         16.0
    65      5.96e-06      3.88e-09          4.0         16.0
   129      1.49e-06      2.43e-10          4.0         16.0

The trapezoidal error ratio converges to 4 (\(= 2^2\)), confirming \(O(h^2)\); Simpson’s ratio converges to 16 (\(= 2^4\)), confirming \(O(h^4)\). Each doubling of \(n\) buys about twice as many digits of accuracy with Simpson as with the trapezoidal rule.

For the detailed convergence comparison code and figures, including the crossover dimension where Monte Carlo beats deterministic quadrature, see Ch2.1.

Adaptive Quadrature

SciPy’s scipy.integrate.quad uses adaptive Gaussian quadrature: it estimates the integral and its error on each subinterval, then recursively subdivides intervals where the error estimate is too large. This concentrates function evaluations where the integrand varies rapidly, avoiding wasted evaluations in smooth regions.

import numpy as np
from scipy.integrate import quad

def multi_peak(x):
    """Integrand with features at multiple scales."""
    return (np.exp(-100*(x - 0.3)**2) +
            0.5*np.exp(-1000*(x - 0.7)**2) + 0.1)

result, error, info = quad(multi_peak, 0, 1,
                            full_output=True, limit=100)

print("Adaptive quadrature on multi-scale integrand")
print(f"  I = ∫₀¹ [exp(-100(x−0.3)²) + "
      f"0.5·exp(-1000(x−0.7)²) + 0.1] dx")
print(f"  Result = {result:.10f}")
print(f"  Estimated error = {error:.2e}")
print(f"  Function evaluations: {info['neval']}")
print(f"  Subintervals: {info['last']}")
print()

for n_pts in [21, 101, 1001]:
    x = np.linspace(0, 1, n_pts)
    trap = np.trapz(multi_peak(x), x)
    print(f"  Trapezoidal (n={n_pts:>4}): "
          f"{trap:.10f}  "
          f"error = {abs(trap - result):.2e}")

Adaptive quadrature on multi-scale integrand
  I = ∫₀¹ [exp(-100(x−0.3)²) + 0.5·exp(-1000(x−0.7)²) + 0.1] dx
  Result = 0.3052683835
  Estimated error = 2.15e-09
  Function evaluations: 231
  Subintervals: 6

  Trapezoidal (n=  21): 0.3063485748  error = 1.08e-03
  Trapezoidal (n= 101): 0.3052683221  error = 6.14e-08
  Trapezoidal (n=1001): 0.3052683828  error = 6.17e-10

Adaptive quadrature achieves \(10^{-9}\) accuracy with 231 function evaluations. The trapezoidal rule needs 1001 points for comparable accuracy — and it doesn’t provide an error estimate. For a single integral with potentially sharp features, quad is almost always the right tool.

Monte Carlo Error Rate

Theorem: Monte Carlo Convergence

For \(X_1, \ldots, X_n \sim f\) with \(\text{Var}(g(X)) < \infty\):

\[\frac{1}{n}\sum_{i=1}^n g(X_i) \to \mathbb{E}[g(X)] \quad \text{at rate } O(n^{-1/2})\]

The error is dimension-independent: the same \(O(n^{-1/2})\) rate holds whether \(X\) is scalar or 1000-dimensional. Quadrature on a \(d\)-dimensional grid with \(m\) points per axis requires \(m^d\) evaluations, while Monte Carlo uses \(n\) evaluations regardless of \(d\).

This curse of dimensionality is why Monte Carlo dominates in high dimensions: the crossover typically occurs at \(d \approx 3\text{--}5\) dimensions. Below that, quadrature’s faster polynomial rate wins; above it, the exponential growth in grid points makes quadrature impractical. The detailed analysis and figures are in Ch2.1.

Sections 2–5 each introduced specific convergence rates — bisection’s \(O(1/2^k)\), Newton’s quadratic convergence, GD’s linear rate, and the \(O(h^2)\) vs \(O(h^4)\) quadrature rates. The next section formalizes this into a unified framework.

Convergence and Stopping Criteria

This section formalizes the general framework. We begin with the asymptotic notation that has appeared informally throughout this appendix, then define convergence orders precisely, and finally address the practical question of when to stop iterating.

Asymptotic (Big-O) Notation

Throughout this appendix we have written statements like “the forward difference has error \(O(h)\)” and “the trapezoidal rule converges at rate \(O(h^2)\).” This asymptotic notation, introduced by Bachmann (1894) and popularized by Landau, makes precise the idea that one quantity “grows no faster than” another.

Definition

Let \(f\) and \(g\) be functions defined near some limit (typically \(h \to 0\) or \(n \to \infty\)). We write

\[f(x) = O(g(x))\]

if there exist constants \(C > 0\) and a threshold \(x_0\) such that

\[|f(x)| \le C\,|g(x)| \quad \text{for all } x \text{ past the threshold.}\]

In words: \(f\) grows (or shrinks) at most as fast as \(g\), up to a constant factor.

The constant \(C\) and threshold \(x_0\) exist but are deliberately left unspecified. Big-O captures the rate of growth or decay while ignoring the constant prefactor. This is exactly what we need for comparing numerical methods: whether an error shrinks like \(h\) or like \(h^2\) matters far more than whether the leading coefficient is 0.3 or 7.

Big-O appears in two distinct contexts in this book, and recognizing which one is in play prevents confusion:

As \(h \to 0\) (error analysis). When we write “the forward difference error is \(O(h)\),” we mean the error satisfies \(|\text{error}| \le Ch\) for all sufficiently small \(h\). Making \(h\) smaller improves accuracy, and the exponent tells us by how much.
As \(n \to \infty\) (algorithmic cost or convergence). When we write “Cholesky factorization costs \(O(n^3/3)\) operations” (Appendix B), we mean the operation count grows at most cubically. When we write “Monte Carlo error is \(O(n^{-1/2})\),” we mean the error shrinks like \(1/\sqrt{n}\) as the sample size grows.

Example 1: Taylor remainder and numerical differentiation. From Taylor’s theorem (Appendix A), the exponential function satisfies

\[e^h = 1 + h + \frac{h^2}{2} + O(h^3)\]

The \(O(h^3)\) term means the approximation \(e^h \approx 1 + h + h^2/2\) has an error bounded by \(C h^3\) for some constant \(C\). We can verify this: at \(h = 0.1\) the error is \(1.71 \times 10^{-4}\), at \(h = 0.01\) it is \(1.67 \times 10^{-7}\), and at \(h = 0.001\) it is \(1.67 \times 10^{-10}\). Each factor-of-10 reduction in \(h\) reduces the error by a factor of \(10^3 = 1000\), confirming the cubic rate. The hidden constant turns out to be \(C = 1/6\) (from the next Taylor term \(h^3/3!\)), but the \(O(h^3)\) statement is useful precisely because we don’t need to know \(C\) to predict how fast the error shrinks.

This is how all the error rates in Section 4 arise: the forward difference error is \(O(h)\) because Taylor expansion leaves a remainder proportional to \(h\); the central difference error is \(O(h^2)\) because the leading remainder cancels.

Example 2: Monte Carlo convergence. The Central Limit Theorem implies that the Monte Carlo estimator \(\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i\) has root-mean-square error

\[\text{RMSE} = \frac{\sigma}{\sqrt{n}} = O(n^{-1/2})\]

Here \(C = \sigma\) (the population standard deviation) is the hidden constant. The key insight from Section 5 is that this rate is dimension-independent: whether \(X_i \in \mathbb{R}\) or \(X_i \in \mathbb{R}^{1000}\), the error still decreases as \(O(n^{-1/2})\). Only the constant \(\sigma\) changes with dimension — and big-O notation lets us separate the rate (which determines feasibility) from the constant (which determines cost).

Related notation. Big-O gives an upper bound on growth rate. Two companion notations complete the picture:

\(f(x) = \Omega(g(x))\) means \(f\) grows at least as fast as \(g\): there exists \(c > 0\) such that \(|f(x)| \ge c\,|g(x)|\) past some threshold. This is the lower-bound counterpart of \(O\).
\(f(x) = \Theta(g(x))\) means \(f\) grows at the same rate as \(g\): both \(f = O(g)\) and \(f = \Omega(g)\) hold. When we say “Cholesky costs \(\Theta(n^3)\) operations,” we mean the cost is exactly cubic — not just at most cubic. Most of the convergence rates in this appendix are tight (\(\Theta\) rather than just \(O\)), but we follow the common convention of writing \(O\) unless the distinction matters.

With this notation in hand, we can precisely classify how quickly iterative algorithms converge.

Convergence Orders

Definition

An iterative sequence \(\{x_k\}\) converging to \(x^*\) has linear convergence with rate \(r \in (0, 1)\) if:

\[|x_{k+1} - x^*| \le r \, |x_k - x^*| \quad \text{for all } k \text{ sufficiently large}\]

The error decreases by a constant fraction each step. Smaller \(r\) means faster convergence.

Definition

The sequence has quadratic convergence if:

\[|x_{k+1} - x^*| \le C \, |x_k - x^*|^2\]

for some constant \(C\). The number of correct digits approximately doubles per iteration.

Table 69 Convergence orders of algorithms in this course
Algorithm	Order	Rate
Bisection	Linear	\(r = 0.5\) (halving the interval)
Fixed-point \(x = g(x)\)	Linear	\(r = \|g'(x^*)\|\)
Gradient descent (strongly convex)	Linear	\(r = (\kappa-1)/(\kappa+1)\)
Newton’s method	Quadratic	Doubles correct digits per step
IRLS (canonical link)	Quadratic	Same as Newton (Fisher scoring = Newton)

Estimating convergence order from iterates. If we have a sequence of errors \(e_0, e_1, \ldots\), the convergence order \(p\) satisfies \(\log|e_{k+1}| \approx p \log|e_k| + \text{const}\). So the ratio \(\log|e_{k+1}|/\log|e_k|\) estimates \(p\): values near 1 indicate linear convergence, near 2 indicate quadratic.

We demonstrate both diagnostics on the \(x = \cos(x)\) problem from Section 2 — the same problem, two algorithms, two convergence orders.

import numpy as np

x_star = 0.7390851332151607

# Fixed-point iteration: g(x) = cos(x)  [linear]
x = 0.0
fp_errors = [abs(x - x_star)]
for _ in range(30):
    x = np.cos(x)
    fp_errors.append(abs(x - x_star))

# Newton on f(x) = x - cos(x)  [quadratic]
x = 0.0
nw_errors = [abs(x - x_star)]
for _ in range(7):
    f_val = x - np.cos(x)
    fp_val = 1 + np.sin(x)
    x = x - f_val / fp_val
    nw_errors.append(abs(x - x_star))

print("Convergence order estimation: x = cos(x)")
print(f"  x* = {x_star:.10f}")
print()

print("Newton on f(x) = x − cos(x):")
print(f"{'k':>4}  {'|e_k|':>14}  {'|e_{k+1}|/|e_k|':>18}"
      f"  {'log ratio':>12}")
print("-" * 54)
for k in range(len(nw_errors) - 1):
    e_k = nw_errors[k]
    e_k1 = nw_errors[k + 1]
    if e_k > 0 and e_k1 > 0:
        ratio = e_k1 / e_k
        log_ratio = (np.log(e_k1) / np.log(e_k)
                     if abs(np.log(e_k)) > 0.01
                     else float('nan'))
        print(f"{k:>4}  {e_k:>14.2e}  "
              f"{ratio:>18.6f}  "
              f"{log_ratio:>12.2f}")
    else:
        print(f"{k:>4}  {e_k:>14.2e}  "
              f"{'— (converged)':>18}")

print()
print("Fixed-point g(x) = cos(x):")
print(f"{'k':>4}  {'|e_k|':>14}  "
      f"{'|e_{k+1}|/|e_k|':>18}")
print("-" * 40)
for k in [0, 5, 10, 15, 20, 25]:
    if k + 1 < len(fp_errors):
        ratio = fp_errors[k+1] / fp_errors[k]
        print(f"{k:>4}  {fp_errors[k]:>14.2e}  "
              f"{ratio:>18.6f}")

L = abs(np.sin(x_star))
print(f"\nNewton: log ratio → 2 (quadratic)")
print(f"Fixed-point: ratio → L = |sin(x*)| "
      f"= {L:.4f} (linear)")

Convergence order estimation: x = cos(x)
  x* = 0.7390851332

Newton on f(x) = x − cos(x):
   k           |e_k|     |e_{k+1}|/|e_k|     log ratio
------------------------------------------------------
   0        7.39e-01            0.353024          4.44
   1        2.61e-01            0.043228          3.34
   2        1.13e-02            0.002461          2.34
   3        2.78e-05            0.000006          2.14
   4        1.70e-10       — (converged)
   5        0.00e+00       — (converged)
   6        0.00e+00       — (converged)

Fixed-point g(x) = cos(x):
   k           |e_k|     |e_{k+1}|/|e_k|
----------------------------------------
   0        7.39e-01            0.353024
   5        5.44e-02            0.693376
  10        7.68e-03            0.670767
  15        1.06e-03            0.674004
  20        1.47e-04            0.673558
  25        2.04e-05            0.673620

Newton: log ratio → 2 (quadratic)
Fixed-point: ratio → L = |sin(x*)| = 0.6736 (linear)

Same problem, same starting point, same fixed point — but Newton’s log ratio converges to 2 (quadratic), while the fixed-point ratio converges to \(L = |g'(x^*)| = |\sin(x^*)| \approx 0.674\) (linear). The contrast between 5 Newton iterations and 25+ fixed-point iterations is explained entirely by the convergence order.

Stopping Criteria

Every iterative algorithm needs a rule for when to stop. Common criteria:

Absolute change: \(\|\theta_{k+1} - \theta_k\| < \varepsilon_{\text{abs}}\). Simple but scale-dependent — a tolerance of \(10^{-8}\) is tight for \(\theta \sim 1\) but meaningless for \(\theta \sim 10^{10}\).
Relative change: \(\|\theta_{k+1} - \theta_k\| / \|\theta_k\| < \varepsilon_{\text{rel}}\). Scale-invariant, usually preferred. A relative tolerance of \(10^{-8}\) means the last 8 digits have stabilized.
Gradient norm: \(\|\nabla f(\theta_k)\| < \varepsilon\). Directly tests the first-order optimality condition. Used by scipy.optimize (the gtol parameter).
Function value change: \(|f(\theta_{k+1}) - f(\theta_k)| < \varepsilon\). Can be misleading near flat regions — the function value barely changes but the parameter is far from optimal.

Common pitfalls:

Premature stopping on a plateau: the function value or gradient may be small on a flat region far from the optimum, particularly near saddle points
Tolerance too tight: below \(\varepsilon_{\text{mach}}\) there is no information left — tighter tolerance just wastes iterations or causes cycling
Maximum iterations without convergence: always set a maximum iteration count and check whether it was hit

Diagnosing Non-Convergence

When an optimization algorithm claims convergence, verify by checking the Hessian at the reported solution.

import numpy as np

# f(x,y) = x^4/4 - x^2/2 + y^2/2
# Saddle at origin, minima at (±1, 0)
def grad_f(x, y):
    return np.array([x**3 - x, y])

def hess_f(x, y):
    return np.array([[3*x**2 - 1, 0], [0, 1]])

# Newton starting near origin
x, y = 0.1, 0.3
print("Newton on f(x,y) = x⁴/4 − x²/2 + y²/2")
print(f"Start: ({x}, {y})")
print()
print(f"{'k':>3}  {'x':>12}  {'y':>12}  {'‖∇f‖':>12}")
print("-" * 45)
for k in range(5):
    g = grad_f(x, y)
    H = hess_f(x, y)
    print(f"{k:>3}  {x:>12.8f}  {y:>12.8f}  "
          f"{np.linalg.norm(g):>12.2e}")
    step = np.linalg.solve(H, g)
    x -= step[0]; y -= step[1]

print()
print(f"Converged to ({x:.10f}, {y:.10f})")
print(f"Gradient norm: "
      f"{np.linalg.norm(grad_f(x, y)):.2e}")
print()
print("Diagnosis: Hessian eigenvalues")
H_conv = hess_f(x, y)
eigs = np.linalg.eigvalsh(H_conv)
print(f"  Eigenvalues: [{eigs[0]:.4f}, {eigs[1]:.4f}]")
if np.any(eigs < 0):
    print(f"  Negative eigenvalue → SADDLE POINT")
    print(f"  True minima are at (±1, 0) "
          f"with f = −0.25")
else:
    print(f"  All positive → confirmed minimum")

Newton on f(x,y) = x⁴/4 − x²/2 + y²/2
Start: (0.1, 0.3)

  k             x             y          ‖∇f‖
---------------------------------------------
  0    0.10000000    0.30000000      3.16e-01
  1   -0.00206186    0.00000000      2.06e-03
  2    0.00000002    0.00000000      1.75e-08
  3   -0.00000000    0.00000000      9.93e-24
  4    0.00000000    0.00000000      0.00e+00

Converged to (0.0000000000, 0.0000000000)
Gradient norm: 0.00e+00

Diagnosis: Hessian eigenvalues
  Eigenvalues: [-1.0000, 1.0000]
  Negative eigenvalue → SADDLE POINT
  True minima are at (±1, 0) with f = −0.25

Newton converged to the origin — a saddle point, not a minimum. The gradient is exactly zero, so gradient-based stopping criteria cannot detect the problem. Only the Hessian eigenvalue check reveals it. In log-likelihood optimization, we maximize \(\ell(\theta)\), so the Hessian should be negative definite at a true maximum. A positive eigenvalue of the Hessian of \(\ell(\theta)\) at a reported MLE means the algorithm found a saddle point, not a maximum.

Connections: From Numerical Analysis to Computation

Numerical Tool	Appendix C Section	Course Location
Floating-point stability	Section 1	Ch 2.3 (`log1p`), Ch 2.6 (`logsumexp`), Ch 2.4 (`finfo`)
Root finding	Section 2	Ch 2.3 (`brentq` for inverse CDF), Ch 3.2 (profile likelihood CI)
Newton-Raphson for MLE	Section 3	Ch 3.2 (full treatment), Ch 3.5 (IRLS)
Gradient descent	Section 3	Ch 3.2 (mentioned), Ch 5.5 (HMC leapfrog)
Numerical differentiation	Section 4	Ch 3.3 (numerical Hessian), Ch 3.1 (moment verification)
Quadrature vs Monte Carlo	Section 5	Ch 2.1 (full comparison)
Convergence orders	Section 6	Ch 3.2 (quadratic convergence), Ch 3.5 (IRLS convergence)
Condition numbers	Sections 3, 6 / Appendix B	Ch 3.2 (scaling/conditioning), Ch 3.4 (QR vs normal equations)

Practice Problems

Exercise 1: Floating-Point Catastrophe and Stable Alternatives

(a) The expression \((1 - \cos x)/x^2 \to 1/2\) as \(x \to 0\) (by L’Hopital or Taylor). Compute this for \(x \in \{10^{-1}, 10^{-3}, 10^{-5}, 10^{-7}, 10^{-8}, 10^{-9}\}\) using both the naive formula (1 - np.cos(x)) / x**2 and the stable alternative 2 * np.sin(x/2)**2 / x**2. Report the relative error of each against the true value \(1/2\). At what magnitude of \(x\) does the naive formula fail completely, and why?

(b) The equation \(x^2 - (10^8 + 10^{-8})x + 1 = 0\) has roots \(r_1 = 10^8\) and \(r_2 = 10^{-8}\). Compute both roots using the standard quadratic formula. Which root is accurate? Compute \(r_2\) using the identity \(r_2 = c/(a \cdot r_1)\) and verify it matches the true value.

(c) Generate data = 1e12 + rng.standard_normal(1000) with rng = np.random.default_rng(seed=42). Compute the sample variance using: (i) the naive two-pass formula \(\frac{1}{n-1}\sum(x_i - \bar{x})^2\), (ii) np.var(data, ddof=1). Are they different? Explain why adding a large constant does or does not cause problems with float64 arithmetic. (For Welford’s online algorithm, see Ch2.1.)

(d) Explain why np.log(np.sum(np.exp([1000, 1001, 1002]))) produces inf. Compute the correct answer using scipy.special.logsumexp.

Solution

import numpy as np
from scipy.special import logsumexp

rng = np.random.default_rng(seed=42)

# (a) (1 - cos(x))/x^2 cancellation
print("(a) (1 - cos(x))/x²: naive vs stable")
print(f"{'x':>12}  {'Naive':>20}  {'Stable':>20}  "
      f"{'Rel err (naive)':>16}")
print("-" * 74)
for x in [1e-1, 1e-3, 1e-5, 1e-7, 1e-8, 1e-9]:
    naive = (1 - np.cos(x)) / x**2
    stable = 2 * np.sin(x/2)**2 / x**2
    exact = 0.5
    rel = abs(naive - exact) / exact
    print(f"{x:>12.0e}  {naive:>20.16e}  "
          f"{stable:>20.16e}  {rel:>16.2e}")
print("  Naive loses all digits at x ~ 1e-8")
print("  (cos rounds to 1.0, so 1-cos = 0)")
print("  Stable: 2*sin²(x/2)/x² = 0.5 throughout")
print()

# (b) Quadratic formula
a_coef = 1.0
b_coef = -(1e8 + 1e-8)
c_coef = 1.0
disc = b_coef**2 - 4*a_coef*c_coef
r1_naive = (-b_coef + np.sqrt(disc)) / (2*a_coef)
r2_naive = (-b_coef - np.sqrt(disc)) / (2*a_coef)
r2_stable = c_coef / (a_coef * r1_naive)

print("(b) Roots of x² − (10⁸+10⁻⁸)x + 1 = 0")
print(f"  r₁ (naive):  {r1_naive:.15e}")
print(f"  r₁ (true):   1.000000000000000e+08")
print(f"  r₂ (naive):  {r2_naive:.15e}")
print(f"  r₂ (stable): {r2_stable:.15e}")
print(f"  r₂ (true):   1.000000000000000e-08")
print()

# (c) Variance with large offset
data = 1e12 + rng.standard_normal(1000)
var_naive = np.sum((data - np.mean(data))**2) / 999
var_numpy = np.var(data, ddof=1)

print("(c) Sample variance of 10¹² + N(0,1)")
print(f"  Naive two-pass: {var_naive:.10f}")
print(f"  np.var:         {var_numpy:.10f}")
print(f"  Difference:     {abs(var_naive - var_numpy):.2e}")
print(f"  Two-pass with float64 is fine here:")
print(f"  x̄ ≈ 10¹² is computed accurately, and")
print(f"  (xᵢ − x̄) ≈ N(0,1) recovers the signal.")
print()

# (d) logsumexp
x = np.array([1000, 1001, 1002])
print("(d) log(sum(exp([1000, 1001, 1002])))")
try:
    naive = np.log(np.sum(np.exp(x)))
    print(f"  Naive: {naive}")
except Exception:
    print(f"  Naive: overflow")
print(f"  logsumexp: {logsumexp(x):.6f}")
print(f"  Expected: 1002 + log(e⁻² + e⁻¹ + 1) "
      f"= {1002 + np.log(np.exp(-2)+np.exp(-1)+1):.6f}")

(a) (1 - cos(x))/x²: naive vs stable
           x                 Naive                Stable   Rel err (naive)
--------------------------------------------------------------------------
       1e-01  4.9958347219741783e-01  4.9958347219742333e-01         8.33e-04
       1e-03  4.9999995832550326e-01  4.9999995833333472e-01         8.33e-08
       1e-05  5.0000004137018539e-01  4.9999999999583328e-01         8.27e-08
       1e-07  4.9960036108132050e-01  4.9999999999999961e-01         7.99e-04
       1e-08  0.0000000000000000e+00  5.0000000000000000e-01         1.00e+00
       1e-09  0.0000000000000000e+00  5.0000000000000000e-01         1.00e+00
  Naive loses all digits at x ~ 1e-8
  (cos rounds to 1.0, so 1-cos = 0)
  Stable: 2*sin²(x/2)/x² = 0.5 throughout

(b) Roots of x² − (10⁸+10⁻⁸)x + 1 = 0
  r₁ (naive):  1.000000000000000e+08
  r₁ (true):   1.000000000000000e+08
  r₂ (naive):  1.490116119384766e-08
  r₂ (stable): 1.000000000000000e-08
  r₂ (true):   1.000000000000000e-08

(c) Sample variance of 10¹² + N(0,1)
  Naive two-pass: 0.9785502705
  np.var:         0.9785502705
  Difference:     0.00e+00
  Two-pass with float64 is fine here:
  x̄ ≈ 10¹² is computed accurately, and
  (xᵢ − x̄) ≈ N(0,1) recovers the signal.

(d) log(sum(exp([1000, 1001, 1002])))
  Naive: inf
  logsumexp: 1002.407606
  Expected: 1002 + log(e⁻² + e⁻¹ + 1) = 1002.407606

Exercise 2: Root-Finding for Confidence Interval Endpoints

(a) Generate \(n = 50\) observations from \(\text{Exp}(\lambda = 2)\) using rng = np.random.default_rng(seed=42) and data = rng.exponential(1/2, 50). Write the log-likelihood \(\ell(\lambda) = n\log\lambda - \lambda \sum x_i\). The 95% profile likelihood confidence interval endpoints solve \(2[\ell(\hat{\lambda}) - \ell(\lambda)] = \chi^2_{1, 0.05} = 3.841\). Formulate this as a root-finding problem: find \(\lambda\) where \(h(\lambda) = 2[\ell(\hat{\lambda}) - \ell(\lambda)] - 3.841 = 0\).

(b) Find both CI endpoints using: (i) manual bisection (implement yourself, count iterations to reach tolerance \(10^{-8}\)), (ii) scipy.optimize.brentq. Compare the number of iterations.

(c) For the exponential model, the exact CI based on the sufficient statistic \(T = \sum X_i \sim \text{Gamma}(n, \lambda)\) can be obtained via chi-squared quantiles. Compute this exact CI and compare to the profile likelihood CI. When is numerical root-finding the only option?

Exercise 3: Gradient Descent vs Newton for Normal MLE

(a) Generate \(n = 100\) observations from \(N(\mu = 5, \sigma^2 = 9)\) using rng = np.random.default_rng(seed=42) and data = rng.normal(5, 3, 100). Write the negative log-likelihood as a function of \((\mu, \sigma^2)\). Compute the gradient and \(2 \times 2\) Hessian analytically.

(b) Implement gradient descent with step size \(\alpha = 0.001\) starting from \((\mu_0, \sigma^2_0) = (3, 3)\). Track the iterates and report how many iterations are needed to reach relative tolerance \(10^{-8}\).

(c) Implement Newton’s method from the same starting point. How many iterations?

(d) Estimate the convergence order for both methods using the technique from Section 6 (log error ratios). Compare your results to scipy.optimize.minimize(method='Newton-CG').

Solution

import numpy as np
from scipy.optimize import minimize

rng = np.random.default_rng(seed=42)

n = 100
data = rng.normal(5, 3, n)
xbar = np.mean(data)
s2_mle = np.var(data)

def neg_ll(params):
    mu, s2 = params
    if s2 <= 0:
        return 1e20
    return (n/2*np.log(2*np.pi*s2)
            + np.sum((data - mu)**2) / (2*s2))

def grad_neg(params):
    mu, s2 = params
    g_mu = -np.sum(data - mu) / s2
    g_s2 = (n/(2*s2)
            - np.sum((data - mu)**2) / (2*s2**2))
    return np.array([g_mu, g_s2])

def hess_neg(params):
    mu, s2 = params
    h11 = n / s2
    h12 = np.sum(data - mu) / s2**2
    h22 = (-n/(2*s2**2)
           + np.sum((data - mu)**2) / s2**3)
    return np.array([[h11, h12], [h12, h22]])

theta_star = np.array([xbar, s2_mle])
print(f"(a) MLE: μ̂={xbar:.4f}, σ̂²={s2_mle:.4f}")
print()

# (b) Gradient descent
theta = np.array([3.0, 3.0])
lr = 0.001
gd_errors = []
gd_iters = 0
for k in range(10000):
    err = np.linalg.norm(theta - theta_star)
    gd_errors.append(err)
    g = grad_neg(theta)
    theta_new = theta - lr * g
    theta_new[1] = max(theta_new[1], 0.01)
    rel_change = (np.linalg.norm(theta_new - theta)
                  / np.linalg.norm(theta))
    theta = theta_new
    if rel_change < 1e-8:
        gd_iters = k + 1
        gd_errors.append(
            np.linalg.norm(theta - theta_star))
        break
else:
    gd_iters = 10000

print(f"(b) Gradient descent (α=0.001):")
print(f"  Converged in {gd_iters} iterations")
print(f"  Final: μ={theta[0]:.6f}, "
      f"σ²={theta[1]:.6f}")
print()

# (c) Newton
theta = np.array([3.0, 3.0])
nw_errors = []
nw_iters = 0
for k in range(50):
    err = np.linalg.norm(theta - theta_star)
    nw_errors.append(err)
    g = grad_neg(theta)
    H = hess_neg(theta)
    step = np.linalg.solve(H, g)
    theta_new = theta - step
    theta_new[1] = max(theta_new[1], 0.01)
    rel_change = (np.linalg.norm(step)
                  / np.linalg.norm(theta))
    theta = theta_new
    if rel_change < 1e-12:
        nw_iters = k + 1
        nw_errors.append(
            np.linalg.norm(theta - theta_star))
        break
else:
    nw_iters = 50

print(f"(c) Newton:")
print(f"  Converged in {nw_iters} iterations")
print(f"  Final: μ={theta[0]:.6f}, "
      f"σ²={theta[1]:.6f}")
print()

# (d) Convergence order
print("(d) Convergence order estimation:")
print("  Newton (last 4 iterations):")
start_k = max(0, len(nw_errors) - 5)
for k in range(start_k, len(nw_errors)-1):
    e0, e1 = nw_errors[k], nw_errors[k+1]
    if e0 > 1e-15 and e1 > 1e-15:
        p = np.log(e1) / np.log(e0)
        print(f"    k={k}: |e|={e0:.2e}, "
              f"order ≈ {p:.2f}")

print("  Gradient descent (sampled):")
for k in [0, 100, 500, 1000]:
    if k+1 < len(gd_errors):
        e0, e1 = gd_errors[k], gd_errors[k+1]
        if e0 > 0 and e1 > 0:
            ratio = e1 / e0
            print(f"    k={k}: |e|={e0:.2e}, "
                  f"ratio={ratio:.6f}")

# scipy comparison
res = minimize(neg_ll, x0=[3, 3],
               method='Newton-CG',
               jac=grad_neg, hess=hess_neg)
print(f"\n  scipy Newton-CG: {res.nit} iterations, "
      f"μ={res.x[0]:.6f}, σ²={res.x[1]:.6f}")

(a) MLE: μ̂=4.8492, σ̂²=5.3748

(b) Gradient descent (α=0.001):
  Converged in 5838 iterations
  Final: μ=4.849191, σ²=5.374715

(c) Newton:
  Converged in 9 iterations
  Final: μ=4.849191, σ²=5.374757

(d) Convergence order estimation:
  Newton (last 4 iterations):
    k=5: |e|=3.58e-02, order ≈ 2.30
    k=6: |e|=4.74e-04, order ≈ 2.13
    k=7: |e|=8.34e-08, order ≈ 2.06
  Gradient descent (sampled):
    k=0: |e|=3.01e+00, ratio=0.979026
    k=100: |e|=1.42e+00, ratio=0.996645
    k=500: |e|=5.23e-01, ratio=0.997876
    k=1000: |e|=1.96e-01, ratio=0.998136

  scipy Newton-CG: 7 iterations, μ=4.849191, σ²=5.374757

Exercise 4: The Step Size Dilemma

Consider \(f(x) = x e^x\), whose derivative \(f'(x) = (x+1)e^x\) is known exactly. Evaluate \(f'(1) = 2e\) numerically using step sizes \(h = 10^{-1}, 10^{-2}, \ldots, 10^{-16}\).

(a) Use the forward difference \([f(x+h) - f(x)]/h\). Plot the relative error versus \(h\) (log-log). Find the optimal \(h\) and compare it to the theoretical prediction \(h^* \approx \sqrt{\varepsilon_{\text{mach}}} \approx 1.5 \times 10^{-8}\).

(b) Repeat with the central difference \([f(x+h) - f(x-h)]/(2h)\). The optimal \(h\) should be near \(\varepsilon_{\text{mach}}^{1/3} \approx 6 \times 10^{-6}\). Does it match?

(c) Repeat with the complex-step method \(\text{Im}(f(x+ih))/h\). Does the error show a V-shaped curve or monotonic improvement? Why?

Solution

import numpy as np

x = 1.0
exact = 2 * np.exp(1)
eps = np.finfo(float).eps
h_vals = [10**(-k) for k in range(1, 17)]

print(f"f(x) = x*exp(x), f'(1) = 2e = {exact:.10f}")
print()

# (a) Forward differences
print("(a) Forward differences:")
print(f"{'h':>12}  {'approx':>18}  "
      f"{'Rel error':>12}")
print("-" * 48)
fwd_errors = []
for h in h_vals:
    approx = ((x+h)*np.exp(x+h)
              - x*np.exp(x)) / h
    rel = abs(approx - exact) / exact
    fwd_errors.append(rel)
    print(f"{h:>12.0e}  {approx:>18.10f}  "
          f"{rel:>12.2e}")

best = h_vals[np.argmin(fwd_errors)]
print(f"  Optimal h ~ {best:.0e} ~ sqrt(eps)")
print(f"  (sqrt(eps) = {np.sqrt(eps):.2e})")
print()

# (b) Central differences
print("(b) Central differences:")
print(f"{'h':>12}  {'approx':>18}  "
      f"{'Rel error':>12}")
print("-" * 48)
cen_errors = []
for h in h_vals:
    fp = (x+h)*np.exp(x+h)
    fm = (x-h)*np.exp(x-h)
    approx = (fp - fm) / (2*h)
    rel = abs(approx - exact) / exact
    cen_errors.append(rel)
    print(f"{h:>12.0e}  {approx:>18.10f}  "
          f"{rel:>12.2e}")

best_c = h_vals[np.argmin(cen_errors)]
print(f"  Best on grid: h = {best_c:.0e}")
print(f"  Theory: h* ~ eps^(1/3) = "
      f"{eps**(1/3):.2e}")
print()

# (c) Complex-step
print("(c) Complex-step derivative:")
print(f"{'h':>12}  {'approx':>18}  "
      f"{'Rel error':>12}")
print("-" * 48)
for h in h_vals:
    z = complex(x, h)
    fz = z * np.exp(z)
    approx = fz.imag / h
    rel = abs(approx - exact) / exact
    print(f"{h:>12.0e}  {approx:>18.10f}  "
          f"{rel:>12.2e}")
print("  No roundoff floor: error decreases")
print("  monotonically as h -> 0.")

f(x) = x*exp(x), f'(1) = 2e = 5.4365636569

(a) Forward differences:
           h              approx     Rel error
------------------------------------------------
       1e-01        5.8630079788      7.84e-02
       1e-02        5.4775196708      7.53e-03
       1e-03        5.4406428924      7.50e-04
       1e-04        5.4369714173      7.50e-05
       1e-05        5.4366044314      7.50e-06
       1e-06        5.4365677338      7.50e-07
       1e-07        5.4365640700      7.60e-08
       1e-08        5.4365636437      2.43e-09
       1e-09        5.4365640878      7.93e-08
       1e-10        5.4365667523      5.69e-07
       1e-11        5.4365845159      3.84e-06
       1e-12        5.4374282854      1.59e-04
       1e-13        5.4356519286      1.68e-04
       1e-14        5.4178883602      3.44e-03
       1e-15        6.2172489379      1.44e-01
       1e-16        0.0000000000      1.00e+00
  Optimal h ~ 1e-08 ~ sqrt(eps)
  (sqrt(eps) = 1.49e-08)

(b) Central differences:
           h              approx     Rel error
------------------------------------------------
       1e-01        5.4546991315      3.34e-03
       1e-02        5.4367448771      3.33e-05
       1e-03        5.4365654691      3.33e-07
       1e-04        5.4365636750      3.33e-09
       1e-05        5.4365636571      3.79e-11
       1e-06        5.4365636566      6.01e-11
       1e-07        5.4365636570      2.16e-11
       1e-08        5.4365636437      2.43e-09
       1e-09        5.4365636437      2.43e-09
       1e-10        5.4365645319      1.61e-07
       1e-11        5.4365623114      2.47e-07
       1e-12        5.4369841962      7.74e-05
       1e-13        5.4356519286      1.68e-04
       1e-14        5.4178883602      3.44e-03
       1e-15        5.7731597281      6.19e-02
       1e-16        2.2204460493      5.92e-01
  Best on grid: h = 1e-07
  Theory: h* ~ eps^(1/3) = 6.06e-06

(c) Complex-step derivative:
           h              approx     Rel error
------------------------------------------------
       1e-01        5.4184553652      3.33e-03
       1e-02        5.4363824395      3.33e-05
       1e-03        5.4365618447      3.33e-07
       1e-04        5.4365636388      3.33e-09
       1e-05        5.4365636567      3.33e-11
       1e-06        5.4365636569      3.33e-13
       1e-07        5.4365636569      3.43e-15
       1e-08        5.4365636569      0.00e+00
       1e-09        5.4365636569      0.00e+00
       1e-10        5.4365636569      1.63e-16
       1e-11        5.4365636569      0.00e+00
       1e-12        5.4365636569      0.00e+00
       1e-13        5.4365636569      0.00e+00
       1e-14        5.4365636569      0.00e+00
       1e-15        5.4365636569      0.00e+00
       1e-16        5.4365636569      0.00e+00
  No roundoff floor: error decreases
  monotonically as h -> 0.

Forward and central differences both show V-shaped error curves: as \(h\) decreases, truncation error shrinks but roundoff error grows. The optimal \(h\) balances the two. The complex-step method avoids cancellation entirely because it never subtracts nearly equal real numbers — the derivative information is isolated in the imaginary part.

Exercise 5: Diagnosing Optimization Results

Consider \(f(x, y) = \frac{x^4}{4} - \frac{x^2}{2} + \frac{y^4}{4} - \frac{y^2}{2}\), which has 9 critical points.

(a) Find all critical points by solving \(\nabla f = 0\). Classify each as a minimum, maximum, or saddle point using the Hessian eigenvalues.

(b) Run Newton’s method from three starting points: \((1.5, 1.5)\), \((0.6, 0.1)\), and \((0.1, 0.3)\). For each, report the convergence point, function value, and Hessian eigenvalues. Does Newton always find a minimum?

(c) What stopping criterion would detect that Newton converged to a non-minimum? Explain why gradient-based stopping criteria alone are insufficient.

Solution

import numpy as np

def f(x, y):
    return x**4/4 - x**2/2 + y**4/4 - y**2/2

def grad_f(x, y):
    return np.array([x**3 - x, y**3 - y])

def hess_f(x, y):
    return np.array([[3*x**2 - 1, 0],
                     [0, 3*y**2 - 1]])

# (a) Critical points
print("(a) Critical points (grad f = 0):")
print("  x^3 - x = 0  =>  x in {-1, 0, 1}")
print("  y^3 - y = 0  =>  y in {-1, 0, 1}")
print("  9 critical points:")
print()
for xc in [-1, 0, 1]:
    for yc in [-1, 0, 1]:
        H = hess_f(xc, yc)
        eigs = np.linalg.eigvalsh(H)
        if np.all(eigs > 0):
            ctype = "minimum"
        elif np.all(eigs < 0):
            ctype = "maximum"
        else:
            ctype = "saddle"
        print(f"  ({xc:>2},{yc:>2}): "
              f"eigs=[{eigs[0]:.0f},{eigs[1]:.0f}],"
              f" f={f(xc,yc):.2f} -> {ctype}")
print()

# (b) Newton from different starts
starts = [(1.5, 1.5), (0.6, 0.1), (0.1, 0.3)]
for x0, y0 in starts:
    x, y = x0, y0
    print(f"(b) Newton from ({x0}, {y0}):")
    print(f"{'k':>5}  {'x':>12}  {'y':>12}  "
          f"{'||grad||':>12}")
    print("-" * 48)
    for k in range(8):
        g = grad_f(x, y)
        gnorm = np.linalg.norm(g)
        print(f"{k:>5}  {x:>12.8f}  "
              f"{y:>12.8f}  {gnorm:>12.2e}")
        if gnorm < 1e-14:
            break
        H = hess_f(x, y)
        step = np.linalg.solve(H, g)
        x -= step[0]; y -= step[1]

    H = hess_f(x, y)
    eigs = np.linalg.eigvalsh(H)
    print(f"  Converged to ({x:.4f}, {y:.4f})")
    print(f"  f = {f(x,y):.4f}")
    print(f"  Hessian eigs: "
          f"[{eigs[0]:.2f}, {eigs[1]:.2f}]")
    if np.all(eigs > 0):
        print("  All positive -> minimum")
    elif np.all(eigs < 0):
        print("  All negative -> maximum!")
    else:
        print("  Mixed signs -> saddle point!")
    print()

(a) Critical points (grad f = 0):
  x^3 - x = 0  =>  x in {-1, 0, 1}
  y^3 - y = 0  =>  y in {-1, 0, 1}
  9 critical points:

  (-1,-1): eigs=[2,2], f=-0.50 -> minimum
  (-1, 0): eigs=[-1,2], f=-0.25 -> saddle
  (-1, 1): eigs=[2,2], f=-0.50 -> minimum
  ( 0,-1): eigs=[-1,2], f=-0.25 -> saddle
  ( 0, 0): eigs=[-1,-1], f=0.00 -> maximum
  ( 0, 1): eigs=[-1,2], f=-0.25 -> saddle
  ( 1,-1): eigs=[2,2], f=-0.50 -> minimum
  ( 1, 0): eigs=[-1,2], f=-0.25 -> saddle
  ( 1, 1): eigs=[2,2], f=-0.50 -> minimum

(b) Newton from (1.5, 1.5):
    k             x             y      ||grad||
------------------------------------------------
    0    1.50000000    1.50000000      2.65e+00
    1    1.17391304    1.17391304      6.28e-01
    2    1.03230713    1.03230713      9.59e-02
    3    1.00145595    1.00145595      4.13e-03
    4    1.00000317    1.00000317      8.96e-06
    5    1.00000000    1.00000000      4.26e-11
    6    1.00000000    1.00000000      0.00e+00
  Converged to (1.0000, 1.0000)
  f = -0.5000
  Hessian eigs: [2.00, 2.00]
  All positive -> minimum

(b) Newton from (0.6, 0.1):
    k             x             y      ||grad||
------------------------------------------------
    0    0.60000000    0.10000000      3.97e-01
    1    5.40000000   -0.00206186      1.52e+02
    2    3.64162812    0.00000002      4.47e+01
    3    2.49034823   -0.00000000      1.30e+01
    4    1.75453404    0.00000000      3.65e+00
    5    1.31172523    0.00000000      9.45e-01
    6    1.08460145    0.00000000      1.91e-01
    7    1.00896898    0.00000000      1.82e-02
  Converged to (1.0001, 0.0000)
  f = -0.2500
  Hessian eigs: [-1.00, 2.00]
  Mixed signs -> saddle point!

(b) Newton from (0.1, 0.3):
    k             x             y      ||grad||
------------------------------------------------
    0    0.10000000    0.30000000      2.90e-01
    1   -0.00206186   -0.07397260      7.36e-02
    2    0.00000002    0.00082306      8.23e-04
    3   -0.00000000   -0.00000000      1.12e-09
    4    0.00000000    0.00000000      0.00e+00
  Converged to (0.0000, 0.0000)
  f = 0.0000
  Hessian eigs: [-1.00, -1.00]
  All negative -> maximum!

Newton converges to whichever critical point is nearest in its basin — it does not distinguish minima from saddles or maxima because all have \(\nabla f = 0\). The gradient norm stopping criterion passes at any critical point. Only checking the Hessian eigenvalues at the converged point reveals whether the result is actually a minimum (all positive), a saddle (mixed signs), or a maximum (all negative).