.. _appendix-numerical-analysis: =============================================== Appendix C: Numerical Analysis Review =============================================== From Exact to Approximate ---------------------------- Most statistical computations cannot be done in closed form. Maximum likelihood requires solving nonlinear score equations. Bayesian posteriors require intractable integrals. Confidence intervals require finding roots of profile likelihood functions. Numerical analysis provides the algorithms that turn these mathematical expressions into numbers — and, crucially, the error bounds and convergence guarantees that tell us when to trust those numbers. This appendix develops the numerical foundations that the rest of the course assumes. We explain *why* the standard algorithms work (convergence theory, contraction mappings), *when* they fail (floating-point arithmetic, saddle points, ill-conditioning), and *how fast* they converge (linear vs quadratic rates). The specific applications of these tools — Newton-Raphson for MLE in :ref:`Ch3.2 `, IRLS for GLMs in :ref:`Ch3.5 `, quadrature vs Monte Carlo in :ref:`Ch2.1 ` — live in the chapters. This appendix provides the engine under the hood. .. admonition:: Road Map 🧭 :class: important - **Section 1 (Floating-Point):** IEEE 754 representation, machine epsilon, catastrophic cancellation, log-space arithmetic — the source of most "mysterious" numerical errors - **Section 2 (Root Finding):** Fixed-point iteration, contraction mapping theorem, bisection, Newton's method — the framework that unifies all iterative algorithms - **Section 3 (Optimization):** Gradient descent, Newton's method for optimization, quasi-Newton — why some algorithms converge in 5 steps and others need 500 - **Section 4 (Differentiation):** Finite differences, optimal step size, the complex-step method — computing derivatives when formulas are unavailable - **Section 5 (Integration):** Quadrature error theory, adaptive quadrature, Monte Carlo — choosing the right tool for numerical integrals - **Section 6 (Convergence):** Convergence orders, stopping criteria, diagnosing failure — knowing when to stop and when something has gone wrong .. admonition:: Prerequisites Check :class: tip This appendix assumes familiarity with: - **Single-variable calculus**: Taylor's theorem and basic derivatives (:ref:`Appendix A `) - **Linear algebra basics**: Eigenvalues, matrix inverse, solving linear systems (:ref:`Appendix B `) - **Python/NumPy**: Basic array operations (:ref:`Appendix F `) .. admonition:: Connection to Course Material :class: tip - **Chapter 2** (Monte Carlo): Quadrature vs Monte Carlo convergence rates (Ch2.1), Brent's method for inverse CDF (Ch2.3), logsumexp for importance weights (Ch2.6) - **Chapter 3** (Parametric Inference): Newton-Raphson and Fisher scoring for MLE (Ch3.2), numerical Hessians for standard errors (Ch3.3), IRLS for GLMs (Ch3.5) - **Chapter 5** (Bayesian): HMC leapfrog integrator step size (Ch5.5) - **Appendix B**: Condition numbers and numerical stability of matrix computations .. contents:: Contents :local: :depth: 2 Floating-Point Arithmetic ---------------------------- IEEE 754 Basics ~~~~~~~~~~~~~~~~~ .. admonition:: Definition :class: definition A **floating-point number** in IEEE 754 double precision (``float64``) is stored as :math:`(-1)^s \times 1.m \times 2^{e-1023}`, where :math:`s` is a sign bit, :math:`m` is a 52-bit mantissa (significand), and :math:`e` is an 11-bit exponent. This gives approximately 15–16 significant decimal digits. .. admonition:: Definition :class: definition **Machine epsilon** :math:`\varepsilon_{\text{mach}}` is the smallest :math:`\varepsilon > 0` such that :math:`\text{fl}(1 + \varepsilon) \ne 1` in floating-point arithmetic. For ``float64``, :math:`\varepsilon_{\text{mach}} = 2^{-52} \approx 2.22 \times 10^{-16}`. The key consequence: the spacing between consecutive floating-point numbers is *not* uniform. Near :math:`x`, the gap to the next representable float is approximately :math:`|x| \cdot \varepsilon_{\text{mach}}`. This means relative precision is roughly constant (~16 digits), but absolute precision varies enormously — fine near zero, coarse for large numbers. .. code-block:: python import numpy as np info = np.finfo(float) print("IEEE 754 double precision (float64)") print(f"{'Property':>25} {'Value':>25}") print("-" * 54) print(f"{'Machine epsilon':>25} {info.eps:>25.2e}") print(f"{'Smallest normal':>25} {info.tiny:>25.2e}") print(f"{'Largest finite':>25} {info.max:>25.2e}") print(f"{'Mantissa bits':>25} {info.nmant:>25d}") print() print("Float spacing at different magnitudes:") print(f"{'x':>15} {'np.spacing(x)':>20} " f"{'x · ε_mach':>20}") print("-" * 58) for x in [1e-10, 1e-5, 1.0, 1e5, 1e10, 1e15]: print(f"{x:>15.0e} {np.spacing(x):>20.2e} " f"{x * info.eps:>20.2e}") .. code-block:: text IEEE 754 double precision (float64) Property Value ------------------------------------------------------ Machine epsilon 2.22e-16 Smallest normal 2.23e-308 Largest finite 1.80e+308 Mantissa bits 52 Float spacing at different magnitudes: x np.spacing(x) x · ε_mach ---------------------------------------------------------- 1e-10 1.29e-26 2.22e-26 1e-05 1.69e-21 2.22e-21 1e+00 2.22e-16 2.22e-16 1e+05 1.46e-11 2.22e-11 1e+10 1.91e-06 2.22e-06 1e+15 1.25e-01 2.22e-01 At :math:`x = 10^{15}`, the gap between consecutive floats is 0.125 — integers above :math:`2^{53} \approx 9 \times 10^{15}` cannot be represented exactly. At :math:`x = 1`, the gap is :math:`\varepsilon_{\text{mach}}` itself. Catastrophic Cancellation ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. admonition:: Definition :class: definition **Catastrophic cancellation** occurs when subtracting two nearly equal floating-point numbers. The leading significant digits cancel, leaving only the less significant (and potentially inaccurate) trailing digits. The result has far fewer correct digits than either operand. **Worked Example: The Quadratic Formula.** Consider :math:`x^2 - Bx + 1 = 0` with :math:`B` large. The roots are :math:`(B \pm \sqrt{B^2 - 4})/2`. The larger root :math:`(B + \sqrt{B^2 - 4})/2 \approx B` is computed accurately. The smaller root :math:`(B - \sqrt{B^2 - 4})/2` subtracts two nearly equal numbers — :math:`B` and :math:`\sqrt{B^2 - 4}` agree in their leading digits when :math:`B \gg 2`. **The fix:** use the algebraically equivalent form :math:`2/(B + \sqrt{B^2 - 4})`, which involves no cancellation. .. code-block:: python import numpy as np print("Quadratic formula: x² − Bx + 1 = 0") print(f"{'B':>12} {'Naive root':>18} {'Stable root':>18}" f" {'Rel error':>12}") print("-" * 66) for B in [1e1, 1e2, 1e3, 1e4, 1e6, 1e8]: disc = B**2 - 4 sqrt_d = np.sqrt(disc) naive = (B - sqrt_d) / 2 stable = 2.0 / (B + sqrt_d) rel_err = abs(naive - stable) / stable print(f"{B:>12.0e} {naive:>18.15f} " f"{stable:>18.15f} {rel_err:>12.2e}") .. code-block:: text Quadratic formula: x² − Bx + 1 = 0 B Naive root Stable root Rel error ------------------------------------------------------------------ 1e+01 0.101020514433644 0.101020514433644 4.26e-15 1e+02 0.010001000200049 0.010001000200050 1.22e-13 1e+03 0.001000001000023 0.001000001000002 2.06e-11 1e+04 0.000100000001112 0.000100000001000 1.12e-09 1e+06 0.000001000007614 0.000001000000000 7.61e-06 1e+08 0.000000007450581 0.000000010000000 2.55e-01 At :math:`B = 10^8`, the naive formula loses 25% of the answer to cancellation — a catastrophic failure. The stable formula is exact to machine precision at every :math:`B`. The general principle: whenever a formula subtracts nearly equal quantities, look for an algebraically equivalent form that avoids the subtraction. For a full worked example of catastrophic cancellation in variance computation and its fix via Welford's algorithm, see :ref:`Ch2.1 `. Log-Space Arithmetic ~~~~~~~~~~~~~~~~~~~~~~~ In likelihood-based inference, products of many small probabilities routinely underflow to zero, and exponentials of large arguments overflow to infinity. The standard solution: **work in log space**. - **Products become sums:** :math:`\log \prod f_i = \sum \log f_i` - **Ratios become differences:** :math:`\log(a/b) = \log a - \log b` - **No overflow/underflow:** :math:`\log L(\theta)` stays finite even when :math:`L(\theta)` is astronomically small For computing :math:`\log(\sum \exp(x_i))` — the log-normalizing constant for importance weights and posterior distributions — the **logsumexp** trick is essential. The full treatment is in :ref:`Ch2.6 `; SciPy provides ``scipy.special.logsumexp``. Two specialized functions avoid cancellation near zero: - ``np.log1p(x)`` computes :math:`\log(1+x)` accurately for small :math:`x`, avoiding :math:`1 + x \approx 1` - ``np.expm1(x)`` computes :math:`e^x - 1` accurately for small :math:`x`, avoiding :math:`e^x \approx 1` .. code-block:: python import numpy as np print("Log-space arithmetic near zero") print(f"{'x':>12} {'log(1+x)':>20} {'log1p(x)':>20} " f"{'Rel diff':>12}") print("-" * 70) for x in [1e-1, 1e-5, 1e-10, 1e-15, 1e-16]: naive = np.log(1 + x) stable = np.log1p(x) rel = abs(naive - stable) / abs(stable) if stable != 0 else 0 print(f"{x:>12.0e} {naive:>20.16f} " f"{stable:>20.16f} {rel:>12.2e}") print() print(f"{'x':>12} {'exp(x)-1':>20} {'expm1(x)':>20} " f"{'Rel diff':>12}") print("-" * 70) for x in [1e-1, 1e-5, 1e-10, 1e-15, 1e-16]: naive = np.exp(x) - 1 stable = np.expm1(x) rel = abs(naive - stable) / abs(stable) if stable != 0 else 0 print(f"{x:>12.0e} {naive:>20.16e} " f"{stable:>20.16e} {rel:>12.2e}") .. code-block:: text Log-space arithmetic near zero x log(1+x) log1p(x) Rel diff ---------------------------------------------------------------------- 1e-01 0.0953101798043249 0.0953101798043249 7.28e-16 1e-05 0.0000099999500004 0.0000099999500003 6.55e-12 1e-10 0.0000000001000000 0.0000000001000000 8.27e-08 1e-15 0.0000000000000011 0.0000000000000010 1.10e-01 1e-16 0.0000000000000000 0.0000000000000001 1.00e+00 x exp(x)-1 expm1(x) Rel diff ---------------------------------------------------------------------- 1e-01 1.0517091807564771e-01 1.0517091807564763e-01 7.92e-16 1e-05 1.0000050000069649e-05 1.0000050000166668e-05 9.70e-12 1e-10 1.0000000827403710e-10 1.0000000000500000e-10 8.27e-08 1e-15 1.1102230246251565e-15 1.0000000000000007e-15 1.10e-01 1e-16 0.0000000000000000e+00 9.9999999999999998e-17 1.00e+00 At :math:`x = 10^{-16}`, ``log(1 + x)`` returns exactly 0 (total loss of the input), while ``log1p`` gives the correct answer to 16 digits. Similarly, ``exp(x) - 1`` returns exactly 0, while ``expm1`` preserves the information. These functions are essential whenever likelihood ratios or probability differences involve values near 1. .. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/Appendices/appendix_c_num_fig01_floating_point.png :alt: float64 spacing growing linearly with magnitude, and catastrophic cancellation in the quadratic formula where the smaller root's relative error rises to about 25 percent at B = 1e8. :width: 100% :align: center **Figure C.1: Floating-Point Arithmetic.** (a) The spacing between consecutive ``float64`` numbers scales linearly with magnitude — the line :math:`x \cdot \varepsilon_{\text{mach}}` is an excellent approximation. (b) Catastrophic cancellation in the quadratic formula :math:`x^2 - Bx + 1 = 0`: the naive formula for the smaller root loses digits proportionally to :math:`B`, reaching 25% relative error at :math:`B = 10^8`. With the arithmetic substrate understood, we now turn to the iterative algorithms that operate on it. The floating-point issues from Section 1 — cancellation, roundoff, overflow — will resurface in convergence checks and stopping criteria throughout. Root Finding -------------- Fixed-Point Iteration and Contraction Mapping ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. admonition:: Definition :class: definition A **fixed point** of a function :math:`g` is a value :math:`x^*` such that :math:`g(x^*) = x^*`. Many algorithms reduce to fixed-point iteration: given :math:`x_k`, compute :math:`x_{k+1} = g(x_k)` and repeat. Newton's method, gradient descent, and IRLS all have this structure — they differ only in the choice of :math:`g`. .. admonition:: Definition :class: definition A function :math:`g` is a **contraction mapping** on an interval :math:`[a, b]` if there exists :math:`L < 1` such that :math:`|g(x) - g(y)| \le L|x - y|` for all :math:`x, y \in [a, b]`. The constant :math:`L` is the **Lipschitz constant** (or **contraction rate**). .. admonition:: Theorem: Banach Fixed-Point Theorem :class: note If :math:`g: [a, b] \to [a, b]` is a contraction mapping with Lipschitz constant :math:`L < 1`, then: 1. :math:`g` has a **unique** fixed point :math:`x^*` in :math:`[a, b]` 2. For any starting point :math:`x_0 \in [a, b]`, the iterates :math:`x_{k+1} = g(x_k)` converge to :math:`x^*` 3. The error satisfies :math:`|x_k - x^*| \le L^k |x_0 - x^*|` — **linear convergence** with rate :math:`L` **Why this matters.** Every iterative algorithm in this course is a fixed-point iteration. The contraction condition :math:`L < 1` is what guarantees convergence. For gradient descent, the contraction rate depends on the step size and the curvature of the objective. For Newton's method, the contraction rate depends on the distance to the root. Understanding this framework explains *why* some algorithms converge and others diverge. .. code-block:: python import numpy as np print("Fixed-point iteration: x = cos(x)") print(f" True fixed point: x* = 0.739085...") print() x = 0.0 x_star = 0.7390851332151607 print(f"{'k':>4} {'x_k':>18} {'|x_k - x*|':>14} " f"{'Ratio':>10}") print("-" * 52) prev_err = abs(x - x_star) for k in range(15): err = abs(x - x_star) if k > 0: ratio = err / prev_err print(f"{k:>4} {x:>18.15f} {err:>14.2e} " f"{ratio:>10.4f}") else: print(f"{k:>4} {x:>18.15f} {err:>14.2e} " f"{'—':>10}") prev_err = err x = np.cos(x) print(f"\nContraction constant L ≈ |sin(x*)| = " f"{abs(np.sin(x_star)):.6f}") print(f"Observed ratio converges to L ✓") .. code-block:: text Fixed-point iteration: x = cos(x) True fixed point: x* = 0.739085... k x_k |x_k - x*| Ratio ---------------------------------------------------- 0 0.000000000000000 7.39e-01 — 1 1.000000000000000 2.61e-01 0.3530 2 0.540302305868140 1.99e-01 0.7619 3 0.857553215846393 1.18e-01 0.5960 4 0.654289790497779 8.48e-02 0.7158 5 0.793480358742566 5.44e-02 0.6415 6 0.701368773622757 3.77e-02 0.6934 7 0.763959682900654 2.49e-02 0.6595 8 0.722102425026708 1.70e-02 0.6827 9 0.750417761763761 1.13e-02 0.6673 10 0.731404042422510 7.68e-03 0.6778 11 0.744237354900557 5.15e-03 0.6708 12 0.735604740436347 3.48e-03 0.6755 13 0.741425086610109 2.34e-03 0.6723 14 0.737506890513243 1.58e-03 0.6745 Contraction constant L ≈ |sin(x*)| = 0.673612 Observed ratio converges to L ✓ The error ratio converges to :math:`L \approx 0.674` — the contraction constant. Each iteration reduces the error by roughly 33%, so about 6 iterations are needed per additional decimal digit. This is **linear convergence**, and for smooth :math:`g`, the contraction constant is :math:`L = |g'(x^*)|`. Bisection ~~~~~~~~~~~ The simplest root-finding algorithm: given :math:`f(a) < 0 < f(b)`, the intermediate value theorem guarantees a root in :math:`[a, b]`. Bisection halves the interval at each step, choosing the half that still brackets the root. .. admonition:: Theorem: Bisection Convergence :class: note After :math:`k` bisection steps, the error satisfies :math:`|x_k - x^*| \le (b - a) / 2^k`. This is linear convergence with rate :math:`r = 1/2`, requiring about 3.3 iterations per decimal digit. Bisection requires no derivatives and no smoothness — only that :math:`f` is continuous and changes sign. The guarantee of convergence makes bisection a reliable fallback when other methods fail. SciPy's ``scipy.optimize.brentq`` combines bisection with interpolation for faster convergence while retaining the convergence guarantee — see :ref:`Ch2.3 ` for its use in inverse CDF computation. Newton's Method for Root Finding ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Newton's method finds roots of :math:`f(x) = 0` by linearizing :math:`f` at each iterate: .. math:: x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)} Geometrically, each step replaces :math:`f` by its tangent line and takes the root of the tangent as the next iterate. .. admonition:: Theorem: Quadratic Convergence :class: note If :math:`f(x^*) = 0`, :math:`f'(x^*) \ne 0`, and :math:`f''` is continuous near :math:`x^*`, then Newton's method converges **quadratically** near :math:`x^*`: .. math:: |x_{k+1} - x^*| \le C |x_k - x^*|^2 where :math:`C = |f''(x^*)|/(2|f'(x^*)|)`. Quadratic convergence roughly doubles the number of correct digits per iteration. **When Newton fails:** - **Multiple roots** (:math:`f'(x^*) = 0`): convergence slows to linear - **Inflection points**: the tangent line sends the iterate far away - **Cycling**: the iterates oscillate without converging - **Division by zero**: :math:`f'(x_k) = 0` at a non-root Newton's method applied to optimization (:math:`f = \ell'`, finding where the score is zero) is the **Newton-Raphson** algorithm developed in :ref:`Ch3.2 `. .. code-block:: python import numpy as np f = lambda x: x**3 - 2 fp = lambda x: 3 * x**2 root = 2.0**(1.0/3.0) # Bisection a, b = 1.0, 2.0 bisect_errs = [] for i in range(20): mid = (a + b) / 2.0 bisect_errs.append(abs(mid - root)) if f(mid) * f(a) < 0: b = mid else: a = mid # Newton x = 2.0 newton_errs = [abs(x - root)] for i in range(7): x = x - f(x) / fp(x) newton_errs.append(abs(x - root)) print("Root finding: f(x) = x³ − 2, x* = ∛2 = " f"{root:.15f}") print() print(f"{'Iter':>4} {'Bisection |err|':>18} " f"{'Newton |err|':>18}") print("-" * 44) for i in range(max(len(bisect_errs), len(newton_errs))): b_str = (f"{bisect_errs[i]:>18.2e}" if i < len(bisect_errs) else f"{'':>18}") n_str = (f"{newton_errs[i]:>18.2e}" if i < len(newton_errs) else f"{'converged':>18}") print(f"{i:>4} {b_str} {n_str}") .. code-block:: text Root finding: f(x) = x³ − 2, x* = ∛2 = 1.259921049894873 Iter Bisection |err| Newton |err| -------------------------------------------- 0 2.40e-01 7.40e-01 1 9.92e-03 2.40e-01 2 1.15e-01 3.64e-02 3 5.26e-02 1.01e-03 4 2.13e-02 8.11e-07 5 5.70e-03 5.22e-13 6 2.11e-03 0.00e+00 7 1.80e-03 0.00e+00 8 1.55e-04 converged 9 8.21e-04 converged 10 3.33e-04 converged 11 8.87e-05 converged 12 3.34e-05 converged 13 2.77e-05 converged 14 2.84e-06 converged 15 1.24e-05 converged 16 4.79e-06 converged 17 9.78e-07 converged 18 9.30e-07 converged 19 2.40e-08 converged Newton reaches machine precision in 6 iterations; bisection needs 20 iterations to reach :math:`10^{-8}`. The contrast illustrates the power of quadratic vs linear convergence. But bisection *always* works — Newton requires a good starting point and a well-behaved function. .. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/Appendices/appendix_c_num_fig02_root_finding.png :alt: Root-finding for the cube root of 2: Newton's method with tangent lines converging in a few steps while bisection slowly narrows the bracket, and a log-scale error plot showing Newton's quadratic rate versus bisection's linear rate. :width: 100% :align: center **Figure C.2: Root-Finding Convergence.** (a) Finding :math:`\sqrt[3]{2}` as a root of :math:`f(x) = x^3 - 2`. Newton's method (colored circles with tangent lines) converges rapidly from :math:`x_0 = 2`; bisection (blue squares) narrows the interval steadily. (b) Error on a log scale: bisection decreases linearly (constant slope), while Newton's error plummets quadratically (slope doubles each step). Optimization is root finding in disguise: minimizing :math:`f(\theta)` means finding where the score :math:`\nabla f = 0`. The contraction mapping framework from above carries over directly — the key new question is how the problem's conditioning affects convergence speed. Numerical Optimization ------------------------ Optimization as Fixed-Point Iteration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Minimizing :math:`f(\theta)` by gradient descent produces the iteration :math:`\theta_{k+1} = \theta_k - \alpha \nabla f(\theta_k)`, which is a fixed-point iteration with :math:`g(\theta) = \theta - \alpha \nabla f(\theta)`. The fixed points of :math:`g` are exactly the critical points of :math:`f` (where :math:`\nabla f = 0`). The contraction condition from the Banach theorem translates directly into step size constraints. .. admonition:: Theorem: Gradient Descent Convergence :class: note If :math:`f` is convex with :math:`L`-Lipschitz continuous gradient (i.e., :math:`\|\nabla^2 f\| \le L` everywhere), then gradient descent with step size :math:`0 < \alpha < 2/L` converges: the function value decreases at rate :math:`f(\theta_k) - f(\theta^*) = O(1/k)`. If :math:`f` is additionally **strongly convex** with condition number :math:`\kappa = L/\mu` (ratio of largest to smallest eigenvalue of the Hessian), then the error converges linearly: .. math:: \|\theta_k - \theta^*\| \le \left(\frac{\kappa - 1}{\kappa + 1}\right)^k \|\theta_0 - \theta^*\| Gradient Descent ~~~~~~~~~~~~~~~~~ The update :math:`\theta_{k+1} = \theta_k - \alpha \nabla f(\theta_k)` moves in the direction of steepest descent. The **step size** (learning rate) :math:`\alpha` controls the trade-off: - **Too large:** the iterates overshoot and oscillate (or diverge) - **Too small:** convergence is unnecessarily slow - **Optimal for quadratics:** :math:`\alpha = 2/(L + \mu)`, giving rate :math:`(\kappa - 1)/(\kappa + 1)` **Why ill-conditioned problems converge slowly.** When the Hessian eigenvalues span a wide range (large :math:`\kappa`), gradient descent zigzags: it makes rapid progress along the well-conditioned directions but creeps along the poorly conditioned ones. For :math:`\kappa = 100`, the linear convergence rate is :math:`99/101 \approx 0.98`, requiring about 115 iterations per decimal digit. This motivates Newton's method, which is immune to conditioning. For step size strategies beyond fixed :math:`\alpha` — diminishing sequences, line search with the Armijo condition, trust regions — see the practical safeguards in :ref:`Ch3.2 `. Newton's Method for Optimization ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Newton's method for minimizing :math:`f` solves the score equation :math:`\nabla f = 0` using the update: .. math:: \boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k - \mathbf{H}^{-1}(\boldsymbol{\theta}_k) \nabla f(\boldsymbol{\theta}_k) where :math:`\mathbf{H}` is the Hessian. Each step minimizes the local quadratic Taylor approximation — see :ref:`Appendix A ` for the connection between second-order Taylor and the quadratic likelihood. .. admonition:: Theorem: Quadratic Convergence of Newton for Optimization :class: note Under regularity conditions (Hessian Lipschitz continuous, positive definite at the minimizer), Newton's method converges quadratically near the minimizer: the number of correct digits roughly doubles per iteration. **Affine invariance.** Newton's method is invariant to affine reparameterizations: if :math:`\phi = A\theta + b` for a nonsingular :math:`A`, Newton on :math:`f(\theta)` and Newton on :math:`f(A^{-1}(\phi - b))` produce the same iterates (in transformed coordinates). This means Newton is unaffected by poor conditioning — unlike gradient descent. The cost is computing and solving with the Hessian at each step. For the full statistical treatment of Newton's method for MLE — including Fisher scoring (replacing the observed Hessian with expected information) and IRLS (Fisher scoring for GLMs) — see :ref:`Ch3.2 ` and :ref:`Ch3.5 `. .. code-block:: python import numpy as np from scipy.special import gammaln, digamma, polygamma rng = np.random.default_rng(42) n = 50 data = rng.gamma(4.0, 1.0/2.0, n) def neg_ll(alpha, beta): return -(n*alpha*np.log(beta) - n*gammaln(alpha) + (alpha-1)*np.sum(np.log(data)) - beta*np.sum(data)) def grad_neg(alpha, beta): g_a = -(n*np.log(beta) - n*digamma(alpha) + np.sum(np.log(data))) g_b = -(n*alpha/beta - np.sum(data)) return np.array([g_a, g_b]) def hess_neg(alpha, beta): h11 = n * polygamma(1, alpha) h12 = -n / beta h22 = n * alpha / beta**2 return np.array([[h11, h12], [h12, h22]]) # Newton (find MLE and track convergence) a_nw, b_nw = 2.0, 1.0 nw_path = [(a_nw, b_nw)] for _ in range(20): g = grad_neg(a_nw, b_nw) H = hess_neg(a_nw, b_nw) step = np.linalg.solve(H, g) a_nw -= step[0]; b_nw -= step[1] nw_path.append((a_nw, b_nw)) if np.linalg.norm(step) < 1e-14: break a_star, b_star = a_nw, b_nw # Gradient descent H_mle = hess_neg(a_star, b_star) L = max(np.linalg.eigvalsh(H_mle)) kappa = L / min(np.linalg.eigvalsh(H_mle)) lr = 1.0 / L a_gd, b_gd = 2.0, 1.0 gd_path = [(a_gd, b_gd)] for _ in range(200): g = grad_neg(a_gd, b_gd) a_gd -= lr * g[0]; b_gd -= lr * g[1] a_gd = max(a_gd, 0.01); b_gd = max(b_gd, 0.01) gd_path.append((a_gd, b_gd)) dist = lambda p: np.sqrt( (p[0]-a_star)**2 + (p[1]-b_star)**2) print(f"Gamma MLE (n={n}): α̂ = {a_star:.4f}, " f"β̂ = {b_star:.4f}") print(f"Condition number κ = {kappa:.1f}") print(f"GD step size α = 1/L = {lr:.4f}") print() print(f"{'Iter':>4} {'GD ‖θ−θ*‖':>14} " f"{'Newton ‖θ−θ*‖':>14}") print("-" * 36) for k in [0, 1, 2, 3, 4, 5, 10, 25, 50, 100, 200]: gd_e = (f"{dist(gd_path[k]):>14.6f}" if k < len(gd_path) else "") nw_e = (f"{dist(nw_path[k]):>14.2e}" if k < len(nw_path) else f"{'converged':>14}") print(f"{k:>4} {gd_e} {nw_e}") .. code-block:: text Gamma MLE (n=50): α̂ = 6.6685, β̂ = 3.5977 Condition number κ = 73.3 GD step size α = 1/L = 0.0299 Iter GD ‖θ−θ*‖ Newton ‖θ−θ*‖ ------------------------------------ 0 5.342546 5.34e+00 1 5.083337 3.84e+00 2 4.862029 1.98e+00 3 4.664648 5.31e-01 4 4.452428 3.80e-02 5 4.283482 1.95e-04 10 3.013962 converged 25 2.173839 converged 50 1.384307 converged 100 0.628403 converged 200 0.149146 converged Newton converges to the MLE in 5 iterations (error :math:`< 10^{-4}`) while gradient descent still has error 0.15 after 200 iterations. The condition number :math:`\kappa = 73.3` explains the slow GD convergence: the theoretical linear rate is :math:`(73.3 - 1)/(73.3 + 1) = 0.973`, requiring about :math:`\log(0.1)/\log(0.973) \approx 84` iterations per decimal digit. .. figure:: https://pqyjaywwccbnqpwgeiuv.supabase.co/storage/v1/object/public/STAT%20418%20Images/assets/Appendices/appendix_c_num_fig03_optimization.png :alt: Contour plot of the negative Gamma log-likelihood where gradient descent zigzags through the elongated contours while Newton's method steps almost straight to the MLE, with an error-versus-iteration plot showing Newton reaching machine precision in about six steps. :width: 100% :align: center **Figure C.3: Gradient Descent vs Newton.** (a) Contour plot of the negative Gamma log-likelihood. Gradient descent (red) zigzags through the elongated contours; Newton (green) cuts directly to the MLE (star). (b) Error vs iteration: Newton reaches machine precision in ~6 steps (quadratic convergence), while GD converges linearly with a slow rate set by the condition number. Quasi-Newton Methods ~~~~~~~~~~~~~~~~~~~~~~ Newton's method requires computing and factoring the :math:`p \times p` Hessian at every step — expensive for large :math:`p`. **Quasi-Newton methods** approximate the Hessian using only gradient information, updating the approximation incrementally. **BFGS** (Broyden-Fletcher-Goldfarb-Shanno) maintains a dense :math:`p \times p` approximate Hessian, achieving superlinear convergence without exact second derivatives. **L-BFGS-B** (limited-memory BFGS with bounds) stores only the last :math:`m` gradient differences (typically :math:`m = 10`), reducing memory from :math:`O(p^2)` to :math:`O(mp)` and supporting simple box constraints :math:`\ell_j \le \theta_j \le u_j`. In SciPy, ``scipy.optimize.minimize(method='L-BFGS-B')`` is the standard workhorse for unconstrained and box-constrained optimization — it is used throughout :ref:`Ch3.2 ` for production MLE computation. The tradeoff is clear: more derivative information means fewer evaluations. .. code-block:: python import numpy as np from scipy.optimize import minimize rng = np.random.default_rng(42) data = rng.normal(5, 3, 200) # Parameterize as (mu, log_sigma2) for unconstrained opt def neg_loglik(params): mu, log_s2 = params s2 = np.exp(log_s2) n = len(data) return (n/2*np.log(2*np.pi*s2) + np.sum((data - mu)**2) / (2*s2)) def grad_neg(params): mu, log_s2 = params s2 = np.exp(log_s2) n = len(data) g_mu = -np.sum(data - mu) / s2 g_ls2 = n/2 - np.sum((data - mu)**2) / (2*s2) return np.array([g_mu, g_ls2]) def hess_neg(params): mu, log_s2 = params s2 = np.exp(log_s2) n = len(data) H11 = n / s2 H12 = np.sum(data - mu) / s2 H22 = np.sum((data - mu)**2) / (2*s2) return np.array([[H11, H12], [H12, H22]]) x0 = [0.0, 0.0] # start far from MLE res_ncg = minimize(neg_loglik, x0, method='Newton-CG', jac=grad_neg, hess=hess_neg) res_lbfgs = minimize(neg_loglik, x0, method='L-BFGS-B', jac=grad_neg) res_nm = minimize(neg_loglik, x0, method='Nelder-Mead') mu_hat = np.mean(data) s2_hat = np.var(data) print("Method comparison for Normal MLE (n=200):") print(f"True MLE: mu={mu_hat:.4f}, " f"sigma2={s2_hat:.4f}") print() print(f"{'Method':>12} {'Derivs':>8} {'Iters':>6} " f"{'f-evals':>8} {'mu':>10} {'sigma2':>10}") print("-" * 65) for name, res, derivs in [ ('Newton-CG', res_ncg, 'grad+H'), ('L-BFGS-B', res_lbfgs, 'grad'), ('Nelder-Mead', res_nm, 'none')]: mu = res.x[0] s2 = np.exp(res.x[1]) print(f"{name:>12} {derivs:>8} {res.nit:>6} " f"{res.nfev:>8} {mu:>10.4f} {s2:>10.4f}") .. code-block:: text Method comparison for Normal MLE (n=200): True MLE: mu=4.9086, sigma2=6.9653 Method Derivs Iters f-evals mu sigma2 ----------------------------------------------------------------- Newton-CG grad+H 12 14 4.9086 6.9653 L-BFGS-B grad 15 16 4.9086 6.9652 Nelder-Mead none 82 156 4.9087 6.9652 Newton-CG converges fastest with access to both gradient and Hessian; L-BFGS-B is nearly as fast using only gradients (it approximates the Hessian internally); Nelder-Mead needs 10× more function evaluations since it must infer geometry from function values alone. All of the optimization methods above require gradients, and Newton additionally requires the Hessian. When analytical derivatives are unavailable or impractical, we need to compute them numerically — and the choice of step size turns out to be a surprisingly delicate problem. Numerical Differentiation ---------------------------- Forward and Central Differences ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When analytical derivatives are unavailable, we approximate :math:`f'(x)` from function evaluations. .. admonition:: Definition :class: definition The **forward difference** approximation is :math:`f'(x) \approx \frac{f(x+h) - f(x)}{h}`, with error :math:`O(h)` from the first-order Taylor remainder. .. admonition:: Definition :class: definition The **central difference** approximation is :math:`f'(x) \approx \frac{f(x+h) - f(x-h)}{2h}`, with error :math:`O(h^2)`. The extra order comes from the cancellation of odd-powered error terms when subtracting symmetric evaluations. The derivations follow directly from Taylor's theorem (:ref:`Appendix A `). For the forward difference: :math:`f(x+h) = f(x) + hf'(x) + \frac{h^2}{2}f''(c)`, so :math:`\frac{f(x+h) - f(x)}{h} = f'(x) + O(h)`. For the central difference: the odd terms cancel, leaving :math:`O(h^2)`. The Step Size Dilemma ~~~~~~~~~~~~~~~~~~~~~~~ Smaller :math:`h` should give more accurate derivatives — but floating-point arithmetic introduces a competing error source. .. admonition:: Theorem: Truncation-Roundoff Tradeoff :class: note The total error in a finite-difference approximation has two components: - **Truncation error** (from the Taylor remainder): :math:`O(h)` for forward, :math:`O(h^2)` for central - **Roundoff error** (from subtracting nearly equal floats): :math:`O(\varepsilon_{\text{mach}} / h)` Minimizing the total error gives optimal step sizes: - **Forward difference:** :math:`h^* \sim \sqrt{\varepsilon_{\text{mach}}} \approx 1.5 \times 10^{-8}` - **Central difference:** :math:`h^* \sim \varepsilon_{\text{mach}}^{1/3} \approx 6 \times 10^{-6}` This explains why ``epsilon=1e-5`` appears as the default step size for numerical Hessians in :ref:`Ch3.3 ` — it is near-optimal for central differences in ``float64``. .. code-block:: python import numpy as np x0 = 1.0 true_deriv = np.exp(x0) print("Numerical differentiation of f(x) = eˣ " "at x = 1") print(f"True f′(1) = e = {true_deriv:.15f}") print() print(f"{'h':>12} {'Forward err':>14} " f"{'Central err':>14}") print("-" * 44) for p in range(1, 17): h = 10.0**(-p) fwd = (np.exp(x0 + h) - np.exp(x0)) / h ctr = (np.exp(x0 + h) - np.exp(x0 - h)) / (2*h) fwd_err = abs(fwd - true_deriv) ctr_err = abs(ctr - true_deriv) print(f"{h:>12.0e} {fwd_err:>14.2e} " f"{ctr_err:>14.2e}") print(f"\nOptimal h (forward): " f"√ε ≈ {np.sqrt(np.finfo(float).eps):.2e}") print(f"Optimal h (central): " f"ε^(1/3) ≈ {np.finfo(float).eps**(1/3):.2e}") .. code-block:: text Numerical differentiation of f(x) = eˣ at x = 1 True f′(1) = e = 2.718281828459045 h Forward err Central err -------------------------------------------- 1e-01 1.41e-01 4.53e-03 1e-02 1.36e-02 4.53e-05 1e-03 1.36e-03 4.53e-07 1e-04 1.36e-04 4.53e-09 1e-05 1.36e-05 5.86e-11 1e-06 1.36e-06 1.63e-10 1e-07 1.40e-07 5.86e-11 1e-08 6.60e-09 6.60e-09 1e-09 2.15e-07 6.60e-09 1e-10 1.55e-06 6.73e-07 1e-11 3.26e-05 1.04e-05 1e-12 4.32e-04 2.10e-04 1e-13 4.56e-04 4.56e-04 1e-14 9.34e-03 9.34e-03 1e-15 3.90e-01 1.68e-01 1e-16 2.72e+00 2.72e+00 Optimal h (forward): √ε ≈ 1.49e-08 Optimal h (central): ε^(1/3) ≈ 6.06e-06 The V-shaped error pattern is unmistakable. For :math:`h > h^*`, truncation error dominates (error decreases with smaller :math:`h`). For :math:`h < h^*`, roundoff error dominates (error *increases* with smaller :math:`h`). The forward difference achieves a best error of :math:`\sim 10^{-8}` at :math:`h \approx 10^{-8}`; the central difference achieves :math:`\sim 10^{-11}` at :math:`h \approx 10^{-5}`. The Complex-Step Method ~~~~~~~~~~~~~~~~~~~~~~~~~ .. admonition:: Definition :class: definition The **complex-step derivative** approximates :math:`f'(x)` using a purely imaginary perturbation: .. math:: f'(x) \approx \frac{\text{Im}(f(x + ih))}{h} This formula has :math:`O(h^2)` truncation error and **no roundoff error from subtraction**, because it uses only the imaginary part — no subtraction of real-valued quantities occurs. **Why it works.** Taylor-expanding :math:`f` in the complex plane: :math:`f(x + ih) = f(x) + ihf'(x) - h^2 f''(x)/2 - \ldots`. Taking the imaginary part: :math:`\text{Im}(f(x+ih)) = hf'(x) + O(h^3)`. No cancellation-prone subtraction is involved. **Limitation:** the function must accept complex inputs. Functions containing ``abs``, conditional branches, or non-analytic operations will fail or give incorrect results. .. code-block:: python import numpy as np x0 = 1.0 true_deriv = np.exp(x0) print("Complex-step vs central differences " "for f(x) = eˣ") print(f"{'h':>12} {'Central err':>14} " f"{'Complex-step err':>18}") print("-" * 48) for p in range(1, 17): h = 10.0**(-p) ctr = (np.exp(x0 + h) - np.exp(x0 - h)) / (2*h) cplx = np.imag(np.exp(x0 + 1j*h)) / h ctr_err = abs(ctr - true_deriv) cplx_err = abs(cplx - true_deriv) print(f"{h:>12.0e} {ctr_err:>14.2e} " f"{cplx_err:>18.2e}") print(f"\nCentral differences plateau at " f"h ≈ 1e-5 to 1e-8") print(f"Complex-step improves indefinitely " f"(no subtraction)") .. code-block:: text Complex-step vs central differences for f(x) = eˣ h Central err Complex-step err ------------------------------------------------ 1e-01 4.53e-03 4.53e-03 1e-02 4.53e-05 4.53e-05 1e-03 4.53e-07 4.53e-07 1e-04 4.53e-09 4.53e-09 1e-05 5.86e-11 4.53e-11 1e-06 1.63e-10 4.53e-13 1e-07 5.86e-11 4.44e-15 1e-08 6.60e-09 0.00e+00 1e-09 6.60e-09 0.00e+00 1e-10 6.73e-07 4.44e-16 1e-11 1.04e-05 0.00e+00 1e-12 2.10e-04 0.00e+00 1e-13 4.56e-04 0.00e+00 1e-14 9.34e-03 0.00e+00 1e-15 1.68e-01 0.00e+00 1e-16 2.72e+00 0.00e+00 Central differences plateau at h ≈ 1e-5 to 1e-8 Complex-step improves indefinitely (no subtraction) The complex-step method achieves machine-precision accuracy at :math:`h = 10^{-8}` and maintains it for arbitrarily smaller :math:`h` — while central differences degrade below :math:`h \approx 10^{-5}`. **Automatic differentiation.** Modern probabilistic programming frameworks (PyMC, JAX, PyTorch) use **automatic differentiation** (AD) rather than finite differences. AD computes exact derivatives (to machine precision) by applying the chain rule to the computational graph of the function. **Reverse-mode AD** is particularly efficient for gradients of scalar-valued functions — the cost of the gradient equals a small multiple of the cost of :math:`f` itself, regardless of the number of parameters. This is what makes gradient-based MCMC (HMC, NUTS) practical in :ref:`Ch5.5 `. The complementary problem to differentiation is integration: computing expectations, normalizing constants, and marginal likelihoods. The error analysis follows a similar pattern — polynomial approximation order determines the convergence rate — but a fundamentally different tool (Monte Carlo) dominates in high dimensions. Numerical Integration ------------------------ Error Analysis for Quadrature ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The quadrature rules introduced in :ref:`Ch2.1 ` have error rates that follow directly from how well their piecewise-polynomial interpolants approximate the integrand. .. admonition:: Theorem: Quadrature Error Rates :class: note For a function :math:`f` with :math:`n` equally spaced evaluation points on :math:`[a, b]` with spacing :math:`h = (b-a)/(n-1)`: - **Trapezoidal rule** (piecewise linear): error :math:`= O(h^2)`. Doubling :math:`n` reduces error by 4×. - **Simpson's rule** (piecewise quadratic): error :math:`= O(h^4)`. Doubling :math:`n` reduces error by 16×. - **Gaussian quadrature** (optimal polynomial nodes): for analytic integrands, error decreases *exponentially* in :math:`n`. These rates assume sufficient smoothness of :math:`f` — each rule achieves its full order only if :math:`f` has enough continuous derivatives. The trapezoidal error rate :math:`O(h^2)` comes from the **Euler-Maclaurin formula**, which relates the trapezoidal sum to the integral plus correction terms involving even powers of :math:`h` (derivatives of :math:`f` at the endpoints). Simpson's rule is equivalent to Richardson extrapolation of the trapezoidal rule — the weighted combination :math:`S = (4T(h) - T(2h))/3` eliminates the leading :math:`O(h^2)` error term, yielding :math:`O(h^4)`. Gaussian quadrature achieves exponential rates by choosing nodes at roots of orthogonal polynomials, optimizing the interpolation points rather than spacing them uniformly. We can verify these rates empirically by doubling :math:`n` and checking the error ratio: .. code-block:: python import numpy as np from scipy.integrate import simpson def f(x): return np.exp(-x**2) a, b = 0.0, 2.0 exact = 0.8820813907624215 # from scipy.integrate.quad print("Quadrature convergence for integral of " "exp(-x^2) on [0, 2]") print(f"Exact value: {exact:.10f}") print() print(f"{'n':>6} {'Trap err':>12} {'Simp err':>12} " f"{'Trap ratio':>11} {'Simp ratio':>11}") print("-" * 60) prev_trap = prev_simp = None for n in [5, 9, 17, 33, 65, 129]: x = np.linspace(a, b, n) y = f(x) trap = np.trapz(y, x) simp = simpson(y, x=x) e_trap = abs(trap - exact) e_simp = abs(simp - exact) if prev_trap is not None: r_trap = prev_trap / e_trap r_simp = prev_simp / e_simp print(f"{n:>6} {e_trap:>12.2e} " f"{e_simp:>12.2e} " f"{r_trap:>11.1f} {r_simp:>11.1f}") else: print(f"{n:>6} {e_trap:>12.2e} " f"{e_simp:>12.2e} " f"{'---':>11} {'---':>11}") prev_trap = e_trap prev_simp = e_simp .. code-block:: text Quadrature convergence for integral of exp(-x^2) on [0, 2] Exact value: 0.8820813908 n Trap err Simp err Trap ratio Simp ratio ------------------------------------------------------------ 5 1.46e-03 2.69e-04 --- --- 9 3.78e-04 1.59e-05 3.9 16.9 17 9.51e-05 9.94e-07 4.0 16.0 33 2.38e-05 6.21e-08 4.0 16.0 65 5.96e-06 3.88e-09 4.0 16.0 129 1.49e-06 2.43e-10 4.0 16.0 The trapezoidal error ratio converges to 4 (:math:`= 2^2`), confirming :math:`O(h^2)`; Simpson's ratio converges to 16 (:math:`= 2^4`), confirming :math:`O(h^4)`. Each doubling of :math:`n` buys about twice as many digits of accuracy with Simpson as with the trapezoidal rule. For the detailed convergence comparison code and figures, including the crossover dimension where Monte Carlo beats deterministic quadrature, see :ref:`Ch2.1 `. Adaptive Quadrature ~~~~~~~~~~~~~~~~~~~~~ SciPy's ``scipy.integrate.quad`` uses **adaptive Gaussian quadrature**: it estimates the integral and its error on each subinterval, then recursively subdivides intervals where the error estimate is too large. This concentrates function evaluations where the integrand varies rapidly, avoiding wasted evaluations in smooth regions. .. code-block:: python import numpy as np from scipy.integrate import quad def multi_peak(x): """Integrand with features at multiple scales.""" return (np.exp(-100*(x - 0.3)**2) + 0.5*np.exp(-1000*(x - 0.7)**2) + 0.1) result, error, info = quad(multi_peak, 0, 1, full_output=True, limit=100) print("Adaptive quadrature on multi-scale integrand") print(f" I = ∫₀¹ [exp(-100(x−0.3)²) + " f"0.5·exp(-1000(x−0.7)²) + 0.1] dx") print(f" Result = {result:.10f}") print(f" Estimated error = {error:.2e}") print(f" Function evaluations: {info['neval']}") print(f" Subintervals: {info['last']}") print() for n_pts in [21, 101, 1001]: x = np.linspace(0, 1, n_pts) trap = np.trapz(multi_peak(x), x) print(f" Trapezoidal (n={n_pts:>4}): " f"{trap:.10f} " f"error = {abs(trap - result):.2e}") .. code-block:: text Adaptive quadrature on multi-scale integrand I = ∫₀¹ [exp(-100(x−0.3)²) + 0.5·exp(-1000(x−0.7)²) + 0.1] dx Result = 0.3052683835 Estimated error = 2.15e-09 Function evaluations: 231 Subintervals: 6 Trapezoidal (n= 21): 0.3063485748 error = 1.08e-03 Trapezoidal (n= 101): 0.3052683221 error = 6.14e-08 Trapezoidal (n=1001): 0.3052683828 error = 6.17e-10 Adaptive quadrature achieves :math:`10^{-9}` accuracy with 231 function evaluations. The trapezoidal rule needs 1001 points for comparable accuracy — and it doesn't provide an error estimate. For a single integral with potentially sharp features, ``quad`` is almost always the right tool. Monte Carlo Error Rate ~~~~~~~~~~~~~~~~~~~~~~~~ .. admonition:: Theorem: Monte Carlo Convergence :class: note For :math:`X_1, \ldots, X_n \sim f` with :math:`\text{Var}(g(X)) < \infty`: .. math:: \frac{1}{n}\sum_{i=1}^n g(X_i) \to \mathbb{E}[g(X)] \quad \text{at rate } O(n^{-1/2}) The error is **dimension-independent**: the same :math:`O(n^{-1/2})` rate holds whether :math:`X` is scalar or 1000-dimensional. Quadrature on a :math:`d`-dimensional grid with :math:`m` points per axis requires :math:`m^d` evaluations, while Monte Carlo uses :math:`n` evaluations regardless of :math:`d`. This **curse of dimensionality** is why Monte Carlo dominates in high dimensions: the crossover typically occurs at :math:`d \approx 3\text{--}5` dimensions. Below that, quadrature's faster polynomial rate wins; above it, the exponential growth in grid points makes quadrature impractical. The detailed analysis and figures are in :ref:`Ch2.1 `. Sections 2–5 each introduced specific convergence rates — bisection's :math:`O(1/2^k)`, Newton's quadratic convergence, GD's linear rate, and the :math:`O(h^2)` vs :math:`O(h^4)` quadrature rates. The next section formalizes this into a unified framework. Convergence and Stopping Criteria ------------------------------------ This section formalizes the general framework. We begin with the asymptotic notation that has appeared informally throughout this appendix, then define convergence orders precisely, and finally address the practical question of when to stop iterating. Asymptotic (Big-O) Notation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Throughout this appendix we have written statements like "the forward difference has error :math:`O(h)`" and "the trapezoidal rule converges at rate :math:`O(h^2)`." This **asymptotic notation**, introduced by Bachmann (1894) and popularized by Landau, makes precise the idea that one quantity "grows no faster than" another. .. admonition:: Definition :class: definition Let :math:`f` and :math:`g` be functions defined near some limit (typically :math:`h \to 0` or :math:`n \to \infty`). We write .. math:: f(x) = O(g(x)) if there exist constants :math:`C > 0` and a threshold :math:`x_0` such that .. math:: |f(x)| \le C\,|g(x)| \quad \text{for all } x \text{ past the threshold.} In words: :math:`f` grows (or shrinks) **at most as fast** as :math:`g`, up to a constant factor. The constant :math:`C` and threshold :math:`x_0` exist but are deliberately left unspecified. Big-O captures the *rate* of growth or decay while ignoring the constant prefactor. This is exactly what we need for comparing numerical methods: whether an error shrinks like :math:`h` or like :math:`h^2` matters far more than whether the leading coefficient is 0.3 or 7. Big-O appears in two distinct contexts in this book, and recognizing which one is in play prevents confusion: - **As** :math:`h \to 0` **(error analysis)**. When we write "the forward difference error is :math:`O(h)`," we mean the error satisfies :math:`|\text{error}| \le Ch` for all sufficiently small :math:`h`. Making :math:`h` smaller *improves* accuracy, and the exponent tells us by how much. - **As** :math:`n \to \infty` **(algorithmic cost or convergence)**. When we write "Cholesky factorization costs :math:`O(n^3/3)` operations" (:ref:`Appendix B `), we mean the operation count grows at most cubically. When we write "Monte Carlo error is :math:`O(n^{-1/2})`," we mean the error shrinks like :math:`1/\sqrt{n}` as the sample size grows. **Example 1: Taylor remainder and numerical differentiation.** From Taylor's theorem (:ref:`Appendix A `), the exponential function satisfies .. math:: e^h = 1 + h + \frac{h^2}{2} + O(h^3) The :math:`O(h^3)` term means the approximation :math:`e^h \approx 1 + h + h^2/2` has an error bounded by :math:`C h^3` for some constant :math:`C`. We can verify this: at :math:`h = 0.1` the error is :math:`1.71 \times 10^{-4}`, at :math:`h = 0.01` it is :math:`1.67 \times 10^{-7}`, and at :math:`h = 0.001` it is :math:`1.67 \times 10^{-10}`. Each factor-of-10 reduction in :math:`h` reduces the error by a factor of :math:`10^3 = 1000`, confirming the cubic rate. The hidden constant turns out to be :math:`C = 1/6` (from the next Taylor term :math:`h^3/3!`), but the :math:`O(h^3)` statement is useful precisely because we don't need to know :math:`C` to predict *how fast* the error shrinks. This is how all the error rates in Section 4 arise: the forward difference error is :math:`O(h)` because Taylor expansion leaves a remainder proportional to :math:`h`; the central difference error is :math:`O(h^2)` because the leading remainder cancels. **Example 2: Monte Carlo convergence.** The Central Limit Theorem implies that the Monte Carlo estimator :math:`\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i` has root-mean-square error .. math:: \text{RMSE} = \frac{\sigma}{\sqrt{n}} = O(n^{-1/2}) Here :math:`C = \sigma` (the population standard deviation) is the hidden constant. The key insight from Section 5 is that this rate is **dimension-independent**: whether :math:`X_i \in \mathbb{R}` or :math:`X_i \in \mathbb{R}^{1000}`, the error still decreases as :math:`O(n^{-1/2})`. Only the constant :math:`\sigma` changes with dimension — and big-O notation lets us separate the rate (which determines feasibility) from the constant (which determines cost). **Related notation.** Big-O gives an *upper bound* on growth rate. Two companion notations complete the picture: - :math:`f(x) = \Omega(g(x))` means :math:`f` grows **at least as fast** as :math:`g`: there exists :math:`c > 0` such that :math:`|f(x)| \ge c\,|g(x)|` past some threshold. This is the lower-bound counterpart of :math:`O`. - :math:`f(x) = \Theta(g(x))` means :math:`f` grows **at the same rate** as :math:`g`: both :math:`f = O(g)` and :math:`f = \Omega(g)` hold. When we say "Cholesky costs :math:`\Theta(n^3)` operations," we mean the cost is *exactly* cubic — not just at most cubic. Most of the convergence rates in this appendix are tight (:math:`\Theta` rather than just :math:`O`), but we follow the common convention of writing :math:`O` unless the distinction matters. With this notation in hand, we can precisely classify how quickly iterative algorithms converge. Convergence Orders ~~~~~~~~~~~~~~~~~~~~ .. admonition:: Definition :class: definition An iterative sequence :math:`\{x_k\}` converging to :math:`x^*` has **linear convergence with rate** :math:`r \in (0, 1)` if: .. math:: |x_{k+1} - x^*| \le r \, |x_k - x^*| \quad \text{for all } k \text{ sufficiently large} The error decreases by a constant fraction each step. Smaller :math:`r` means faster convergence. .. admonition:: Definition :class: definition The sequence has **quadratic convergence** if: .. math:: |x_{k+1} - x^*| \le C \, |x_k - x^*|^2 for some constant :math:`C`. The number of correct digits approximately doubles per iteration. .. list-table:: Convergence orders of algorithms in this course :header-rows: 1 :widths: 35 20 45 * - Algorithm - Order - Rate * - Bisection - Linear - :math:`r = 0.5` (halving the interval) * - Fixed-point :math:`x = g(x)` - Linear - :math:`r = |g'(x^*)|` * - Gradient descent (strongly convex) - Linear - :math:`r = (\kappa-1)/(\kappa+1)` * - Newton's method - Quadratic - Doubles correct digits per step * - IRLS (canonical link) - Quadratic - Same as Newton (Fisher scoring = Newton) **Estimating convergence order from iterates.** If we have a sequence of errors :math:`e_0, e_1, \ldots`, the convergence order :math:`p` satisfies :math:`\log|e_{k+1}| \approx p \log|e_k| + \text{const}`. So the ratio :math:`\log|e_{k+1}|/\log|e_k|` estimates :math:`p`: values near 1 indicate linear convergence, near 2 indicate quadratic. We demonstrate both diagnostics on the :math:`x = \cos(x)` problem from Section 2 — the same problem, two algorithms, two convergence orders. .. code-block:: python import numpy as np x_star = 0.7390851332151607 # Fixed-point iteration: g(x) = cos(x) [linear] x = 0.0 fp_errors = [abs(x - x_star)] for _ in range(30): x = np.cos(x) fp_errors.append(abs(x - x_star)) # Newton on f(x) = x - cos(x) [quadratic] x = 0.0 nw_errors = [abs(x - x_star)] for _ in range(7): f_val = x - np.cos(x) fp_val = 1 + np.sin(x) x = x - f_val / fp_val nw_errors.append(abs(x - x_star)) print("Convergence order estimation: x = cos(x)") print(f" x* = {x_star:.10f}") print() print("Newton on f(x) = x − cos(x):") print(f"{'k':>4} {'|e_k|':>14} {'|e_{k+1}|/|e_k|':>18}" f" {'log ratio':>12}") print("-" * 54) for k in range(len(nw_errors) - 1): e_k = nw_errors[k] e_k1 = nw_errors[k + 1] if e_k > 0 and e_k1 > 0: ratio = e_k1 / e_k log_ratio = (np.log(e_k1) / np.log(e_k) if abs(np.log(e_k)) > 0.01 else float('nan')) print(f"{k:>4} {e_k:>14.2e} " f"{ratio:>18.6f} " f"{log_ratio:>12.2f}") else: print(f"{k:>4} {e_k:>14.2e} " f"{'— (converged)':>18}") print() print("Fixed-point g(x) = cos(x):") print(f"{'k':>4} {'|e_k|':>14} " f"{'|e_{k+1}|/|e_k|':>18}") print("-" * 40) for k in [0, 5, 10, 15, 20, 25]: if k + 1 < len(fp_errors): ratio = fp_errors[k+1] / fp_errors[k] print(f"{k:>4} {fp_errors[k]:>14.2e} " f"{ratio:>18.6f}") L = abs(np.sin(x_star)) print(f"\nNewton: log ratio → 2 (quadratic)") print(f"Fixed-point: ratio → L = |sin(x*)| " f"= {L:.4f} (linear)") .. code-block:: text Convergence order estimation: x = cos(x) x* = 0.7390851332 Newton on f(x) = x − cos(x): k |e_k| |e_{k+1}|/|e_k| log ratio ------------------------------------------------------ 0 7.39e-01 0.353024 4.44 1 2.61e-01 0.043228 3.34 2 1.13e-02 0.002461 2.34 3 2.78e-05 0.000006 2.14 4 1.70e-10 — (converged) 5 0.00e+00 — (converged) 6 0.00e+00 — (converged) Fixed-point g(x) = cos(x): k |e_k| |e_{k+1}|/|e_k| ---------------------------------------- 0 7.39e-01 0.353024 5 5.44e-02 0.693376 10 7.68e-03 0.670767 15 1.06e-03 0.674004 20 1.47e-04 0.673558 25 2.04e-05 0.673620 Newton: log ratio → 2 (quadratic) Fixed-point: ratio → L = |sin(x*)| = 0.6736 (linear) Same problem, same starting point, same fixed point — but Newton's log ratio converges to 2 (quadratic), while the fixed-point ratio converges to :math:`L = |g'(x^*)| = |\sin(x^*)| \approx 0.674` (linear). The contrast between 5 Newton iterations and 25+ fixed-point iterations is explained entirely by the convergence order. Stopping Criteria ~~~~~~~~~~~~~~~~~~~ Every iterative algorithm needs a rule for when to stop. Common criteria: - **Absolute change:** :math:`\|\theta_{k+1} - \theta_k\| < \varepsilon_{\text{abs}}`. Simple but scale-dependent — a tolerance of :math:`10^{-8}` is tight for :math:`\theta \sim 1` but meaningless for :math:`\theta \sim 10^{10}`. - **Relative change:** :math:`\|\theta_{k+1} - \theta_k\| / \|\theta_k\| < \varepsilon_{\text{rel}}`. Scale-invariant, usually preferred. A relative tolerance of :math:`10^{-8}` means the last 8 digits have stabilized. - **Gradient norm:** :math:`\|\nabla f(\theta_k)\| < \varepsilon`. Directly tests the first-order optimality condition. Used by ``scipy.optimize`` (the ``gtol`` parameter). - **Function value change:** :math:`|f(\theta_{k+1}) - f(\theta_k)| < \varepsilon`. Can be misleading near flat regions — the function value barely changes but the parameter is far from optimal. **Common pitfalls:** - **Premature stopping on a plateau:** the function value or gradient may be small on a flat region far from the optimum, particularly near saddle points - **Tolerance too tight:** below :math:`\varepsilon_{\text{mach}}` there is no information left — tighter tolerance just wastes iterations or causes cycling - **Maximum iterations without convergence:** always set a maximum iteration count and check whether it was hit Diagnosing Non-Convergence ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When an optimization algorithm claims convergence, verify by checking the Hessian at the reported solution. .. code-block:: python import numpy as np # f(x,y) = x^4/4 - x^2/2 + y^2/2 # Saddle at origin, minima at (±1, 0) def grad_f(x, y): return np.array([x**3 - x, y]) def hess_f(x, y): return np.array([[3*x**2 - 1, 0], [0, 1]]) # Newton starting near origin x, y = 0.1, 0.3 print("Newton on f(x,y) = x⁴/4 − x²/2 + y²/2") print(f"Start: ({x}, {y})") print() print(f"{'k':>3} {'x':>12} {'y':>12} {'‖∇f‖':>12}") print("-" * 45) for k in range(5): g = grad_f(x, y) H = hess_f(x, y) print(f"{k:>3} {x:>12.8f} {y:>12.8f} " f"{np.linalg.norm(g):>12.2e}") step = np.linalg.solve(H, g) x -= step[0]; y -= step[1] print() print(f"Converged to ({x:.10f}, {y:.10f})") print(f"Gradient norm: " f"{np.linalg.norm(grad_f(x, y)):.2e}") print() print("Diagnosis: Hessian eigenvalues") H_conv = hess_f(x, y) eigs = np.linalg.eigvalsh(H_conv) print(f" Eigenvalues: [{eigs[0]:.4f}, {eigs[1]:.4f}]") if np.any(eigs < 0): print(f" Negative eigenvalue → SADDLE POINT") print(f" True minima are at (±1, 0) " f"with f = −0.25") else: print(f" All positive → confirmed minimum") .. code-block:: text Newton on f(x,y) = x⁴/4 − x²/2 + y²/2 Start: (0.1, 0.3) k x y ‖∇f‖ --------------------------------------------- 0 0.10000000 0.30000000 3.16e-01 1 -0.00206186 0.00000000 2.06e-03 2 0.00000002 0.00000000 1.75e-08 3 -0.00000000 0.00000000 9.93e-24 4 0.00000000 0.00000000 0.00e+00 Converged to (0.0000000000, 0.0000000000) Gradient norm: 0.00e+00 Diagnosis: Hessian eigenvalues Eigenvalues: [-1.0000, 1.0000] Negative eigenvalue → SADDLE POINT True minima are at (±1, 0) with f = −0.25 Newton converged to the origin — a saddle point, not a minimum. The gradient is exactly zero, so gradient-based stopping criteria cannot detect the problem. Only the Hessian eigenvalue check reveals it. In log-likelihood optimization, we maximize :math:`\ell(\theta)`, so the Hessian should be *negative* definite at a true maximum. A *positive* eigenvalue of the Hessian of :math:`\ell(\theta)` at a reported MLE means the algorithm found a saddle point, not a maximum. Connections: From Numerical Analysis to Computation ------------------------------------------------------ .. list-table:: :header-rows: 1 :widths: 20 30 50 * - Numerical Tool - Appendix C Section - Course Location * - Floating-point stability - Section 1 - :ref:`Ch 2.3 ` (``log1p``), :ref:`Ch 2.6 ` (``logsumexp``), :ref:`Ch 2.4 ` (``finfo``) * - Root finding - Section 2 - :ref:`Ch 2.3 ` (``brentq`` for inverse CDF), :ref:`Ch 3.2 ` (profile likelihood CI) * - Newton-Raphson for MLE - Section 3 - :ref:`Ch 3.2 ` (full treatment), :ref:`Ch 3.5 ` (IRLS) * - Gradient descent - Section 3 - :ref:`Ch 3.2 ` (mentioned), :ref:`Ch 5.5 ` (HMC leapfrog) * - Numerical differentiation - Section 4 - :ref:`Ch 3.3 ` (numerical Hessian), :ref:`Ch 3.1 ` (moment verification) * - Quadrature vs Monte Carlo - Section 5 - :ref:`Ch 2.1 ` (full comparison) * - Convergence orders - Section 6 - :ref:`Ch 3.2 ` (quadratic convergence), :ref:`Ch 3.5 ` (IRLS convergence) * - Condition numbers - Sections 3, 6 / :ref:`Appendix B ` - :ref:`Ch 3.2 ` (scaling/conditioning), :ref:`Ch 3.4 ` (QR vs normal equations) Practice Problems ------------------- .. admonition:: Exercise 1: Floating-Point Catastrophe and Stable Alternatives :class: exercise **(a)** The expression :math:`(1 - \cos x)/x^2 \to 1/2` as :math:`x \to 0` (by L'Hopital or Taylor). Compute this for :math:`x \in \{10^{-1}, 10^{-3}, 10^{-5}, 10^{-7}, 10^{-8}, 10^{-9}\}` using both the naive formula ``(1 - np.cos(x)) / x**2`` and the stable alternative ``2 * np.sin(x/2)**2 / x**2``. Report the relative error of each against the true value :math:`1/2`. At what magnitude of :math:`x` does the naive formula fail completely, and why? **(b)** The equation :math:`x^2 - (10^8 + 10^{-8})x + 1 = 0` has roots :math:`r_1 = 10^8` and :math:`r_2 = 10^{-8}`. Compute both roots using the standard quadratic formula. Which root is accurate? Compute :math:`r_2` using the identity :math:`r_2 = c/(a \cdot r_1)` and verify it matches the true value. **(c)** Generate ``data = 1e12 + rng.standard_normal(1000)`` with ``rng = np.random.default_rng(seed=42)``. Compute the sample variance using: (i) the naive two-pass formula :math:`\frac{1}{n-1}\sum(x_i - \bar{x})^2`, (ii) ``np.var(data, ddof=1)``. Are they different? Explain why adding a large constant does or does not cause problems with ``float64`` arithmetic. (For Welford's online algorithm, see :ref:`Ch2.1 `.) **(d)** Explain why ``np.log(np.sum(np.exp([1000, 1001, 1002])))`` produces ``inf``. Compute the correct answer using ``scipy.special.logsumexp``. .. dropdown:: Solution :class-container: solution .. code-block:: python import numpy as np from scipy.special import logsumexp rng = np.random.default_rng(seed=42) # (a) (1 - cos(x))/x^2 cancellation print("(a) (1 - cos(x))/x²: naive vs stable") print(f"{'x':>12} {'Naive':>20} {'Stable':>20} " f"{'Rel err (naive)':>16}") print("-" * 74) for x in [1e-1, 1e-3, 1e-5, 1e-7, 1e-8, 1e-9]: naive = (1 - np.cos(x)) / x**2 stable = 2 * np.sin(x/2)**2 / x**2 exact = 0.5 rel = abs(naive - exact) / exact print(f"{x:>12.0e} {naive:>20.16e} " f"{stable:>20.16e} {rel:>16.2e}") print(" Naive loses all digits at x ~ 1e-8") print(" (cos rounds to 1.0, so 1-cos = 0)") print(" Stable: 2*sin²(x/2)/x² = 0.5 throughout") print() # (b) Quadratic formula a_coef = 1.0 b_coef = -(1e8 + 1e-8) c_coef = 1.0 disc = b_coef**2 - 4*a_coef*c_coef r1_naive = (-b_coef + np.sqrt(disc)) / (2*a_coef) r2_naive = (-b_coef - np.sqrt(disc)) / (2*a_coef) r2_stable = c_coef / (a_coef * r1_naive) print("(b) Roots of x² − (10⁸+10⁻⁸)x + 1 = 0") print(f" r₁ (naive): {r1_naive:.15e}") print(f" r₁ (true): 1.000000000000000e+08") print(f" r₂ (naive): {r2_naive:.15e}") print(f" r₂ (stable): {r2_stable:.15e}") print(f" r₂ (true): 1.000000000000000e-08") print() # (c) Variance with large offset data = 1e12 + rng.standard_normal(1000) var_naive = np.sum((data - np.mean(data))**2) / 999 var_numpy = np.var(data, ddof=1) print("(c) Sample variance of 10¹² + N(0,1)") print(f" Naive two-pass: {var_naive:.10f}") print(f" np.var: {var_numpy:.10f}") print(f" Difference: {abs(var_naive - var_numpy):.2e}") print(f" Two-pass with float64 is fine here:") print(f" x̄ ≈ 10¹² is computed accurately, and") print(f" (xᵢ − x̄) ≈ N(0,1) recovers the signal.") print() # (d) logsumexp x = np.array([1000, 1001, 1002]) print("(d) log(sum(exp([1000, 1001, 1002])))") try: naive = np.log(np.sum(np.exp(x))) print(f" Naive: {naive}") except Exception: print(f" Naive: overflow") print(f" logsumexp: {logsumexp(x):.6f}") print(f" Expected: 1002 + log(e⁻² + e⁻¹ + 1) " f"= {1002 + np.log(np.exp(-2)+np.exp(-1)+1):.6f}") .. code-block:: text (a) (1 - cos(x))/x²: naive vs stable x Naive Stable Rel err (naive) -------------------------------------------------------------------------- 1e-01 4.9958347219741783e-01 4.9958347219742333e-01 8.33e-04 1e-03 4.9999995832550326e-01 4.9999995833333472e-01 8.33e-08 1e-05 5.0000004137018539e-01 4.9999999999583328e-01 8.27e-08 1e-07 4.9960036108132050e-01 4.9999999999999961e-01 7.99e-04 1e-08 0.0000000000000000e+00 5.0000000000000000e-01 1.00e+00 1e-09 0.0000000000000000e+00 5.0000000000000000e-01 1.00e+00 Naive loses all digits at x ~ 1e-8 (cos rounds to 1.0, so 1-cos = 0) Stable: 2*sin²(x/2)/x² = 0.5 throughout (b) Roots of x² − (10⁸+10⁻⁸)x + 1 = 0 r₁ (naive): 1.000000000000000e+08 r₁ (true): 1.000000000000000e+08 r₂ (naive): 1.490116119384766e-08 r₂ (stable): 1.000000000000000e-08 r₂ (true): 1.000000000000000e-08 (c) Sample variance of 10¹² + N(0,1) Naive two-pass: 0.9785502705 np.var: 0.9785502705 Difference: 0.00e+00 Two-pass with float64 is fine here: x̄ ≈ 10¹² is computed accurately, and (xᵢ − x̄) ≈ N(0,1) recovers the signal. (d) log(sum(exp([1000, 1001, 1002]))) Naive: inf logsumexp: 1002.407606 Expected: 1002 + log(e⁻² + e⁻¹ + 1) = 1002.407606 .. admonition:: Exercise 2: Root-Finding for Confidence Interval Endpoints :class: exercise **(a)** Generate :math:`n = 50` observations from :math:`\text{Exp}(\lambda = 2)` using ``rng = np.random.default_rng(seed=42)`` and ``data = rng.exponential(1/2, 50)``. Write the log-likelihood :math:`\ell(\lambda) = n\log\lambda - \lambda \sum x_i`. The 95% profile likelihood confidence interval endpoints solve :math:`2[\ell(\hat{\lambda}) - \ell(\lambda)] = \chi^2_{1, 0.05} = 3.841`. Formulate this as a root-finding problem: find :math:`\lambda` where :math:`h(\lambda) = 2[\ell(\hat{\lambda}) - \ell(\lambda)] - 3.841 = 0`. **(b)** Find both CI endpoints using: (i) manual bisection (implement yourself, count iterations to reach tolerance :math:`10^{-8}`), (ii) ``scipy.optimize.brentq``. Compare the number of iterations. **(c)** For the exponential model, the exact CI based on the sufficient statistic :math:`T = \sum X_i \sim \text{Gamma}(n, \lambda)` can be obtained via chi-squared quantiles. Compute this exact CI and compare to the profile likelihood CI. When is numerical root-finding the *only* option? .. dropdown:: Solution :class-container: solution .. code-block:: python import numpy as np from scipy.optimize import brentq from scipy import stats rng = np.random.default_rng(seed=42) n = 50 data = rng.exponential(1/2, n) sum_x = np.sum(data) lam_hat = n / sum_x def loglik(lam): return n * np.log(lam) - lam * sum_x ll_max = loglik(lam_hat) chi2_crit = stats.chi2.ppf(0.95, 1) def h(lam): return 2 * (ll_max - loglik(lam)) - chi2_crit print(f"(a) Exponential data: n={n}, Σxᵢ={sum_x:.4f}") print(f" MLE: λ̂ = n/Σxᵢ = {lam_hat:.6f}") print(f" χ²₁,₀.₀₅ = {chi2_crit:.3f}") print() # (b) Manual bisection for lower endpoint a, b = 0.5, lam_hat bisect_iters = 0 while (b - a) > 1e-8: mid = (a + b) / 2 if h(mid) * h(a) < 0: b = mid else: a = mid bisect_iters += 1 lam_lo_bisect = (a + b) / 2 # Upper endpoint a, b = lam_hat, 5.0 bisect_iters_up = 0 while (b - a) > 1e-8: mid = (a + b) / 2 if h(mid) * h(a) < 0: b = mid else: a = mid bisect_iters_up += 1 lam_hi_bisect = (a + b) / 2 # brentq lam_lo_brent = brentq(h, 0.5, lam_hat) lam_hi_brent = brentq(h, lam_hat, 5.0) print(f"(b) Profile likelihood 95% CI:") print(f" Bisection: [{lam_lo_bisect:.6f}, " f"{lam_hi_bisect:.6f}]") print(f" Iters: {bisect_iters} (lower), " f"{bisect_iters_up} (upper)") print(f" Brentq: [{lam_lo_brent:.6f}, " f"{lam_hi_brent:.6f}]") print() # (c) Exact CI via chi-squared # 2λΣXᵢ ~ χ²(2n), so CI for λ is # χ²(2n, α/2)/(2Σxᵢ) to χ²(2n, 1-α/2)/(2Σxᵢ) lo_exact = stats.chi2.ppf(0.025, 2*n) / (2*sum_x) hi_exact = stats.chi2.ppf(0.975, 2*n) / (2*sum_x) print(f"(c) Exact CI (chi-squared): " f"[{lo_exact:.6f}, {hi_exact:.6f}]") print(f" Profile LR CI: " f"[{lam_lo_brent:.6f}, {lam_hi_brent:.6f}]") print(f" Root-finding is the only option when") print(f" no exact pivotal quantity exists (e.g.,") print(f" Gamma shape, mixture models, GLMs).") .. code-block:: text (a) Exponential data: n=50, Σxᵢ=20.8974 MLE: λ̂ = n/Σxᵢ = 2.392642 χ²₁,₀.₀₅ = 3.841 (b) Profile likelihood 95% CI: Bisection: [1.789254, 3.118475] Iters: 28 (lower), 28 (upper) Brentq: [1.789254, 3.118475] (c) Exact CI (chi-squared): [1.775865, 3.099936] Profile LR CI: [1.789254, 3.118475] Root-finding is the only option when no exact pivotal quantity exists (e.g., Gamma shape, mixture models, GLMs). .. admonition:: Exercise 3: Gradient Descent vs Newton for Normal MLE :class: exercise **(a)** Generate :math:`n = 100` observations from :math:`N(\mu = 5, \sigma^2 = 9)` using ``rng = np.random.default_rng(seed=42)`` and ``data = rng.normal(5, 3, 100)``. Write the negative log-likelihood as a function of :math:`(\mu, \sigma^2)`. Compute the gradient and :math:`2 \times 2` Hessian analytically. **(b)** Implement gradient descent with step size :math:`\alpha = 0.001` starting from :math:`(\mu_0, \sigma^2_0) = (3, 3)`. Track the iterates and report how many iterations are needed to reach relative tolerance :math:`10^{-8}`. **(c)** Implement Newton's method from the same starting point. How many iterations? **(d)** Estimate the convergence order for both methods using the technique from Section 6 (log error ratios). Compare your results to ``scipy.optimize.minimize(method='Newton-CG')``. .. dropdown:: Solution :class-container: solution .. code-block:: python import numpy as np from scipy.optimize import minimize rng = np.random.default_rng(seed=42) n = 100 data = rng.normal(5, 3, n) xbar = np.mean(data) s2_mle = np.var(data) def neg_ll(params): mu, s2 = params if s2 <= 0: return 1e20 return (n/2*np.log(2*np.pi*s2) + np.sum((data - mu)**2) / (2*s2)) def grad_neg(params): mu, s2 = params g_mu = -np.sum(data - mu) / s2 g_s2 = (n/(2*s2) - np.sum((data - mu)**2) / (2*s2**2)) return np.array([g_mu, g_s2]) def hess_neg(params): mu, s2 = params h11 = n / s2 h12 = np.sum(data - mu) / s2**2 h22 = (-n/(2*s2**2) + np.sum((data - mu)**2) / s2**3) return np.array([[h11, h12], [h12, h22]]) theta_star = np.array([xbar, s2_mle]) print(f"(a) MLE: μ̂={xbar:.4f}, σ̂²={s2_mle:.4f}") print() # (b) Gradient descent theta = np.array([3.0, 3.0]) lr = 0.001 gd_errors = [] gd_iters = 0 for k in range(10000): err = np.linalg.norm(theta - theta_star) gd_errors.append(err) g = grad_neg(theta) theta_new = theta - lr * g theta_new[1] = max(theta_new[1], 0.01) rel_change = (np.linalg.norm(theta_new - theta) / np.linalg.norm(theta)) theta = theta_new if rel_change < 1e-8: gd_iters = k + 1 gd_errors.append( np.linalg.norm(theta - theta_star)) break else: gd_iters = 10000 print(f"(b) Gradient descent (α=0.001):") print(f" Converged in {gd_iters} iterations") print(f" Final: μ={theta[0]:.6f}, " f"σ²={theta[1]:.6f}") print() # (c) Newton theta = np.array([3.0, 3.0]) nw_errors = [] nw_iters = 0 for k in range(50): err = np.linalg.norm(theta - theta_star) nw_errors.append(err) g = grad_neg(theta) H = hess_neg(theta) step = np.linalg.solve(H, g) theta_new = theta - step theta_new[1] = max(theta_new[1], 0.01) rel_change = (np.linalg.norm(step) / np.linalg.norm(theta)) theta = theta_new if rel_change < 1e-12: nw_iters = k + 1 nw_errors.append( np.linalg.norm(theta - theta_star)) break else: nw_iters = 50 print(f"(c) Newton:") print(f" Converged in {nw_iters} iterations") print(f" Final: μ={theta[0]:.6f}, " f"σ²={theta[1]:.6f}") print() # (d) Convergence order print("(d) Convergence order estimation:") print(" Newton (last 4 iterations):") start_k = max(0, len(nw_errors) - 5) for k in range(start_k, len(nw_errors)-1): e0, e1 = nw_errors[k], nw_errors[k+1] if e0 > 1e-15 and e1 > 1e-15: p = np.log(e1) / np.log(e0) print(f" k={k}: |e|={e0:.2e}, " f"order ≈ {p:.2f}") print(" Gradient descent (sampled):") for k in [0, 100, 500, 1000]: if k+1 < len(gd_errors): e0, e1 = gd_errors[k], gd_errors[k+1] if e0 > 0 and e1 > 0: ratio = e1 / e0 print(f" k={k}: |e|={e0:.2e}, " f"ratio={ratio:.6f}") # scipy comparison res = minimize(neg_ll, x0=[3, 3], method='Newton-CG', jac=grad_neg, hess=hess_neg) print(f"\n scipy Newton-CG: {res.nit} iterations, " f"μ={res.x[0]:.6f}, σ²={res.x[1]:.6f}") .. code-block:: text (a) MLE: μ̂=4.8492, σ̂²=5.3748 (b) Gradient descent (α=0.001): Converged in 5838 iterations Final: μ=4.849191, σ²=5.374715 (c) Newton: Converged in 9 iterations Final: μ=4.849191, σ²=5.374757 (d) Convergence order estimation: Newton (last 4 iterations): k=5: |e|=3.58e-02, order ≈ 2.30 k=6: |e|=4.74e-04, order ≈ 2.13 k=7: |e|=8.34e-08, order ≈ 2.06 Gradient descent (sampled): k=0: |e|=3.01e+00, ratio=0.979026 k=100: |e|=1.42e+00, ratio=0.996645 k=500: |e|=5.23e-01, ratio=0.997876 k=1000: |e|=1.96e-01, ratio=0.998136 scipy Newton-CG: 7 iterations, μ=4.849191, σ²=5.374757 .. admonition:: Exercise 4: The Step Size Dilemma :class: exercise Consider :math:`f(x) = x e^x`, whose derivative :math:`f'(x) = (x+1)e^x` is known exactly. Evaluate :math:`f'(1) = 2e` numerically using step sizes :math:`h = 10^{-1}, 10^{-2}, \ldots, 10^{-16}`. **(a)** Use the **forward difference** :math:`[f(x+h) - f(x)]/h`. Plot the relative error versus :math:`h` (log-log). Find the optimal :math:`h` and compare it to the theoretical prediction :math:`h^* \approx \sqrt{\varepsilon_{\text{mach}}} \approx 1.5 \times 10^{-8}`. **(b)** Repeat with the **central difference** :math:`[f(x+h) - f(x-h)]/(2h)`. The optimal :math:`h` should be near :math:`\varepsilon_{\text{mach}}^{1/3} \approx 6 \times 10^{-6}`. Does it match? **(c)** Repeat with the **complex-step** method :math:`\text{Im}(f(x+ih))/h`. Does the error show a V-shaped curve or monotonic improvement? Why? .. dropdown:: Solution :class-container: solution .. code-block:: python import numpy as np x = 1.0 exact = 2 * np.exp(1) eps = np.finfo(float).eps h_vals = [10**(-k) for k in range(1, 17)] print(f"f(x) = x*exp(x), f'(1) = 2e = {exact:.10f}") print() # (a) Forward differences print("(a) Forward differences:") print(f"{'h':>12} {'approx':>18} " f"{'Rel error':>12}") print("-" * 48) fwd_errors = [] for h in h_vals: approx = ((x+h)*np.exp(x+h) - x*np.exp(x)) / h rel = abs(approx - exact) / exact fwd_errors.append(rel) print(f"{h:>12.0e} {approx:>18.10f} " f"{rel:>12.2e}") best = h_vals[np.argmin(fwd_errors)] print(f" Optimal h ~ {best:.0e} ~ sqrt(eps)") print(f" (sqrt(eps) = {np.sqrt(eps):.2e})") print() # (b) Central differences print("(b) Central differences:") print(f"{'h':>12} {'approx':>18} " f"{'Rel error':>12}") print("-" * 48) cen_errors = [] for h in h_vals: fp = (x+h)*np.exp(x+h) fm = (x-h)*np.exp(x-h) approx = (fp - fm) / (2*h) rel = abs(approx - exact) / exact cen_errors.append(rel) print(f"{h:>12.0e} {approx:>18.10f} " f"{rel:>12.2e}") best_c = h_vals[np.argmin(cen_errors)] print(f" Best on grid: h = {best_c:.0e}") print(f" Theory: h* ~ eps^(1/3) = " f"{eps**(1/3):.2e}") print() # (c) Complex-step print("(c) Complex-step derivative:") print(f"{'h':>12} {'approx':>18} " f"{'Rel error':>12}") print("-" * 48) for h in h_vals: z = complex(x, h) fz = z * np.exp(z) approx = fz.imag / h rel = abs(approx - exact) / exact print(f"{h:>12.0e} {approx:>18.10f} " f"{rel:>12.2e}") print(" No roundoff floor: error decreases") print(" monotonically as h -> 0.") .. code-block:: text f(x) = x*exp(x), f'(1) = 2e = 5.4365636569 (a) Forward differences: h approx Rel error ------------------------------------------------ 1e-01 5.8630079788 7.84e-02 1e-02 5.4775196708 7.53e-03 1e-03 5.4406428924 7.50e-04 1e-04 5.4369714173 7.50e-05 1e-05 5.4366044314 7.50e-06 1e-06 5.4365677338 7.50e-07 1e-07 5.4365640700 7.60e-08 1e-08 5.4365636437 2.43e-09 1e-09 5.4365640878 7.93e-08 1e-10 5.4365667523 5.69e-07 1e-11 5.4365845159 3.84e-06 1e-12 5.4374282854 1.59e-04 1e-13 5.4356519286 1.68e-04 1e-14 5.4178883602 3.44e-03 1e-15 6.2172489379 1.44e-01 1e-16 0.0000000000 1.00e+00 Optimal h ~ 1e-08 ~ sqrt(eps) (sqrt(eps) = 1.49e-08) (b) Central differences: h approx Rel error ------------------------------------------------ 1e-01 5.4546991315 3.34e-03 1e-02 5.4367448771 3.33e-05 1e-03 5.4365654691 3.33e-07 1e-04 5.4365636750 3.33e-09 1e-05 5.4365636571 3.79e-11 1e-06 5.4365636566 6.01e-11 1e-07 5.4365636570 2.16e-11 1e-08 5.4365636437 2.43e-09 1e-09 5.4365636437 2.43e-09 1e-10 5.4365645319 1.61e-07 1e-11 5.4365623114 2.47e-07 1e-12 5.4369841962 7.74e-05 1e-13 5.4356519286 1.68e-04 1e-14 5.4178883602 3.44e-03 1e-15 5.7731597281 6.19e-02 1e-16 2.2204460493 5.92e-01 Best on grid: h = 1e-07 Theory: h* ~ eps^(1/3) = 6.06e-06 (c) Complex-step derivative: h approx Rel error ------------------------------------------------ 1e-01 5.4184553652 3.33e-03 1e-02 5.4363824395 3.33e-05 1e-03 5.4365618447 3.33e-07 1e-04 5.4365636388 3.33e-09 1e-05 5.4365636567 3.33e-11 1e-06 5.4365636569 3.33e-13 1e-07 5.4365636569 3.43e-15 1e-08 5.4365636569 0.00e+00 1e-09 5.4365636569 0.00e+00 1e-10 5.4365636569 1.63e-16 1e-11 5.4365636569 0.00e+00 1e-12 5.4365636569 0.00e+00 1e-13 5.4365636569 0.00e+00 1e-14 5.4365636569 0.00e+00 1e-15 5.4365636569 0.00e+00 1e-16 5.4365636569 0.00e+00 No roundoff floor: error decreases monotonically as h -> 0. Forward and central differences both show V-shaped error curves: as :math:`h` decreases, truncation error shrinks but roundoff error grows. The optimal :math:`h` balances the two. The complex-step method avoids cancellation entirely because it never subtracts nearly equal real numbers — the derivative information is isolated in the imaginary part. .. admonition:: Exercise 5: Diagnosing Optimization Results :class: exercise Consider :math:`f(x, y) = \frac{x^4}{4} - \frac{x^2}{2} + \frac{y^4}{4} - \frac{y^2}{2}`, which has 9 critical points. **(a)** Find all critical points by solving :math:`\nabla f = 0`. Classify each as a minimum, maximum, or saddle point using the Hessian eigenvalues. **(b)** Run Newton's method from three starting points: :math:`(1.5, 1.5)`, :math:`(0.6, 0.1)`, and :math:`(0.1, 0.3)`. For each, report the convergence point, function value, and Hessian eigenvalues. Does Newton always find a minimum? **(c)** What stopping criterion would detect that Newton converged to a non-minimum? Explain why gradient-based stopping criteria alone are insufficient. .. dropdown:: Solution :class-container: solution .. code-block:: python import numpy as np def f(x, y): return x**4/4 - x**2/2 + y**4/4 - y**2/2 def grad_f(x, y): return np.array([x**3 - x, y**3 - y]) def hess_f(x, y): return np.array([[3*x**2 - 1, 0], [0, 3*y**2 - 1]]) # (a) Critical points print("(a) Critical points (grad f = 0):") print(" x^3 - x = 0 => x in {-1, 0, 1}") print(" y^3 - y = 0 => y in {-1, 0, 1}") print(" 9 critical points:") print() for xc in [-1, 0, 1]: for yc in [-1, 0, 1]: H = hess_f(xc, yc) eigs = np.linalg.eigvalsh(H) if np.all(eigs > 0): ctype = "minimum" elif np.all(eigs < 0): ctype = "maximum" else: ctype = "saddle" print(f" ({xc:>2},{yc:>2}): " f"eigs=[{eigs[0]:.0f},{eigs[1]:.0f}]," f" f={f(xc,yc):.2f} -> {ctype}") print() # (b) Newton from different starts starts = [(1.5, 1.5), (0.6, 0.1), (0.1, 0.3)] for x0, y0 in starts: x, y = x0, y0 print(f"(b) Newton from ({x0}, {y0}):") print(f"{'k':>5} {'x':>12} {'y':>12} " f"{'||grad||':>12}") print("-" * 48) for k in range(8): g = grad_f(x, y) gnorm = np.linalg.norm(g) print(f"{k:>5} {x:>12.8f} " f"{y:>12.8f} {gnorm:>12.2e}") if gnorm < 1e-14: break H = hess_f(x, y) step = np.linalg.solve(H, g) x -= step[0]; y -= step[1] H = hess_f(x, y) eigs = np.linalg.eigvalsh(H) print(f" Converged to ({x:.4f}, {y:.4f})") print(f" f = {f(x,y):.4f}") print(f" Hessian eigs: " f"[{eigs[0]:.2f}, {eigs[1]:.2f}]") if np.all(eigs > 0): print(" All positive -> minimum") elif np.all(eigs < 0): print(" All negative -> maximum!") else: print(" Mixed signs -> saddle point!") print() .. code-block:: text (a) Critical points (grad f = 0): x^3 - x = 0 => x in {-1, 0, 1} y^3 - y = 0 => y in {-1, 0, 1} 9 critical points: (-1,-1): eigs=[2,2], f=-0.50 -> minimum (-1, 0): eigs=[-1,2], f=-0.25 -> saddle (-1, 1): eigs=[2,2], f=-0.50 -> minimum ( 0,-1): eigs=[-1,2], f=-0.25 -> saddle ( 0, 0): eigs=[-1,-1], f=0.00 -> maximum ( 0, 1): eigs=[-1,2], f=-0.25 -> saddle ( 1,-1): eigs=[2,2], f=-0.50 -> minimum ( 1, 0): eigs=[-1,2], f=-0.25 -> saddle ( 1, 1): eigs=[2,2], f=-0.50 -> minimum (b) Newton from (1.5, 1.5): k x y ||grad|| ------------------------------------------------ 0 1.50000000 1.50000000 2.65e+00 1 1.17391304 1.17391304 6.28e-01 2 1.03230713 1.03230713 9.59e-02 3 1.00145595 1.00145595 4.13e-03 4 1.00000317 1.00000317 8.96e-06 5 1.00000000 1.00000000 4.26e-11 6 1.00000000 1.00000000 0.00e+00 Converged to (1.0000, 1.0000) f = -0.5000 Hessian eigs: [2.00, 2.00] All positive -> minimum (b) Newton from (0.6, 0.1): k x y ||grad|| ------------------------------------------------ 0 0.60000000 0.10000000 3.97e-01 1 5.40000000 -0.00206186 1.52e+02 2 3.64162812 0.00000002 4.47e+01 3 2.49034823 -0.00000000 1.30e+01 4 1.75453404 0.00000000 3.65e+00 5 1.31172523 0.00000000 9.45e-01 6 1.08460145 0.00000000 1.91e-01 7 1.00896898 0.00000000 1.82e-02 Converged to (1.0001, 0.0000) f = -0.2500 Hessian eigs: [-1.00, 2.00] Mixed signs -> saddle point! (b) Newton from (0.1, 0.3): k x y ||grad|| ------------------------------------------------ 0 0.10000000 0.30000000 2.90e-01 1 -0.00206186 -0.07397260 7.36e-02 2 0.00000002 0.00082306 8.23e-04 3 -0.00000000 -0.00000000 1.12e-09 4 0.00000000 0.00000000 0.00e+00 Converged to (0.0000, 0.0000) f = 0.0000 Hessian eigs: [-1.00, -1.00] All negative -> maximum! Newton converges to whichever critical point is nearest in its basin — it does not distinguish minima from saddles or maxima because all have :math:`\nabla f = 0`. The gradient norm stopping criterion passes at any critical point. Only checking the Hessian eigenvalues at the converged point reveals whether the result is actually a minimum (all positive), a saddle (mixed signs), or a maximum (all negative).