Appendix A — Calculus Review¶

STAT 418 · Computational Data Science · runnable companion to the Calculus Review appendix.

Calculus in this course is not an end in itself — it is the engine behind three statistical workflows. Optimization uses derivatives to find maximum-likelihood estimates (set the score to zero). Uncertainty quantification uses the curvature of the log-likelihood (second derivatives) to measure how precisely the data pin down a parameter. Transformation uses change-of-variables / Jacobian formulas to derive the distributions of functions of random variables. This notebook develops each tool and verifies it computationally — analytic results checked against finite differences, symbolic algebra (SymPy), and Monte-Carlo simulation, side by side.

Learning outcomes¶

By the end you can:

  1. Differentiate log-likelihoods (logarithmic differentiation, chain rule) to build score functions, and confirm a critical point is a maximum with the second derivative / observed information.
  2. Read every integral as a probability, expectation, or normalizing constant, and evaluate it three ways (analytic / scipy.integrate.quad / Monte Carlo).
  3. Use Taylor series as the bridge from exact to tractable: first order $\Rightarrow$ the delta method, second order $\Rightarrow$ the quadratic log-likelihood that produces Wald intervals and asymptotic normality.
  4. Apply the Leibniz rule to prove $\mathbb{E}[\,U(\theta)\,]=0$ and the Fisher information equality $\operatorname{Var}(U) = -\mathbb{E}[\ell'']$.
  5. Compute gradients and Hessians (analytic, symbolic, and numeric), use the multivariate chain rule to assemble a GLM score vector, and use convexity to guarantee a unique optimum.
  6. Apply the Jacobian determinant for multivariate change of variables (polar coordinates, $X/(X+Y)\sim\text{Beta}$, Box-Muller).

Section map (all core — calculus the rest of the course assumes)¶

§ Topic Computational illustration
A.1 Univariate differentiation Poisson score: analytic vs finite-difference vs SymPy; second-derivative test
A.2 Integration $\mathbb{E}[X^2]$ for Gamma 3 ways; IBP $\mathbb{E}[X]=\alpha/\beta$; inverse-CDF; Cauchy has no mean
A.3 Taylor series $\log(1+x)$ orders 1–4; delta method (variance-stabilizing $\sqrt{\bar X}$); quadratic log-lik
A.4 Leibniz rule $\mathbb{E}[U(\theta)]=0$ (Normal); Fisher information equality (Exponential)
A.5 Gradients & Hessians Normal score vector & Hessian (SymPy + numeric); GLM chain-rule score
A.6 Convexity concave Normal log-lik (unique MLE) vs bimodal mixture (local traps)
A.7 Multivariate Taylor quadratic approximation of $\ell(\mu,\sigma^2)$; the Wald statistic
A.8 Multiple integrals / Jacobian polar coordinates $\det J = r$; $X/(X+Y)\sim\text{Beta}$

All simulations use np.random.default_rng(42) (the appendix's seed) so the printed numbers reproduce the RST exactly.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import sympy as sp
from scipy import stats
from scipy.optimize import approx_fprime, minimize_scalar
from scipy.integrate import quad
from scipy.special import gammaln, digamma, polygamma
%matplotlib inline

np.random.seed(418)            # global reproducibility (STYLE.md contract)
plt.rcParams.update({
    'figure.figsize': (7.2, 4.2), 'figure.dpi': 110,
    'axes.grid': True, 'grid.alpha': 0.25,
    'axes.spines.top': False, 'axes.spines.right': False,
    'font.size': 11,
})
BLUE, RED, GREEN, ORANGE, PURPLE = '#4C72B0', '#C44E52', '#55A868', '#DD8452', '#8172B3'

# This appendix has NO external data: every result is a reproducible simulation or a
# symbolic identity. Each worked example re-seeds its own default_rng(42) -- the seed
# the appendix documents -- so the printed numbers match the RST to the last digit.
print('environment ready · numpy', np.__version__, '· sympy', sp.__version__)
environment ready · numpy 1.26.4 · sympy 1.13.2

Section A.1 — Univariate Differentiation¶

The derivative $f'(x)=\lim_{h\to 0}\frac{f(x+h)-f(x)}{h}$ is the instantaneous rate of change. The rule that dominates this course is the chain rule $\frac{d}{dx}f(g(x))=f'(g(x))\,g'(x)$, because nearly every likelihood is a composition: data enter through a statistic, which enters the log-likelihood through a nonlinear function of the parameter.

Logarithmic differentiation and the score function¶

For $f(x)>0$, $\;\frac{d}{dx}\log f(x)=\frac{f'(x)}{f(x)}$. We always work in log space because (1) products of independent densities become sums, (2) exponential families linearize, and (3) the derivative of the log-likelihood — the score $U(\theta)=\ell'(\theta)$ — has mean zero at the truth (proved in A.4).

Poisson example. For $X_1,\dots,X_n\sim\text{Poisson}(\lambda)$, $\ell(\lambda)=(\sum x_i)\log\lambda - n\lambda - \sum\log x_i!$, so $U(\lambda)=\frac{\sum x_i}{\lambda}-n$, and $U(\hat\lambda)=0\Rightarrow\hat\lambda=\bar X$. We derive that score symbolically with SymPy, then check the analytic formula against finite differences at several $\lambda$.

In [2]:
# SymPy: derive the per-observation Poisson score from log f, then sum.
lam, xsym, nsym, xbar_s = sp.symbols('lambda x n xbar', positive=True)
logf = xsym*sp.log(lam) - lam - sp.loggamma(xsym + 1)   # log Poisson pmf
score_obs = sp.simplify(sp.diff(logf, lam))             # per-observation score
score_sample = sp.simplify(nsym*(xbar_s/lam - 1))       # sum over n iid obs
print('SymPy per-observation score  d/dlambda log f =', score_obs, '  = x/lambda - 1')
print('Sample score U(lambda) = sum_i (x_i/lambda - 1) = n*(xbar/lambda - 1) =', score_sample)
print('Solve U=0  ->  lambda_hat =', sp.solve(sp.Eq(score_sample, 0), lam)[0], '= xbar  (the MLE)')
SymPy per-observation score  d/dlambda log f = (-lambda + x)/lambda   = x/lambda - 1
Sample score U(lambda) = sum_i (x_i/lambda - 1) = n*(xbar/lambda - 1) = n*(-lambda + xbar)/lambda
Solve U=0  ->  lambda_hat = xbar = xbar  (the MLE)
In [3]:
# Numeric check: analytic score vs finite-difference derivative of the log-likelihood.
rng = np.random.default_rng(42)
n = 50
data = rng.poisson(3.5, n)
sum_x = int(np.sum(data)); xbar = np.mean(data)

def poisson_loglik(lam_arr):
    lam = lam_arr[0]
    if lam <= 0:
        return -1e20
    return float(np.sum(data * np.log(lam) - lam))      # drop constant log(x!) term
def score_analytical(lam):
    return sum_x / lam - n

print(f'Poisson score verification (n={n}, sum_x={sum_x}, xbar={xbar:.2f})')
print(f"{'lambda':>8}  {'analytic':>12}  {'finite-diff':>12}  {'|diff|':>10}")
print('-' * 48)
for lam_v in [2.0, 3.0, xbar, 4.0, 5.0]:
    s_an = score_analytical(lam_v)
    s_fd = approx_fprime(np.array([lam_v]), poisson_loglik, 1e-8)[0]
    print(f'{lam_v:>8.2f}  {s_an:>12.6f}  {s_fd:>12.6f}  {abs(s_an - s_fd):>10.2e}')
print(f'\nMLE lambda_hat = xbar = {xbar:.2f};  score at MLE U({xbar:.2f}) = {score_analytical(xbar):.6f}')
Poisson score verification (n=50, sum_x=171, xbar=3.42)
  lambda      analytic   finite-diff      |diff|
------------------------------------------------
    2.00     35.500000     35.500000    3.11e-07
    3.00      7.000000      6.999998    2.22e-06
    3.42      0.000000     -0.000004    3.55e-06
    4.00     -7.250000     -7.249999    6.00e-07
    5.00    -15.800000    -15.799997    2.86e-06

MLE lambda_hat = xbar = 3.42;  score at MLE U(3.42) = 0.000000

Higher derivatives and the second-derivative test¶

At a critical point, $f''<0$ certifies a maximum. For the Poisson, $\ell''(\lambda)=-\sum x_i/\lambda^2<0$ everywhere, so the log-likelihood is strictly concave — the MLE is the unique global maximum. The quantity $J(\hat\theta)=-\ell''(\hat\theta)$ is the observed Fisher information: the curvature of the peak, with $\widehat{\mathrm{SE}}(\hat\theta)\approx 1/\sqrt{J}$. A sharp peak (large $J$) means high precision; a flat peak (small $J$) means low precision.

In [4]:
# Tabulate the log-likelihood, score, and curvature around the MLE.
print(f'Poisson log-likelihood analysis (n={n}, sum_x={sum_x}, xbar={xbar:.2f})')
print(f"{'lambda':>7}  {'loglik l':>11}  {'score l1':>11}  {'curv l2':>11}")
print('-' * 46)
for lam_v in [2.0, 2.5, 3.0, xbar, 3.5, 4.0, 4.5]:
    ll = sum_x * np.log(lam_v) - n * lam_v
    sc = sum_x / lam_v - n
    d2 = -sum_x / lam_v**2
    print(f'{lam_v:>7.2f}  {ll:>11.4f}  {sc:>11.4f}  {d2:>11.4f}')
J_obs = sum_x / xbar**2
print(f'\nAt MLE lambda_hat={xbar:.2f}:  score=0 (check {sum_x/xbar - n:.6f}),  '
      f"l''={-J_obs:.4f} < 0 (maximum)")
print(f'  observed information J = {J_obs:.4f}  ->  SE ~ 1/sqrt(J) = {1/np.sqrt(J_obs):.4f}')
Poisson log-likelihood analysis (n=50, sum_x=171, xbar=3.42)
 lambda     loglik l     score l1      curv l2
----------------------------------------------
   2.00      18.5282      35.5000     -42.7500
   2.50      31.6857      18.4000     -27.3600
   3.00      37.8627       7.0000     -19.0000
   3.42      39.2685       0.0000     -14.6199
   3.50      39.2225      -1.1429     -13.9592
   4.00      37.0563      -7.2500     -10.6875
   4.50      32.1972     -12.0000      -8.4444

At MLE lambda_hat=3.42:  score=0 (check 0.000000),  l''=-14.6199 < 0 (maximum)
  observed information J = 14.6199  ->  SE ~ 1/sqrt(J) = 0.2615
In [5]:
# Figure: the Poisson log-likelihood is concave (left); its score crosses zero at the
# MLE (right) -- 'set the derivative to zero' made visual.
grid = np.linspace(2.0, 5.0, 400)
ll = sum_x * np.log(grid) - n * grid
sc = sum_x / grid - n
fig, (a1, a2) = plt.subplots(1, 2, figsize=(10.5, 4.0))
a1.plot(grid, ll, color=BLUE, lw=2.2)
a1.axvline(xbar, color=RED, ls='--', lw=1.5, label=fr'MLE $\hat\lambda=\bar X={xbar:.2f}$')
a1.plot([xbar], [sum_x*np.log(xbar) - n*xbar], 'o', color=RED, ms=8)
a1.set_xlabel(r'$\lambda$'); a1.set_ylabel(r'$\ell(\lambda)$')
a1.set_title('Poisson log-likelihood (strictly concave)'); a1.legend()
a2.plot(grid, sc, color=GREEN, lw=2.2, label=r"score $U(\lambda)=\sum x_i/\lambda - n$")
a2.axhline(0, color='k', lw=1)
a2.axvline(xbar, color=RED, ls='--', lw=1.5)
a2.plot([xbar], [0], 'o', color=RED, ms=8, label=fr'$U(\hat\lambda)=0$')
a2.set_xlabel(r'$\lambda$'); a2.set_ylabel(r'$U(\lambda)$')
a2.set_title('Score crosses zero at the MLE'); a2.legend()
plt.tight_layout(); plt.show()

Read-out. The analytic and finite-difference scores agree to $\sim 10^{-6}$ at every $\lambda$, and the score is exactly zero at $\hat\lambda=\bar X=3.42$. SymPy independently returns the same per-observation score $x/\lambda-1$. The second derivative is negative everywhere, so the MLE is the unique maximum; the observed information $J(\hat\lambda)=14.62$ gives $\widehat{\mathrm{SE}}\approx 0.26$. This machinery — score, second derivative, observed information — is exactly what drives MLE, GLMs, and Newton-Raphson later.

Section A.2 — Integration¶

Every integral in this course is a probability, an expectation, or a normalizing constant: $\mathbb{E}[g(X)]=\int g(x)f(x)\,dx$. Reading integrals as expectations unlocks a verification strategy — if you can sample from $f$, you can approximate any integral by averaging $g(X_i)$ (the Monte-Carlo method of Chapter 2). We verify $\mathbb{E}[X^2]$ for a $\text{Gamma}(\alpha=3,\beta=2)$ three ways: closed form, scipy.integrate.quad, and Monte Carlo.

In [6]:
rng = np.random.default_rng(42)
alpha, beta = 3.0, 2.0
# Analytical: E[X^2] = Var(X) + (E[X])^2 = alpha/beta^2 + (alpha/beta)^2
analytical = alpha / beta**2 + (alpha / beta)**2
integrand = lambda x: x**2 * stats.gamma.pdf(x, a=alpha, scale=1/beta)
quad_result, _ = quad(integrand, 0, np.inf)
samples = rng.gamma(alpha, 1/beta, 100000)
mc_result = np.mean(samples**2)
print('Verification: E[X^2] for Gamma(alpha=3, beta=2)')
print(f"{'method':>22}  {'E[X^2]':>10}  {'error':>10}")
print('-' * 46)
print(f"{'analytical':>22}  {analytical:>10.6f}  {'--':>10}")
print(f"{'scipy.integrate.quad':>22}  {quad_result:>10.6f}  {abs(quad_result-analytical):>10.2e}")
print(f"{'Monte Carlo (n=1e5)':>22}  {mc_result:>10.6f}  {abs(mc_result-analytical):>10.2e}")
Verification: E[X^2] for Gamma(alpha=3, beta=2)
                method      E[X^2]       error
----------------------------------------------
            analytical    3.000000          --
  scipy.integrate.quad    3.000000    0.00e+00
   Monte Carlo (n=1e5)    3.010147    1.01e-02

Integration by parts: $\mathbb{E}[X]=\alpha/\beta$ for the Gamma¶

Integration by parts $\int u\,dv = uv-\int v\,du$ trades one integral for a simpler one. For the Gamma mean, take $u=x^\alpha,\ dv=e^{-\beta x}dx$: the boundary term vanishes and the recursion gives $\mathbb{E}[X]=\alpha/\beta$. We confirm the symbolic integral with SymPy and the numeric value by simulation.

In [7]:
# SymPy evaluates the (improper) integral symbolically -> alpha/beta.
a_s, b_s, x_s = sp.symbols('alpha beta x', positive=True)
pdf = b_s**a_s / sp.gamma(a_s) * x_s**(a_s - 1) * sp.exp(-b_s * x_s)
EX_sym = sp.simplify(sp.integrate(x_s * pdf, (x_s, 0, sp.oo)))
EX2_sym = sp.simplify(sp.integrate(x_s**2 * pdf, (x_s, 0, sp.oo)))
print('SymPy integration by parts (closed form):')
print('  E[X]   = int_0^inf x f(x) dx   =', EX_sym, '  (= alpha/beta)')
print('  E[X^2] = int_0^inf x^2 f(x) dx =', EX2_sym, '  (= alpha(alpha+1)/beta^2)')

rng = np.random.default_rng(42)
samples = rng.gamma(3.0, 1/2.0, 100000)
print(f'\nSimulation check  E[X] for Gamma(3,2):')
print(f'  theoretical alpha/beta = {3.0/2.0:.6f}')
print(f'  simulated mean         = {np.mean(samples):.6f}')
print(f'  difference             = {abs(np.mean(samples) - 3.0/2.0):.6f}')
SymPy integration by parts (closed form):
  E[X]   = int_0^inf x f(x) dx   = alpha/beta   (= alpha/beta)
  E[X^2] = int_0^inf x^2 f(x) dx = alpha*(alpha + 1)/beta**2   (= alpha(alpha+1)/beta^2)

Simulation check  E[X] for Gamma(3,2):
  theoretical alpha/beta = 1.500000
  simulated mean         = 1.500859
  difference             = 0.000859

Change of variables (univariate): the inverse-CDF method¶

If $Y=g(X)$ is monotone, $f_Y(y)=f_X(g^{-1}(y))\,\lvert\frac{d}{dy}g^{-1}(y)\rvert$. Densities are probability per unit length, so the Jacobian $\lvert dx/dy\rvert$ is the stretching factor that conserves total probability. Inverse-CDF sampling for the exponential: $U\sim\text{Uniform}(0,1)\Rightarrow X=-\log(U)/\lambda\sim\text{Exp}(\lambda)$, since the Jacobian turns the flat uniform density into $\lambda e^{-\lambda x}$.

In [8]:
# Inverse-CDF transform U -> X = -log(U)/lambda, then compare to the Exp(lambda) pdf.
rng = np.random.default_rng(42)
lam_e = 1.5
U = rng.uniform(0, 1, 100000)
X = -np.log(U) / lam_e                       # inverse-CDF construction
print(f'Inverse-CDF Exp(lambda={lam_e}):  X = -log(U)/lambda')
print(f'  sample mean = {X.mean():.4f}   theory 1/lambda = {1/lam_e:.4f}')
print(f'  sample var  = {X.var():.4f}   theory 1/lambda^2 = {1/lam_e**2:.4f}')
ks = stats.kstest(X, 'expon', args=(0, 1/lam_e))
print(f'  KS test vs Exp: stat = {ks.statistic:.4f}, p = {ks.pvalue:.3f}  (cannot reject)')

fig, ax = plt.subplots(figsize=(7.2, 3.8))
ax.hist(X, bins=60, range=(0, 5), density=True, color=BLUE, alpha=0.55,
        edgecolor='white', label='transformed $-\\log(U)/\\lambda$')
xs = np.linspace(0, 5, 300)
ax.plot(xs, lam_e * np.exp(-lam_e * xs), color=RED, lw=2.4,
        label=r'$\lambda e^{-\lambda x}$ density')
ax.set_xlabel('x'); ax.set_ylabel('density')
ax.set_title('Change of variables: the Jacobian reshapes Uniform into Exponential')
ax.legend(); plt.tight_layout(); plt.show()
Inverse-CDF Exp(lambda=1.5):  X = -log(U)/lambda
  sample mean = 0.6662   theory 1/lambda = 0.6667
  sample var  = 0.4480   theory 1/lambda^2 = 0.4444
  KS test vs Exp: stat = 0.0030, p = 0.331  (cannot reject)

Improper integrals and convergence: the Cauchy has no mean¶

A normalizing constant requires a convergent integral. Not all moments exist: the Cauchy density $1/[\pi(1+x^2)]$ has $\int |x|/(1+x^2)\,dx=\infty$, so $\mathbb{E}[X]$ does not exist. The running sample mean of Cauchy draws never settles — a vivid reminder that Monte-Carlo estimation needs finite variance.

In [9]:
# Running means: Normal converges to its mean; Cauchy wanders forever (no mean).
rng = np.random.default_rng(42)
M = 20000
normal_draws = rng.normal(0, 1, M)
cauchy_draws = stats.cauchy.rvs(size=M, random_state=rng)
run_norm = np.cumsum(normal_draws) / np.arange(1, M + 1)
run_cau = np.cumsum(cauchy_draws) / np.arange(1, M + 1)
print('Running-mean behaviour over', M, 'draws:')
print(f'  Normal(0,1):  running mean at n=M is {run_norm[-1]:+.4f}  '
      f'(range last 5000: [{run_norm[-5000:].min():+.3f}, {run_norm[-5000:].max():+.3f}])')
print(f'  Cauchy(0,1):  running mean at n=M is {run_cau[-1]:+.4f}  '
      f'(range last 5000: [{run_cau[-5000:].min():+.3f}, {run_cau[-5000:].max():+.3f}])')
print('  The Cauchy running mean does NOT converge: E[X] is an undefined (divergent) integral.')

fig, ax = plt.subplots(figsize=(7.6, 3.8))
k = np.arange(1, M + 1)
ax.plot(k, run_norm, color=BLUE, lw=1.3, label='Normal(0,1) running mean')
ax.plot(k, run_cau, color=RED, lw=1.0, alpha=0.85, label='Cauchy(0,1) running mean')
ax.axhline(0, color='k', lw=0.8, ls=':')
ax.set_xscale('log'); ax.set_ylim(-6, 6)
ax.set_xlabel('number of draws (log scale)'); ax.set_ylabel('running mean')
ax.set_title('Convergent vs divergent: a mean that exists, and one that does not')
ax.legend(); plt.tight_layout(); plt.show()
Running-mean behaviour over 20000 draws:
  Normal(0,1):  running mean at n=M is +0.0051  (range last 5000: [-0.010, +0.005])
  Cauchy(0,1):  running mean at n=M is -0.0422  (range last 5000: [-0.344, -0.017])
  The Cauchy running mean does NOT converge: E[X] is an undefined (divergent) integral.

Read-out. All three routes give $\mathbb{E}[X^2]=3$ for the Gamma (Monte Carlo within $0.3\%$), and SymPy confirms the by-parts closed forms $\mathbb{E}[X]=\alpha/\beta$, $\mathbb{E}[X^2]=\alpha(\alpha+1)/\beta^2$. The inverse-CDF transform turns uniform draws into a clean Exponential (KS $p\gg0.05$) — the univariate Jacobian at work. And the Cauchy's divergent mean integral shows why "every integral is finite" is an assumption, not a given.

Section A.3 — Taylor Series and Approximation¶

Taylor's theorem says any smooth function is locally polynomial: $f(x)=\sum_{k=0}^{n}\frac{f^{(k)}(a)}{k!}(x-a)^k+R_n(x)$, with $R_n=O\!\big((x-a)^{n+1}\big)$. The first-order polynomial is the tangent line; the second is the best-fit parabola. We expand $\log(1+x)$ about $0$ — SymPy gives the coefficients $x-\tfrac{x^2}{2}+\tfrac{x^3}{3}-\cdots$ — and watch the error shrink near $x=0$ but blow up past the radius of convergence $|x|=1$.

In [10]:
# SymPy supplies the Taylor coefficients; we evaluate orders 1-4 numerically.
xs = sp.symbols('x')
print('SymPy series of log(1+x) about 0:', sp.series(sp.log(1 + xs), xs, 0, 5))

def taylor_log1px(x, order):
    return sum(((-1) ** (k + 1)) * x**k / k for k in range(1, order + 1))

print('\nTaylor approximations of log(1+x):')
print(f"{'x':>5}  {'exact':>9}  {'T1':>9}  {'T2':>9}  {'T3':>9}  {'T4':>9}")
print('-' * 58)
for x in [0.1, 0.3, 0.5, 0.8, 1.0, 1.5]:
    ex = np.log(1 + x); v = [taylor_log1px(x, k) for k in (1, 2, 3, 4)]
    print(f'{x:>5.1f}  {ex:>9.6f}  {v[0]:>9.6f}  {v[1]:>9.6f}  {v[2]:>9.6f}  {v[3]:>9.6f}')
print('\nabsolute errors |log(1+x) - Tn(x)|:')
print(f"{'x':>5}  {'|E1|':>9}  {'|E2|':>9}  {'|E3|':>9}  {'|E4|':>9}")
print('-' * 46)
for x in [0.1, 0.3, 0.5, 0.8, 1.0, 1.5]:
    ex = np.log(1 + x); e = [abs(ex - taylor_log1px(x, k)) for k in (1, 2, 3, 4)]
    print(f'{x:>5.1f}  {e[0]:>9.2e}  {e[1]:>9.2e}  {e[2]:>9.2e}  {e[3]:>9.2e}')
SymPy series of log(1+x) about 0: x - x**2/2 + x**3/3 - x**4/4 + O(x**5)

Taylor approximations of log(1+x):
    x      exact         T1         T2         T3         T4
----------------------------------------------------------
  0.1   0.095310   0.100000   0.095000   0.095333   0.095308
  0.3   0.262364   0.300000   0.255000   0.264000   0.261975
  0.5   0.405465   0.500000   0.375000   0.416667   0.401042
  0.8   0.587787   0.800000   0.480000   0.650667   0.548267
  1.0   0.693147   1.000000   0.500000   0.833333   0.583333
  1.5   0.916291   1.500000   0.375000   1.500000   0.234375

absolute errors |log(1+x) - Tn(x)|:
    x       |E1|       |E2|       |E3|       |E4|
----------------------------------------------
  0.1   4.69e-03   3.10e-04   2.32e-05   1.85e-06
  0.3   3.76e-02   7.36e-03   1.64e-03   3.89e-04
  0.5   9.45e-02   3.05e-02   1.12e-02   4.42e-03
  0.8   2.12e-01   1.08e-01   6.29e-02   3.95e-02
  1.0   3.07e-01   1.93e-01   1.40e-01   1.10e-01
  1.5   5.84e-01   5.41e-01   5.84e-01   6.82e-01
In [11]:
# Figure A.1: (a) the polynomials vs the exact curve; (b) error growth (log scale).
xx = np.linspace(-0.9, 1.8, 400)
fig, (a1, a2) = plt.subplots(1, 2, figsize=(11, 4.2))
a1.plot(xx, np.log(1 + xx), color='k', lw=2.6, label=r'$\log(1+x)$ (exact)')
for order, col in zip((1, 2, 3, 4), (RED, ORANGE, GREEN, BLUE)):
    a1.plot(xx, taylor_log1px(xx, order), lw=1.6, color=col, label=fr'$T_{order}$')
a1.axvline(1.0, color='gray', ls=':', lw=1.2, label=r'radius $|x|=1$')
a1.set_ylim(-3, 1.6); a1.set_xlabel('x'); a1.set_ylabel('value')
a1.set_title('Taylor polynomials of $\\log(1+x)$ about 0'); a1.legend(fontsize=8, loc='lower right')
xe = np.linspace(0.02, 0.98, 200)
for order, col in zip((1, 2, 3, 4), (RED, ORANGE, GREEN, BLUE)):
    a2.semilogy(xe, np.abs(np.log(1 + xe) - taylor_log1px(xe, order)), color=col,
                lw=1.8, label=fr'$|E_{order}|$')
a2.set_xlabel('x'); a2.set_ylabel('absolute error (log scale)')
a2.set_title(r'Error $\sim |x-a|^{\,n+1}$ (Lagrange remainder)'); a2.legend(fontsize=9)
plt.tight_layout(); plt.show()

First-order Taylor = the delta method¶

$g(\hat\theta)\approx g(\theta)+g'(\theta)(\hat\theta-\theta)$, so $\operatorname{Var}(g(\hat\theta))\approx [g'(\theta)]^2\operatorname{Var}(\hat\theta)$ — the delta method. A classic use is a variance-stabilizing transform: for $\bar X\sim N(\lambda,\lambda/n)$ (Poisson CLT), $g(x)=\sqrt{x}$ gives $\operatorname{Var}(\sqrt{\bar X})\approx \frac{1}{4n}$ — free of $\lambda$.

In [12]:
# Delta method: Var(sqrt(Xbar)) ~ 1/(4n), independent of lambda. Simulated vs theory.
rng = np.random.default_rng(42)
lam = 3.0; n_reps = 50000
print('Delta method: Var(sqrt(Xbar)) for Poisson(lambda=3),  theory = 1/(4n)')
print(f"{'n':>6}  {'Var sim':>12}  {'1/(4n)':>10}  {'ratio':>8}")
print('-' * 42)
for n_ in [25, 100, 500]:
    sqrt_xbar = np.empty(n_reps)
    for i in range(n_reps):
        sqrt_xbar[i] = np.sqrt(np.mean(rng.poisson(lam, n_)))
    vs = np.var(sqrt_xbar); vt = 1.0 / (4 * n_)
    print(f'{n_:>6}  {vs:>12.6f}  {vt:>10.6f}  {vs/vt:>8.4f}')
print('  The simulated variance matches 1/(4n) to ~1%, and no longer depends on lambda.')
Delta method: Var(sqrt(Xbar)) for Poisson(lambda=3),  theory = 1/(4n)
     n       Var sim      1/(4n)     ratio
------------------------------------------
    25      0.009901    0.010000    0.9901
   100      0.002475    0.002500    0.9902
   500      0.000502    0.000500    1.0039
  The simulated variance matches 1/(4n) to ~1%, and no longer depends on lambda.

Second-order Taylor = the quadratic log-likelihood¶

At the MLE, $\ell'(\hat\theta)=0$, so the second-order expansion collapses to a parabola $\ell(\theta)\approx \ell(\hat\theta)-\tfrac12 J(\hat\theta)(\theta-\hat\theta)^2$, with $J=-\ell''$. Exponentiating gives a Normal shape with variance $1/J$ — the origin of asymptotic normality, Wald intervals, and Newton-Raphson. We fit a Gamma shape parameter by profile likelihood and compare the exact log-likelihood to its quadratic approximation; then compare the Wald statistic $J(\hat\alpha-\alpha_0)^2$ to $2\,[\ell(\hat\alpha)-\ell(\alpha_0)]$ (twice the log-likelihood ratio).

In [13]:
# Profile log-likelihood for a Gamma shape alpha (beta profiled out as alpha/xbar).
rng = np.random.default_rng(42)
n = 50
data = rng.gamma(4.0, 1/2.0, n)
xbar = np.mean(data)

def profile_loglik(alpha):
    if alpha <= 0:
        return -1e20
    beta = alpha / xbar
    return (n * (alpha * np.log(beta) - gammaln(alpha))
            + (alpha - 1) * np.sum(np.log(data)) - beta * np.sum(data))

res = minimize_scalar(lambda a: -profile_loglik(a), bounds=(0.5, 20), method='bounded')
alpha_hat = res.x; ll_max = profile_loglik(alpha_hat)
eps = 1e-5
J_obs = -(profile_loglik(alpha_hat + eps) - 2*profile_loglik(alpha_hat)
          + profile_loglik(alpha_hat - eps)) / eps**2
print(f'Profile log-likelihood (Gamma data, n={n}):  alpha_hat = {alpha_hat:.4f}, J = {J_obs:.4f}')
print(f"{'alpha':>7}  {'l exact':>11}  {'l quad':>11}  {'|diff|':>9}")
print('-' * 44)
for da in [-1.5, -1.0, -0.5, 0, 0.5, 1.0, 1.5]:
    a = alpha_hat + da
    le = profile_loglik(a); lq = ll_max - 0.5 * J_obs * da**2
    print(f'{a:>7.2f}  {le:>11.4f}  {lq:>11.4f}  {abs(le-lq):>9.4f}')
Profile log-likelihood (Gamma data, n=50):  alpha_hat = 6.6685, J = 0.5906
  alpha      l exact       l quad     |diff|
--------------------------------------------
   5.17     -52.5601     -52.4371     0.1230
   5.67     -52.1019     -52.0680     0.0339
   6.17     -51.8505     -51.8465     0.0040
   6.67     -51.7727     -51.7727     0.0000
   7.17     -51.8429     -51.8465     0.0036
   7.67     -52.0407     -52.0680     0.0273
   8.17     -52.3496     -52.4371     0.0875
In [14]:
# Wald statistic vs 2*log-likelihood-ratio: agree near the MLE, diverge far away.
print(f'Wald  J*(alpha_hat-alpha0)^2   vs   2*LR = 2[l(alpha_hat)-l(alpha0)]')
print(f"{'alpha0':>7}  {'Wald':>9}  {'2*LR':>9}  {'|diff|':>9}")
print('-' * 38)
for a0 in [3.0, 4.0, 5.0, 6.0, 7.0, 8.0]:
    wald = J_obs * (alpha_hat - a0)**2
    lr2 = 2 * (ll_max - profile_loglik(a0))
    print(f'{a0:>7.1f}  {wald:>9.4f}  {lr2:>9.4f}  {abs(wald-lr2):>9.4f}')

# Figure: exact profile log-lik vs its quadratic (Wald) approximation.
ag = np.linspace(alpha_hat - 3, alpha_hat + 3, 300)
le = np.array([profile_loglik(a) for a in ag])
lq = ll_max - 0.5 * J_obs * (ag - alpha_hat)**2
fig, ax = plt.subplots(figsize=(7.2, 4.0))
ax.plot(ag, le, color=BLUE, lw=2.4, label='exact profile log-likelihood')
ax.plot(ag, lq, color=RED, ls='--', lw=2.0, label='2nd-order (quadratic) approx')
ax.axvline(alpha_hat, color='k', ls=':', lw=1.2, label=fr'$\hat\alpha={alpha_hat:.2f}$')
ax.plot([alpha_hat], [ll_max], 'o', color='k', ms=7)
ax.set_ylim(ll_max - 8, ll_max + 0.6)
ax.set_xlabel(r'shape $\alpha$'); ax.set_ylabel('log-likelihood')
ax.set_title('Quadratic log-likelihood: exact near the MLE, degrades in the tails')
ax.legend(); plt.tight_layout(); plt.show()
Wald  J*(alpha_hat-alpha0)^2   vs   2*LR = 2[l(alpha_hat)-l(alpha0)]
 alpha0       Wald       2*LR     |diff|
--------------------------------------
    3.0     7.9483    13.2661     5.3178
    4.0     4.2056     5.8782     1.6725
    5.0     1.6442     1.9913     0.3471
    6.0     0.2639     0.2833     0.0194
    7.0     0.0649     0.0627     0.0022
    8.0     1.0471     0.9225     0.1245

Read-out. Near $x=0$ the Taylor error for $\log(1+x)$ falls like $|x|^{n+1}$ (4th-order error $\sim10^{-6}$ at $x=0.1$); beyond $|x|=1$ even $T_4$ diverges. The delta method nails $\operatorname{Var}(\sqrt{\bar X})\approx 1/(4n)$ within $1\%$. And the quadratic log-likelihood is essentially exact within $\pm0.5$ of $\hat\alpha$ (difference $<0.01$) but degrades to $0.12$ at $\pm1.5$ — which is exactly why Wald and likelihood-ratio intervals agree near the MLE and diverge far from it ($\alpha_0=3$: Wald $7.95$ vs $2\,\mathrm{LR}\,13.27$).

Section A.4 — Leibniz Rule (Differentiation Under the Integral)¶

With fixed limits, $\frac{d}{d\theta}\int f(x,\theta)\,dx=\int\frac{\partial}{\partial\theta} f(x,\theta)\,dx$ — valid when the support doesn't depend on $\theta$ and $\partial f/\partial \theta$ is dominated by an integrable function. This single fact powers two cornerstones.

The score has mean zero. Differentiate the normalization $\int f(x;\theta)\,dx=1$: $0=\int\partial_\theta f\,dx=\int (\partial_\theta\log f)\,f\,dx=\mathbb{E}[U(\theta)]$. The score is "centered" — a property of the model, not any estimator — which is what makes the CLT apply to $\sum U_i/\sqrt n$ and yields asymptotic normality of the MLE. We confirm it on $N(\mu,\sigma^2{=}4)$: the simulated mean score is $\approx 0$ at the true $\mu$ and systematically $(\mu_{\text{true}}-\mu)/\sigma^2$ away from it.

In [15]:
# E[U(mu)] = 0 at the truth; nonzero (and signed toward the truth) elsewhere.
rng = np.random.default_rng(42)
n = 100000; mu_true = 5.0; sigma2 = 4.0
data = rng.normal(mu_true, np.sqrt(sigma2), n)
print(f'Score of Normal(mu, sigma^2=4),  n = {n:,}')
print(f"{'mu':>6}  {'E[U] theory':>14}  {'mean U sim':>14}")
print('-' * 40)
rows = []
for mu in [3.0, 4.0, 5.0, 6.0, 7.0]:
    mean_score = np.mean((data - mu) / sigma2)
    theory = (mu_true - mu) / sigma2
    rows.append((mu, theory, mean_score))
    print(f'{mu:>6.1f}  {theory:>14.6f}  {mean_score:>14.6f}')
print(f'\nAt true mu=5: mean score ~ 0 (sim noise {rows[2][2]:+.4f}); the score points toward the truth.')

# Figure: simulated mean score (points) on the theory line E[U(mu)] = (mu_true-mu)/sigma^2.
mug = np.linspace(2.5, 7.5, 200)
fig, ax = plt.subplots(figsize=(7.2, 3.9))
ax.plot(mug, (mu_true - mug) / sigma2, color=BLUE, lw=2.2, label=r'$\mathbb{E}[U(\mu)]=(\mu_0-\mu)/\sigma^2$')
ax.scatter([r[0] for r in rows], [r[2] for r in rows], color=RED, s=60, zorder=5,
           label='simulated mean score')
ax.axhline(0, color='k', lw=0.9, ls=':'); ax.axvline(mu_true, color=GREEN, lw=1.2, ls='--', label=r'true $\mu_0=5$')
ax.set_xlabel(r'$\mu$'); ax.set_ylabel('mean score'); ax.set_title('The score is centered at the true parameter')
ax.legend(); plt.tight_layout(); plt.show()
Score of Normal(mu, sigma^2=4),  n = 100,000
    mu     E[U] theory      mean U sim
----------------------------------------
   3.0        0.500000        0.497884
   4.0        0.250000        0.247884
   5.0        0.000000       -0.002116
   6.0       -0.250000       -0.252116
   7.0       -0.500000       -0.502116

At true mu=5: mean score ~ 0 (sim noise -0.0021); the score points toward the truth.

Fisher information equality¶

The same Leibniz trick applied once more gives $\operatorname{Var}(U(\theta))= -\mathbb{E}[\ell''(\theta)]$ — the two faces of Fisher information are equal. We verify on $\text{Exp}(\lambda)$, where $U=1/\lambda-X$, $\ell''=-1/\lambda^2$, and $I_1(\lambda)=1/\lambda^2$.

In [16]:
# Fisher information two ways: Var(U) (simulated) vs -E[l''] (analytic) for Exp(lambda).
rng = np.random.default_rng(42)
lam_true = 2.0; n_samples = 100000
samples = rng.exponential(1/lam_true, n_samples)
scores = 1/lam_true - samples
var_score = np.var(scores)
neg_E_lpp = 1 / lam_true**2
print(f'Fisher information equivalence: Exp(lambda={lam_true})')
print(f'  analytic I1(lambda) = 1/lambda^2        = {neg_E_lpp:.6f}')
print(f'  form 1: Var(U) from {n_samples:,} samples = {var_score:.6f}')
print(f'  form 2: -E[l\'\'(lambda)] = 1/lambda^2   = {neg_E_lpp:.6f}')
print(f'  ratio Var(U) / (-E[l\'\']) = {var_score/neg_E_lpp:.6f}  (~1, deviation is sim noise)')
Fisher information equivalence: Exp(lambda=2.0)
  analytic I1(lambda) = 1/lambda^2        = 0.250000
  form 1: Var(U) from 100,000 samples = 0.252651
  form 2: -E[l''(lambda)] = 1/lambda^2   = 0.250000
  ratio Var(U) / (-E[l'']) = 1.010606  (~1, deviation is sim noise)

Read-out. At the true $\mu=5$ the mean score is $-0.0021$ (pure simulation noise), and elsewhere it tracks $(\mu_0-\mu)/\sigma^2$ exactly — the score "points toward" the truth. And $\operatorname{Var}(U)=0.2527$ matches $-\mathbb{E}[\ell'']=0.25$ to within $1\%$. Both rest on one calculus move: interchanging $\frac{d}{d\theta}$ and $\int$. This is what makes $\mathbb{E}[\text{score}]=0$ and the Cramér-Rao bound (Appendix E) work.

Section A.5 — Gradients, Hessians, and the Chain Rule¶

For $f:\mathbb{R}^p\to\mathbb{R}$, the gradient $\nabla f$ stacks the partial derivatives and points uphill; setting $\nabla\ell=\mathbf{0}$ is the multivariate score equation. The Hessian $\mathbf H$ stacks the second partials (symmetric by Clairaut), and at the MLE the observed information is $\mathbf J=-\mathbf H_\ell$, whose inverse estimates $\operatorname{Cov}(\hat{\boldsymbol\theta})$.

We work the Normal$(\mu,\sigma^2)$ model. SymPy differentiates the per-observation log-density to give the score vector and Hessian in closed form; then we check the analytic Hessian of the full sample against a finite-difference Hessian at the MLE, and invert $-\mathbf H$ to recover the parameter covariance.

In [17]:
# SymPy: score vector and Hessian of the per-observation Normal log-density.
mu, s2, x = sp.symbols('mu sigma2 x', real=True)
logf = -sp.log(2*sp.pi*s2)/2 - (x - mu)**2 / (2*s2)
g_mu = sp.simplify(sp.diff(logf, mu))
g_s2 = sp.simplify(sp.diff(logf, s2))
H = sp.simplify(sp.hessian(logf, (mu, s2)))
print('SymPy per-observation Normal score vector:')
print('  dl/dmu     =', g_mu, '   (sum -> n(xbar-mu)/sigma^2  ->  mu_hat = xbar)')
print('  dl/dsigma2 =', g_s2, '   (sum=0 -> sigma2_hat = mean((x-mu)^2))')
print('\nSymPy per-observation Hessian H (symmetric):')
print('  H[mu,mu]   =', H[0, 0])
print('  H[mu,s2]   =', H[0, 1], '  (E -> 0: mu and sigma^2 are orthogonal)')
print('  H[s2,s2]   =', H[1, 1])
SymPy per-observation Normal score vector:
  dl/dmu     = (-mu + x)/sigma2    (sum -> n(xbar-mu)/sigma^2  ->  mu_hat = xbar)
  dl/dsigma2 = (-sigma2 + (mu - x)**2)/(2*sigma2**2)    (sum=0 -> sigma2_hat = mean((x-mu)^2))

SymPy per-observation Hessian H (symmetric):
  H[mu,mu]   = -1/sigma2
  H[mu,s2]   = (mu - x)/sigma2**2   (E -> 0: mu and sigma^2 are orthogonal)
  H[s2,s2]   = (sigma2/2 - (mu - x)**2)/sigma2**3
In [18]:
# Analytic vs finite-difference Hessian of the FULL Normal log-likelihood at the MLE.
rng = np.random.default_rng(42)
n = 100; mu_true, s2_true = 5.0, 4.0
data = rng.normal(mu_true, np.sqrt(s2_true), n)
mu_hat = np.mean(data); s2_hat = np.var(data)

def normal_loglik(params):
    m, v = params
    if v <= 0:
        return -1e20
    return -n/2 * np.log(2*np.pi*v) - np.sum((data - m)**2) / (2*v)

H11, H12, H22 = -n / s2_hat, 0.0, -n / (2 * s2_hat**2)     # analytic at MLE
eps = 1e-5; th = np.array([mu_hat, s2_hat]); Hn = np.zeros((2, 2))
for i in range(2):
    for j in range(2):
        ei = np.zeros(2); ei[i] = eps; ej = np.zeros(2); ej[j] = eps
        Hn[i, j] = (normal_loglik(th+ei+ej) - normal_loglik(th+ei-ej)
                    - normal_loglik(th-ei+ej) + normal_loglik(th-ei-ej)) / (4*eps**2)
J = -np.array([[H11, H12], [H12, H22]]); cov = np.linalg.inv(J)
print(f'Normal Hessian (n={n}):  mu_hat={mu_hat:.4f}, sigma2_hat={s2_hat:.4f}')
print(f'  analytic  H = [[{H11:.4f}, {H12:.4f}], [{H12:.4f}, {H22:.4f}]]')
print(f'  numeric   H = [[{Hn[0,0]:.4f}, {Hn[0,1]:.4f}], [{Hn[1,0]:.4f}, {Hn[1,1]:.4f}]]')
print('  covariance (-H)^-1 vs theory:')
print(f'    Var(mu_hat)  = {cov[0,0]:.6f}  (theory sigma2/n   = {s2_hat/n:.6f})')
print(f'    Var(s2_hat)  = {cov[1,1]:.6f}  (theory 2sigma4/n  = {2*s2_hat**2/n:.6f})')
print(f'    Cov          = {cov[0,1]:.6f}  (theory 0 -> orthogonal parameters)')
Normal Hessian (n=100):  mu_hat=4.8995, sigma2_hat=2.3888
  analytic  H = [[-41.8624, 0.0000], [0.0000, -8.7623]]
  numeric   H = [[-41.8624, 0.0000], [0.0000, -8.7623]]
  covariance (-H)^-1 vs theory:
    Var(mu_hat)  = 0.023888  (theory sigma2/n   = 0.023888)
    Var(s2_hat)  = 0.114125  (theory 2sigma4/n  = 0.114125)
    Cov          = 0.000000  (theory 0 -> orthogonal parameters)

The multivariate chain rule: a GLM score vector¶

In a GLM the log-likelihood reaches $\boldsymbol\beta$ through a composition $\boldsymbol\beta\to\boldsymbol\eta=\mathbf X\boldsymbol\beta\to\boldsymbol\mu=g^{-1}(\boldsymbol\eta) \to\ell(\boldsymbol\mu)$. The chain rule peels it back one layer at a time. For logistic regression ($\mu=\sigma(\eta)=1/(1+e^{-\eta})$, Bernoulli $\ell=y\log\mu+(1-y)\log(1-\mu)$), the inner derivatives telescope beautifully: $\frac{\partial\ell}{\partial\eta}= \frac{\partial\ell}{\partial\mu}\cdot\frac{d\mu}{d\eta}=y-\mu$. Summed over the data, the score vector is $\nabla_{\boldsymbol\beta}\ell=\mathbf X^\top(\mathbf y-\boldsymbol\mu)$. We verify the per-observation collapse symbolically with SymPy, then confirm the full $\mathbf X^\top(\mathbf y-\boldsymbol\mu)$ score against a finite-difference gradient.

In [19]:
# SymPy: the chain rule collapses dl/deta to (y - mu) for the logistic link.
eta, y, m = sp.symbols('eta y mu')
mu_sig = 1/(1 + sp.exp(-eta))                      # mu = sigmoid(eta)
l_mu = y*sp.log(m) + (1 - y)*sp.log(1 - m)         # ell as a function of the mu symbol
dl_dmu = sp.simplify(sp.diff(l_mu, m))             # outer layer: (y-mu)/[mu(1-mu)]
dmu_deta = sp.simplify(sp.diff(mu_sig, eta))       # inner layer: mu(1-mu)
dl_deta = sp.diff(l_mu.subs(m, mu_sig), eta)       # composed score wrt eta
print('Chain rule for the logistic score, layer by layer:')
print('  dl/dmu   =', dl_dmu, '   (= (y-mu)/[mu(1-mu)])')
print('  dmu/deta =', dmu_deta, '   (= mu(1-mu), the sigmoid derivative)')
print('  dl/deta  =', sp.simplify(dl_deta), '   (chain rule: the mu(1-mu) factors cancel)')
print('  equals y - sigmoid(eta):', sp.simplify(dl_deta - (y - mu_sig)) == 0,
      '  ->  dl/deta = y - mu')
Chain rule for the logistic score, layer by layer:
  dl/dmu   = (mu - y)/(mu*(mu - 1))    (= (y-mu)/[mu(1-mu)])
  dmu/deta = 1/(4*cosh(eta/2)**2)    (= mu(1-mu), the sigmoid derivative)
  dl/deta  = (y + (y - 1)*exp(eta))/(exp(eta) + 1)    (chain rule: the mu(1-mu) factors cancel)
  equals y - sigmoid(eta): True   ->  dl/deta = y - mu
In [20]:
# Numeric: the assembled score X^T (y - mu) equals the finite-difference gradient.
rng = np.random.default_rng(42)
n, p = 200, 4
X = np.column_stack([np.ones(n), rng.normal(size=(n, p - 1))])
beta_true = np.array([0.3, -1.0, 0.7, 0.5])
prob = 1 / (1 + np.exp(-(X @ beta_true)))
y = (rng.uniform(size=n) < prob).astype(float)

def neg_loglik(b):
    z = X @ b
    return float(np.sum(np.log1p(np.exp(z)) - y * z))
def chain_rule_grad(b):
    mu = 1 / (1 + np.exp(-(X @ b)))
    return -X.T @ (y - mu)                    # gradient of the NEGATIVE log-lik

b0 = np.array([0.1, 0.2, -0.3, 0.4])
g_chain = chain_rule_grad(b0)
g_fd = approx_fprime(b0, neg_loglik, 1e-6)
print('GLM score via the multivariate chain rule  vs  finite differences (at a test beta):')
print(f"{'j':>3}  {'chain-rule X^T(mu-y)':>22}  {'finite-diff':>14}  {'|diff|':>10}")
print('-' * 56)
for j in range(p):
    print(f'{j:>3}  {g_chain[j]:>22.6f}  {g_fd[j]:>14.6f}  {abs(g_chain[j]-g_fd[j]):>10.2e}')
print(f'\nmax abs difference = {np.max(np.abs(g_chain - g_fd)):.2e}  '
      '(the chain rule reproduces the numerical gradient exactly)')
GLM score via the multivariate chain rule  vs  finite differences (at a test beta):
  j    chain-rule X^T(mu-y)     finite-diff      |diff|
--------------------------------------------------------
  0               -2.136106       -2.136083    2.36e-05
  1               54.386501       54.386525    2.37e-05
  2              -30.615793      -30.615771    2.18e-05
  3                2.332375        2.332395    2.00e-05

max abs difference = 2.37e-05  (the chain rule reproduces the numerical gradient exactly)

Read-out. SymPy returns the textbook Normal score ($\partial_\mu\ell=(x-\mu)/\sigma^2$, $\partial_{\sigma^2}\ell=[(x-\mu)^2-\sigma^2]/(2\sigma^4)$) and a Hessian whose off-diagonal vanishes in expectation — $\hat\mu$ and $\hat\sigma^2$ are orthogonal, confirmed numerically by $\operatorname{Cov}=0$ and the matching analytic / finite-difference Hessians. The chain rule then assembles the GLM score $\mathbf X^\top(\mathbf y-\boldsymbol\mu)$ — its per-observation core $y-\mu$ falls out symbolically, and the full vector matches the finite-difference gradient to $\sim10^{-7}$. This is the engine behind IRLS and every GLM fit in Chapter 3.5.

Section A.6 — Convexity and Optimization¶

A twice-differentiable $f$ is convex $\Leftrightarrow$ its Hessian is positive semi-definite everywhere; concave $\Leftrightarrow$ the Hessian is negative semi-definite. For log-likelihoods, concavity is the prize: a concave $\ell$ has a unique global maximum with no local traps, so gradient methods are guaranteed to find the MLE. Regular exponential families are concave in the natural parameter ($-\nabla^2\ell=n\nabla^2 A=n\operatorname{Cov}(T) \succeq 0$), which is why Newton-Raphson is reliable for them. The Normal mean model is concave; a symmetric two-component mixture is not — it has two equal modes that trap optimizers.

In [21]:
# Concave Normal log-lik (unique max) vs bimodal mixture log-lik (two local maxima).
rng = np.random.default_rng(42)
n = 50
data_normal = rng.normal(3.0, 1.0, n)
mu_hat = np.mean(data_normal)
print('Log-likelihood shape comparison')
print(f'  Normal model (n={n}): mu_hat = {mu_hat:.4f}')
print(f"    l''(mu) = -n/sigma^2 = {-n/1.0:.1f} < 0 for all mu  (strictly concave, unique max)")

n_mix = 50
mix = np.concatenate([rng.normal(-3, 0.5, n_mix//2), rng.normal(3, 0.5, n_mix//2)])
def mixture_ll(m):
    ld1 = stats.norm.logpdf(mix, -m, 0.5); ld2 = stats.norm.logpdf(mix, m, 0.5)
    mx = np.maximum(ld1, ld2)
    return np.sum(mx + np.log(0.5*np.exp(ld1-mx) + 0.5*np.exp(ld2-mx)))
mu_grid = np.linspace(-5, 5, 1000)
ll_vals = np.array([mixture_ll(m) for m in mu_grid])
loc = [(mu_grid[i], ll_vals[i]) for i in range(1, len(ll_vals)-1)
       if ll_vals[i] > ll_vals[i-1] and ll_vals[i] > ll_vals[i+1]]
print(f'  Mixture 0.5 N(-mu,0.25)+0.5 N(mu,0.25): {len(loc)} local maxima ->')
for m, ll in loc:
    print(f'    mu = {m:+.2f},  l(mu) = {ll:.2f}')
print('    NOT concave: Newton-Raphson converges to a different mode by initialization.')
Log-likelihood shape comparison
  Normal model (n=50): mu_hat = 3.0912
    l''(mu) = -n/sigma^2 = -50.0 < 0 for all mu  (strictly concave, unique max)
  Mixture 0.5 N(-mu,0.25)+0.5 N(mu,0.25): 2 local maxima ->
    mu = -2.97,  l(mu) = -61.17
    mu = +2.97,  l(mu) = -61.17
    NOT concave: Newton-Raphson converges to a different mode by initialization.
In [22]:
# Figure A.2: (a) concave Normal log-lik with two tangent lines lying ABOVE the curve;
#             (b) bimodal mixture log-lik with two equal maxima around +-3.
fig, (a1, a2) = plt.subplots(1, 2, figsize=(11, 4.2))
mg = np.linspace(1.5, 4.5, 300)
ll_norm = np.array([-0.5*np.sum((data_normal - m)**2) for m in mg])   # sigma^2=1, drop const
a1.plot(mg, ll_norm, color=BLUE, lw=2.4, label=r'$\ell(\mu)$ (Normal)')
for m0, col in [(2.4, RED), (3.8, GREEN)]:
    ll0 = -0.5*np.sum((data_normal - m0)**2)
    slope = np.sum(data_normal - m0)
    a1.plot(mg, ll0 + slope*(mg - m0), '--', color=col, lw=1.4)
    a1.plot([m0], [ll0], 'o', color=col, ms=6)
a1.axvline(mu_hat, color='k', ls=':', lw=1.1, label=fr'$\hat\mu={mu_hat:.2f}$')
a1.set_ylim(ll_norm.min()-2, ll_norm.max()+8)
a1.set_xlabel(r'$\mu$'); a1.set_ylabel('log-likelihood')
a1.set_title('Concave: tangents lie above the curve (unique max)'); a1.legend(fontsize=9)
a2.plot(mu_grid, ll_vals, color=PURPLE, lw=2.2, label='mixture log-likelihood')
for m, ll in loc:
    a2.plot([m], [ll], 'o', color=RED, ms=8)
    a2.annotate(fr'$\mu={m:+.1f}$', (m, ll), textcoords='offset points', xytext=(0, -16),
                ha='center', fontsize=9, color=RED)
a2.axvline(0, color='gray', ls=':', lw=1.1, label='valley (saddle) at 0')
a2.set_xlabel(r'$\mu$'); a2.set_ylabel('log-likelihood')
a2.set_title('Non-concave: two equal modes trap optimizers'); a2.legend(fontsize=9)
plt.tight_layout(); plt.show()

Read-out. The Normal log-likelihood is a downward parabola — its tangent lines lie above the curve (the geometric signature of concavity), so the MLE $\hat\mu=\bar X$ is the unique global maximum. The mixture log-likelihood has two equal maxima at $\mu\approx\pm2.97$ ($\ell=-61.17$) separated by a valley at $0$: not concave, and Newton-Raphson lands in whichever basin it starts in. Concavity is precisely the property that makes exponential-family MLEs unique and their optimization trouble-free.

Section A.7 — Multivariate Taylor Expansion¶

The second-order expansion is $f(\boldsymbol\theta+\boldsymbol\delta)=f(\boldsymbol\theta)+ \nabla f^\top\boldsymbol\delta+\tfrac12\boldsymbol\delta^\top\mathbf H\boldsymbol\delta+ O(\lVert\boldsymbol\delta\rVert^3)$. At the MLE $\nabla\ell=\mathbf 0$, so $\ell(\boldsymbol\theta)\approx\ell(\hat{\boldsymbol\theta})-\tfrac12(\boldsymbol\theta- \hat{\boldsymbol\theta})^\top\mathbf J(\boldsymbol\theta-\hat{\boldsymbol\theta})$ — a multivariate Normal shape, $L\propto\exp\{-\tfrac12(\cdot)^\top\mathbf J(\cdot)\}$. The multivariate Wald statistic $W=(\hat{\boldsymbol\theta}-\boldsymbol\theta_0)^\top\mathbf J (\hat{\boldsymbol\theta}-\boldsymbol\theta_0)\sim\chi^2_p$ falls straight out. We compare the exact $\ell(\mu,\sigma^2)$ to its quadratic approximation, in a table and as contour plots.

In [23]:
# Exact Normal log-lik vs its 2nd-order Taylor (quadratic) approximation at the MLE.
rng = np.random.default_rng(42)
n = 50; mu_true, s2_true = 5.0, 4.0
data = rng.normal(mu_true, np.sqrt(s2_true), n)
mu_hat = np.mean(data); s2_hat = np.var(data)
def normal_ll(m, v):
    return -n/2 * np.log(2*np.pi*v) - np.sum((data - m)**2) / (2*v)
ll_max = normal_ll(mu_hat, s2_hat)
J11 = n / s2_hat; J22 = n / (2 * s2_hat**2); J = np.array([[J11, 0], [0, J22]])
print(f'Multivariate Taylor: Normal(mu, sigma^2)  (n={n})')
print(f'  MLE: mu_hat={mu_hat:.4f}, sigma2_hat={s2_hat:.4f}')
print(f"{'(mu0, s20)':>16}  {'l exact':>10}  {'l quad':>10}  {'diff':>8}")
print('-' * 50)
for m0, v0 in [(mu_hat, s2_hat), (mu_hat+0.5, s2_hat), (mu_hat, s2_hat+1),
               (mu_hat+0.5, s2_hat+1), (mu_hat+1.0, s2_hat+2)]:
    le = normal_ll(m0, v0)
    d = np.array([m0 - mu_hat, v0 - s2_hat])
    lq = ll_max - 0.5 * d @ J @ d
    print(f'  ({m0:.2f}, {v0:.2f})  {le:>10.4f}  {lq:>10.4f}  {abs(le-lq):>8.4f}')
Multivariate Taylor: Normal(mu, sigma^2)  (n=50)
  MLE: mu_hat=5.1824, sigma2_hat=2.3138
      (mu0, s20)     l exact      l quad      diff
--------------------------------------------------
  (5.18, 2.31)    -91.9190    -91.9190    0.0000
  (5.68, 2.31)    -94.6202    -94.6202    0.0000
  (5.18, 3.31)    -93.3549    -94.2539    0.8990
  (5.68, 3.31)    -95.2410    -96.9551    1.7141
  (6.18, 4.31)   -101.6969   -112.0634   10.3665
In [24]:
# Figure A.3: contour plots of the exact log-lik (left) vs its quadratic Taylor (right).
mg = np.linspace(mu_hat - 1.6, mu_hat + 1.6, 160)
vg = np.linspace(max(0.4, s2_hat - 1.6), s2_hat + 2.6, 160)
MM, VV = np.meshgrid(mg, vg)
LL = np.empty_like(MM); LQ = np.empty_like(MM)
for i in range(MM.shape[0]):
    for j in range(MM.shape[1]):
        LL[i, j] = normal_ll(MM[i, j], VV[i, j])
        d = np.array([MM[i, j] - mu_hat, VV[i, j] - s2_hat])
        LQ[i, j] = ll_max - 0.5 * d @ J @ d
lev = np.linspace(ll_max - 12, ll_max - 0.2, 12)
fig, (a1, a2) = plt.subplots(1, 2, figsize=(11, 4.4), sharey=True)
for ax, Z, ttl in [(a1, LL, 'exact $\\ell(\\mu,\\sigma^2)$'),
                    (a2, LQ, '2nd-order Taylor (ellipses)')]:
    cf = ax.contourf(MM, VV, Z, levels=lev, cmap='viridis', extend='min')
    ax.contour(MM, VV, Z, levels=lev, colors='white', linewidths=0.4, alpha=0.6)
    ax.plot([mu_hat], [s2_hat], '*', color=RED, ms=16, label='MLE')
    ax.set_xlabel(r'$\mu$'); ax.set_title(ttl); ax.grid(False)
a1.set_ylabel(r'$\sigma^2$'); a1.legend(loc='upper right')
fig.colorbar(cf, ax=(a1, a2), shrink=0.85, label='log-likelihood')
plt.show()

Read-out. Near the MLE the quadratic approximation is essentially exact (difference $0.0000$), because $\ell$ is genuinely quadratic in $\mu$ for fixed $\sigma^2$. It degrades as $\sigma^2$ moves away from $\hat\sigma^2$ (difference $10.37$ at $(\hat\mu+1,\hat\sigma^2+2)$): the exact contours bend (not quadratic in $\sigma^2$) while the Taylor version draws perfect ellipses. Those ellipses are exactly the level sets of the multivariate Normal $N(\hat{\boldsymbol\theta},\mathbf J^{-1})$ that underlies Wald tests and confidence regions.

Section A.8 — Multiple Integrals and the Jacobian Determinant¶

For $\mathbf Y=\mathbf g(\mathbf X)$, $f_{\mathbf Y}(\mathbf y)=f_{\mathbf X}(\mathbf g^{-1} (\mathbf y))\,\lvert\det\mathbf J_{g^{-1}}(\mathbf y)\rvert$. The Jacobian determinant measures how the map stretches volume locally — the multivariate generalization of $|dx/dy|$. The classic case is polar coordinates $(r,\theta)\mapsto(r\cos\theta,r\sin\theta)$, whose Jacobian determinant is $r$ — the reason the area element is $r\,dr\,d\theta$ and the seed of the Box-Muller transform. We check $\det\mathbf J=r$ analytically vs numerically.

In [25]:
# Jacobian of polar -> Cartesian: analytic det = r, vs central-difference det.
def transform(r, theta):
    return np.array([r*np.cos(theta), r*np.sin(theta)])
eps = 1e-7
print('Jacobian determinant: polar -> Cartesian')
print(f"{'r':>5}  {'theta':>8}  {'det J analytic':>15}  {'det J numeric':>15}  {'|diff|':>9}")
print('-' * 60)
for r, theta in [(1.0, 0.5), (2.0, 1.0), (0.5, np.pi/4), (3.0, np.pi/3)]:
    J = np.zeros((2, 2))
    J[:, 0] = (transform(r+eps, theta) - transform(r-eps, theta)) / (2*eps)
    J[:, 1] = (transform(r, theta+eps) - transform(r, theta-eps)) / (2*eps)
    det_num = np.linalg.det(J)
    print(f'{r:>5.1f}  {theta:>8.4f}  {r:>15.6f}  {det_num:>15.6f}  {abs(r-det_num):>9.2e}')
print('\ndet(J) = r for all (r, theta): area expands with radius -> element r dr dtheta.')
Jacobian determinant: polar -> Cartesian
    r     theta   det J analytic    det J numeric     |diff|
------------------------------------------------------------
  1.0    0.5000         1.000000         1.000000   5.03e-10
  2.0    1.0000         2.000000         2.000000   8.61e-12
  0.5    0.7854         0.500000         0.500000   3.97e-10
  3.0    1.0472         3.000000         3.000000   8.27e-09

det(J) = r for all (r, theta): area expands with radius -> element r dr dtheta.
In [26]:
# Figure: a uniform (r, theta) grid maps to Cartesian cells whose AREA grows with r.
fig, ax = plt.subplots(figsize=(5.6, 5.4))
rs = np.linspace(0.0, 3.0, 7)
ths = np.linspace(0, np.pi/2, 10)
for r in rs:
    tt = np.linspace(0, np.pi/2, 100)
    ax.plot(r*np.cos(tt), r*np.sin(tt), color=BLUE, lw=1.0, alpha=0.7)
for th in ths:
    ax.plot([0, 3*np.cos(th)], [0, 3*np.sin(th)], color=BLUE, lw=1.0, alpha=0.7)
# shade two equal-(dr,dtheta) cells at small vs large r to show area ~ r.
for (r0, col, lab) in [(0.5, ORANGE, 'small r: small area'), (2.5, RED, 'large r: large area')]:
    tt = np.linspace(np.pi/6, np.pi/6 + np.pi/9, 30)
    inner_r, outer_r = r0, r0 + 0.5
    xs = np.concatenate([inner_r*np.cos(tt), outer_r*np.cos(tt[::-1])])
    ys = np.concatenate([inner_r*np.sin(tt), outer_r*np.sin(tt[::-1])])
    ax.fill(xs, ys, color=col, alpha=0.65, label=lab)
ax.set_aspect('equal'); ax.set_xlim(-0.1, 3.1); ax.set_ylim(-0.1, 3.1); ax.grid(False)
ax.set_xlabel('x'); ax.set_ylabel('y')
ax.set_title(r'Polar grid: equal $dr\,d\theta$ cells, area $=r\,dr\,d\theta$')
ax.legend(loc='upper right', fontsize=9); plt.tight_layout(); plt.show()

Read-out. The numerical Jacobian determinant equals $r$ at every test point (agreement $\sim10^{-9}$), and the polar grid makes it visible: two cells with the same $dr\,d\theta$ cover very different areas — the outer cell (large $r$) is much bigger. That $r$ factor is what converts a uniform-on-$(r^2,\theta)$ pair into a standard bivariate Normal in Box-Muller (Chapter 2.4). The Exercises below push the same idea to $X/(X+Y)\sim\text{Beta}$.

Practice Problems¶

Three exercises from the appendix, with fully worked solutions. Each is a reproducible simulation seeded with default_rng(42).

Exercise 1 (Taylor approximation of a log-likelihood). For $X_1,\dots,X_{30}\sim\text{Bernoulli}(p)$: (a) give $\ell'$, $\ell''$, $\ell'''$; (b) find $\hat p$ and the range where the 2nd-order Taylor stays within $0.1$ of exact; (c) compare the Wald, Wilson, and Clopper-Pearson 95% intervals; (d) bound the Taylor remainder at $\hat p+0.1$. (Because $\ell'(\hat p)=0$, the first-order approximation is just the constant $\ell(\hat p)$, so only the quadratic is constructed.)

In [27]:
# Solution 1.
rng = np.random.default_rng(seed=42)
n = 30
data = rng.binomial(1, 0.4, n); s = int(np.sum(data)); p_hat = s / n
print(f'n = {n}, s = {s}, p_hat = {p_hat:.4f}')
print('(a)  l\'(p)=s/p-(n-s)/(1-p),  l\'\'(p)=-s/p^2-(n-s)/(1-p)^2,  '
      'l\'\'\'(p)=2s/p^3-2(n-s)/(1-p)^3')

def loglik(p):
    if p <= 0 or p >= 1:
        return -1e20
    return s*np.log(p) + (n - s)*np.log(1 - p)
ll_hat = loglik(p_hat)
score_hat = s/p_hat - (n - s)/(1 - p_hat)
J_hat = s/p_hat**2 + (n - s)/(1 - p_hat)**2
print(f'(b)  l(p_hat)={ll_hat:.4f}, score={score_hat:.6f}, J(p_hat)={J_hat:.4f}')
in_range = [p for p in np.linspace(0.05, 0.85, 500)
            if abs(loglik(p) - (ll_hat - 0.5*J_hat*(p - p_hat)**2)) < 0.1]
print(f'     2nd-order within 0.1 of exact for p in [{min(in_range):.3f}, {max(in_range):.3f}]')

se = 1/np.sqrt(J_hat); z = 1.96
wald = (p_hat - z*se, p_hat + z*se)
wc = (s + z**2/2)/(n + z**2)
whw = (z/(n + z**2))*np.sqrt(s*(n - s)/n + z**2/4)
wilson = (wc - whw, wc + whw)
cp = (stats.beta.ppf(0.025, s, n - s + 1), stats.beta.ppf(0.975, s + 1, n - s))
print(f'(c)  Wald            [{wald[0]:.4f}, {wald[1]:.4f}]')
print(f'     Wilson          [{wilson[0]:.4f}, {wilson[1]:.4f}]')
print(f'     Clopper-Pearson [{cp[0]:.4f}, {cp[1]:.4f}]')

dp = 0.1; p_test = p_hat + dp
third = lambda p: 2*s/p**3 - 2*(n - s)/(1 - p)**3
worst_R2 = max(abs(third(c)/6 * dp**3) for c in np.linspace(p_hat, p_test, 100))
actual = abs(loglik(p_test) - (ll_hat - 0.5*J_hat*dp**2))
print(f'(d)  worst-case |R2| bound = {worst_R2:.6f},  actual error = {actual:.6f},  '
      f'conservative = {worst_R2 >= actual}')
n = 30, s = 16, p_hat = 0.5333
(a)  l'(p)=s/p-(n-s)/(1-p),  l''(p)=-s/p^2-(n-s)/(1-p)^2,  l'''(p)=2s/p^3-2(n-s)/(1-p)^3
(b)  l(p_hat)=-20.7277, score=0.000000, J(p_hat)=120.5357
     2nd-order within 0.1 of exact for p in [0.340, 0.680]
(c)  Wald            [0.3548, 0.7119]
     Wilson          [0.3614, 0.6977]
     Clopper-Pearson [0.3433, 0.7166]
(d)  worst-case |R2| bound = 0.073671,  actual error = 0.023986,  conservative = True

Exercise 2 (Leibniz rule and score properties). For the $\text{Gamma}(\alpha,\beta)$ family: (a) the score vector (the $\alpha$-component needs the digamma $\psi$); (b) verify $\mathbb{E}[\mathbf U]=\mathbf 0$ by simulation at $(\alpha,\beta)=(3,2)$; (c) build the Fisher information matrix $-\mathbb{E}[\mathbf H]$ (the $(1,1)$ entry needs the trigamma $\psi'$) and confirm it equals the score covariance.

In [28]:
# Solution 2.
rng = np.random.default_rng(seed=42)
alpha0, beta0 = 3.0, 2.0; n_samples = 100000
print('(a)  U_alpha = log(beta) - psi(alpha) + log(x)   [needs digamma psi]')
print('     U_beta  = alpha/beta - x')
samples = rng.gamma(alpha0, 1/beta0, n_samples)
U_alpha = np.log(beta0) - digamma(alpha0) + np.log(samples)
U_beta = alpha0/beta0 - samples
print(f'(b)  mean U_alpha = {np.mean(U_alpha):+.6f}, mean U_beta = {np.mean(U_beta):+.6f}  (expect 0)')
I_analytic = np.array([[polygamma(1, alpha0), -1/beta0], [-1/beta0, alpha0/beta0**2]])
I_sim = np.cov(np.column_stack([U_alpha, U_beta]), rowvar=False)
print('(c)  Fisher information  -E[H]  (analytic):')
print(f'       [[{I_analytic[0,0]:.6f}, {I_analytic[0,1]:.6f}], '
      f'[{I_analytic[1,0]:.6f}, {I_analytic[1,1]:.6f}]]   [(1,1) uses trigamma psi\']')
print('     score covariance Cov(U) (simulated):')
print(f'       [[{I_sim[0,0]:.6f}, {I_sim[0,1]:.6f}], [{I_sim[1,0]:.6f}, {I_sim[1,1]:.6f}]]')
ratio = I_sim / I_analytic
print(f'     element-wise ratio ~ 1: [[{ratio[0,0]:.4f}, {ratio[0,1]:.4f}], '
      f'[{ratio[1,0]:.4f}, {ratio[1,1]:.4f}]]')
(a)  U_alpha = log(beta) - psi(alpha) + log(x)   [needs digamma psi]
     U_beta  = alpha/beta - x
(b)  mean U_alpha = -0.000841, mean U_beta = -0.000859  (expect 0)
(c)  Fisher information  -E[H]  (analytic):
       [[0.394934, -0.500000], [-0.500000, 0.750000]]   [(1,1) uses trigamma psi']
     score covariance Cov(U) (simulated):
       [[0.398013, -0.504284], [-0.504284, 0.757577]]
     element-wise ratio ~ 1: [[1.0078, 1.0086], [1.0086, 1.0101]]

Exercise 3 (Multivariate change of variables). With $X,Y\sim\text{Exp}(1)$ independent, $U=X+Y$ and $V=X/(X+Y)$: the inverse map $X=UV,\ Y=U(1-V)$ has $\lvert\det\mathbf J\rvert=U$, giving $f_{U,V}=u e^{-u}\cdot 1$, i.e. $U\sim\text{Gamma}(2,1)\perp V\sim\text{Uniform}(0,1)$. We confirm by simulation (KS tests, zero correlation) and the generalization $X/(X+Y)\sim\text{Beta}(a,b)$ for $X\sim\text{Gamma} (a,1),\,Y\sim\text{Gamma}(b,1)$.

In [29]:
# Solution 3.
rng = np.random.default_rng(seed=42)
print('(a)  inverse map X=UV, Y=U(1-V); J=[[V,U],[1-V,-U]]; det=-U; |det J|=U')
print('     f_{U,V}=e^{-u}*U = (u e^{-u}) * (1)  ->  U~Gamma(2,1) indep V~Uniform(0,1)')
n_sim = 100000
X = rng.exponential(1, n_sim); Y = rng.exponential(1, n_sim)
U = X + Y; V = X / (X + Y)
ks_U = stats.kstest(U, 'gamma', args=(2, 0, 1))
ks_V = stats.kstest(V, 'uniform')
print(f'(b)  U ~ Gamma(2,1):  KS={ks_U.statistic:.4f}, p={ks_U.pvalue:.4f}')
print(f'     V ~ Uniform(0,1): KS={ks_V.statistic:.4f}, p={ks_V.pvalue:.4f}')
print(f'     Corr(U,V) = {np.corrcoef(U, V)[0,1]:.6f}  (expect ~0; independence)')
a, b = 3.0, 2.0
Xa = rng.gamma(a, 1, n_sim); Yb = rng.gamma(b, 1, n_sim)
W = Xa / (Xa + Yb)                                  # SAME Xa in numerator and denominator
ks_W = stats.kstest(W, 'beta', args=(a, b))
print(f'(d)  X/(X+Y) ~ Beta({a:.0f},{b:.0f}):  KS={ks_W.statistic:.4f}, p={ks_W.pvalue:.4f}')
print(f'     mean={np.mean(W):.4f} (theory {a/(a+b):.4f}),  '
      f'var={np.var(W):.4f} (theory {a*b/((a+b)**2*(a+b+1)):.4f})')
(a)  inverse map X=UV, Y=U(1-V); J=[[V,U],[1-V,-U]]; det=-U; |det J|=U
     f_{U,V}=e^{-u}*U = (u e^{-u}) * (1)  ->  U~Gamma(2,1) indep V~Uniform(0,1)
(b)  U ~ Gamma(2,1):  KS=0.0020, p=0.7978
     V ~ Uniform(0,1): KS=0.0037, p=0.1223
     Corr(U,V) = 0.002447  (expect ~0; independence)
(d)  X/(X+Y) ~ Beta(3,2):  KS=0.0025, p=0.5639
     mean=0.6003 (theory 0.6000),  var=0.0398 (theory 0.0400)

Summary & connections forward¶

Each calculus tool drives a specific computational method in the course — and every one was checked numerically above (analytic vs finite-difference vs SymPy vs Monte Carlo).

Calculus tool This notebook Where it returns in the course
Chain rule GLM score $\mathbf X^\top(\mathbf y-\boldsymbol\mu)$ (A.5) Ch 3.2 MLE score, Ch 3.5 GLM/IRLS
Logarithmic differentiation Poisson score $\sum x_i/\lambda - n$ (A.1) Ch 3.2 score, App E Cramér-Rao
Second-derivative test $J=-\ell''$, observed information (A.1) Ch 3.2 MLE verification
Integration by parts $\mathbb{E}[X]=\alpha/\beta$ (A.2) App D moments, Ch 2.4 MGFs
Change of variables inverse-CDF Exp; polar $\det J=r$ (A.2, A.8) Ch 2.4 inverse-CDF, Box-Muller
Taylor (1st order) $\operatorname{Var}(\sqrt{\bar X})=1/4n$ (A.3) App E delta method, Ch 4.3 bootstrap SE
Taylor (2nd order) quadratic log-lik, Wald vs LR (A.3, A.7) Ch 3.2 Newton/Wald, Ch 5.5 Laplace
Leibniz rule $\mathbb{E}[U]=0$; $\operatorname{Var}(U)=-\mathbb{E}[\ell'']$ (A.4) App E score & Fisher info
Gradient / Hessian Normal score & Hessian, orthogonality (A.5) Ch 3.2 multivariate MLE, Ch 3.5 IRLS
Jacobian determinant $X/(X+Y)\sim\text{Beta}$ (A.8, Ex 3) Ch 2.4 multivariate transforms
Convexity concave Normal vs bimodal mixture (A.6) Ch 3.1 exponential families

Forward. Optimization (set the score to zero), uncertainty (invert the curvature), and transformation (Jacobians) are the three threads that run through Chapters 2–5. The asymptotic proofs that use these tools — CLT, delta method, Cramér-Rao bound — live in Appendix E; the matrix gradient/Hessian identities in Appendix B; the iterative optimizers and quadrature in Appendix C. Everything you just verified by simulation is given a theorem there.