Four Faces of Leibniz

Throughout Cases 1–3, interchanging differentiation and integration is valid under standard regularity conditions: fixed support in the parameter, differentiability a.e., and a dominating integrable bound. When these fail — $\text{Uniform}(0,\theta)$ violates fixed support, $\text{Laplace}(\theta)$ violates smoothness — the downstream identities can fail. In a computational pipeline, these failures are not abstract: Monte Carlo gradient estimates become biased, Fisher information matrices become singular or negative definite, and optimizers receive corrupted curvature information — what a theorist calls a “regularity violation” shows up as gradients that explode, vanish, or point the wrong way. Case 4 requires a boundary condition instead (stated explicitly below).

Case 1

Classical Score

Starting identity

\int f(x|\theta)\,dx = 1 \quad \text{for all } \theta

First Leibniz — pass $\frac{\partial}{\partial\theta}$ through $\int$

0 \;=\; \frac{\partial}{\partial\theta}\int f(x|\theta)\,dx \;=\; \int \frac{\partial f}{\partial\theta}\,dx \;=\; \int \frac{\partial \log f}{\partial\theta}\cdot f\,dx \;=\; \mathbb{E}_\theta\!\left[\frac{\partial \log f}{\partial\theta}\right]

$$\mathbb{E}_\theta\!\left[U(\theta)\right] = 0$$

Second Leibniz — pass $\frac{\partial}{\partial\theta}$ through $\int$ again

Starting from $\int \frac{\partial \log f}{\partial\theta} \cdot f\,dx = 0$, differentiate the integrand via the product rule:

\frac{\partial}{\partial\theta}\!\!\left(\frac{\partial \log f}{\partial\theta}\cdot f\right) = \frac{\partial^2 \log f}{\partial\theta^2}\cdot f \;+\; \left(\frac{\partial \log f}{\partial\theta}\right)^{\!2}\cdot f

Integrate both sides (passing $\frac{\partial}{\partial\theta}$ through $\int$):

0 = \mathbb{E}_\theta\!\left[\frac{\partial^2 \log f}{\partial\theta^2}\right] + \mathbb{E}_\theta\!\left[\left(\frac{\partial \log f}{\partial\theta}\right)^{\!2}\right]

$$I(\theta) = \mathbb{E}\!\left[\left(\frac{\partial \log f}{\partial\theta}\right)^{\!2}\right]$$

$$I(\theta) = -\,\mathbb{E}\!\left[\frac{\partial^2 \log f}{\partial\theta^2}\right]$$

The definition is a variance — always non-negative, requires only $\mathbb{E}[U]=0$. The second form is a theorem: it requires the second interchange to be valid. When it fails — $\text{Uniform}(0,\theta)$, $\text{Laplace}(\theta)$ — the two expressions disagree, and the definition governs.

What it gives you
MLE theory, Fisher information, Cramér–Rao bound, score tests, natural gradient.

Case 2

Log-Partition Function

Starting identity

e^{A(\eta)} = \int h(x)\exp\!\big\{\eta^\top T(x)\big\}\,dx

where $A(\eta) = \log \int h(x)\exp\{\eta^\top T(x)\}\,dx$ is the log-partition function. (For discrete $X$, replace $\int\cdots\,dx$ with $\sum_x$ throughout; the parameter $\eta$ is always continuous, so all derivatives with respect to $\eta$ proceed unchanged.)

First Leibniz — pass $\frac{\partial}{\partial\eta_j}$ through $\int$

\frac{\partial}{\partial\eta_j}e^{A(\eta)} = \int h(x)\,T_j(x)\,\exp\!\big\{\eta^\top T(x)\big\}\,dx

Left side: $A'_j(\eta)\,e^{A(\eta)}$. Right side: $e^{A(\eta)}\,\mathbb{E}_\eta[T_j(X)]$. Cancel $e^{A(\eta)}$:

$$\nabla A(\eta) = \mathbb{E}_\eta[T(X)]$$

Second Leibniz — pass $\frac{\partial}{\partial\eta_k}$ through $\int$ again

Differentiate $\frac{\partial A}{\partial\eta_j} = \mathbb{E}_\eta[T_j] = \int T_j(x)\,f(x|\eta)\,dx$ with respect to $\eta_k$. By Leibniz:

\frac{\partial^2 A}{\partial\eta_j\,\partial\eta_k} = \int T_j(x)\,\frac{\partial f(x|\eta)}{\partial\eta_k}\,dx

We need $\frac{\partial f}{\partial\eta_k}$. Since $f(x|\eta) = h(x)\exp\{\eta^\top T(x) - A(\eta)\}$, the chain rule gives:

\frac{\partial f}{\partial\eta_k} = f(x|\eta)\cdot\!\left(T_k(x) - \frac{\partial A}{\partial\eta_k}\right) = f(x|\eta)\cdot\!\left(T_k(x) - \mathbb{E}_\eta[T_k]\right)

where the last step uses the first Leibniz result $\frac{\partial A}{\partial\eta_k} = \mathbb{E}_\eta[T_k]$. Substituting back:

\frac{\partial^2 A}{\partial\eta_j\,\partial\eta_k} = \int T_j(x)\big(T_k(x) - \mathbb{E}_\eta[T_k]\big)\,f(x|\eta)\,dx = \mathbb{E}_\eta[T_j\,T_k] - \mathbb{E}_\eta[T_j]\,\mathbb{E}_\eta[T_k]

$$\nabla^2 A(\eta) = \mathrm{Var}_\eta\!\big(T(X)\big) \;\succeq\; 0$$

$A$ is convex. Fisher information in natural parameters: $I(\eta) = A''(\eta) = \mathrm{Var}_\eta(T)$. Moments of any order by further differentiation.

What it gives you
Exponential family theory, moment-generating properties, convexity of the log-partition, Fisher information via $A''(\eta)$.

Case 3

REINFORCE / Policy Gradient

Starting identity

V(\theta) = \mathbb{E}_{p(x|\theta)}[f(x)] = \int f(x)\,p(x|\theta)\,dx

Here $f(x)$ is a reward that does not depend on $\theta$, and $p(x|\theta)$ is a policy or sampling distribution parameterized by $\theta$. The goal is to find $\theta$ that maximizes $V(\theta)$ — we want the policy that produces, on average, the highest reward. The natural approach is gradient ascent: $\theta \leftarrow \theta + \alpha\,\nabla_\theta V(\theta)$. So we need $\nabla_\theta V(\theta)$.

The problem — gradient of an expectation you can only sample from

We want $\nabla_\theta V(\theta)$, but the expectation is taken over $p(x|\theta)$ — the distribution we're differentiating with respect to. We can draw samples from $p(x|\theta)$, but we cannot differentiate through the sampling process itself. If $x$ is discrete, or if $p$ is a black-box simulator, a computer looking at the sampling procedure is completely blind — there is no path to push $\nabla_\theta$ inside naively. The Leibniz interchange is what makes this computable at all: without it, there is no gradient signal, and gradient ascent cannot even start.

The solution — same idea as Chapter 2

Recall the fundamental Monte Carlo move from Chapter 2.1: to estimate an integral, rewrite it as an expectation under a distribution you can sample from, then average over samples. Algorithm 2.1 computes $\hat{I}_n = \frac{1}{n}\sum_i h(X_i)$ where $X_i \sim f$. REINFORCE does the same thing for a gradient: it rewrites $\nabla_\theta V$ as $\mathbb{E}_{p(x|\theta)}[\,\cdot\,]$, giving a Monte Carlo estimator

\widehat{\nabla_\theta V} = \frac{1}{n}\sum_{i=1}^{n} f(x_i)\,\nabla_\theta \log p(x_i|\theta), \qquad x_i \sim p(\cdot|\theta)

Draw samples, evaluate $f(x_i)\,\nabla_\theta \log p(x_i|\theta)$ at each one, average. The only new ingredient is the Leibniz step that makes the rewrite possible — the estimation itself is Algorithm 2.1 applied to a vector-valued integrand.

Leibniz — pass $\nabla_\theta$ through $\int$

\nabla_\theta V(\theta) = \int f(x)\,\nabla_\theta p(x|\theta)\,dx

Apply the same identity as Case 1: $\nabla_\theta p = p\,\nabla_\theta \log p$:

= \int f(x)\,p(x|\theta)\,\nabla_\theta \log p(x|\theta)\,dx

$$\nabla_\theta\,\mathbb{E}_{p(x|\theta)}[f(x)] = \mathbb{E}_{p(x|\theta)}\!\left[f(x)\,\nabla_\theta \log p(x|\theta)\right]$$

Variance reduction via Case 1

The estimator above is unbiased, but just as Chapter 2 warned that naive Monte Carlo has high variance for rare events, the REINFORCE estimator can have enormous variance — $f(x)$ can be large and $\nabla_\theta \log p(x|\theta)$ fluctuates across samples. The baseline trick exploits Case 1 to reduce this variance without introducing bias.

Since $\mathbb{E}_{p(x|\theta)}[\nabla_\theta \log p(x|\theta)] = 0$ (from Case 1), for any constant $b$ that does not depend on $x$:

\mathbb{E}_{p(x|\theta)}\!\left[b\,\nabla_\theta \log p(x|\theta)\right] = b\,\mathbb{E}_{p(x|\theta)}\!\left[\nabla_\theta \log p(x|\theta)\right] = 0

So we can subtract $b$ from $f(x)$ without changing the expected gradient:

\nabla_\theta V(\theta) = \mathbb{E}_{p(x|\theta)}\!\left[\big(f(x) - b\big)\,\nabla_\theta \log p(x|\theta)\right]

This helps because the variance of the estimator scales with $\mathbb{E}[(f(x) - b)^2 \|\nabla_\theta \log p(x|\theta)\|^2]$. By centering $f(x)$ around $b$, we shrink the magnitude of the terms being averaged. The optimal scalar baseline minimizes this variance:

$$b^* = \frac{\mathbb{E}_{p(x|\theta)}\!\left[f(x)\,\|\nabla_\theta \log p(x|\theta)\|^2\right]}{\mathbb{E}_{p(x|\theta)}\!\left[\|\nabla_\theta \log p(x|\theta)\|^2\right]}$$

In practice, $b^*$ is estimated from samples, and simpler choices work well: the running mean of rewards in RL, or a learned value function $V_\phi(s)$ in actor-critic methods. The key point is that any $b$ independent of $x$ is valid — the mean-zero property of the score guarantees unbiasedness.

What it gives you
Policy gradients in RL, RLHF fine-tuning of language models, score function estimator for ELBO gradients in variational inference, discrete latent variable models where reparameterization is unavailable.

Case 4

Score Matching

The problem — flexible models with intractable normalizing constants

Suppose you want to model a complex distribution by specifying how “unlikely” each configuration $x$ is through an energy function $E_\theta(x)$ — a scalar-valued function where low energy means high probability and high energy means low probability. The model takes the Boltzmann form $p_\theta(x) = \exp(-E_\theta(x))\,/\,Z(\theta)$, where the normalizing constant $Z(\theta) = \int \exp(-E_\theta(x))\,dx$ ensures the density integrates to 1. The energy can be as flexible as you like — a neural network, a pairwise interaction model, any function that captures the structure you believe is in the data.

The catch: MLE requires evaluating the likelihood, which requires the normalizing constant $Z(\theta) = \int \exp(-E_\theta(x))\,dx$. For any model rich enough to be interesting, this integral is intractable — it is a sum or integral over all possible configurations of $x$. This is not a rare corner case. It arises in:

Undirected graphical models (Markov random fields, Boltzmann machines) — the partition function sums over exponentially many configurations
Spatial and network models — autologistic models for lattice data, exponential random graph models (ERGMs) for network data
Energy-based models in machine learning — any model of the form $p_\theta(x) \propto \exp(-E_\theta(x))$ with a flexible $E_\theta$

Without $Z(\theta)$, you cannot compute the likelihood, so you cannot do MLE, you cannot do Bayesian updates, you cannot compare models by AIC or BIC. Score matching offers an escape: fit the model without ever evaluating $Z(\theta)$.

The key observation — $\nabla_x$ kills the normalizing constant

We want to fit a model $p_\theta(x)$ to data from unknown $p_{\text{data}}(x)$. The data-side score is $\mathbf{s}(x) = \nabla_x \log p(x)$ — gradient w.r.t. $x$, not $\theta$. Unlike the classical score from Case 1 (a scalar measuring sensitivity to the parameter), this is a vector field over data space: at every point $x$, it points in the direction where the density increases fastest. Think of the density $p(x)$ as a landscape — the data-side score is the slope at every point, telling you which way is “uphill” toward higher probability. Score matching fits this vector field rather than the probability surface itself, and this is precisely what lets the normalizing constant drop out.

If $p_\theta(x) = q_\theta(x)/Z(\theta)$, then:

\nabla_x \log p_\theta(x) = \nabla_x \log q_\theta(x) - \underbrace{\nabla_x \log Z(\theta)}_{=\;0} = \nabla_x \log q_\theta(x)

$Z(\theta)$ does not depend on $x$, so its $x$-gradient vanishes.

The Fisher divergence

D_F(p_{\text{data}} \| p_\theta) = \tfrac{1}{2}\,\mathbb{E}_{p_{\text{data}}}\!\left[\left\|\nabla_x \log p_{\text{data}}(x) - \nabla_x \log p_\theta(x)\right\|^2\right]

Expanding the square: the first term is constant in $\theta$, the third is computable. The cross term requires the unknown $\nabla_x \log p_{\text{data}}$.

Integration by parts in $x$ — the key move

For each component $i$:

\mathbb{E}_{p_{\text{data}}}\!\left[\frac{\partial \log p_{\text{data}}}{\partial x_i}\cdot\frac{\partial \log p_\theta}{\partial x_i}\right] = \int \frac{\partial \log p_\theta}{\partial x_i}\cdot\frac{\partial p_{\text{data}}}{\partial x_i}\,dx

Integrate by parts in $x_i$ (assuming $p_{\text{data}}(x)\,\frac{\partial \log p_\theta(x)}{\partial x_i} \to 0$ as $\|x\| \to \infty$, so the boundary term vanishes):

= -\int p_{\text{data}}(x)\,\frac{\partial^2 \log p_\theta(x)}{\partial x_i^2}\,dx = -\,\mathbb{E}_{p_{\text{data}}}\!\left[\frac{\partial^2 \log p_\theta}{\partial x_i^2}\right]

$$D_F(p_{\text{data}} \| p_\theta) = \mathbb{E}_{p_{\text{data}}}\!\left[\sum_{i=1}^d \frac{\partial^2 \log p_\theta}{\partial x_i^2} + \frac{1}{2}\!\left(\frac{\partial \log p_\theta}{\partial x_i}\right)^{\!2}\right] + \text{const}$$

The right side involves only derivatives of $\log p_\theta$ evaluated at data points — no $p_{\text{data}}$ needed, no normalizing constant needed.

Where is Leibniz? Integration by parts is Leibniz

In Cases 1–3, we used $\int \frac{\partial}{\partial\theta}[\cdots]\,dx = \frac{\partial}{\partial\theta}\int[\cdots]\,dx$ — moving a $\theta$-derivative past an $x$-integral. Here the derivative is in $x$, the same variable we integrate over, so the tool is integration by parts rather than Leibniz in its standard form. But it is the same structural move. Start from the product rule:

\frac{\partial}{\partial x_i}\!\left[p_{\text{data}}(x)\,\frac{\partial \log p_\theta(x)}{\partial x_i}\right] = \frac{\partial p_{\text{data}}}{\partial x_i}\cdot\frac{\partial \log p_\theta}{\partial x_i} \;+\; p_{\text{data}}\cdot\frac{\partial^2 \log p_\theta}{\partial x_i^2}

Integrate both sides over all $x$. The left side is a total derivative — it integrates to a boundary term $\big[p_{\text{data}}\cdot\frac{\partial \log p_\theta}{\partial x_i}\big]_{-\infty}^{+\infty}$ that vanishes when $p_{\text{data}}(x)\,\frac{\partial \log p_\theta}{\partial x_i} \to 0$ as $\|x\|\to\infty$:

0 = \int \frac{\partial p_{\text{data}}}{\partial x_i}\cdot\frac{\partial \log p_\theta}{\partial x_i}\,dx \;+\; \int p_{\text{data}}\cdot\frac{\partial^2 \log p_\theta}{\partial x_i^2}\,dx

This is exactly the pattern: an integral equals something known (zero, because the boundary terms vanish), and the constraint transfers a derivative from one factor ($p_{\text{data}}$, which we don't know) to another ($\log p_\theta$, which we do). In Cases 1–3, the known constant was $\int f\,dx = 1$. Here it is $\big[p_{\text{data}}\cdot\frac{\partial \log p_\theta}{\partial x_i}\big]_{-\infty}^{+\infty} = 0$. Different constant, same move.

What it gives you
Parameter estimation for any model where the normalizing constant is intractable: undirected graphical models, exponential random graph models, energy-based models, non-Gaussian spatial models. This same idea — matching scores instead of likelihoods — is the foundation of several important methods you may encounter beyond this course, including denoising score matching, sliced score matching, and the diffusion models (e.g., DALL-E, Stable Diffusion) that drive modern generative AI.

One Move, Four Domains

The same structural trick — differentiate an integral that equals something known — across statistics, exponential family theory, reinforcement learning, and generative modeling.

Case	Integral	Diff. w.r.t.	Key identity	Applications
Score	$\int f(x\|\theta)\,dx = 1$	$\theta$	$\mathbb{E}_\theta[\nabla_\theta \log f] = 0$	MLE Cramér–Rao Fisher info Natural gradient
Log-partition	$e^{A(\eta)} = \int h\,e^{\eta^\top T}\,dx$	$\eta$	$\nabla A = \mathbb{E}[T]$, $\nabla^2 A = \mathrm{Var}(T)$	Exp. families Convexity Moment generation
REINFORCE	$V = \int f(x)\,p(x\|\theta)\,dx$	$\theta$	$\nabla_\theta V = \mathbb{E}[f\cdot\nabla_\theta\!\log p]$	Policy gradients RLHF Variational inference
Score matching	$\big[p_{\text{data}}\,\partial_{x_i}\!\log p_\theta\big]_{-\infty}^{\infty}\!=0$	$x$	cross term $= -\mathbb{E}[\nabla_x^2\!\log p_\theta]$	Energy-based models Graphical models ERGMs