IRLS for Logistic Regression — 2D Classification

No intercept: P(Y=1|x) = σ(β₁x₁ + β₂x₂). Decision boundary is the p = 0.5 contour through the origin.

t = 0

GD α 0.050 total steps

P(Y=1 | x), data, and decision boundary

y=1

y=0

p=0.5

True

Contours: .1 .25 .75 .9

Log-likelihood & optimization paths in (β₁, β₂) space

IRLS

GD (full path)

MLE

True

Working response: WLS view at current iteration

Point size ∝ w_i

WLS fit: z = Xβ^(t+1)

z = η (identity)

Convergence (log-log: both axes)

IRLS (quadratic)

GD (linear)

IRLS computation at current iteration

Per-observation quantities

Convergence history

Python reference — what you should actually write vs. what's under the hood

The JavaScript above uses explicit 2×2 inversion and scalar loops. In practice, you fit GLMs with statsmodels. Below: the two-line version first, then the manual IRLS that sm.GLM.fit() runs internally.

What you write

import numpy as np
import statsmodels.api as sm

# Generate data (same setup as the simulation)
rng = np.random.default_rng(418)
n = 20
X = rng.normal(size=(n, 2)) * 1.5
beta_true = np.array([1.5, -1.0])
p_true = 1 / (1 + np.exp(-X @ beta_true))
y = rng.binomial(1, p_true)

# Fit: GLM with Binomial family, logit link (canonical), no intercept
# fit() calls IRLS internally — uses QR on the weighted system
model = sm.GLM(y, X, family=sm.families.Binomial())
result = model.fit()

print(result.summary())
print(f"Converged in {result.fit_history['iteration']} IRLS iterations")

# Diagnostics that connect to this simulation:
print(f"Coefficients:   {result.params}")
print(f"Log-likelihood: {result.llf:.6f}")
print(f"Deviance:       {result.deviance:.6f}")

# Detect separation: check the condition number of the final X'WX
mu_hat = result.mu
w_hat = mu_hat * (1 - mu_hat)
XtWX = (X * w_hat[:, None]).T @ X
print(f"cond(X'WX):     {np.linalg.cond(XtWX):.2e}")
print(f"min(w_i):       {w_hat.min():.4e}")

What statsmodels does internally (manual IRLS)

from scipy.special import expit  # numerically stable sigmoid

def irls_logistic(X, y, beta_init, max_iter=25, tol=1e-10):
    """Manual IRLS matching the simulation step-by-step."""
    beta = beta_init.copy()
    history = []

    for t in range(max_iter):
        # --- All vectorized: no loops over observations ---
        eta = X @ beta                        # (n,) linear predictor
        mu = expit(eta)                       # (n,) stable sigmoid, not 1/(1+exp(-x))
        w = mu * (1 - mu)                    # (n,) weights = b''(eta)
        z = eta + (y - mu) / w                # (n,) working response

        # --- Broadcasting: X * w[:,None] avoids n×n diag(w) ---
        Xw = X * w[:, None]                   # (n, p) row-scaled design matrix
        XtWX = Xw.T @ X                       # (p, p) Fisher information
        XtWz = Xw.T @ z                       # (p,) weighted RHS

        # --- SOLVE, never invert ---
        # statsmodels uses QR on sqrt(W)*X instead of forming X'WX
        # for better numerical stability; here we use the normal equations
        beta_new = np.linalg.solve(XtWX, XtWz)

        # --- Monitor stability (what the JS hides) ---
        cond = np.linalg.cond(XtWX)
        score = X.T @ (y - mu)
        grad_norm = np.linalg.norm(score)

        history.append({
            'beta': beta.copy(), 'll': np.sum(y*np.log(mu+1e-15) + (1-y)*np.log(1-mu+1e-15)),
            'grad_norm': grad_norm, 'cond': cond, 'min_w': w.min(),
        })

        if grad_norm < tol:
            beta = beta_new
            break

        if cond > 1e12:
            print(f"⚠ Iter {t}: cond(X'WX)={cond:.1e}, min(w)={w.min():.1e} — separation?")

        beta = beta_new

    return beta, history

# Verify: manual matches statsmodels
beta_manual, hist = irls_logistic(X, y, np.zeros(2))
print(f"statsmodels: {result.params}")
print(f"manual IRLS: {beta_manual}")
print(f"max |diff|:  {np.max(np.abs(result.params - beta_manual)):.2e}")

Three anti-patterns the JS demo commits (and why):

1. Explicit inversion — inv2(M) computes (X^TWX)⁻¹ directly. In production, statsmodels uses QR on sqrt(W) @ X, which avoids forming X^TWX entirely. The JS gets away with Cramer's formula because p = 2.

2. Scalar loops — The JS accumulates the Hessian element-by-element. In NumPy: (X * w[:, None]).T @ X. The broadcasting trick X * w[:, None] scales each row by its weight without the n×n diagonal.

3. Silent weight thresholding — The JS caps w_i at 10⁻¹² to prevent NaN. Don't threshold silently — monitor np.linalg.cond(XtWX) and switch to penalized likelihood or a prior when the condition number explodes. statsmodels will warn “PerfectSeparationError” when it detects this.