Python reference — what you should actually write vs. what's under the hood
The JavaScript above uses explicit 2×2 inversion and scalar loops.
In practice, you fit GLMs with statsmodels. Below: the two-line
version first, then the manual IRLS that sm.GLM.fit() runs internally.
What you write
import numpy as np
import statsmodels.api as sm
# Generate data (same setup as the simulation)
rng = np.random.default_rng(418)
n = 20
X = rng.normal(size=(n, 2)) * 1.5
beta_true = np.array([1.5, -1.0])
p_true = 1 / (1 + np.exp(-X @ beta_true))
y = rng.binomial(1, p_true)
# Fit: GLM with Binomial family, logit link (canonical), no intercept
# fit() calls IRLS internally — uses QR on the weighted system
model = sm.GLM(y, X, family=sm.families.Binomial())
result = model.fit()
print(result.summary())
print(f"Converged in {result.fit_history['iteration']} IRLS iterations")
# Diagnostics that connect to this simulation:
print(f"Coefficients: {result.params}")
print(f"Log-likelihood: {result.llf:.6f}")
print(f"Deviance: {result.deviance:.6f}")
# Detect separation: check the condition number of the final X'WX
mu_hat = result.mu
w_hat = mu_hat * (1 - mu_hat)
XtWX = (X * w_hat[:, None]).T @ X
print(f"cond(X'WX): {np.linalg.cond(XtWX):.2e}")
print(f"min(w_i): {w_hat.min():.4e}")
What statsmodels does internally (manual IRLS)
from scipy.special import expit # numerically stable sigmoid
def irls_logistic(X, y, beta_init, max_iter=25, tol=1e-10):
"""Manual IRLS matching the simulation step-by-step."""
beta = beta_init.copy()
history = []
for t in range(max_iter):
# --- All vectorized: no loops over observations ---
eta = X @ beta # (n,) linear predictor
mu = expit(eta) # (n,) stable sigmoid, not 1/(1+exp(-x))
w = mu * (1 - mu) # (n,) weights = b''(eta)
z = eta + (y - mu) / w # (n,) working response
# --- Broadcasting: X * w[:,None] avoids n×n diag(w) ---
Xw = X * w[:, None] # (n, p) row-scaled design matrix
XtWX = Xw.T @ X # (p, p) Fisher information
XtWz = Xw.T @ z # (p,) weighted RHS
# --- SOLVE, never invert ---
# statsmodels uses QR on sqrt(W)*X instead of forming X'WX
# for better numerical stability; here we use the normal equations
beta_new = np.linalg.solve(XtWX, XtWz)
# --- Monitor stability (what the JS hides) ---
cond = np.linalg.cond(XtWX)
score = X.T @ (y - mu)
grad_norm = np.linalg.norm(score)
history.append({
'beta': beta.copy(), 'll': np.sum(y*np.log(mu+1e-15) + (1-y)*np.log(1-mu+1e-15)),
'grad_norm': grad_norm, 'cond': cond, 'min_w': w.min(),
})
if grad_norm < tol:
beta = beta_new
break
if cond > 1e12:
print(f"⚠ Iter {t}: cond(X'WX)={cond:.1e}, min(w)={w.min():.1e} — separation?")
beta = beta_new
return beta, history
# Verify: manual matches statsmodels
beta_manual, hist = irls_logistic(X, y, np.zeros(2))
print(f"statsmodels: {result.params}")
print(f"manual IRLS: {beta_manual}")
print(f"max |diff|: {np.max(np.abs(result.params - beta_manual)):.2e}")
Three anti-patterns the JS demo commits (and why):
1. Explicit inversion —
inv2(M) computes (XTWX)−1 directly.
In production, statsmodels uses QR on
sqrt(W) @ X, which avoids forming
XTWX entirely.
The JS gets away with Cramer's formula because p = 2.
2. Scalar loops —
The JS accumulates the Hessian element-by-element. In NumPy:
(X * w[:, None]).T @ X. The broadcasting trick
X * w[:, None] scales each row by its weight
without the n×n diagonal.
3. Silent weight thresholding —
The JS caps wi at 10−12 to prevent NaN.
Don't threshold silently — monitor
np.linalg.cond(XtWX) and switch to
penalized likelihood or a prior when the condition number explodes.
statsmodels will warn
“PerfectSeparationError” when it detects this.