Revision Notes/Unit 14 — GLMs & Logistic Regression/Logistic Regression and the GLM Framework

Logistic Regression and the GLM Framework

Intuition

Linear regression assumes Y is continuous and roughly Normal. When Y is binary (admitted/rejected, saved/scored, recovered/not), OLS shatters: predictions leak outside [0, 1]; residuals can take only two values; variance is largest at p = 0.5 and smallest at the extremes; the true relationship between predictor and probability is S-shaped, not linear. The fix isn't to abandon regression — it's to make it model the *log-odds* of the outcome instead of the outcome directly. That's logistic regression. And logistic regression is just one member of a larger family — Generalised Linear Models (GLMs) — that handles any non-Normal Y by combining three parts: a distribution, a linear predictor, and a *link function* that bridges them.

Explanation

Why OLS fails for binary Y — four problems. (1) Predictions out of bounds: $\hat{Y} = β_{0} + β_{1} X$ can be < 0 or > 1, but probability can't. (2) Non-Normal residuals: for binary Y, residuals take only two values (1 − $\overset{p}{^}$ or 0 − $\overset{p}{^}$ ). (3) Heteroscedasticity by construction: Var(Y) = p(1 − p), maximised at p = 0.5, near zero at extremes — variance changes with mean. (4) Wrong shape: the relationship between X and probability is *S-shaped*, not linear: pushing from p = 0.85 to 0.93 (extreme) vs from 0.30 to 0.55 (middle) requires very different X changes.

The Generalised Linear Model framework. Three components: (1) Random component — the distribution of Y (Normal, Bernoulli, Poisson, Gamma, multinomial); (2) Systematic component — the linear predictor $η = β_{0} + β_{1} X_{1} + \dots + β_{k} X_{k}$ ; (3) Link function g — connects E[Y] to η via $g (E [Y]) = η$ , equivalently $E [Y] = g^{- 1} (η)$ .

The GLM zoo. Continuous Y → Normal + identity link = ordinary linear regression. Binary Y → Bernoulli + logit link = logistic regression. Count Y → Poisson + log link = Poisson regression. Proportion Y → Binomial + logit = logistic for proportions. Skewed positive Y → Gamma + log link. Categorical > 2 levels → multinomial. Ordered Y → ordinal (proportional odds). One framework. Different combinations of distribution and link.

Why the logit link. Probability p lives in (0, 1). The linear predictor η lives on (−∞, +∞). We need a transformation that maps (0, 1) to (−∞, +∞). Two steps. Step 1 — odds: $odds = p / (1 - p)$ maps (0, 1) → (0, ∞). Step 2 — log: $lo g (odds)$ maps (0, ∞) → (−∞, ∞). The composition $lo g (p / (1 - p))$ is the logit — perfectly matched to the linear predictor's scale.

Logistic regression model. $lo g (\frac{p _{i}}{1 - p _{i}}) = β_{0} + β_{1} X_{1 i} + \dots + β_{k} X_{k i}$ . Equivalently, solving for p: $p_{i} = \frac{1}{1 + e ^{- η_{i}}} = \frac{e ^{η_{i}}}{1 + e ^{η_{i}}}$ . The logistic / sigmoid S-curve: flat at the bottom (η ≪ 0), steepest at η = 0 (p = 0.5), flat at the top (η ≫ 0).

Coefficient interpretation 1 — log-odds units (raw β). A one-unit increase in $X_{j}$ changes the log-odds of Y by $β_{j}$ , holding other predictors constant. Mathematically clean. Not intuitive — readers don't think in log-odds.

**Coefficient interpretation 2 — odds ratio ( $e^{β_{j}}$ ).** A one-unit increase in $X_{j}$ multiplies the odds of Y by $e^{β_{j}}$ . This is the standard reporting format. Bands: $e^{β} > 1$ → odds increase; $e^{β} = 1$ → no effect; $e^{β} < 1$ → odds decrease. Examples: $e^{β} = 2$ → odds double per unit X. $e^{β} = 0.5$ → odds halve. $e^{β} = 1.05$ → odds rise by 5% per unit.

OR ≠ probability ratio (and ≠ risk ratio). Odds ratio = 2 doesn't mean 'twice as likely to occur.' It means odds change from $p / (1 - p)$ to $2 p / (1 - p)$ — which only approximates 'doubled probability' when p is small. For p = 0.1: OR = 2 → new p ≈ 0.18, not 0.20. Always state 'odds ratio,' never 'risk ratio,' for logistic output.

MLE — not OLS. Logistic regression can't minimise squared residuals (residuals are weird). It maximises the likelihood of the observed data: choose β so that the model assigns the highest joint probability to what we actually saw. Likelihood: $L (β) = \prod_{i} p_{i}^{Y_{i}} (1 - p_{i})^{1 - Y_{i}}$ . Take logs for numerical stability: $ℓ (β) = \sum_{i} [Y_{i} lo g p_{i} + (1 - Y_{i}) lo g (1 - p_{i})]$ . Maximise via iterative numerical methods (Newton-Raphson / Fisher scoring / IRLS).

R syntax. glm(admit ~ gre + gpa + factor(rank), data = d, family = binomial(link = "logit")). The family argument flips you to logistic; for Poisson use family = poisson(link = "log"). summary(model) gives coefficients with z-tests. exp(coef(model)) gives odds ratios. exp(confint(model)) gives CIs for odds ratios.

Assumptions of logistic regression. (1) Binary (or binomial proportion) response. (2) Independence of observations. (3) Linearity in the log-odds — not in probability — between each X and logit(p). Polynomial / interaction terms allowed if log-odds is non-linear. (4) No severe multicollinearity — VIF same as OLS. (5) Large sample: rule of thumb 10 events (Y = 1) per predictor — applies to the rarer class. Not required: normality of Y, normality of residuals, homoscedasticity (handled implicitly by the Bernoulli distribution).

Categorical predictors and interactions. Same as OLS: $k - 1$ dummies for k-level categorical. Interactions = cross-product terms; interpretation on the log-odds scale ('a 1-unit GPA increase changes log-odds by β_GPA + β_GPA:rank · rank_value').

Goalkeeper example (worked). 24 penalties when team behind: 2 saved, 22 scored → odds(save) = 2/22 ≈ 0.091. Comparable 20 penalties when not behind: 6 saved, 14 scored → odds(save) = 6/14 ≈ 0.429. Logistic model: log(odds_save) = β₀ + β₁ X with X = 0/1 for not-behind/behind. $e^{β_{1}} = 0.091/0.429 \approx 0.21$ — being behind multiplies save-odds by 0.21. Flip it: 1/0.21 ≈ 4.7 → roughly five times more likely to *score* when team is behind. (The slide deck gives ~3 for a sample at different counts; key is interpreting the ratio.)

Admission example (multivariable). $logit (P (admit)) = β_{0} + β_{GRE} GRE + β_{GPA} GPA + β_{rank} rank$ . Each $e^{β}$ is an odds ratio per unit X. Predicted probability for an applicant: compute η, then $p = 1/ (1 + e^{- η})$ . Decision threshold (e.g., admit if p > 0.5) gives a classifier.

Model fit metrics. Deviance $= - 2 ln L$ — analog of SS_res; smaller is better. AIC $= - 2 ln L + 2 k$ . BIC $= - 2 ln L + k ln n$ . Pseudo R² (McFadden $= 1 - ln L_{model} / ln L_{null}$ , Nagelkerke, Cox-Snell) — not directly comparable to OLS R². Hosmer-Lemeshow goodness-of-fit test for calibration.

Classification metrics. Once predicted p's exist, choose threshold (default 0.5) and classify. Confusion matrix: TP, FP, TN, FN. Accuracy = (TP + TN) / N. Precision = TP / (TP + FP). Recall (sensitivity) = TP / (TP + FN). Specificity = TN / (TN + FP). F1 = 2·P·R/(P + R). ROC curve plots sensitivity vs (1 − specificity) across thresholds. AUC = area under ROC = probability that a random positive ranks above a random negative; 0.5 = chance, 1.0 = perfect.

Nested model comparison. Use the likelihood ratio test (LRT) instead of the F-test: $D = - 2 (ln L_{reduced} - ln L_{full})$ . Under H₀ (reduced model adequate), $D \sim χ_{k_{full} - k_{reduced}}^{2}$ . In R: anova(model1, model2, test = "Chisq"). Small p → larger model significantly better.

Other GLMs you should know. Poisson regression: count Y; log link; β = log rate ratio; assumes Var(Y) = E[Y] (equidispersion). Negative binomial: for over-dispersed counts. Multinomial logistic: k > 2 unordered categories; one logit per non-reference category vs reference. Ordinal logistic (proportional odds): ordered Y; cumulative logits with a single slope assumption across categories.

Wrapping the family. Linear regression generalises t-tests, ANOVA, correlation. GLMs generalise linear regression. Logistic + Poisson + multinomial + Gamma all live under one roof. The exam will probe whether you can identify which member of the family fits a given problem and interpret coefficients on the right scale.

Definitions

GLM (Generalised Linear Model) — Framework with three components — distribution of Y, linear predictor η = Xβ, link function g(E[Y]) = η. Encompasses OLS, logistic, Poisson, etc.
Random component — The assumed distribution of Y in a GLM (Normal, Bernoulli, Poisson, Gamma, multinomial).
Systematic component — The linear predictor η = β₀ + β₁X₁ + … + βₖXₖ. Identical in structure to OLS.
Link function (g) — Maps E[Y] to η. Identity for OLS, logit for logistic, log for Poisson.
Logit function — $lo g (p / (1 - p))$ . Maps p ∈ (0, 1) to η ∈ (−∞, ∞). The canonical link for binomial.
Logistic function (sigmoid) — $1/ (1 + e^{- η})$ . Inverse of logit. Maps η to p ∈ (0, 1) via the S-curve.
Odds — $p / (1 - p)$ . Ratio of probability of event to probability of non-event.
Log-odds (logit) — Logarithm of the odds. Lives on (−∞, +∞).
Odds ratio (OR) — $e^{β_{j}}$ . Multiplicative change in odds per unit increase in $X_{j}$ . Standard reporting format.
Maximum Likelihood Estimation (MLE) — Estimate β by maximising the likelihood of observed data. Standard for all GLMs. Fit numerically via Newton-Raphson / IRLS.
Deviance — $- 2 ln L$ . GLM analog of SS_res. Smaller = better fit. Used in likelihood-ratio tests.
Likelihood Ratio Test (LRT) — Compares nested GLMs via $D \sim χ_{Δ k}^{2}$ . Replaces the F-test of OLS.
AIC / BIC — Information criteria for non-nested comparison. AIC = $- 2 ln L + 2 k$ ; BIC adds $k ln n$ penalty. Lower better.
McFadden pseudo R² — $1 - ln L_{model} / ln L_{null}$ . Logistic analog of R². Bands very different — 0.2 = excellent.
Confusion matrix — 2×2 table of predicted vs actual class. TP, FP, TN, FN — the basis of accuracy, precision, recall, F1.
Precision — TP / (TP + FP). Of those predicted positive, how many actually are.
Recall (sensitivity) — TP / (TP + FN). Of actual positives, how many are caught.
ROC curve — Sensitivity vs 1 − specificity across decision thresholds. Diagonal = chance.
AUC — Area under ROC. Probability that a random positive ranks above a random negative. Threshold-independent.
Perfect separation — A predictor / combination that perfectly classifies the outcome. MLE diverges → infinite β. Use Firth's correction.
Poisson regression — GLM for count data. Log link. $e^{β}$ = rate ratio. Assumes Var = Mean.
Multinomial logistic — GLM for unordered categorical Y > 2 levels. One logit per non-reference category vs reference.
Ordinal logistic (proportional odds) — GLM for ordered Y. Cumulative logits with a single slope assumption.

Formulas

$lo g \frac{p}{1 - p} = β_{0} + β_{1} X_{1} + \dots + β_{k} X_{k} (logit / log-odds)$
$p = \frac{1}{1 + e ^{- η}} = \frac{e ^{η}}{1 + e ^{η}} (logistic / sigmoid)$
$OR_{j} = e^{β_{j}} (odds ratio per unit increase in X_{j})$
$g (E [Y]) = η = X β (GLM master equation)$
$L (β) = i \prod p_{i}^{Y_{i}} (1 - p_{i})^{1 - Y_{i}} (binomial likelihood)$
$ℓ (β) = i \sum [Y_{i} lo g p_{i} + (1 - Y_{i}) lo g (1 - p_{i})] (log-likelihood)$
$D = - 2 (ln L_{red} - ln L_{full}) \sim χ_{Δ k}^{2} (LRT)$
$McFadden’s pseudo R^{2} = 1 - \frac{ln L _{model}}{ln L _{null}}$

Derivations

From logit to probability — the sigmoid. Start with $lo g (p / (1 - p)) = η$ . Exponentiate: $p / (1 - p) = e^{η}$ . Multiply: $p = (1 - p) e^{η} = e^{η} - p e^{η}$ . Move p-terms: $p (1 + e^{η}) = e^{η}$ . Divide: $p = e^{η} / (1 + e^{η}) = 1/ (1 + e^{- η})$ . The S-curve. QED.

**Why OR = $e^{β_{j}}$ .** Take two values of $X_{j}$ differing by 1: $X_{j}$ and $X_{j} + 1$ . Log-odds change: $lo g (odds_{X_{j} + 1}) - lo g (odds_{X_{j}}) = β_{j}$ . So $lo g (odds_{X_{j} + 1} / odds_{X_{j}}) = β_{j}$ → $odds_{X_{j} + 1} / odds_{X_{j}} = e^{β_{j}}$ . By definition that ratio *is* the odds ratio. QED.

Why OLS variance assumption breaks. For binary Y with success probability p, $Var (Y) = p (1 - p)$ . As p varies with X (via the model), Var(Y) varies too — maximised at p = 0.5 (Var = 0.25), shrinking toward 0 at extremes. Homoscedasticity is built-in *false*.

Why MLE recovers β. Likelihood $L (β) = \prod_{i} p_{i}^{Y_{i}} (1 - p_{i})^{1 - Y_{i}}$ . Log-likelihood $ℓ (β) = \sum_{i} [Y_{i} lo g p_{i} + (1 - Y_{i}) lo g (1 - p_{i})]$ . Substituting $p_{i} = 1/ (1 + e^{- X_{i} β})$ and differentiating: $\partial ℓ / \partial β = \sum_{i} (Y_{i} - p_{i}) X_{i} = 0$ . No closed form — solved iteratively (Newton-Raphson uses the Hessian; IRLS reweights each iteration).

Likelihood ratio test asymptotics. Under H₀ (smaller model true), $- 2 (ln L_{red} - ln L_{full}) d χ_{Δ k}^{2}$ by Wilks's theorem. Difference in dimension = degrees of freedom. This is the GLM analog of the F-test for nested OLS models.

Examples

β interpretation. $β_{GRE} = 0.005$ → for each extra point of GRE, log-odds of admission rise by 0.005; odds multiplied by $e^{0.005} \approx 1.005$ — a 0.5% odds increase per GRE point.
β = 0.7 → OR ≈ 2.01. Odds approximately double per unit predictor increase. β = −0.4 → OR ≈ 0.67, odds shrink by ~33%.
Predicted probability calculation. β₀ = −5, $β_{GPA}$ = 1.5; applicant has GPA = 3.5. η = −5 + 1.5·3.5 = 0.25. p = 1/(1 + e⁻⁰·²⁵) ≈ 0.562. About 56% chance of admission.
Goalkeeper / penalty data. Team behind: 2/24 saved → odds(save | behind) = 2/22 ≈ 0.091. Team not behind: 6/20 saved → odds(save | ¬behind) = 6/14 ≈ 0.429. OR(save | behind vs ¬behind) = 0.091/0.429 ≈ 0.21 → being behind cuts save-odds to ~21%. Equivalently, OR(score | behind vs ¬behind) = 22/2 ÷ 14/6 = 11/2.33 ≈ 4.7 → goalkeepers concede ~5× more often when their team is behind.
Admission with interaction. glm(admit ~ gre + gpa * rank, ...). The gpa:rank term tells you how the GPA slope (in log-odds) shifts across rank levels. If $β_{gpa:rank} > 0$ , the GPA boost is *stronger* for lower-ranked schools.
Pseudo R² caution. A logistic model with McFadden pseudo R² = 0.20 is considered an *excellent* fit. Don't compare to OLS R² (which would call 0.20 modest).
Class imbalance. If 5% of patients have the disease and your model predicts 'no disease' for everyone, accuracy = 95% — but the model is useless. Switch to AUC (0.5 → chance), F1, or balanced accuracy. Threshold calibration matters.
Likelihood ratio test in R. Two nested models: model1 (gre + gpa), model2 (gre + gpa + rank). anova(model1, model2, test="Chisq") returns $Δ D$ and a chi-square p. If significant, rank improves the model.
Poisson regression sketch. Number of citations $\sim$ Poisson; log link. $lo g (λ) = β_{0} + β_{1} (years)$ . $e^{β_{1}}$ = rate ratio. Check for over-dispersion (Var ≫ Mean) → may need negative binomial.

Diagrams

Logistic S-curve: p vs η. Flat at extremes; steepest slope at η = 0, p = 0.5. Annotate the asymptotes at 0 and 1 — bounded.
Why OLS fails: a scatter of binary Y (0/1) with a fitted OLS line crossing 0 and 1 → impossible predicted probabilities.
GLM three-part diagram: 'Random component (Y's distribution)' → 'Link function g' → 'Systematic component (η = Xβ)'. Examples filled in for Normal/identity, binomial/logit, Poisson/log.
Variance of binary Y: Var(Y) = p(1−p), a downward-opening parabola peaking at p = 0.5.
ROC curve: sensitivity on y, 1 − specificity on x, diagonal = chance, AUC = area under the curve.
Sigmoid for varying β₁: steeper curves for larger |β₁|, slope at η = 0 equals β₁/4 in probability units.

Edge cases

Perfect / quasi-complete separation: a predictor (or linear combination) perfectly classifies the outcome → MLE diverges to ±∞. Detection: huge SEs, β estimates ±10+. Fix: penalised regression (Firth's correction), Bayesian regularised priors, or drop the offending predictor.
Rare-event bias: with very few events (Y = 1), MLE underestimates probabilities. Use Firth or exact logistic regression.
Sample-size rule: at least 10 events per predictor (the *rarer* class drives this) — and 20 is safer.
Non-linear log-odds: check by binning a continuous predictor and plotting empirical logit. If non-linear, add polynomial / spline terms.
Class imbalance distorts threshold-dependent metrics (accuracy) but not coefficient estimation. Use AUC, F1, or set the decision threshold to maximise the metric you actually care about.
Over-dispersion in Poisson regression (Var ≫ Mean) → use negative binomial GLM or quasi-Poisson.
Repeated measures in a binary outcome → fit a mixed-effects logistic regression (GLMM) instead of plain glm.
Correlated observations (clusters) → cluster-robust SEs or GEE.

Common mistakes

Interpreting β as change in *probability*. β changes log-odds; $e^{β}$ changes odds; neither directly equals a probability change.
Calling the odds ratio a 'risk ratio' or 'probability ratio'. They're only approximately equal when p is small. For common outcomes, OR overstates the ratio of probabilities.
Applying OLS assumptions (Normality of residuals, homoscedasticity) to logistic. Not applicable.
Using R² for a logistic model — OLS R² is meaningless. Use McFadden / Nagelkerke pseudo R² and don't compare across model families.
Reporting only accuracy on imbalanced data — use AUC, F1, recall/precision.
Comparing nested logistic models by F-test — use the likelihood-ratio test (chi-square) instead.
Forgetting that 'no multicollinearity' still applies — VIF is just as needed here as in OLS.
Not exponentiating: reporting log-odds coefficients in papers without converting to odds ratios. Readers expect ORs.
Treating perfect separation as a great fit — it's a numerical pathology requiring intervention.
Using logistic regression on a continuous Y dichotomised at the median — you lose information; fit OLS on the original Y instead.

Shortcuts

OLS fails on binary Y for four reasons: bounds, non-Normal residuals, heteroscedasticity, S-shape.
GLM = random + systematic + link. Linear is GLM with Normal + identity.
Logit: $lo g (p / (1 - p)) = η$ . Sigmoid: $p = 1/ (1 + e^{- η})$ .
**OR = $e^{β}$ .** Always report odds ratios + 95% CIs.
Estimation = MLE, fit by glm() with family = binomial.
Assumptions: binary Y, independence, linearity *in log-odds*, no severe multicollinearity, ≥ 10 events/predictor.
Nested comparison: LRT (chi-square), not F-test.
Pseudo R² ≠ OLS R² — different scale, McFadden 0.2 = strong fit.
ROC + AUC for threshold-independent classifier evaluation.
Poisson regression for counts (log link); negative binomial for over-dispersion.

Proofs / Algorithms

Logit ↔ logistic invertibility. Define $η = lo g (p / (1 - p))$ for $p \in (0, 1)$ . Logit is continuous and strictly increasing on (0, 1) with range $(- \infty, + \infty)$ — hence a bijection. Inverting: $e^{η} = p / (1 - p) \Rightarrow p = e^{η} / (1 + e^{η}) = 1/ (1 + e^{- η})$ . QED.

MLE score equation for logistic regression. Log-likelihood $ℓ (β) = \sum_{i} [Y_{i} lo g p_{i} + (1 - Y_{i}) lo g (1 - p_{i})]$ where $p_{i} = 1/ (1 + e^{- X_{i}^{T} β})$ . Compute $\partial p_{i} / \partial β = p_{i} (1 - p_{i}) X_{i}$ . Then $\partial ℓ / \partial β = \sum_{i} [Y_{i} / p_{i} - (1 - Y_{i}) / (1 - p_{i})] \cdot p_{i} (1 - p_{i}) X_{i} = \sum_{i} (Y_{i} - p_{i}) X_{i}$ . Setting to zero gives the score equation $\sum_{i} (Y_{i} - p_{i}) X_{i} = 0$ — no closed form, solved iteratively. The Hessian is $- X^{T} W X$ where W is diagonal with $W_{ii} = p_{i} (1 - p_{i})$ → negative-definite → strict concavity → unique global maximum (when X is full rank and no separation). QED.

Behavioral Research: Statistical Methods