Saral Shiksha Yojna
Courses/Behavioral Research: Statistical Methods

Behavioral Research: Statistical Methods

CG3.402
Vinoo AlluriMonsoon 2025-264 credits
Revision Notes/Unit 14 — GLMs & Logistic Regression/Logistic Regression and the GLM Framework

Logistic Regression and the GLM Framework

NotesStory

Intuition

Linear regression assumes Y is continuous and roughly Normal. When Y is binary (admitted/rejected, saved/scored, recovered/not), OLS shatters: predictions leak outside [0, 1]; residuals can take only two values; variance is largest at p = 0.5 and smallest at the extremes; the true relationship between predictor and probability is S-shaped, not linear. The fix isn't to abandon regression — it's to make it model the *log-odds* of the outcome instead of the outcome directly. That's logistic regression. And logistic regression is just one member of a larger family — Generalised Linear Models (GLMs) — that handles any non-Normal Y by combining three parts: a distribution, a linear predictor, and a *link function* that bridges them.

Explanation

Why OLS fails for binary Y — four problems. (1) Predictions out of bounds: can be < 0 or > 1, but probability can't. (2) Non-Normal residuals: for binary Y, residuals take only two values (1 − or 0 − ). (3) Heteroscedasticity by construction: Var(Y) = p(1 − p), maximised at p = 0.5, near zero at extremes — variance changes with mean. (4) Wrong shape: the relationship between X and probability is *S-shaped*, not linear: pushing from p = 0.85 to 0.93 (extreme) vs from 0.30 to 0.55 (middle) requires very different X changes.

The Generalised Linear Model framework. Three components: (1) Random component — the distribution of Y (Normal, Bernoulli, Poisson, Gamma, multinomial); (2) Systematic component — the linear predictor ; (3) Link function g — connects E[Y] to η via , equivalently .

The GLM zoo. Continuous Y → Normal + identity link = ordinary linear regression. Binary Y → Bernoulli + logit link = logistic regression. Count Y → Poisson + log link = Poisson regression. Proportion Y → Binomial + logit = logistic for proportions. Skewed positive Y → Gamma + log link. Categorical > 2 levels → multinomial. Ordered Y → ordinal (proportional odds). One framework. Different combinations of distribution and link.

Why the logit link. Probability p lives in (0, 1). The linear predictor η lives on (−∞, +∞). We need a transformation that maps (0, 1) to (−∞, +∞). Two steps. Step 1 — odds: maps (0, 1) → (0, ∞). Step 2 — log: maps (0, ∞) → (−∞, ∞). The composition is the logit — perfectly matched to the linear predictor's scale.

Logistic regression model. . Equivalently, solving for p: . The logistic / sigmoid S-curve: flat at the bottom (η ≪ 0), steepest at η = 0 (p = 0.5), flat at the top (η ≫ 0).

Coefficient interpretation 1 — log-odds units (raw β). A one-unit increase in changes the log-odds of Y by , holding other predictors constant. Mathematically clean. Not intuitive — readers don't think in log-odds.

**Coefficient interpretation 2 — odds ratio ().** A one-unit increase in multiplies the odds of Y by . This is the standard reporting format. Bands: → odds increase; → no effect; → odds decrease. Examples: → odds double per unit X. → odds halve. → odds rise by 5% per unit.

OR ≠ probability ratio (and ≠ risk ratio). Odds ratio = 2 doesn't mean 'twice as likely to occur.' It means odds change from to — which only approximates 'doubled probability' when p is small. For p = 0.1: OR = 2 → new p ≈ 0.18, not 0.20. Always state 'odds ratio,' never 'risk ratio,' for logistic output.

MLE — not OLS. Logistic regression can't minimise squared residuals (residuals are weird). It maximises the likelihood of the observed data: choose β so that the model assigns the highest joint probability to what we actually saw. Likelihood: . Take logs for numerical stability: . Maximise via iterative numerical methods (Newton-Raphson / Fisher scoring / IRLS).

R syntax. glm(admit ~ gre + gpa + factor(rank), data = d, family = binomial(link = "logit")). The family argument flips you to logistic; for Poisson use family = poisson(link = "log"). summary(model) gives coefficients with z-tests. exp(coef(model)) gives odds ratios. exp(confint(model)) gives CIs for odds ratios.

Assumptions of logistic regression. (1) Binary (or binomial proportion) response. (2) Independence of observations. (3) Linearity in the log-odds — not in probability — between each X and logit(p). Polynomial / interaction terms allowed if log-odds is non-linear. (4) No severe multicollinearity — VIF same as OLS. (5) Large sample: rule of thumb 10 events (Y = 1) per predictor — applies to the rarer class. Not required: normality of Y, normality of residuals, homoscedasticity (handled implicitly by the Bernoulli distribution).

Categorical predictors and interactions. Same as OLS: dummies for k-level categorical. Interactions = cross-product terms; interpretation on the log-odds scale ('a 1-unit GPA increase changes log-odds by β_GPA + β_GPA:rank · rank_value').

Goalkeeper example (worked). 24 penalties when team behind: 2 saved, 22 scored → odds(save) = 2/22 ≈ 0.091. Comparable 20 penalties when not behind: 6 saved, 14 scored → odds(save) = 6/14 ≈ 0.429. Logistic model: log(odds_save) = β₀ + β₁ X with X = 0/1 for not-behind/behind. — being behind multiplies save-odds by 0.21. Flip it: 1/0.21 ≈ 4.7 → roughly five times more likely to *score* when team is behind. (The slide deck gives ~3 for a sample at different counts; key is interpreting the ratio.)

Admission example (multivariable). . Each is an odds ratio per unit X. Predicted probability for an applicant: compute η, then . Decision threshold (e.g., admit if p > 0.5) gives a classifier.

Model fit metrics. Deviance — analog of SS_res; smaller is better. AIC . BIC . Pseudo R² (McFadden , Nagelkerke, Cox-Snell) — not directly comparable to OLS R². Hosmer-Lemeshow goodness-of-fit test for calibration.

Classification metrics. Once predicted p's exist, choose threshold (default 0.5) and classify. Confusion matrix: TP, FP, TN, FN. Accuracy = (TP + TN) / N. Precision = TP / (TP + FP). Recall (sensitivity) = TP / (TP + FN). Specificity = TN / (TN + FP). F1 = 2·P·R/(P + R). ROC curve plots sensitivity vs (1 − specificity) across thresholds. AUC = area under ROC = probability that a random positive ranks above a random negative; 0.5 = chance, 1.0 = perfect.

Nested model comparison. Use the likelihood ratio test (LRT) instead of the F-test: . Under H₀ (reduced model adequate), . In R: anova(model1, model2, test = "Chisq"). Small p → larger model significantly better.

Other GLMs you should know. Poisson regression: count Y; log link; β = log rate ratio; assumes Var(Y) = E[Y] (equidispersion). Negative binomial: for over-dispersed counts. Multinomial logistic: k > 2 unordered categories; one logit per non-reference category vs reference. Ordinal logistic (proportional odds): ordered Y; cumulative logits with a single slope assumption across categories.

Wrapping the family. Linear regression generalises t-tests, ANOVA, correlation. GLMs generalise linear regression. Logistic + Poisson + multinomial + Gamma all live under one roof. The exam will probe whether you can identify which member of the family fits a given problem and interpret coefficients on the right scale.

Definitions

  • GLM (Generalised Linear Model)Framework with three components — distribution of Y, linear predictor η = Xβ, link function g(E[Y]) = η. Encompasses OLS, logistic, Poisson, etc.
  • Random componentThe assumed distribution of Y in a GLM (Normal, Bernoulli, Poisson, Gamma, multinomial).
  • Systematic componentThe linear predictor η = β₀ + β₁X₁ + … + βₖXₖ. Identical in structure to OLS.
  • Link function (g)Maps E[Y] to η. Identity for OLS, logit for logistic, log for Poisson.
  • Logit function. Maps p ∈ (0, 1) to η ∈ (−∞, ∞). The canonical link for binomial.
  • Logistic function (sigmoid). Inverse of logit. Maps η to p ∈ (0, 1) via the S-curve.
  • Odds. Ratio of probability of event to probability of non-event.
  • Log-odds (logit)Logarithm of the odds. Lives on (−∞, +∞).
  • Odds ratio (OR). Multiplicative change in odds per unit increase in . Standard reporting format.
  • Maximum Likelihood Estimation (MLE)Estimate β by maximising the likelihood of observed data. Standard for all GLMs. Fit numerically via Newton-Raphson / IRLS.
  • Deviance. GLM analog of SS_res. Smaller = better fit. Used in likelihood-ratio tests.
  • Likelihood Ratio Test (LRT)Compares nested GLMs via . Replaces the F-test of OLS.
  • AIC / BICInformation criteria for non-nested comparison. AIC = ; BIC adds penalty. Lower better.
  • McFadden pseudo R². Logistic analog of R². Bands very different — 0.2 = excellent.
  • Confusion matrix2×2 table of predicted vs actual class. TP, FP, TN, FN — the basis of accuracy, precision, recall, F1.
  • PrecisionTP / (TP + FP). Of those predicted positive, how many actually are.
  • Recall (sensitivity)TP / (TP + FN). Of actual positives, how many are caught.
  • ROC curveSensitivity vs 1 − specificity across decision thresholds. Diagonal = chance.
  • AUCArea under ROC. Probability that a random positive ranks above a random negative. Threshold-independent.
  • Perfect separationA predictor / combination that perfectly classifies the outcome. MLE diverges → infinite β. Use Firth's correction.
  • Poisson regressionGLM for count data. Log link. = rate ratio. Assumes Var = Mean.
  • Multinomial logisticGLM for unordered categorical Y > 2 levels. One logit per non-reference category vs reference.
  • Ordinal logistic (proportional odds)GLM for ordered Y. Cumulative logits with a single slope assumption.

Formulas

Derivations

From logit to probability — the sigmoid. Start with . Exponentiate: . Multiply: . Move p-terms: . Divide: . The S-curve. QED.

**Why OR = .** Take two values of differing by 1: and . Log-odds change: . So . By definition that ratio *is* the odds ratio. QED.

Why OLS variance assumption breaks. For binary Y with success probability p, . As p varies with X (via the model), Var(Y) varies too — maximised at p = 0.5 (Var = 0.25), shrinking toward 0 at extremes. Homoscedasticity is built-in *false*.

Why MLE recovers β. Likelihood . Log-likelihood . Substituting and differentiating: . No closed form — solved iteratively (Newton-Raphson uses the Hessian; IRLS reweights each iteration).

Likelihood ratio test asymptotics. Under H₀ (smaller model true), by Wilks's theorem. Difference in dimension = degrees of freedom. This is the GLM analog of the F-test for nested OLS models.

Examples

  • β interpretation. → for each extra point of GRE, log-odds of admission rise by 0.005; odds multiplied by — a 0.5% odds increase per GRE point.
  • β = 0.7 → OR ≈ 2.01. Odds approximately double per unit predictor increase. β = −0.4 → OR ≈ 0.67, odds shrink by ~33%.
  • Predicted probability calculation. β₀ = −5, = 1.5; applicant has GPA = 3.5. η = −5 + 1.5·3.5 = 0.25. p = 1/(1 + e⁻⁰·²⁵) ≈ 0.562. About 56% chance of admission.
  • Goalkeeper / penalty data. Team behind: 2/24 saved → odds(save | behind) = 2/22 ≈ 0.091. Team not behind: 6/20 saved → odds(save | ¬behind) = 6/14 ≈ 0.429. OR(save | behind vs ¬behind) = 0.091/0.429 ≈ 0.21 → being behind cuts save-odds to ~21%. Equivalently, OR(score | behind vs ¬behind) = 22/2 ÷ 14/6 = 11/2.33 ≈ 4.7 → goalkeepers concede ~5× more often when their team is behind.
  • Admission with interaction. glm(admit ~ gre + gpa * rank, ...). The gpa:rank term tells you how the GPA slope (in log-odds) shifts across rank levels. If , the GPA boost is *stronger* for lower-ranked schools.
  • Pseudo R² caution. A logistic model with McFadden pseudo R² = 0.20 is considered an *excellent* fit. Don't compare to OLS R² (which would call 0.20 modest).
  • Class imbalance. If 5% of patients have the disease and your model predicts 'no disease' for everyone, accuracy = 95% — but the model is useless. Switch to AUC (0.5 → chance), F1, or balanced accuracy. Threshold calibration matters.
  • Likelihood ratio test in R. Two nested models: model1 (gre + gpa), model2 (gre + gpa + rank). anova(model1, model2, test="Chisq") returns and a chi-square p. If significant, rank improves the model.
  • Poisson regression sketch. Number of citations Poisson; log link. . = rate ratio. Check for over-dispersion (Var ≫ Mean) → may need negative binomial.

Diagrams

  • Logistic S-curve: p vs η. Flat at extremes; steepest slope at η = 0, p = 0.5. Annotate the asymptotes at 0 and 1 — bounded.
  • Why OLS fails: a scatter of binary Y (0/1) with a fitted OLS line crossing 0 and 1 → impossible predicted probabilities.
  • GLM three-part diagram: 'Random component (Y's distribution)' → 'Link function g' → 'Systematic component (η = Xβ)'. Examples filled in for Normal/identity, binomial/logit, Poisson/log.
  • Variance of binary Y: Var(Y) = p(1−p), a downward-opening parabola peaking at p = 0.5.
  • ROC curve: sensitivity on y, 1 − specificity on x, diagonal = chance, AUC = area under the curve.
  • Sigmoid for varying β₁: steeper curves for larger |β₁|, slope at η = 0 equals β₁/4 in probability units.

Edge cases

  • Perfect / quasi-complete separation: a predictor (or linear combination) perfectly classifies the outcome → MLE diverges to ±∞. Detection: huge SEs, β estimates ±10+. Fix: penalised regression (Firth's correction), Bayesian regularised priors, or drop the offending predictor.
  • Rare-event bias: with very few events (Y = 1), MLE underestimates probabilities. Use Firth or exact logistic regression.
  • Sample-size rule: at least 10 events per predictor (the *rarer* class drives this) — and 20 is safer.
  • Non-linear log-odds: check by binning a continuous predictor and plotting empirical logit. If non-linear, add polynomial / spline terms.
  • Class imbalance distorts threshold-dependent metrics (accuracy) but not coefficient estimation. Use AUC, F1, or set the decision threshold to maximise the metric you actually care about.
  • Over-dispersion in Poisson regression (Var ≫ Mean) → use negative binomial GLM or quasi-Poisson.
  • Repeated measures in a binary outcome → fit a mixed-effects logistic regression (GLMM) instead of plain glm.
  • Correlated observations (clusters) → cluster-robust SEs or GEE.

Common mistakes

  • Interpreting β as change in *probability*. β changes log-odds; changes odds; neither directly equals a probability change.
  • Calling the odds ratio a 'risk ratio' or 'probability ratio'. They're only approximately equal when p is small. For common outcomes, OR overstates the ratio of probabilities.
  • Applying OLS assumptions (Normality of residuals, homoscedasticity) to logistic. Not applicable.
  • Using R² for a logistic model — OLS R² is meaningless. Use McFadden / Nagelkerke pseudo R² and don't compare across model families.
  • Reporting only accuracy on imbalanced data — use AUC, F1, recall/precision.
  • Comparing nested logistic models by F-test — use the likelihood-ratio test (chi-square) instead.
  • Forgetting that 'no multicollinearity' still applies — VIF is just as needed here as in OLS.
  • Not exponentiating: reporting log-odds coefficients in papers without converting to odds ratios. Readers expect ORs.
  • Treating perfect separation as a great fit — it's a numerical pathology requiring intervention.
  • Using logistic regression on a continuous Y dichotomised at the median — you lose information; fit OLS on the original Y instead.

Shortcuts

  • OLS fails on binary Y for four reasons: bounds, non-Normal residuals, heteroscedasticity, S-shape.
  • GLM = random + systematic + link. Linear is GLM with Normal + identity.
  • Logit: . Sigmoid: .
  • **OR = .** Always report odds ratios + 95% CIs.
  • Estimation = MLE, fit by glm() with family = binomial.
  • Assumptions: binary Y, independence, linearity *in log-odds*, no severe multicollinearity, ≥ 10 events/predictor.
  • Nested comparison: LRT (chi-square), not F-test.
  • Pseudo R² ≠ OLS R² — different scale, McFadden 0.2 = strong fit.
  • ROC + AUC for threshold-independent classifier evaluation.
  • Poisson regression for counts (log link); negative binomial for over-dispersion.

Proofs / Algorithms

Logit ↔ logistic invertibility. Define for . Logit is continuous and strictly increasing on (0, 1) with range — hence a bijection. Inverting: . QED.

MLE score equation for logistic regression. Log-likelihood where . Compute . Then . Setting to zero gives the score equation — no closed form, solved iteratively. The Hessian is where W is diagonal with → negative-definite → strict concavity → unique global maximum (when X is full rank and no separation). QED.