Revision Notes/Unit 14 — GLMs & Logistic Regression/Logistic Regression and the GLM Framework/Story

Logistic Regression and the GLM Framework

Unit 14 — GLMs & Logistic Regression

Maya Meets the GLM, and Statistics Finally Closes Its Loop

The semester is almost done. Maya has assembled an arsenal: t-tests, ANOVAs, regression in nine flavors, Bayes. But last Tuesday she ran into a problem that broke everything.

A friend in psychology asked her to help analyse data: 400 graduate-school applicants, 127 admitted, 273 rejected. Predictors: GRE, GPA, undergrad institution rank. *Predict admission.*

Maya's instinct, by reflex, was linear regression:

\hat{Admit_{i}} = β_{0} + β_{1} GRE_{i} + β_{2} GPA_{i} + β_{3} rank_{i}

She typed it into R. The output came back. R² = 0.18. Coefficients with t-values and p-values. Looked fine until she generated predictions for a few applicants and saw:

Applicant A: $\hat{Y} = 1.34$ — *predicted probability 134% admit*?
Applicant B: $\hat{Y} = - 0.08$ — *negative probability?*

Probabilities are bounded between 0 and 1. Linear regression doesn't know that. It hasn't been told. It will happily predict whatever the line says.

She stared at the screen for a while.

*"There must be something specifically built for this kind of outcome."*

There is. It's called logistic regression, and it's just one member of a much wider family.

---

The Four Things That Go Wrong With OLS on Binary Y

She lists them carefully in her notebook because she suspects the exam will ask:

1. Predictions escape (0, 1). Linear model produces anything in (−∞, ∞). 2. Residuals can take only two values. For a fixed prediction $\overset{p}{^}$ , every actual residual is either $0 - \overset{p}{^}$ or $1 - \overset{p}{^}$ . Not Normal. Not even close. 3. Heteroscedasticity is built in. For binary outcomes, Var(Y) = p(1 − p). Variance peaks at p = 0.5 and shrinks to nearly zero at the extremes. The OLS assumption of constant variance is mathematically *guaranteed* false. 4. The relationship is S-shaped, not linear. Moving GRE from 250 to 260 might shift admit probability from 0.10 to 0.15 (small change in flat region). Moving from 300 to 310 might shift it from 0.30 to 0.55 (steep middle). Moving from 330 to 340 might shift it from 0.85 to 0.93 (flat again). Sigmoid, not line.

The fix isn't to throw out regression. The fix is to *model a different quantity* that *is* approximately linear in the predictors.

---

The Two-Step Trick That Makes Everything Work

What quantity lives on the same scale as the linear predictor (−∞ to +∞), but is built from probability?

Step 1: Probability → Odds.

odds = \frac{p}{1 - p}

Maps (0, 1) → (0, ∞). Better. But still bounded below by 0.

Step 2: Odds → Log-odds.

lo g (odds) = lo g (\frac{p}{1 - p})

Maps (0, ∞) → (−∞, ∞). Now the scale matches. We can write:

lo g (\frac{p}{1 - p}) = β_{0} + β_{1} X_{1} + \dots + β_{k} X_{k}

This is logistic regression. The thing we model linearly isn't probability; it's the logit — the log of the odds.

Inverting back gives the logistic (sigmoid) function:

p = \frac{1}{1 + e ^{- η}}, η = β_{0} + β_{1} X_{1} + \dots

Plot it: flat near 0 for very negative η, steepest slope at η = 0 (p = 0.5), flat near 1 for very positive η. Exactly the S-shape we wanted. Probability is bounded by construction.

Maya stares at the curve. It's the same shape as her data when she binned it. It's the same shape as the neural-network activation she'd seen in a friend's deep-learning notes. It's everywhere.

---

How To Read the Coefficients (The Exam Loves This)

Two ways to interpret β:

1. Raw β = change in log-odds per unit X (holding others constant). Mathematically clean, but no human thinks in log-odds.

**2. $e^{β}$ = odds ratio. A one-unit increase in X multiplies** the odds of Y by $e^{β}$ .

$e^{β} = 2$ → odds double per unit.
$e^{β} = 1$ → no effect.
$e^{β} = 0.5$ → odds halve.
$e^{β} = 1.05$ → odds rise by 5%.

This is the standard reporting format: 'For each additional point of GRE, the odds of admission increased by a factor of 1.005 (95% CI: 1.003 to 1.007), p < .001.'

She underlines a warning: OR ≠ probability ratio. An OR of 2 doesn't mean 'twice as likely.' It means the *odds* (not the probability) doubled. For p = 0.10, OR = 2 takes p to 0.18, not 0.20. The two coincide only when p is small.

---

The Goalkeeper Problem

Her professor's slide deck has a clean toy example. *Do goalkeepers save more penalties when their team is behind?* (Pop psychology says no — pressure hurts.)

24 penalties faced when team was behind: 2 saved, 22 scored.

odds (save ∣ behind) = 2/22 \approx 0.091

20 penalties faced when team was not behind: 6 saved, 14 scored.

odds (save ∣ not behind) = 6/14 \approx 0.429

A logistic regression with X = 0/1 for not behind/behind gives:

e^{β_{1}} = \frac{0.091}{0.429} \approx 0.21

Being behind multiplies save-odds by 0.21 — slashes them by ~80%. Or flip it: $1/0.21 \approx 4.7$ — goalkeepers concede ~5× more often when their team is behind.

*"So the 'pressure breaks the keeper' folk wisdom isn't wrong. The data agree."*

She writes this and circles it.

---

How It's Fitted — Maximum Likelihood

OLS minimised squared residuals. Logistic regression can't — residuals are binary noise. Instead it uses Maximum Likelihood Estimation:

*Choose the β that makes the observed data most likely under the model.*

Each observation gets a model-predicted probability $p_{i}$ of being the class it actually is. Multiply them all:

L (β) = i = 1 \prod n p_{i}^{Y_{i}} (1 - p_{i})^{1 - Y_{i}}

Take logs (numerical stability), get the log-likelihood, and find β that maximises it. There's no closed-form solution — software iterates (Newton-Raphson, IRLS, Fisher scoring) until it converges.

You don't compute this by hand. You write:

``r model <- glm(admit ~ gre + gpa + factor(rank), data = mydata, family = binomial(link = "logit")) summary(model) exp(coef(model)) # odds ratios exp(confint(model)) # 95% CIs for ORs ``

family = binomial(link = "logit") tells R: 'logistic regression please.' Change to family = poisson(link = "log") and you've got Poisson regression for counts. Same line. Different distribution. Different link. Same framework.

---

The Bigger Picture — Generalised Linear Models

Maya's professor flips to the big slide: GLM = three components.

| Component | What it is | Linear regression | Logistic | Poisson | |---|---|---|---|---| | Random | Y's distribution | Normal | Bernoulli | Poisson | | Systematic | η = Xβ | same | same | same | | Link | g(E[Y]) = η | identity | logit | log |

Everything is GLM. Linear regression is GLM with Normal + identity link. Logistic is GLM with Bernoulli + logit. Poisson regression for counts is GLM with Poisson + log. Multinomial logistic for k > 2 categories. Ordinal logistic for ordered Y. Gamma regression for positive skewed Y.

The unifying picture from Session 11 ('everything is regression') just got bigger: not just continuous Y. Everything is GLM.

Maya writes in her notebook:

$ $linear \subset GLM$ $

*and t-tests, ANOVA, ANCOVA, correlation ⊂ linear*

*and logistic, Poisson, multinomial ⊂ GLM*

Six tests collapse to one framework. Her statistics is finally finite.

---

Assumptions She'll Be Asked About

Logistic regression makes *fewer* assumptions than OLS — but they're not zero:

1. Binary (or binomial) Y. 2. Independence of observations. 3. Linearity in the log-odds (not in p) — add polynomials / splines if violated. 4. No severe multicollinearity — same VIF concerns as OLS. 5. Adequate sample: at least 10 events per predictor (events = the rarer class, Y = 1).

What's *missing* compared to OLS: no Normality, no homoscedasticity. These are handled implicitly by the choice of Bernoulli distribution.

---

Maya's Worked Admission Model

Back to the original data:

`` Estimate Std.Error z value Pr(>|z|) (Intercept) -3.99 1.14 -3.50 0.0005 gre 0.0023 0.0011 2.07 0.0385 gpa 0.804 0.332 2.42 0.0154 factor(rank)2 -0.675 0.316 -2.13 0.0328 factor(rank)3 -1.34 0.345 -3.88 0.0001 factor(rank)4 -1.55 0.418 -3.71 0.0002 ``

She converts:

$e^{0.0023} \approx 1.0023$ — each GRE point → 0.23% odds increase. Tiny per point; meaningful across hundreds of points.
$e^{0.804} \approx 2.23$ — each full point of GPA → odds of admission more than double.
Rank 4 vs rank 1 (reference): $e^{- 1.55} \approx 0.21$ — applicants from rank-4 schools have ~21% the admission odds of identical applicants from rank-1 schools.

She predicts for an applicant: GRE = 580, GPA = 3.7, rank = 2.

η = - 3.99 + 0.0023 (580) + 0.804 (3.7) + (- 0.675) (1) = - 3.99 + 1.334 + 2.975 - 0.675 = - 0.356

p = \frac{1}{1 + e ^{0.356}} = \frac{1}{1 + 1.43} \approx 0.412

41% probability of admission. Reasonable.

---

How Maya Closes the Year

By the time the session is over, Maya can feel the pieces locking. The list of named tests has collapsed twice. First into 'everything is regression' (Session 11). Now into 'everything is GLM.' Tomorrow, she'll review the entire course in Session 14 with this unified picture in her head.

She makes one last note:

*"OLS asks: how does Y change with X? Logistic asks: how do the odds of Y change with X? Same question, two scales."*

*"GLMs are the language of behavioural data. Whatever your Y looks like — continuous, binary, count, ordinal, categorical — there's a GLM for it. The question is never 'is there a regression for this?' The question is: 'what's the right random component and link?'"*

She turns off her desk lamp. Outside, the campus is quiet. Tomorrow is exam revision. She is, for the first time all semester, ready.

Behavioral Research: Statistical Methods