Revision Notes/Unit 12 — Regression (Linear, Multiple)/OLS, Diagnostics, Multiple Regression/Story

OLS, Diagnostics, Multiple Regression

Unit 12 — Regression (Linear, Multiple)

Maya, Her Master Tool, and the Day Regression Ate Statistics

The semester is half over. Maya has met t-tests, ANOVAs (in all nine flavors), correlations, chi-square. They feel like separate tools — different scenarios, different formulas, different reporting conventions. She has memorised them dutifully.

Then comes Session 11. Her professor stands in front of the board and says, dead-pan:

*"Today, all of them die. Or rather, today you realise they were never separate. Today, you meet regression."*

Maya thinks she's joking. She isn't.

---

Where the Magic Lives — A Line Through a Cloud

Maya pulls up her dataset of 100 students. X = hours studied per week. Y = exam score. She scatter-plots it. A diffuse upward cloud. Some structure, lots of noise. She wants to predict Y from X.

The simplest model: a straight line.

\hat{Y}_{i} = β_{0} + β_{1} X_{i}

But which line? There are infinitely many. Her instinct: the line where the prediction errors are smallest. Each data point's miss — its residual — is $ε_{i} = Y_{i} - \hat{Y}_{i}$ . Some residuals are positive (points above the line), some negative (below). If she summed raw residuals, they'd cancel. If she summed absolute residuals, the math gets ugly — absolute values aren't differentiable at zero.

So: square them and add up.

SS_{res} = i = 1 \sum n (Y_{i} - \hat{Y}_{i})^{2}

The line minimising this sum is the OLS (Ordinary Least Squares) line. Calculus does the rest. Setting partial derivatives to zero gives a closed-form solution:

\hat{β}_{1} = \frac{\sum ( X _{i} - X ˉ ) ( Y _{i} - Y ˉ )}{\sum ( X _{i} - X ˉ ) ^{2}} = \frac{cov ( X , Y )}{Var ( X )}, \hat{β}_{0} = \overset{ˉ}{Y} - \hat{β}_{1} \overset{ˉ}{X}

Maya stops. The numerator of $\hat{β}_{1}$ is the covariance. The denominator is the variance of X. She remembers Session 6: Pearson's r is $cov (X, Y) / (s_{X} s_{Y})$ . The OLS slope and Pearson's r are *almost* the same quantity — they differ only by the scaling of Y.

In fact, if she z-scores both X and Y first and re-runs the regression, the slope she gets is exactly r. Standardised regression slope = correlation coefficient.

*"They're the same thing,"* she mutters. *"Correlation is regression in standardised units."*

She computes for her data: $\hat{Y} = 50 + 3 X$ . Predict for someone studying 10 hours: 80. Plain English: each extra hour adds 3 points to predicted score. R² = 0.60. Predictors explain 60% of variance in exam scores.

---

Many Predictors, One Idea

She adds two more variables: IQ and hours of sleep. The model grows:

\hat{Y}_{i} = β_{0} + β_{1} hours_{i} + β_{2} IQ_{i} + β_{3} sleep_{i}

OLS still minimises $\sum ε_{i}^{2}$ , just now with four coefficients to estimate. The math becomes matrix algebra ( $\hat{β} = (X^{T} X)^{- 1} X^{T} y$ ), but the principle is identical.

The R output:

``` Estimate Std. Error t value Pr(>|t|) (Intercept) 50.23 4.12 12.20 < 2e-16 * hours 2.50 0.32 7.81 < 0.001 * IQ 0.81 0.18 4.50 < 0.001 * sleep 1.42 0.51 2.78 0.006

Multiple R-squared: 0.65, Adjusted R-squared: 0.64 F-statistic: 59.4 on 3 and 96 DF, p-value: < 2.2e-16 ```

The interpretation that matters most: **each β is the effect of that predictor *holding the others constant*.** The 2.50 on hours isn't 'study time correlated with score' — it's 'study time *after partialling out* IQ and sleep.' That nuance is what regression buys you over raw correlation.

Maya reads each line:

Model F = 59.4, p < .001 → the model as a whole beats predicting the mean for everyone.
R² = 0.65 → 65% of variance explained.
Each coefficient's t-test → each predictor uniquely matters.

She notices the F-statistic. The same F from ANOVA. Because — wait.

---

The Moment Maya's Mind Breaks: Everything Is Regression

She remembers her professor's claim that today everything dies. She tests it.

*"Run a one-way ANOVA: anxiety scores across 3 treatments."*

She does. F(2, 87) = 4.61, p = .013, η² = .10.

*"Now: encode treatment as two dummy variables (D_meds, D_both; counselling = reference). Run regression Y ~ D_meds + D_both."*

She does. Identical F. Identical p. The two dummy coefficients give: β_meds = mean(meds) − mean(counselling); β_both = mean(both) − mean(counselling). The regression's overall F-test asks 'are both dummies zero?' — which is precisely the ANOVA's H₀ that all group means are equal.

*"Independent t-test, two groups?"*

Same as regression on a single 0/1 dummy. $\hat{β}_{1}$ = mean difference; t-statistic identical.

*"Pearson correlation?"*

Same as the standardised slope of simple regression. $r^{2} = R^{2}$ exactly.

*"ANCOVA?"*

Regression with a categorical predictor (dummies) + a continuous covariate.

She writes in big letters across the top of her notebook:

EVERYTHING IS REGRESSION.

T-tests, ANOVA, ANCOVA, correlation — all special cases of a single General Linear Model. The reporting conventions differ for historical reasons; the math is one math. The board calls it the GLM. Maya calls it relief.

---

R² Lies if You Let It

She turns to goodness of fit.

R^{2} = 1 - \frac{SS _{res}}{SS _{tot}} = \frac{SS _{reg}}{SS _{tot}}

Proportion of variance in Y explained. Clean. Intuitive.

But there's a trap. R² never decreases when you add a predictor, even if the predictor is *random noise*. The proof is mechanical: enlarging the parameter space can only let OLS find a smaller (or equal) SS_res. So you could throw 10 random columns into your model and R² will climb, *just by chance*.

The fix is adjusted R²:

\overset{ˉ}{R}^{2} = 1 - (1 - R^{2}) \cdot \frac{n - 1}{n - k - 1}

The factor $(n - 1) / (n - k - 1)$ grows with k. Useless predictors increase k but barely raise R² — adjusted R² *decreases*. Useful predictors raise R² more than enough to overcome the penalty. Maya's rule:

Report R² for interpretation ('% explained'). Report adjusted R² for honest model comparison.

She writes another rule in red: never compare models by raw R² if they have different numbers of predictors. Always adjusted R², AIC, or a nested F-test.

---

The Six Assumptions That Decide Whether Any of This Is Valid

Her professor is now flipping through slides. The next one is titled *"Six Assumptions, Memorise Them."*

Maya counts:

| # | Assumption | What it means | How to check | |---|---|---|---| | 1 | Linearity in parameters | β's enter linearly (X can enter non-linearly via polynomials, logs) | Residuals-vs-fitted plot — no curvature | | 2 | Normality of *residuals* | ε ~ Normal | Q-Q plot; Shapiro-Wilk | | 3 | Homoscedasticity | Var(ε) constant | Residuals-vs-fitted — no fan | | 4 | Exogeneity | E[ε \| X] = 0 | Domain knowledge; random assignment helps | | 5 | Independence | Residuals uncorrelated | Design; Durbin-Watson for time series | | 6 | No severe multicollinearity | Predictors not redundant | VIF < 5 (or 10) |

She circles #2 twice. The most common student mistake: people check whether *Y* is normal, or *X* is normal, when the assumption is about residuals. Y can be flatly non-normal and the regression still works fine if residuals are roughly Normal.

She circles #3 twice. Heteroscedasticity — residual variance grows with X — is everywhere in real data (income, reaction times, expenditure). The fix isn't to abandon regression; it's robust HC standard errors. Point estimates of β unchanged; SEs corrected. p-values and CIs become trustworthy.

She circles #4 three times. Exogeneity is the deep one. Omit a confounder, measure X with error, or have Y secretly affecting X (reverse causality) and your β is biased. Random assignment in experiments enforces exogeneity by design — which is why experimentalists sleep better than observational researchers.

---

The Diagnostic Plots Maya Now Demands From Every Model

After fitting any regression, she produces four plots:

1. Residuals vs fitted. A horizontal cloud of constant width = ideal. Curvature → linearity violated; transform X or add polynomial terms. Fan shape → heteroscedasticity; use robust SEs or transform Y. 2. Q-Q of residuals. Straight diagonal = Normal. Curl at the ends = heavy tails. Hockey stick = skew. 3. Scale-location (√|standardised residuals| vs fitted). Another lens on heteroscedasticity, more sensitive. 4. Residuals vs leverage with Cook's distance contours. Cook's d > 1 flags points whose removal would substantially move the coefficients. Inspect them — don't silently drop.

She thinks of this as the regression equivalent of brushing her teeth. You don't trust a model until you've looked at its residuals.

---

Simpson, Strikes Back

She remembers Simpson's paradox from Session 1 (the Berkeley admissions data). It shows up here too, in regression flesh.

She runs Y ~ education on her income data. β = 5000, p < .001. Education matters!

Then she adds job type (manual / professional) as a dummy and refits. β_education drops to 800, p = .12.

What happened? The original positive effect was almost entirely *because* more education leads to professional jobs (which pay more) — not because, within a job type, an extra year of education adds much to income. Partialling out job type kills the apparent effect.

*"Omit an important covariate and your β tells the wrong story. Include the right ones and the truth surfaces."*

The flip side haunts her too: including *too many* variables can also distort. There's no automatic rule. Theory tells you what to include. Stepwise regression and AIC are tools, not authorities.

---

How Maya Closes the Session

By 2am she has built three regressions for her thesis. Each has:

A clear research question.
The right scale-level predictors.
Adjusted R² and a model F-test.
Coefficient t-tests for each predictor.
Standardised β's so she can compare magnitudes.
The four diagnostic plots tucked into the appendix.
A note on which assumptions hold and which she's worried about.
A sensitivity analysis dropping the one high-Cook's-d outlier.

She remembers her professor's claim from the morning: that today everything dies. She no longer hears it as a threat. The fragmented arsenal of tests — t, ANOVA, χ², correlation — collapses into one principled model. She doesn't have to remember nine reporting conventions. She has to understand one.

*"R² explains. β explains, holding others constant. Diagnostics check whether you're allowed to trust the explanation."*

She turns off the desk lamp. Tomorrow she'll meet logistic regression — the GLM stretched to binary outcomes — and Session 11 will turn out to have prepared her completely. Because tomorrow's model is just another link in the same chain.

Regression isn't a tool. It's the language.

Behavioral Research: Statistical Methods