OLS, Diagnostics, Multiple Regression
Maya, Her Master Tool, and the Day Regression Ate Statistics
The semester is half over. Maya has met t-tests, ANOVAs (in all nine flavors), correlations, chi-square. They feel like separate tools — different scenarios, different formulas, different reporting conventions. She has memorised them dutifully.
Then comes Session 11. Her professor stands in front of the board and says, dead-pan:
*"Today, all of them die. Or rather, today you realise they were never separate. Today, you meet regression."*
Maya thinks she's joking. She isn't.
---
Where the Magic Lives — A Line Through a Cloud
Maya pulls up her dataset of 100 students. X = hours studied per week. Y = exam score. She scatter-plots it. A diffuse upward cloud. Some structure, lots of noise. She wants to predict Y from X.
The simplest model: a straight line.
But which line? There are infinitely many. Her instinct: the line where the prediction errors are smallest. Each data point's miss — its residual — is . Some residuals are positive (points above the line), some negative (below). If she summed raw residuals, they'd cancel. If she summed absolute residuals, the math gets ugly — absolute values aren't differentiable at zero.
So: square them and add up.
The line minimising this sum is the OLS (Ordinary Least Squares) line. Calculus does the rest. Setting partial derivatives to zero gives a closed-form solution:
Maya stops. The numerator of is the covariance. The denominator is the variance of X. She remembers Session 6: Pearson's r is . The OLS slope and Pearson's r are *almost* the same quantity — they differ only by the scaling of Y.
In fact, if she z-scores both X and Y first and re-runs the regression, the slope she gets is exactly r. Standardised regression slope = correlation coefficient.
*"They're the same thing,"* she mutters. *"Correlation is regression in standardised units."*
She computes for her data: . Predict for someone studying 10 hours: 80. Plain English: each extra hour adds 3 points to predicted score. R² = 0.60. Predictors explain 60% of variance in exam scores.
---
Many Predictors, One Idea
She adds two more variables: IQ and hours of sleep. The model grows:
OLS still minimises , just now with four coefficients to estimate. The math becomes matrix algebra (), but the principle is identical.
The R output:
``` Estimate Std. Error t value Pr(>|t|) (Intercept) 50.23 4.12 12.20 < 2e-16 * hours 2.50 0.32 7.81 < 0.001 * IQ 0.81 0.18 4.50 < 0.001 * sleep 1.42 0.51 2.78 0.006
Multiple R-squared: 0.65, Adjusted R-squared: 0.64 F-statistic: 59.4 on 3 and 96 DF, p-value: < 2.2e-16 ```
The interpretation that matters most: **each β is the effect of that predictor *holding the others constant*.** The 2.50 on hours isn't 'study time correlated with score' — it's 'study time *after partialling out* IQ and sleep.' That nuance is what regression buys you over raw correlation.
Maya reads each line:
- Model F = 59.4, p < .001 → the model as a whole beats predicting the mean for everyone.
- R² = 0.65 → 65% of variance explained.
- Each coefficient's t-test → each predictor uniquely matters.
She notices the F-statistic. The same F from ANOVA. Because — wait.
---
The Moment Maya's Mind Breaks: Everything Is Regression
She remembers her professor's claim that today everything dies. She tests it.
*"Run a one-way ANOVA: anxiety scores across 3 treatments."*
She does. F(2, 87) = 4.61, p = .013, η² = .10.
*"Now: encode treatment as two dummy variables (D_meds, D_both; counselling = reference). Run regression Y ~ D_meds + D_both."*
She does. Identical F. Identical p. The two dummy coefficients give: β_meds = mean(meds) − mean(counselling); β_both = mean(both) − mean(counselling). The regression's overall F-test asks 'are both dummies zero?' — which is precisely the ANOVA's H₀ that all group means are equal.
*"Independent t-test, two groups?"*
Same as regression on a single 0/1 dummy. = mean difference; t-statistic identical.
*"Pearson correlation?"*
Same as the standardised slope of simple regression. exactly.
*"ANCOVA?"*
Regression with a categorical predictor (dummies) + a continuous covariate.
She writes in big letters across the top of her notebook:
EVERYTHING IS REGRESSION.
T-tests, ANOVA, ANCOVA, correlation — all special cases of a single General Linear Model. The reporting conventions differ for historical reasons; the math is one math. The board calls it the GLM. Maya calls it relief.
---
R² Lies if You Let It
She turns to goodness of fit.
Proportion of variance in Y explained. Clean. Intuitive.
But there's a trap. R² never decreases when you add a predictor, even if the predictor is *random noise*. The proof is mechanical: enlarging the parameter space can only let OLS find a smaller (or equal) SS_res. So you could throw 10 random columns into your model and R² will climb, *just by chance*.
The fix is adjusted R²:
The factor grows with k. Useless predictors increase k but barely raise R² — adjusted R² *decreases*. Useful predictors raise R² more than enough to overcome the penalty. Maya's rule:
Report R² for interpretation ('% explained'). Report adjusted R² for honest model comparison.
She writes another rule in red: never compare models by raw R² if they have different numbers of predictors. Always adjusted R², AIC, or a nested F-test.
---
The Six Assumptions That Decide Whether Any of This Is Valid
Her professor is now flipping through slides. The next one is titled *"Six Assumptions, Memorise Them."*
Maya counts:
| # | Assumption | What it means | How to check | |---|---|---|---| | 1 | Linearity in parameters | β's enter linearly (X can enter non-linearly via polynomials, logs) | Residuals-vs-fitted plot — no curvature | | 2 | Normality of *residuals* | ε ~ Normal | Q-Q plot; Shapiro-Wilk | | 3 | Homoscedasticity | Var(ε) constant | Residuals-vs-fitted — no fan | | 4 | Exogeneity | E[ε \| X] = 0 | Domain knowledge; random assignment helps | | 5 | Independence | Residuals uncorrelated | Design; Durbin-Watson for time series | | 6 | No severe multicollinearity | Predictors not redundant | VIF < 5 (or 10) |
She circles #2 twice. The most common student mistake: people check whether *Y* is normal, or *X* is normal, when the assumption is about residuals. Y can be flatly non-normal and the regression still works fine if residuals are roughly Normal.
She circles #3 twice. Heteroscedasticity — residual variance grows with X — is everywhere in real data (income, reaction times, expenditure). The fix isn't to abandon regression; it's robust HC standard errors. Point estimates of β unchanged; SEs corrected. p-values and CIs become trustworthy.
She circles #4 three times. Exogeneity is the deep one. Omit a confounder, measure X with error, or have Y secretly affecting X (reverse causality) and your β is biased. Random assignment in experiments enforces exogeneity by design — which is why experimentalists sleep better than observational researchers.
---
The Diagnostic Plots Maya Now Demands From Every Model
After fitting any regression, she produces four plots:
1. Residuals vs fitted. A horizontal cloud of constant width = ideal. Curvature → linearity violated; transform X or add polynomial terms. Fan shape → heteroscedasticity; use robust SEs or transform Y. 2. Q-Q of residuals. Straight diagonal = Normal. Curl at the ends = heavy tails. Hockey stick = skew. 3. Scale-location (√|standardised residuals| vs fitted). Another lens on heteroscedasticity, more sensitive. 4. Residuals vs leverage with Cook's distance contours. Cook's d > 1 flags points whose removal would substantially move the coefficients. Inspect them — don't silently drop.
She thinks of this as the regression equivalent of brushing her teeth. You don't trust a model until you've looked at its residuals.
---
Simpson, Strikes Back
She remembers Simpson's paradox from Session 1 (the Berkeley admissions data). It shows up here too, in regression flesh.
She runs Y ~ education on her income data. β = 5000, p < .001. Education matters!
Then she adds job type (manual / professional) as a dummy and refits. β_education drops to 800, p = .12.
What happened? The original positive effect was almost entirely *because* more education leads to professional jobs (which pay more) — not because, within a job type, an extra year of education adds much to income. Partialling out job type kills the apparent effect.
*"Omit an important covariate and your β tells the wrong story. Include the right ones and the truth surfaces."*
The flip side haunts her too: including *too many* variables can also distort. There's no automatic rule. Theory tells you what to include. Stepwise regression and AIC are tools, not authorities.
---
How Maya Closes the Session
By 2am she has built three regressions for her thesis. Each has:
- A clear research question.
- The right scale-level predictors.
- Adjusted R² and a model F-test.
- Coefficient t-tests for each predictor.
- Standardised β's so she can compare magnitudes.
- The four diagnostic plots tucked into the appendix.
- A note on which assumptions hold and which she's worried about.
- A sensitivity analysis dropping the one high-Cook's-d outlier.
She remembers her professor's claim from the morning: that today everything dies. She no longer hears it as a threat. The fragmented arsenal of tests — t, ANOVA, χ², correlation — collapses into one principled model. She doesn't have to remember nine reporting conventions. She has to understand one.
*"R² explains. β explains, holding others constant. Diagnostics check whether you're allowed to trust the explanation."*
She turns off the desk lamp. Tomorrow she'll meet logistic regression — the GLM stretched to binary outcomes — and Session 11 will turn out to have prepared her completely. Because tomorrow's model is just another link in the same chain.
Regression isn't a tool. It's the language.