Revision Notes/Unit 12 — Regression (Linear, Multiple)/OLS, Diagnostics, Multiple Regression

OLS, Diagnostics, Multiple Regression

Intuition

If ANOVA is the workhorse, regression is the racehorse. The same one model — fit a line, or a hyperplane, that minimises squared residuals — subsumes correlation, t-tests, ANOVA, and ANCOVA as special cases. Each coefficient $β_{j}$ asks: *holding all other predictors constant, how much does Y change per unit of $X_{j}$ ?* R² is the headline ('% variance explained'). But the model only earns its conclusions if six assumptions hold — and that's why half of regression is diagnostics, not estimation.

Explanation

Simple linear regression. Predict Y from one X: $\hat{Y}_{i} = β_{0} + β_{1} X_{i}$ , and any actual data point sits at $Y_{i} = β_{0} + β_{1} X_{i} + ε_{i}$ where $ε_{i}$ is the residual (vertical distance from the fitted line). β₀ = intercept (predicted Y at X = 0; often not directly meaningful but anchors the line). β₁ = slope (change in Y per unit X).

Why OLS — the least-squares principle. Among the infinite lines we could draw, choose the one minimising $\sum (y_{i} - \overset{y}{^}_{i})^{2}$ . Why squared? Absolute values are mathematically awkward; raw residuals cancel. Squaring penalises large misses more than small ones and gives a smooth, differentiable loss → closed-form solution: $\hat{β}_{1} = cov (X, Y) / Var (X)$ , $\hat{β}_{0} = \overset{y}{ˉ} - \hat{β}_{1} \overset{x}{ˉ}$ . (Calculus: set partial derivatives of the squared-error loss to zero.)

Connection to correlation. The slope of standardised regression (z-score X and Y first) equals Pearson's r. And for one predictor, $R^{2} = r^{2}$ . Regression, correlation, and ANOVA all live in the same family.

Multiple linear regression. With k predictors: $\hat{Y}_{i} = β_{0} + β_{1} X_{1 i} + \dots + β_{k} X_{k i}$ . Each $β_{j}$ is the change in Y per unit $X_{j}$ holding all other predictors constant — the *unique* contribution after partialling out everything else. This 'holding constant' clause is what makes regression more nuanced than pairwise correlations.

R² — goodness of fit. $R^{2} = 1 - SS_{res} / SS_{tot}$ . Proportion of variance in Y explained by the predictors. 0 → model explains nothing (predicting $\overset{ˉ}{Y}$ for everyone would do equally well). 1 → every data point falls exactly on the regression surface. R² = 0.65 reads as 'predictors explain 65% of variance in Y.'

Adjusted R² — penalising complexity. R² never decreases when you add predictors, even random garbage — the model just has more flexibility. $\overset{ˉ}{R}^{2} = 1 - (1 - R^{2}) \cdot (n - 1) / (n - k - 1)$ subtracts a penalty for k. Useless predictors *decrease* adjusted R². Report both: R² for interpretation, adjusted R² for honest model comparison.

Three layers of hypothesis tests. (1) Model F-test ( $F = MS_{model} / MS_{res}$ ): does the model as a whole beat the null model 'predict $\overset{ˉ}{Y}$ for everyone'? (2) Coefficient t-tests ( $t = \hat{β}_{j} / SE (\hat{β}_{j})$ , df = n − k − 1): does each predictor contribute uniquely after controlling for the others? (3) Confidence intervals $\hat{β}_{j} \pm t^{*} \cdot SE (\hat{β}_{j})$ — equivalent to the t-test: if the CI excludes 0, predictor is significant.

Standardised coefficients (β). Predictors on different scales (income in lakhs, education in years) can't be compared by raw coefficients. Z-score X and Y, rerun, get standardised β's — '1 SD of X → β SDs of Y'. Compares predictor magnitudes meaningfully. Caveat: SDs must be comparable across predictors.

Assumption 1 — Linearity in parameters. The model is linear in the *coefficients*, not necessarily in X. $Y = β_{0} + β_{1} X + β_{2} X^{2} + β_{3} lo g X_{2} + ε$ is linear regression — coefficients enter linearly even though the X-relationship is curved. The functional form must capture the true shape. Check: residuals-vs-fitted plot; curvature signals violation.

Assumption 2 — Normality of residuals. Not X. Not Y. *The residuals.* X and Y can individually be non-normal so long as the residuals are. Check: Q-Q plot (residuals against theoretical normal quantiles — straight diagonal = normal). Shapiro-Wilk for formal test (over-sensitive at huge n). CLT cushions for large n.

Assumption 3 — Homoscedasticity (homogeneity of variance). Residual variance should be constant across the range of fitted values. Classic violation: residual variance grows with the mean (e.g., income data). Check: residuals-vs-fitted plot (should be a horizontal cloud of constant width; fan shape = heteroscedastic). Formal: Breusch-Pagan / ncvTest. Fix: transform DV (log, sqrt), robust HC standard errors ('sandwich estimators' — keeps β estimates but fixes their SEs), or weighted least squares.

**Assumption 4 — Exogeneity ( $E [ε ∣ X] = 0$ ).** No systematic relationship between predictors and the unobserved error. Violated by: omitted confounders, measurement error in X, reverse causality. Random assignment in experiments forces exogeneity. In observational data, this is the deepest and most often-violated assumption — addressed via covariate control, instrumental variables, or causal-inference methods.

Assumption 5 — Independence of observations. Each observation contributes independent information. Violations: repeated measures (same person multiple times), time-series (close-in-time observations correlated), clustered sampling (students within schools). Standard regression underestimates SEs when violated; use mixed-effects models, time-series regression, etc.

Assumption 6 — No severe multicollinearity. Predictors should not be near-perfectly correlated with each other. Check VIF; > 5–10 = severe (see Unit 10). Inflates SEs, can flip signs, makes individual β's uninterpretable.

Outliers and influence. Not strictly an assumption, but critical. Cook's distance measures how much each observation moves the coefficient vector if removed. Rule: Cook's d > 1 is influential (some use 4/n as a stricter threshold). Always inspect — don't blindly drop, but understand and justify.

Residual diagnostic suite. Always produce four plots: (1) *Residuals vs fitted* — linearity + homoscedasticity (the single most useful plot); (2) *Q-Q* — normality of residuals; (3) *Scale-location* — variance of √|residuals| vs fitted (more sensitive heteroscedasticity check); (4) *Residuals vs leverage* — Cook's distance contours highlight influential points.

Dummy coding for categorical predictors. A categorical predictor with k levels needs $k - 1$ dummy (0/1) variables. The omitted level is the reference category; each β is the difference vs reference. *Never* include all k dummies — perfect multicollinearity (the 'dummy variable trap').

Model selection. Occam's razor — simplest model fitting well. Adjusted R² for nested comparison. AIC $= 2 k - 2 ln L$ (lower better; differences > 10 strong, < 2 weak). BIC has larger penalty $ln (n) \cdot k$ — favours simpler models. Stepwise (forward/backward selection by AIC) — automated but unreliable, treat as suggestion not answer. Nested F-test for comparing two specific models that differ by some predictors: $F = (SS_{res, A} - SS_{res, B}) / (k_{B} - k_{A}) / MS_{res, B}$ .

Simpson's paradox in regression. The coefficient on a predictor can flip sign when controlling for another variable. Income ~ education looks positive *until* you control for job type, when within-job-type the relationship goes flat or negative. Omitting an important covariate produces misleading coefficients — the original positive coefficient was an artifact of education leading to professional jobs. The flip side warns that adding *or* omitting variables can change conclusions; both are decisions to defend, not defaults.

Prediction vs explanation. Two distinct goals. For *prediction* (forecast new Y's), multicollinearity is fine — only fit matters. For *explanation* (which X drives Y), multicollinearity bites because individual β's become unstable. Reporting either goal requires different model decisions.

Regression as the General Linear Model. One-sample t-test = regression on intercept only. Independent t-test = regression on a binary dummy. One-way ANOVA = regression on $k - 1$ dummies. Factorial ANOVA = regression with multiple categorical predictors + interactions. ANCOVA = regression with categorical + continuous predictors. Pearson r = standardised simple regression. Everything is regression.

Definitions

OLS (Ordinary Least Squares) — Estimator that minimises $\sum (y_{i} - \overset{y}{^}_{i})^{2}$ . Closed-form solution; unbiased under regression assumptions.
Intercept (β₀) — Predicted Y when all X = 0. Often not directly meaningful, but anchors the line.
Slope / coefficient (β_j) — Predicted change in Y per unit change in $X_{j}$ , holding all other predictors constant.
Residual (ε_i) — Difference between observed $Y_{i}$ and predicted $\hat{Y}_{i}$ . Used to compute SS_res and to check assumptions.
R² (coefficient of determination) — Proportion of variance in Y explained by predictors. $1 - SS_{res} / SS_{tot}$ . Always ↑ with added predictors.
Adjusted R² — R² penalised by the number of predictors. Can decrease when a useless predictor is added — honest for model comparison.
Model F-test — Tests whether the model as a whole beats the intercept-only null. $F = MS_{reg} / MS_{res}$ .
Coefficient t-test — Tests $H_{0} : β_{j} = 0$ via $t = \hat{β}_{j} / SE (\hat{β}_{j})$ with df = n − k − 1.
Standardised coefficient (β) — Slope after z-scoring X and Y. Allows magnitude comparison across predictors on different scales. For one predictor, equals r.
LINeM assumptions — Linearity, Independence of errors, Normality of residuals, Equal variance (homoscedasticity), no Multicollinearity. Plus exogeneity ( $E [ε ∣ X] = 0$ ).
Linearity in parameters — Coefficients enter linearly even if X enters non-linearly (polynomials, logs, interactions are fine).
Homoscedasticity — Constant residual variance across fitted values. Violation = heteroscedasticity.
Heteroscedasticity — Residual variance changes with X or fitted values. Biases SEs; fix with robust HC SEs or transformations.
Exogeneity — $E [ε ∣ X] = 0$ — predictors are uncorrelated with the unobserved error. Violated by omitted confounders, reverse causality, measurement error.
Multicollinearity — Correlated predictors. Detect via VIF > 5–10. Inflates β SEs, can flip signs.
Dummy variable — 0/1 indicator for a categorical level. For k levels create k − 1 dummies; one is the reference category.
Cook's distance — Influence diagnostic — how much each observation shifts β if removed. > 1 flags influential outliers.
Leverage — How extreme a data point's X-values are. High-leverage + large residual = influential.
AIC / BIC — Information criteria for model comparison. Lower better. AIC penalises 2k; BIC penalises ln(n)·k.
Nested F-test — Compares two models where one is a subset of the other. Tests whether the extra predictors collectively add fit.
Stepwise regression — Automated forward/backward selection by AIC. Heuristic; can disagree across directions; do not treat as theorem.
Simpson's paradox — Coefficient direction or magnitude flips when an additional variable is included. Sign of confounding.
General Linear Model — Umbrella framework — regression with continuous + categorical predictors. Subsumes t-tests, ANOVA, ANCOVA.

Formulas

$\hat{Y}_{i} = β_{0} + β_{1} X_{1 i} + \dots + β_{k} X_{k i}$
$\hat{β}_{1} = \frac{\sum ( X _{i} - X ˉ ) ( Y _{i} - Y ˉ )}{\sum ( X _{i} - X ˉ ) ^{2}} = \frac{cov ( X , Y )}{Var ( X )}$
$\hat{β}_{0} = \overset{ˉ}{Y} - \hat{β}_{1} \overset{ˉ}{X}$
$R^{2} = 1 - \frac{SS _{res}}{SS _{tot}} = \frac{SS _{reg}}{SS _{tot}}$
$\overset{ˉ}{R}^{2} = 1 - (1 - R^{2}) \frac{n - 1}{n - k - 1}$
$F = \frac{MS _{reg}}{MS _{res}} = \frac{SS _{reg} / k}{SS _{res} / ( n - k - 1 )}$
$t = \frac{β ^ _{j}}{SE ( β ^ _{j} )}, df = n - k - 1$
$CI_{95%} = \hat{β}_{j} \pm t_{0.975, n - k - 1}^{*} \cdot SE (\hat{β}_{j})$
$AIC = 2 k - 2 ln L, BIC = k ln n - 2 ln L$
$F_{nested} = \frac{( SS _{res, A} - SS _{res, B} ) / ( k _{B} - k _{A} )}{SS _{res, B} / ( n - k _{B} - 1 )}$

Derivations

OLS closed form (simple). Loss $L (β_{0}, β_{1}) = \sum (y_{i} - β_{0} - β_{1} x_{i})^{2}$ . $\partial L / \partial β_{0} = - 2 \sum (y_{i} - β_{0} - β_{1} x_{i}) = 0 \Rightarrow \hat{β}_{0} = \overset{y}{ˉ} - \hat{β}_{1} \overset{x}{ˉ}$ . Substitute back and solve $\partial L / \partial β_{1} = 0$ : $\hat{β}_{1} = \sum (x_{i} - \overset{x}{ˉ}) (y_{i} - \overset{y}{ˉ}) / \sum (x_{i} - \overset{x}{ˉ})^{2} = cov (X, Y) / Var (X)$ .

Why R² never decreases with added predictors. Adding a predictor enlarges the parameter space. OLS minimises $SS_{res}$ over this space; minimum over a *larger* set is ≤ minimum over a smaller set. So $SS_{res}$ can only stay equal or shrink → $R^{2} = 1 - SS_{res} / SS_{tot}$ can only grow. Hence the need for adjusted R².

SS partition for regression. $Y_{i} - \overset{ˉ}{Y} = (Y_{i} - \hat{Y}_{i}) + (\hat{Y}_{i} - \overset{ˉ}{Y})$ . Square and sum: the cross-term vanishes (because OLS residuals are orthogonal to fitted values), giving $SS_{tot} = SS_{res} + SS_{reg}$ . Same logic as ANOVA's SS_Total = SS_Within + SS_Between.

Standardised slope = r. After z-scoring X and Y, $Var (X) = Var (Y) = 1$ and $cov (X, Y) = r$ . So $\hat{β}_{1}^{std} = r /1 = r$ .

Why adjusted R² can decrease. Differentiate $\overset{ˉ}{R}^{2}$ wrt k holding $R^{2}$ fixed: the factor $(n - 1) / (n - k - 1)$ grows with k → $(1 - R^{2})$ times growing factor → the subtracted term grows → $\overset{ˉ}{R}^{2}$ shrinks unless the new predictor raised $R^{2}$ enough to compensate. Useless predictor → tiny $R^{2}$ bump, big k penalty, net decrease.

Why one-way ANOVA = regression on dummies. Encode k groups as $k - 1$ dummies. The regression model becomes $Y_{i} = β_{0} + β_{1} D_{1 i} + \dots + β_{k - 1} D_{k - 1, i}$ . β₀ = mean of reference group; each $β_{j}$ = mean of group $j$ minus mean of reference. The F-test for 'all $β_{j} = 0$ ' is mathematically identical to the one-way ANOVA F-test.

Examples

Hours of study on exam score. $\hat{Y} = 50 + 3 X$ , R² = 0.6. Predict: 10 hours → 50 + 3·10 = 80. Interpretation: each extra hour of study predicts 3 more points on the exam; 60% of variance in scores explained by study hours.
Multiple regression — exam score on hours, IQ, sleep. Output: Intercept 50.23 (SE 4.12, p < .001), hours 2.50 (p < .001), IQ 0.81 (p < .001), sleep 1.42 (p = .006). Multiple R² = 0.65, adj R² = 0.64. F(3, 96) = 59.4, p < .001. Each hour of study → 2.50 points *holding IQ and sleep constant*; each IQ point → 0.81 points *holding hours and sleep constant*. All three predictors uniquely contribute.
Standardised β comparison. Raw coefficients hours = 2.5, IQ = 0.81. Can't compare — different units. Standardised: β_hours = 0.42, β_IQ = 0.31. Now: 1 SD of hours moves Y 1.35× more than 1 SD of IQ.
Dummy coding — transport mode (car / bus / train). Two dummies: $D_{bus}, D_{train}$ ; car is reference. Model: $\hat{Y} = β_{0} + β_{1} D_{bus} + β_{2} D_{train}$ . β₀ = mean Y for car users. β₁ = bus − car difference. β₂ = train − car difference. To test bus vs train, refit with bus as reference, or test the linear contrast β₂ − β₁.
Curvilinear via polynomial. Productivity vs caffeine isn't monotone — peaks then falls. Fit $\hat{Y} = β_{0} + β_{1} X + β_{2} X^{2}$ . Coefficient on $X^{2}$ is significant and negative → inverted-U. Linear regression handles this because the *coefficients* are still linear.
Heteroscedasticity in income data. Plot residuals-vs-fitted: a clear fan, residual SD grows with predicted income. Refit using robust HC3 standard errors. β estimates unchanged; SEs (and hence p-values, CIs) corrected.
Influential outlier. N = 50, one extreme point with Cook's d = 1.8. Refit without it: β changes from 2.5 to 1.2 — the point was driving the result. Document, investigate why it's extreme, decide to retain (with sensitivity analysis) or remove (with justification). Never silently drop.
Simpson's paradox. Univariate: education → income, β = 5000, p < .001. Add job-type dummies: education β collapses to 800, p = .12. The univariate effect was largely *because more education leads to professional jobs* — partialling out job type leaves a much smaller direct effect.
Nested model comparison. Model A: hours + IQ; Model B: hours + IQ + sleep + diet. F-test compares them. SS_res,A = 8480, SS_res,B = 7320, k_A = 2, k_B = 4, n = 100. $F = (8480 - 7320) /2 \div 7320/ (100 - 4 - 1) = 580/77.1 = 7.52$ , df (2, 95), p < .001 → sleep + diet *together* significantly improve fit.
Regression-as-t-test. Two groups, n₁ = n₂ = 25, mean diff 5. Run independent t: t(48) = 3.4, p = .001. Run regression Y ~ D (D = 0/1 dummy): $\hat{β}_{1} = 5$ (mean diff), $t = 3.4$ , identical p. Same test, different framing.

Diagrams

Scatter with OLS line + vertical residual segments showing $\sum ε_{i}^{2}$ as the squared-distance pile being minimised.
Four diagnostic plots (the canonical 2×2 grid): (top-left) Residuals vs fitted; (top-right) Q-Q of standardised residuals; (bottom-left) Scale-location; (bottom-right) Residuals vs leverage with Cook's distance contours.
Heteroscedasticity fan: residuals-vs-fitted plot where the spread of residuals widens systematically with fitted value.
Curvilinear residual pattern: residuals-vs-fitted shows a U or inverted-U → linearity assumption violated.
Simpson's paradox illustration: within-subgroup regression lines all sloping one way, aggregate line sloping the other.
Interaction plot for X1 × X2 in regression: regression line for X1 has different slopes at each level of X2.

Edge cases

Heteroscedasticity → robust HC3 SEs preserve β but correct inference; transformations (log, sqrt) often help.
Influential outliers flagged by Cook's d > 1 (or 4/n). Always inspect, never silently drop.
Endogeneity (predictor correlated with error) → β biased; need instrumental variables or causal-inference design.
Categorical with many levels (zip code, hospital) → mixed-effects / random-effects models or regularisation; raw dummies eat df.
Perfect multicollinearity (one predictor a linear combo of others) → software drops a column. Dummy-variable trap is the textbook example.
Tiny n relative to k (k near n) → overfitting; use cross-validation or regularised regression (ridge / lasso).
Auto-correlated residuals (time series) → Durbin-Watson test; use autoregressive errors or time-series models.
Non-linear relationship missed → look at residual plots before trusting any coefficient.

Common mistakes

Comparing models via raw R² with different numbers of predictors — use adjusted R² or AIC.
Inferring causation from observational regression — β reflects association after partialling, not cause.
Skipping residual diagnostics — assumptions matter and a single plot reveals most violations.
Interpreting individual β under multicollinearity — they become unstable; report jointly.
Including k dummies for k categorical levels (perfect collinearity — software silently drops one or errors).
Forgetting that 'normality' applies to residuals, not X or Y individually.
Treating significant overall F as proof every coefficient matters — check individual t's.
Confusing 'standardised β' with 'β/SE' — different quantities (standardised β is on z-score units; β/SE is a t-statistic).
Stepwise regression treated as ground truth — it's a heuristic, not a theorem.
Reporting R² without sample size or any CI — a high R² with n = 10 is fragile.

Shortcuts

OLS = minimise Σ(residual)².
Simple regression β₁ = cov(X,Y)/Var(X); β₀ = Ȳ − β₁X̄.
R² = SS_reg/SS_tot; adj R² penalises k. R² always ↑ with predictors.
For one predictor: R² = r²; standardised β = r.
Six assumptions (LINeM + independence + exogeneity): Linearity, Independence, Normality of residuals, Equal variance, no Multicollinearity, exogeneity.
Coefficient t-test: df = n − k − 1.
Dummy coding: k levels → k − 1 dummies; omitted level is the reference.
Cook's d > 1 (or 4/n) = influential outlier.
AIC lower better; BIC penalises more. Differences > 10 strong; < 2 weak.
Everything is regression — t-tests, ANOVA, ANCOVA, correlation all special cases.

Proofs / Algorithms

OLS minimises squared residuals (existence + uniqueness). Loss $L (β) = ∥ y - X β ∥^{2}$ is a convex quadratic in β with Hessian $2 X^{T} X$ . If $X^{T} X$ is positive-definite (no perfect multicollinearity), there is a unique minimiser given by the normal equations $X^{T} X \hat{β} = X^{T} y$ , i.e., $\hat{β} = (X^{T} X)^{- 1} X^{T} y$ . The simple regression case is the scalar reduction.

Standardised regression slope = Pearson r. Standardise: $Z_{X} = (X - \overset{ˉ}{X}) / s_{X}$ , $Z_{Y} = (Y - \overset{ˉ}{Y}) / s_{Y}$ . Then $Var (Z_{X}) = 1$ , $cov (Z_{X}, Z_{Y}) = r$ . From the OLS formula: $\hat{β}_{1}^{std} = cov (Z_{X}, Z_{Y}) / Var (Z_{X}) = r$ . QED.

Behavioral Research: Statistical Methods