Revision Notes/Unit 11 — ANOVA (one-way, RM, two-way)/Partition, F-test, Sphericity, Post-hoc

Partition, F-test, Sphericity, Post-hoc

Intuition

If t-tests answer 'do these two groups differ?', ANOVA answers it for three or more. The trick: instead of running every pairwise t-test (and paying a brutal Type I tax), ANOVA invents a single number — F — that asks 'is the variability *between group means* bigger than the variability *inside groups*?' Under H₀ (all means equal) F sits near 1; under H₁ it grows. One omnibus test, one α, no matter how many groups. The cost: F tells you *somewhere* the means differ, not *where* — that's what post-hoc tests are for.

Explanation

Why not multiple t-tests. With k groups there are $(2 k) = k (k - 1) /2$ pairwise comparisons: k=3→3, k=4→6, k=5→10. At α = 0.05 per test under all-true-nulls, family-wise error climbs steeply: 3 tests ≈ 14%, 10 tests ≈ 40%. ANOVA performs every comparison simultaneously with one α — the elegant fix.

One-way ANOVA — the design. One IV (factor) with $k \geq 3$ levels, one continuous DV, between-subjects (each person in exactly one group). H₀: $μ_{1} = μ_{2} = \dots = μ_{k}$ . H₁: at least one $μ$ differs (notice — *omnibus*, doesn't say which).

Why 'analysis of variance' to compare means. Two kinds of variance live in any grouped dataset: between-group variance (how much group means scatter around the grand mean — the signal if treatments work) and within-group variance (how much individuals vary inside their own group — the natural noise). The ratio MSB/MSW is F. If treatments matter, MSB ≫ MSW → F large. If they don't, MSB ≈ MSW → F ≈ 1.

SS decomposition (the heart of ANOVA). Total variability splits exactly: $SS_{Total} = SS_{B} + SS_{W}$ . SS_B = $\sum_{j} n_{j} (\overset{x}{ˉ}_{j} - \overset{x}{ˉ}_{grand})^{2}$ (signal). SS_W = $\sum_{j} \sum_{i} (x_{ij} - \overset{x}{ˉ}_{j})^{2}$ (noise). This identity is the heart of ANOVA — every flavor builds on it.

Degrees of freedom. $d f_{B} = k - 1$ ; $d f_{W} = N - k$ ; $d f_{total} = N - 1$ . Mean squares: $MSB = SS_{B} / d f_{B}$ , $MSW = SS_{W} / d f_{W}$ . Finally $F = MSB / MSW$ .

Interpreting F. F-distribution is positively skewed and one-tailed (ratio is non-negative). F < 1: signal smaller than noise → no effect. F ≈ 1: noise. F large enough that p < α → reject H₀. F can't tell you direction — only that *some* difference exists.

Reporting format (memorise). $F (d f_{B}, d f_{W}) = value, p = value, η^{2} = value$ . Example: F(2, 42) = 6.47, p = .003, η² = .24. Always include effect size — without it F is half the story.

Effect size — eta-squared. $η^{2} = SS_{B} / SS_{Total}$ . Proportion of total variance explained by the factor. Benchmarks: < 0.01 negligible, 0.01–0.06 small, 0.06–0.14 medium, ≥ 0.14 large. Partial η² = SS_effect / (SS_effect + SS_error) is used in factorial designs to isolate one factor's contribution.

After significant F — which groups differ? Two routes. Planned contrasts (a priori, pre-specified from theory — few, focused, mild Type I cost). Post-hoc tests (exploratory, all pairwise, explicit FWER control). Choose route *before* collecting data — switching after is p-hacking.

Tukey HSD. Standard post-hoc for equal-n one-way ANOVA. Mean difference must exceed $HSD = q \cdot MSW / n$ , where q is the studentized range statistic depending on α, k, df_W. Worked: q = 3.44, MSW = 88.53, n = 15 → HSD ≈ 6.95. Any pairwise mean difference larger than 6.95 is significant.

Other post-hocs. Bonferroni-corrected pairwise t-tests (each p compared to α/m) — simple, conservative, good for few comparisons. Games-Howell for unequal n or unequal variances (Welch-style adjusted df). Scheffé for arbitrary linear contrasts (most conservative). Dunnett for comparing each group against a single control — more powerful when that's the focus.

ANOVA assumptions. (1) Normality of each group (Shapiro-Wilk; visualise with Q-Q). (2) Homogeneity of variance across groups (Levene's test; rule of thumb: largest variance < 4–5× smallest → ANOVA still valid). (3) Independence of observations (between-subjects design — each person in exactly one group).

Violations & remedies. Non-normal + small n → Kruskal-Wallis (rank-based, Session 8). Unequal variances → Welch's ANOVA (doesn't assume equal variances, default in modern software). Both violated → non-parametric. Normality matters less with n > 25 per group (CLT). Levene severity matters more than Levene's p-value.

Repeated-measures (RM) ANOVA. Same participants under all k conditions (or k time points). Each person is their own control. SS decomposition splits finer: $SS_{Total} = SS_{Between} + SS_{Subjects} + SS_{Error}$ . The subject term is *pulled out of* the error, giving a smaller denominator → more power for the same n. F = MS_Between / MS_Error. df: Between = k − 1, Subjects = n − 1, Error = (k − 1)(n − 1).

Sphericity (new assumption for RM). Variances of *pairwise differences* between conditions are equal across all pairs — the within-subjects analog of homogeneity of variance. Tested by Mauchly's W. H₀ of Mauchly's: sphericity holds. p < 0.05 → violated. Don't switch tests — apply a correction.

Sphericity corrections. Greenhouse-Geisser (more conservative, recommended when ε < 0.75) and Huynh-Feldt (less conservative, for ε > 0.75). Both adjust df by multiplying by ε (an estimate of how badly sphericity is violated; ε = 1 means perfect sphericity). F stays the same; the reference distribution moves. Software reports both — pick by severity.

Friedman test. Non-parametric alternative to RM-ANOVA. Ranks within each subject across conditions, then tests whether rank sums differ. Use when data are non-normal or sphericity is severely violated and corrections feel unsafe.

ANCOVA — Analysis of Covariance. ANOVA + a continuous covariate that confounds the IV-DV relationship. Two-step logic: (1) statistically remove the covariate's linear effect on DV, (2) run ANOVA on the residualised DV. Result: effect of IV *net of* the covariate. Boosts power by shrinking the error term. Extra assumption: homogeneity of regression slopes — covariate-DV relationship is the same across groups (no IV × covariate interaction).

Factorial ANOVA — multiple IVs. A k × m design has two IVs with k and m levels. Tests three things: main effect of A (averaged over B), main effect of B (averaged over A), interaction A × B (does A's effect depend on B?). The interaction is usually the most interesting question. Parallel lines in interaction plot → no interaction; non-parallel or crossing → interaction.

MANOVA — multiple DVs. Two or more *different* DVs tested simultaneously (e.g., reaction time *and* memory). Controls Type I across the DV set and exploits correlations between DVs. Test statistics: Pillai's trace (most robust, default), Wilks' lambda (when covariance matrices unequal), plus Hotelling and Roy. Significant MANOVA → follow up with univariate ANOVAs on each DV.

MANOVA vs RM-ANOVA — the distinction the exam loves. RM-ANOVA = same DV measured multiple times on same people (sphericity matters). MANOVA = different DVs measured once (homogeneity of covariance matrices matters). Phrase to memorise: *RM = same thing many times; MANOVA = different things once.*

Mixed ANOVA. Combines a between-subjects factor (e.g., age group) and a within-subjects factor (e.g., pre/post). Common for intervention designs. Between factor uses between error term; within factor uses within error term (with sphericity check for the within factor).

Definitions

One-way ANOVA — Omnibus F-test for differences across $k \geq 3$ group means, one IV, between-subjects. Partitions $SS_{Total} = SS_{B} + SS_{W}$ .
F-ratio — $F = MSB / MSW$ . Under H₀ centres near 1; under H₁ exceeds 1. Always one-tailed.
MSB / MSW — Mean squares: SS divided by df. MSB = signal estimate; MSW = noise estimate.
Eta-squared (η²) — Effect size = SS_B / SS_Total. Proportion of variance explained by the factor. Bands .01/.06/.14.
Partial η² — $SS_{effect} / (SS_{effect} + SS_{error})$ . Used in factorial / RM ANOVA to isolate one effect's contribution.
Tukey HSD — Post-hoc pairwise comparisons for equal-n one-way ANOVA. Uses the studentized range q. Controls FWER.
Bonferroni post-hoc — Run all pairwise t-tests, compare each p to α/m. Simple, conservative, good for few comparisons.
Games-Howell — Post-hoc for unequal n or unequal variances. Welch-style df adjustment.
Scheffé — Most conservative post-hoc; valid for arbitrary linear contrasts including non-pairwise.
Dunnett — Post-hoc for comparing each group to a single control. More powerful when control comparisons are the focus.
Planned contrast — Pre-specified comparison from theory or prior literature. Few in number, mild Type I cost.
Repeated-measures ANOVA — Same participants in all conditions. SS partition adds SS_Subjects; F = MS_Between / MS_Error. More power than between-subjects.
Sphericity — Equality of variances of pairwise differences across all condition pairs in RM-ANOVA. Tested by Mauchly's W.
Mauchly's test — Test of sphericity. H₀: sphericity holds. p < .05 → violated → apply correction.
Greenhouse-Geisser correction — Multiplies df by ε estimate to correct sphericity violation. Recommended when ε < 0.75.
Huynh-Feldt correction — Less conservative sphericity correction. Recommended when ε > 0.75.
Friedman test — Non-parametric counterpart of RM-ANOVA. Ranks within subjects across conditions.
Kruskal-Wallis — Non-parametric counterpart of one-way ANOVA. Ranks all data, compares group rank sums.
Welch's ANOVA — ANOVA variant that doesn't assume equal variances. Default in modern software.
ANCOVA — ANOVA + continuous covariate. Adjusts DV for covariate's linear effect before testing IV. Assumes equal regression slopes across groups.
Factorial ANOVA — Two or more categorical IVs. Tests main effects + interactions.
Main effect — Effect of one IV averaged over the other(s).
Interaction effect — Effect of one IV depends on the level of another. Non-parallel lines in interaction plot.
MANOVA — Multivariate ANOVA — 2+ DVs tested simultaneously. Pillai's trace / Wilks' lambda. Controls Type I across DV set.
Pillai's trace — Most robust MANOVA test statistic. Default when covariance matrices are homogeneous (Box's M test).
Mixed ANOVA — Combines between-subjects and within-subjects factors. Common for pre/post intervention designs.

Formulas

$F = \frac{MSB}{MSW} = \frac{SS _{B} / ( k - 1 )}{SS _{W} / ( N - k )}$
$SS_{Total} = SS_{B} + SS_{W} (one-way)$
$SS_{Total} = SS_{Between} + SS_{Subjects} + SS_{Error} (RM)$
$η^{2} = \frac{SS _{B}}{SS _{Total}}$
$η_{p}^{2} = \frac{SS _{effect}}{SS _{effect} + SS _{error}}$
$Tukey HSD = q_{α, k, d f_{W}} \cdot \frac{MSW}{n}$
$(2 k) = \frac{k ( k - 1 )}{2} (number of pairwise comparisons)$
$P (at least one FP among m tests) = 1 - (1 - α)^{m}$
$d f_{B} = k - 1, d f_{W} = N - k, d f_{Total} = N - 1$
$Adjusted df for GG = ε \cdot d f_{original}$

Derivations

SS decomposition identity. For any data point $x_{ij}$ in group $j$ , write $x_{ij} - \overset{x}{ˉ}_{grand} = (x_{ij} - \overset{x}{ˉ}_{j}) + (\overset{x}{ˉ}_{j} - \overset{x}{ˉ}_{grand})$ . Square both sides and sum over all i, j. The cross-term $2 \sum (x_{ij} - \overset{x}{ˉ}_{j}) (\overset{x}{ˉ}_{j} - \overset{x}{ˉ}_{grand})$ vanishes because $\sum_{i} (x_{ij} - \overset{x}{ˉ}_{j}) = 0$ within each group. Left with $SS_{Total} = SS_{W} + SS_{B}$ . Total variability is exactly partitioned.

Why F = MSB/MSW tests equal means. Under H₀ (all $μ_{j}$ equal), both MSB and MSW are unbiased estimators of $σ^{2}$ → their ratio centres near 1. Under H₁, $E [MSB] = σ^{2} + \frac{1}{k - 1} \sum n_{j} (μ_{j} - \overset{μ}{ˉ})^{2}$ — strictly larger than $σ^{2}$ — while $E [MSW] = σ^{2}$ . So $F > 1$ on average when H₁ is true.

Why RM-ANOVA is more powerful. The within-subject SS gets split into SS_Subjects + SS_Error. The denominator MS_Error = SS_Error / [(k−1)(n−1)] is *smaller* than the one-way MSW because subject-level variance has been removed. Smaller denominator → larger F → more power to detect the same effect with the same n.

Greenhouse-Geisser ε derivation (sketch). ε measures departure from sphericity. For perfect sphericity ε = 1; for maximum violation ε = 1/(k − 1). Multiply both df by ε to get the corrected reference distribution. F stays unchanged — only the p-value moves (becomes larger, i.e., more conservative).

Family-wise error inflation. For m independent tests at per-test α, P(no false positive) = $(1 - α)^{m}$ , so P(at least one FP) = $1 - (1 - α)^{m}$ . Bonferroni inverts: set per-test α to α_FW / m to bound family-wise error at α_FW.

Examples

Anxiety treatments (motivating example). 90 participants randomised to counselling / anti-anxiety meds / both. DV: anxiety score (lower = better). H₀: μ_C = μ_M = μ_B. F(2, 87) = 4.61, p = .013, η² = .096 → medium effect. Tukey HSD: Both vs Counselling difference = 3.4 (p = .008, sig); Meds vs Counselling = 1.8 (p = .12, ns); Meds vs Both = 1.6 (p = .047, sig). Conclusion: combination beats counselling alone; meds alone indistinguishable from counselling.
Schooling outcomes (worked). 45 students across 3 schools (home, boarding, regular day), exam scores. Group means 78.8 / 71.87 / 84.2, variances 141 / 74 / 50. SS_B = 1146.71, SS_W = 3718.53, F(2, 42) = 6.47, p = .003, η² = .24 (large). Tukey HSD = 6.95. Differences: Home-Regular = 5.4 (ns); Boarding-Regular = 12.33 (sig); Home-Boarding = 6.93 (just below threshold).
Study strategies — repeated-measures. 6 participants each try 3 strategies (reread, answer Qs, create-and-answer). DV: post-test score. F(2, 10) = 19.09, p < .001, η² = .79. Within-subjects linear contrast: F(1, 5) = 93.75, p < .001 — performance climbs across the three strategies in order. Mauchly's W = 0.82, p = .42 → sphericity OK, no correction needed.
RM-ANOVA with sphericity violation. Reaction times across 4 caffeine doses, n = 20. Mauchly's W = 0.42, p = .002 → violated. Greenhouse-Geisser ε = 0.74. Original df (3, 57) → adjusted df (2.22, 42.18). Reported F(2.22, 42.18) = 8.91, p < .001 — still significant after correction.
ANCOVA — time-of-day on RT with sleep as covariate. 60 participants tested morning or afternoon. Hours of sleep happens to be higher in the morning group (confound). ANCOVA partials sleep out: main effect of time of day after controlling for sleep, F(1, 57) = 5.2, p = .026, partial η² = .083. Without ANCOVA, the effect was inflated by sleep imbalance.
Factorial ANOVA — caffeine × time of day on RT. 2 (time: AM, PM) × 3 (caffeine: none, some, lots). Main effect time F(1, 50) = 9.4, p = .003, η²_p = .16. Main effect caffeine F(2, 50) = 12.1, p < .001, η²_p = .33. Interaction F(2, 50) = 1.3, p = .28 (ns). Conclusion: caffeine helps regardless of time of day; effects are additive.
Crossover interaction (the most interesting kind). Teaching method (A, B) × prior experience (novice, expert) on test score. Method A wins for novices (M = 78 vs 68); Method B wins for experts (M = 88 vs 75). Lines cross in the interaction plot. F_interaction(1, 80) = 24.6, p < .001 — must interpret the interaction; reporting only the (ns) main effect of method would be misleading.
Mixed ANOVA — anxiety pre/post by treatment group. Between: 3 treatments. Within: 2 time points (pre, post). Main effect time F(1, 87) = 32.4, p < .001 (everyone improves). Main effect treatment F(2, 87) = 2.1, p = .13. Time × treatment interaction F(2, 87) = 7.8, p < .001 — improvement *depends* on treatment, the key finding.

Diagrams

SS partition diagram: total bar (length SS_Total) splits horizontally into SS_Between (signal, left) and SS_Within (noise, right). For RM: the SS_Within bar further splits into SS_Subjects + SS_Error.
F-distribution density curve: right-skewed, one-tailed, with critical F_α shaded in the upper tail. Mean near df_W/(df_W − 2). The 'reject' region is the upper tail only.
Group means scatter chart: three horizontal bars at group means with error bars (95% CI). Stars (* p<.05, p<.01, *p<.001) above pairs that survive post-hoc.
Interaction plot: x-axis = factor B levels, y-axis = DV mean, separate lines for each level of factor A. Parallel lines = no interaction; non-parallel = interaction; crossing = full crossover.
Sphericity scatter: each point is the variance of a pairwise difference across conditions. Equal variances → sphericity holds. Spread of dots → violation.
Decision-tree flowchart: count DVs and IVs → branch to t-test / ANOVA / RM-ANOVA / Welch / Kruskal-Wallis / Friedman / ANCOVA / factorial / mixed / MANOVA.

Edge cases

Unequal sample sizes make ANOVA sensitive to variance heterogeneity. Welch's ANOVA + Games-Howell post-hoc is the safe default when n's are unbalanced.
Severely non-normal + small n → Kruskal-Wallis (one-way) or Friedman (RM). For n > 25 per group, CLT typically saves you.
Sphericity ε near 1 — apply no correction (or Huynh-Feldt). ε near 1/(k−1) — apply Greenhouse-Geisser (more conservative).
Significant interaction usually overrides main effects. Always interpret the interaction first; main effects with a strong interaction can be misleading.
ANCOVA assumption of equal slopes — test for IV × covariate interaction. Significant → ANCOVA invalid; report separate regressions per group.
MANOVA with very correlated DVs — equivalent to running ANOVA on a single linear combination of them; consider whether the DVs are really distinct constructs.
One-tailed F is wrong if the question is 'is variance A bigger than variance B' (e.g., variance ratio tests) — that's a different F use case from ANOVA. ANOVA F is implicitly two-sided regarding mean direction.
Levene's test is itself underpowered with small n — visualise variances (boxplots) and use rule-of-thumb (largest variance < 4× smallest → safe).

Common mistakes

Running multiple t-tests across 3+ groups instead of one ANOVA — inflating Type I to ~14–40%.
Reporting only F without effect size — η² or partial η² is mandatory.
Running post-hocs on a non-significant omnibus F (fishing expedition).
Skipping Mauchly's test in RM-ANOVA — sphericity violations are common and inflate Type I if uncorrected.
Interpreting main effects without checking the interaction in factorial designs — can lead to opposite conclusions.
Confusing MANOVA with RM-ANOVA. MANOVA = different DVs. RM = same DV across times.
Forgetting that F-tests are one-tailed; reporting two-tailed p for ANOVA F.
Calling ANCOVA invalid for any baseline imbalance — only the homogeneity-of-slopes failure invalidates it.
Applying Bonferroni to ANOVA post-hocs when Tukey HSD already controls FWER — needless double-correction.
Treating non-significant F as proof of no effect — could be underpowered. Report power.

Shortcuts

SS partition (memorise): Total = Between + Within (one-way); = Between + Subjects + Error (RM).
F = MSB / MSW. df_B = k − 1; df_W = N − k.
Reporting: $F (d f_{B}, d f_{W}) = X, p = Y, η^{2} = Z$ .
η² benchmarks: .01 / .06 / .14 small / medium / large.
** $(2 k)$ pairwise comparisons:** 3 / 6 / 10 for k = 3 / 4 / 5.
Sphericity violated → Greenhouse-Geisser if ε < 0.75, else Huynh-Feldt.
Significant interaction first, main effects second.
Unequal variances → Welch ANOVA + Games-Howell.
RM more powerful than between-subjects because subject variance is removed from error.
Decision flow: count DVs and IVs, ask between/within, check assumptions → pick test.

Proofs / Algorithms

SS_Total = SS_B + SS_W (exact partition). Decompose each data point: $x_{ij} - \overset{x}{ˉ}_{grand} = (x_{ij} - \overset{x}{ˉ}_{j}) + (\overset{x}{ˉ}_{j} - \overset{x}{ˉ}_{grand})$ . Square and sum: $SS_{Total} = \sum (x_{ij} - \overset{x}{ˉ}_{j})^{2} + 2 \sum (x_{ij} - \overset{x}{ˉ}_{j}) (\overset{x}{ˉ}_{j} - \overset{x}{ˉ}_{grand}) + \sum (\overset{x}{ˉ}_{j} - \overset{x}{ˉ}_{grand})^{2}$ . Within each group $\sum_{i} (x_{ij} - \overset{x}{ˉ}_{j}) = 0$ , so the cross term vanishes. Result: SS_Total = SS_W + SS_B. QED.

E[MSW] = σ² under any H. MSW is a pooled within-group variance estimate; each $s_{j}^{2}$ is unbiased for $σ^{2}$ , so the pooled value also is. E[MSB] = σ² under H₀, strictly larger under H₁. When all $μ_{j}$ equal, group means scatter only by sampling noise → MSB estimates σ². When they differ, MSB also captures $\frac{1}{k - 1} \sum n_{j} (μ_{j} - \overset{μ}{ˉ})^{2} > 0$ . Therefore F = MSB/MSW ≈ 1 under H₀ and > 1 under H₁ — F serves as a directional indicator of mean inequality.

Behavioral Research: Statistical Methods