Partition, F-test, Sphericity, Post-hoc
Intuition
If t-tests answer 'do these two groups differ?', ANOVA answers it for three or more. The trick: instead of running every pairwise t-test (and paying a brutal Type I tax), ANOVA invents a single number — F — that asks 'is the variability *between group means* bigger than the variability *inside groups*?' Under H₀ (all means equal) F sits near 1; under H₁ it grows. One omnibus test, one α, no matter how many groups. The cost: F tells you *somewhere* the means differ, not *where* — that's what post-hoc tests are for.
Explanation
Why not multiple t-tests. With k groups there are pairwise comparisons: k=3→3, k=4→6, k=5→10. At α = 0.05 per test under all-true-nulls, family-wise error climbs steeply: 3 tests ≈ 14%, 10 tests ≈ 40%. ANOVA performs every comparison simultaneously with one α — the elegant fix.
One-way ANOVA — the design. One IV (factor) with levels, one continuous DV, between-subjects (each person in exactly one group). H₀: . H₁: at least one differs (notice — *omnibus*, doesn't say which).
Why 'analysis of variance' to compare means. Two kinds of variance live in any grouped dataset: between-group variance (how much group means scatter around the grand mean — the signal if treatments work) and within-group variance (how much individuals vary inside their own group — the natural noise). The ratio MSB/MSW is F. If treatments matter, MSB ≫ MSW → F large. If they don't, MSB ≈ MSW → F ≈ 1.
SS decomposition (the heart of ANOVA). Total variability splits exactly: . SS_B = (signal). SS_W = (noise). This identity is the heart of ANOVA — every flavor builds on it.
Degrees of freedom. ; ; . Mean squares: , . Finally .
Interpreting F. F-distribution is positively skewed and one-tailed (ratio is non-negative). F < 1: signal smaller than noise → no effect. F ≈ 1: noise. F large enough that p < α → reject H₀. F can't tell you direction — only that *some* difference exists.
Reporting format (memorise). . Example: F(2, 42) = 6.47, p = .003, η² = .24. Always include effect size — without it F is half the story.
Effect size — eta-squared. . Proportion of total variance explained by the factor. Benchmarks: < 0.01 negligible, 0.01–0.06 small, 0.06–0.14 medium, ≥ 0.14 large. Partial η² = SS_effect / (SS_effect + SS_error) is used in factorial designs to isolate one factor's contribution.
After significant F — which groups differ? Two routes. Planned contrasts (a priori, pre-specified from theory — few, focused, mild Type I cost). Post-hoc tests (exploratory, all pairwise, explicit FWER control). Choose route *before* collecting data — switching after is p-hacking.
Tukey HSD. Standard post-hoc for equal-n one-way ANOVA. Mean difference must exceed , where q is the studentized range statistic depending on α, k, df_W. Worked: q = 3.44, MSW = 88.53, n = 15 → HSD ≈ 6.95. Any pairwise mean difference larger than 6.95 is significant.
Other post-hocs. Bonferroni-corrected pairwise t-tests (each p compared to α/m) — simple, conservative, good for few comparisons. Games-Howell for unequal n or unequal variances (Welch-style adjusted df). Scheffé for arbitrary linear contrasts (most conservative). Dunnett for comparing each group against a single control — more powerful when that's the focus.
ANOVA assumptions. (1) Normality of each group (Shapiro-Wilk; visualise with Q-Q). (2) Homogeneity of variance across groups (Levene's test; rule of thumb: largest variance < 4–5× smallest → ANOVA still valid). (3) Independence of observations (between-subjects design — each person in exactly one group).
Violations & remedies. Non-normal + small n → Kruskal-Wallis (rank-based, Session 8). Unequal variances → Welch's ANOVA (doesn't assume equal variances, default in modern software). Both violated → non-parametric. Normality matters less with n > 25 per group (CLT). Levene severity matters more than Levene's p-value.
Repeated-measures (RM) ANOVA. Same participants under all k conditions (or k time points). Each person is their own control. SS decomposition splits finer: . The subject term is *pulled out of* the error, giving a smaller denominator → more power for the same n. F = MS_Between / MS_Error. df: Between = k − 1, Subjects = n − 1, Error = (k − 1)(n − 1).
Sphericity (new assumption for RM). Variances of *pairwise differences* between conditions are equal across all pairs — the within-subjects analog of homogeneity of variance. Tested by Mauchly's W. H₀ of Mauchly's: sphericity holds. p < 0.05 → violated. Don't switch tests — apply a correction.
Sphericity corrections. Greenhouse-Geisser (more conservative, recommended when ε < 0.75) and Huynh-Feldt (less conservative, for ε > 0.75). Both adjust df by multiplying by ε (an estimate of how badly sphericity is violated; ε = 1 means perfect sphericity). F stays the same; the reference distribution moves. Software reports both — pick by severity.
Friedman test. Non-parametric alternative to RM-ANOVA. Ranks within each subject across conditions, then tests whether rank sums differ. Use when data are non-normal or sphericity is severely violated and corrections feel unsafe.
ANCOVA — Analysis of Covariance. ANOVA + a continuous covariate that confounds the IV-DV relationship. Two-step logic: (1) statistically remove the covariate's linear effect on DV, (2) run ANOVA on the residualised DV. Result: effect of IV *net of* the covariate. Boosts power by shrinking the error term. Extra assumption: homogeneity of regression slopes — covariate-DV relationship is the same across groups (no IV × covariate interaction).
Factorial ANOVA — multiple IVs. A k × m design has two IVs with k and m levels. Tests three things: main effect of A (averaged over B), main effect of B (averaged over A), interaction A × B (does A's effect depend on B?). The interaction is usually the most interesting question. Parallel lines in interaction plot → no interaction; non-parallel or crossing → interaction.
MANOVA — multiple DVs. Two or more *different* DVs tested simultaneously (e.g., reaction time *and* memory). Controls Type I across the DV set and exploits correlations between DVs. Test statistics: Pillai's trace (most robust, default), Wilks' lambda (when covariance matrices unequal), plus Hotelling and Roy. Significant MANOVA → follow up with univariate ANOVAs on each DV.
MANOVA vs RM-ANOVA — the distinction the exam loves. RM-ANOVA = same DV measured multiple times on same people (sphericity matters). MANOVA = different DVs measured once (homogeneity of covariance matrices matters). Phrase to memorise: *RM = same thing many times; MANOVA = different things once.*
Mixed ANOVA. Combines a between-subjects factor (e.g., age group) and a within-subjects factor (e.g., pre/post). Common for intervention designs. Between factor uses between error term; within factor uses within error term (with sphericity check for the within factor).
Definitions
- One-way ANOVA — Omnibus F-test for differences across group means, one IV, between-subjects. Partitions .
- F-ratio — . Under H₀ centres near 1; under H₁ exceeds 1. Always one-tailed.
- MSB / MSW — Mean squares: SS divided by df. MSB = signal estimate; MSW = noise estimate.
- Eta-squared (η²) — Effect size = SS_B / SS_Total. Proportion of variance explained by the factor. Bands .01/.06/.14.
- Partial η² — . Used in factorial / RM ANOVA to isolate one effect's contribution.
- Tukey HSD — Post-hoc pairwise comparisons for equal-n one-way ANOVA. Uses the studentized range q. Controls FWER.
- Bonferroni post-hoc — Run all pairwise t-tests, compare each p to α/m. Simple, conservative, good for few comparisons.
- Games-Howell — Post-hoc for unequal n or unequal variances. Welch-style df adjustment.
- Scheffé — Most conservative post-hoc; valid for arbitrary linear contrasts including non-pairwise.
- Dunnett — Post-hoc for comparing each group to a single control. More powerful when control comparisons are the focus.
- Planned contrast — Pre-specified comparison from theory or prior literature. Few in number, mild Type I cost.
- Repeated-measures ANOVA — Same participants in all conditions. SS partition adds SS_Subjects; F = MS_Between / MS_Error. More power than between-subjects.
- Sphericity — Equality of variances of pairwise differences across all condition pairs in RM-ANOVA. Tested by Mauchly's W.
- Mauchly's test — Test of sphericity. H₀: sphericity holds. p < .05 → violated → apply correction.
- Greenhouse-Geisser correction — Multiplies df by ε estimate to correct sphericity violation. Recommended when ε < 0.75.
- Huynh-Feldt correction — Less conservative sphericity correction. Recommended when ε > 0.75.
- Friedman test — Non-parametric counterpart of RM-ANOVA. Ranks within subjects across conditions.
- Kruskal-Wallis — Non-parametric counterpart of one-way ANOVA. Ranks all data, compares group rank sums.
- Welch's ANOVA — ANOVA variant that doesn't assume equal variances. Default in modern software.
- ANCOVA — ANOVA + continuous covariate. Adjusts DV for covariate's linear effect before testing IV. Assumes equal regression slopes across groups.
- Factorial ANOVA — Two or more categorical IVs. Tests main effects + interactions.
- Main effect — Effect of one IV averaged over the other(s).
- Interaction effect — Effect of one IV depends on the level of another. Non-parallel lines in interaction plot.
- MANOVA — Multivariate ANOVA — 2+ DVs tested simultaneously. Pillai's trace / Wilks' lambda. Controls Type I across DV set.
- Pillai's trace — Most robust MANOVA test statistic. Default when covariance matrices are homogeneous (Box's M test).
- Mixed ANOVA — Combines between-subjects and within-subjects factors. Common for pre/post intervention designs.
Formulas
Derivations
SS decomposition identity. For any data point in group , write . Square both sides and sum over all i, j. The cross-term vanishes because within each group. Left with . Total variability is exactly partitioned.
Why F = MSB/MSW tests equal means. Under H₀ (all equal), both MSB and MSW are unbiased estimators of → their ratio centres near 1. Under H₁, — strictly larger than — while . So on average when H₁ is true.
Why RM-ANOVA is more powerful. The within-subject SS gets split into SS_Subjects + SS_Error. The denominator MS_Error = SS_Error / [(k−1)(n−1)] is *smaller* than the one-way MSW because subject-level variance has been removed. Smaller denominator → larger F → more power to detect the same effect with the same n.
Greenhouse-Geisser ε derivation (sketch). ε measures departure from sphericity. For perfect sphericity ε = 1; for maximum violation ε = 1/(k − 1). Multiply both df by ε to get the corrected reference distribution. F stays unchanged — only the p-value moves (becomes larger, i.e., more conservative).
Family-wise error inflation. For m independent tests at per-test α, P(no false positive) = , so P(at least one FP) = . Bonferroni inverts: set per-test α to α_FW / m to bound family-wise error at α_FW.
Examples
- Anxiety treatments (motivating example). 90 participants randomised to counselling / anti-anxiety meds / both. DV: anxiety score (lower = better). H₀: μ_C = μ_M = μ_B. F(2, 87) = 4.61, p = .013, η² = .096 → medium effect. Tukey HSD: Both vs Counselling difference = 3.4 (p = .008, sig); Meds vs Counselling = 1.8 (p = .12, ns); Meds vs Both = 1.6 (p = .047, sig). Conclusion: combination beats counselling alone; meds alone indistinguishable from counselling.
- Schooling outcomes (worked). 45 students across 3 schools (home, boarding, regular day), exam scores. Group means 78.8 / 71.87 / 84.2, variances 141 / 74 / 50. SS_B = 1146.71, SS_W = 3718.53, F(2, 42) = 6.47, p = .003, η² = .24 (large). Tukey HSD = 6.95. Differences: Home-Regular = 5.4 (ns); Boarding-Regular = 12.33 (sig); Home-Boarding = 6.93 (just below threshold).
- Study strategies — repeated-measures. 6 participants each try 3 strategies (reread, answer Qs, create-and-answer). DV: post-test score. F(2, 10) = 19.09, p < .001, η² = .79. Within-subjects linear contrast: F(1, 5) = 93.75, p < .001 — performance climbs across the three strategies in order. Mauchly's W = 0.82, p = .42 → sphericity OK, no correction needed.
- RM-ANOVA with sphericity violation. Reaction times across 4 caffeine doses, n = 20. Mauchly's W = 0.42, p = .002 → violated. Greenhouse-Geisser ε = 0.74. Original df (3, 57) → adjusted df (2.22, 42.18). Reported F(2.22, 42.18) = 8.91, p < .001 — still significant after correction.
- ANCOVA — time-of-day on RT with sleep as covariate. 60 participants tested morning or afternoon. Hours of sleep happens to be higher in the morning group (confound). ANCOVA partials sleep out: main effect of time of day after controlling for sleep, F(1, 57) = 5.2, p = .026, partial η² = .083. Without ANCOVA, the effect was inflated by sleep imbalance.
- Factorial ANOVA — caffeine × time of day on RT. 2 (time: AM, PM) × 3 (caffeine: none, some, lots). Main effect time F(1, 50) = 9.4, p = .003, η²_p = .16. Main effect caffeine F(2, 50) = 12.1, p < .001, η²_p = .33. Interaction F(2, 50) = 1.3, p = .28 (ns). Conclusion: caffeine helps regardless of time of day; effects are additive.
- Crossover interaction (the most interesting kind). Teaching method (A, B) × prior experience (novice, expert) on test score. Method A wins for novices (M = 78 vs 68); Method B wins for experts (M = 88 vs 75). Lines cross in the interaction plot. F_interaction(1, 80) = 24.6, p < .001 — must interpret the interaction; reporting only the (ns) main effect of method would be misleading.
- Mixed ANOVA — anxiety pre/post by treatment group. Between: 3 treatments. Within: 2 time points (pre, post). Main effect time F(1, 87) = 32.4, p < .001 (everyone improves). Main effect treatment F(2, 87) = 2.1, p = .13. Time × treatment interaction F(2, 87) = 7.8, p < .001 — improvement *depends* on treatment, the key finding.
Diagrams
- SS partition diagram: total bar (length SS_Total) splits horizontally into SS_Between (signal, left) and SS_Within (noise, right). For RM: the SS_Within bar further splits into SS_Subjects + SS_Error.
- F-distribution density curve: right-skewed, one-tailed, with critical F_α shaded in the upper tail. Mean near df_W/(df_W − 2). The 'reject' region is the upper tail only.
- Group means scatter chart: three horizontal bars at group means with error bars (95% CI). Stars (* p<.05, p<.01, *p<.001) above pairs that survive post-hoc.
- Interaction plot: x-axis = factor B levels, y-axis = DV mean, separate lines for each level of factor A. Parallel lines = no interaction; non-parallel = interaction; crossing = full crossover.
- Sphericity scatter: each point is the variance of a pairwise difference across conditions. Equal variances → sphericity holds. Spread of dots → violation.
- Decision-tree flowchart: count DVs and IVs → branch to t-test / ANOVA / RM-ANOVA / Welch / Kruskal-Wallis / Friedman / ANCOVA / factorial / mixed / MANOVA.
Edge cases
- Unequal sample sizes make ANOVA sensitive to variance heterogeneity. Welch's ANOVA + Games-Howell post-hoc is the safe default when n's are unbalanced.
- Severely non-normal + small n → Kruskal-Wallis (one-way) or Friedman (RM). For n > 25 per group, CLT typically saves you.
- Sphericity ε near 1 — apply no correction (or Huynh-Feldt). ε near 1/(k−1) — apply Greenhouse-Geisser (more conservative).
- Significant interaction usually overrides main effects. Always interpret the interaction first; main effects with a strong interaction can be misleading.
- ANCOVA assumption of equal slopes — test for IV × covariate interaction. Significant → ANCOVA invalid; report separate regressions per group.
- MANOVA with very correlated DVs — equivalent to running ANOVA on a single linear combination of them; consider whether the DVs are really distinct constructs.
- One-tailed F is wrong if the question is 'is variance A bigger than variance B' (e.g., variance ratio tests) — that's a different F use case from ANOVA. ANOVA F is implicitly two-sided regarding mean direction.
- Levene's test is itself underpowered with small n — visualise variances (boxplots) and use rule-of-thumb (largest variance < 4× smallest → safe).
Common mistakes
- Running multiple t-tests across 3+ groups instead of one ANOVA — inflating Type I to ~14–40%.
- Reporting only F without effect size — η² or partial η² is mandatory.
- Running post-hocs on a non-significant omnibus F (fishing expedition).
- Skipping Mauchly's test in RM-ANOVA — sphericity violations are common and inflate Type I if uncorrected.
- Interpreting main effects without checking the interaction in factorial designs — can lead to opposite conclusions.
- Confusing MANOVA with RM-ANOVA. MANOVA = different DVs. RM = same DV across times.
- Forgetting that F-tests are one-tailed; reporting two-tailed p for ANOVA F.
- Calling ANCOVA invalid for any baseline imbalance — only the homogeneity-of-slopes failure invalidates it.
- Applying Bonferroni to ANOVA post-hocs when Tukey HSD already controls FWER — needless double-correction.
- Treating non-significant F as proof of no effect — could be underpowered. Report power.
Shortcuts
- SS partition (memorise): Total = Between + Within (one-way); = Between + Subjects + Error (RM).
- F = MSB / MSW. df_B = k − 1; df_W = N − k.
- Reporting: .
- η² benchmarks: .01 / .06 / .14 small / medium / large.
- ** pairwise comparisons:** 3 / 6 / 10 for k = 3 / 4 / 5.
- Sphericity violated → Greenhouse-Geisser if ε < 0.75, else Huynh-Feldt.
- Significant interaction first, main effects second.
- Unequal variances → Welch ANOVA + Games-Howell.
- RM more powerful than between-subjects because subject variance is removed from error.
- Decision flow: count DVs and IVs, ask between/within, check assumptions → pick test.
Proofs / Algorithms
SS_Total = SS_B + SS_W (exact partition). Decompose each data point: . Square and sum: . Within each group , so the cross term vanishes. Result: SS_Total = SS_W + SS_B. QED.
E[MSW] = σ² under any H. MSW is a pooled within-group variance estimate; each is unbiased for , so the pooled value also is. E[MSB] = σ² under H₀, strictly larger under H₁. When all equal, group means scatter only by sampling noise → MSB estimates σ². When they differ, MSB also captures . Therefore F = MSB/MSW ≈ 1 under H₀ and > 1 under H₁ — F serves as a directional indicator of mean inequality.