FWER vs FDR; Bonferroni, Holm, BH
How Chance Sneaks Into Significance
A CEO walks into Maya's lab with a drug. He thinks it improves memory. Maya runs a clean experiment, computes the test, and reports back: p = 0.12. No significant effect on memory.
The CEO is not happy. *"Hmm. Reanalyse the data and see if it improves concentration."*
Maya runs the test again. p = 0.18. Nothing.
*"Reaction time?"* p = 0.21. *"Verbal fluency?"* p = 0.09. *"Spatial reasoning?"* p = 0.34.
After twenty different cognitive measures, on the twenty-first, something pops. p = 0.043. *"Executive control!"* the CEO exclaims. *"It's a miracle executive-control drug. Let's raise 100 crores."*
This is the multiple comparisons problem, and it's one of the most consequential pitfalls in the entire course.
The intuition: why p < 0.05 stops meaning what it should
Recall α = 0.05 means we accept a 5% chance of a Type I error per test.
If you do one test under a true null hypothesis, your false-positive rate is 5%. That's the deal.
If you do twenty tests, all under true null hypotheses, what's the chance of at least one false positive?
So with 20 independent tests under the null, you have a 64% chance of finding at least one "significant" result by pure chance. Cherry-picking the significant one and reporting it is a recipe for false discoveries.
The coin analogue
Take a fair coin, flip it 10 times, you might get 9 heads and 1 tail — would you call it unfair? Maybe. Now clone the coin 19 times, flip each one 10 times. If just one of the 20 coins shows 9H+1T, do you still call it unfair? No — with 20 coins, getting one extreme outcome is expected by chance.
The structure is identical to the drug company running 20 tests and reporting only the significant one.
Family-Wise Error Rate (FWER) and False Discovery Rate (FDR)
Let = number of tests. Some are significant (call that ); among the "discoveries", some are true positives (TP) and some are false positives (FP).
| | H₀ true | H₀ false | | --- | --- | --- | | Reject H₀ | FP | TP | | Don't reject | TN | FN |
FWER = P(FP ≥ 1) — probability of at least one false positive across all m tests. "Did I make any mistake at all?" Event-based.
FDR = E[FP/R] — expected proportion of false discoveries among rejections. "How many of my claims are wrong?" Proportion-based.
Both equal α for a single test. They diverge with multiple tests.
When to control which?
- FWER (Bonferroni, Holm) — when even a single false positive is unacceptable. Confirmatory clinical trials, replication studies. Very conservative.
- FDR (Benjamini-Hochberg) — when you can live with some false positives in exchange for not missing real effects. Exploratory genomics, neuroimaging. More power than FWER.
Quick mnemonic:
- FWER: *"Did I make any mistake?"* Very conservative. Confirmatory.
- FDR: *"How many of my claims are wrong?"* More power. Exploratory.
Bonferroni correction — simple but conservative
The simplest FWER correction: if you do m tests, divide α by m and use that criterion for each test:
*Example:* 50 t-tests at original α = 0.05. Adjusted α = 0.001 per test.
Why does Bonferroni work? Boole's inequality (the union bound): .
Problems with Bonferroni
The exam may ask for these explicitly:
1. Too stringent. Reducing the criterion dramatically increases Type II error (β) — we miss real effects. 2. Assumes independent tests. In reality, tests are often correlated (adjacent brain regions, related survey items). Bonferroni over-corrects. 3. Doesn't depend on data structure — only on .
So Bonferroni is the right answer when m is small or tests are genuinely independent. When tests are correlated, it's overkill.
Holm correction — Bonferroni's smarter cousin
A sequential procedure that controls FWER but is less stringent:
1. Compute p-values for all m tests. 2. Sort: . 3. Compare smallest to . If significant, reject. 4. Compare next smallest to . If significant, reject. 5. Continue with until a non-significant test. Stop.
The key idea: as you reject hypotheses, the "effective" number of tests goes down, so the criterion gets less stringent. Holm gives the same FWER protection as Bonferroni but more power. No good reason to use Bonferroni instead.
Benjamini-Hochberg — FDR control
When you care about FDR instead of FWER:
1. Sort the m p-values: . 2. Compute BH critical value for each: , where Q is the chosen FDR level (e.g., 0.05). 3. **Find the largest i such that .** 4. Reject the null for all tests with rank up to that i.
Worked example
m = 5, Q = 0.05. p-values sorted to (0.001, 0.008, 0.039, 0.041, 0.250):
| Rank i | p-value | Critical (i/5)·0.05 | Significant? | | --- | --- | --- | --- | | 1 | 0.001 | 0.010 | ✓ | | 2 | 0.008 | 0.020 | ✓ | | 3 | 0.039 | 0.030 | ✗ | | 4 | 0.041 | 0.040 | ✗ | | 5 | 0.250 | 0.050 | ✗ |
Largest passing rank is 2 → reject tests 1 and 2.
BH gives more power than Bonferroni when many nulls are actually false. In genomics with thousands of tests where you expect many real effects, BH-style FDR control is standard.
Permutation-based correction — using the data's structure
Both Bonferroni and BH make the independence assumption. In practice this is often false:
- Adjacent brain regions in fMRI have correlated activity.
- Related survey questions tap overlapping constructs.
- Multiple measurements from the same participant are dependent.
The general approach: permutation tests.
1. You ran m tests, each producing an "uncorrected" p-value. 2. Randomly permute the data many times (1000+) — preserving structure that isn't of interest (correlation across measurements) but breaking the structure of interest (group assignment). 3. Conduct all m tests for each permutation. Under randomisation, all m nulls are true by construction. 4. Record the distribution of "extreme" results across permutations. Find the empirical threshold for 5% FWER. 5. Use that empirical threshold as the criterion for the original data.
The beauty: this automatically accounts for correlation among tests. You recover power without sacrificing FWER control. Standard in neuroimaging.
Some quick worked numbers for the exam
For α = 0.05 and m tests under all-true-nulls (independent):
| m | FWER | | --- | --- | | 2 | 0.0975 (~10%) | | 3 | 0.143 (~14%) | | 5 | 0.226 (~23%) | | 10 | 0.401 (~40%) | | 20 | 0.642 (~64%) | | 100 | ~99.4% |
Memorise m = 20 → 64%. Easy exam fodder.
Choosing a correction
| Situation | Method | | --- | --- | | m small (< 5), confirmatory | Holm | | m moderate (5–50), confirmatory | Holm | | m large (100s+), exploratory | Benjamini-Hochberg | | Correlated tests (fMRI, related surveys) | Permutation |
Meta-rule: report what you did. The reader needs to know how many tests you ran and which correction you applied. Failing to disclose multiple testing is itself a form of p-hacking.
Why this connects to the replication crisis
Multiple comparisons isn't just a technical detail — it's one of the causes of the replication crisis. Combined with publication bias (Unit 1) and p-hacking, multiple comparisons inflate the apparent rate of "discoveries" in the literature far beyond the true rate. The drug company at the top of this lesson is a fictional version of what happens, in subtler forms, all the time.
Pre-registration is the systemic fix. Lock in hypotheses, design, and analyses before data collection.
What you carry into the exam
- Multiple comparisons inflate Type I. m = 20, α = 0.05 → FWER ≈ 64%.
- FWER = P(any FP); FDR = E[FP/R]. Different scales.
- Bonferroni: α/m. Simple, conservative, assumes independence.
- Holm: stepwise FWER; uniformly more powerful than Bonferroni.
- Benjamini-Hochberg: stepwise FDR; reject up to largest i with .
- Permutation tests for correlated data.
- Use FWER for confirmatory; FDR for exploratory.
- Pre-register to avoid the garden of forking paths.
When you're ready, send "next" and we'll move into non-parametric and categorical tests — Mann-Whitney, Wilcoxon, Kruskal-Wallis, Friedman, χ², McNemar — for when parametric assumptions fail.