Revision Notes/Unit 8 — Multiple Comparisons (FWER, FDR)/FWER vs FDR; Bonferroni, Holm, BH

FWER vs FDR; Bonferroni, Holm, BH

Intuition

Run m independent tests at α = 0.05 each → P(at least one false positive) ≈ $1 - 0.9 5^{m}$ . For m = 20: 64%! Multiple comparisons inflate Type I error far beyond nominal α. Two control strategies: FWER (Family-Wise Error Rate) — P(any FP) ≤ α, conservative, Bonferroni and Holm; FDR (False Discovery Rate) — E[FP/rejections] ≤ Q, less conservative, Benjamini-Hochberg. Use FWER for few costly tests; FDR for many exploratory tests.

Explanation

The drug-company story. A CEO walks into Maya's lab with a memory drug. Maya runs the test: p = 0.12. CEO: 'try concentration'. p = 0.18. 'Reaction time?' p = 0.21. 'Verbal fluency?' p = 0.09. After 20 cognitive measures, the 21st shows p = 0.043 — 'It's an executive-control drug!' This is the multiple-comparisons problem — running many tests inflates false positives.

The math. Recall α = 0.05 → 5% chance of Type I error per test. One test: P(no FP) = 0.95. Twenty *independent* tests: P(no FP on any) = $0.9 5^{20} \approx 0.358$ . P(at least one FP) ≈ 0.64 — 64% chance of a 'significant' result by pure chance.

Coin analogue. Take a fair coin, flip 10 times, get 9H+1T. Would you call it unfair? Maybe. Now clone the coin 19 times, flip each. If just *one* of 20 coins shows 9H+1T, do you call it unfair? No — with 20 coins, one extreme outcome is expected by chance. The structure is identical to the drug-company example.

Cherry-picking is the failure mode. Define a criterion that is unlikely under the null (p < 0.05; 9H out of 10). Repeat the experiment many times. Highlight only the most extreme result. Rare events happen by chance; if you run many tests, rare events become likely. Reporting only the 'successful' result is misleading.

Two error rates to control: FWER and FDR. Let $m$ = number of tests; $R$ = number rejected (claimed significant); TP / FP among the R; TN / FN among the m − R unrejected. FWER = P(FP ≥ 1) — probability of *any* false positive across the family. Event-based. FDR = E[FP/R] — expected *proportion* of false discoveries among the rejections. Proportion-based.

FWER vs FDR — when to use which. *FWER (Bonferroni, Holm)* — when even a single false positive is unacceptable. Confirmatory clinical trials, replication studies, expensive follow-up. Very conservative. *FDR (Benjamini-Hochberg)* — when some false positives are tolerable in exchange for not missing real effects. Exploratory genomics, neuroimaging, large-scale screening. More power than FWER.

Quick mnemonic. FWER: 'Did I make any mistake?' — event. FDR: 'How many of my claims are wrong?' — proportion. When all m nulls are true, FWER = FDR. They diverge when some nulls are false (FDR less strict in that case).

Bonferroni correction — simple but conservative. $α_{Bonferroni} = α / m$ . With 50 tests at α = 0.05: per-test α = 0.001. *Equivalent reformulation:* multiply each p-value by m, compare to original α. Why it works (Boole's inequality / union bound): $P (⋃_{i} FP_{i}) \leq \sum_{i} P (FP_{i}) = m \cdot (α / m) = α$ . Hence FWER ≤ α.

Problems with Bonferroni. *(1)* Too stringent — reduces criterion dramatically; increases Type II error β; misses real effects. *(2)* Assumes *independent* tests — reality often violates this (correlated brain regions, related survey items). *(3)* Depends only on m, ignoring data structure or test correlation.

Holm correction — Bonferroni's smarter cousin. Sequential FWER control. Sort p-values: $p_{(1)} \leq p_{(2)} \leq \dots \leq p_{(m)}$ . Compare $p_{(1)}$ to $α / m$ ; if significant, reject. Compare $p_{(2)}$ to $α / (m - 1)$ ; if significant, reject. Continue with $α / (m - 2)$ , etc., until you hit a non-significant test — then stop. Same FWER protection as Bonferroni but more power. No good reason to use Bonferroni instead of Holm.

Benjamini-Hochberg — FDR control. Sort p-values $p_{(1)} \leq \dots \leq p_{(m)}$ . Compute BH critical value for each: $(i / m) \cdot Q$ where Q is target FDR (e.g., 0.05). **Find the largest i such that $p_{(i)} \leq (i / m) \cdot Q$ .** Reject all tests with rank ≤ that i. (If individual ranks below the cutoff fail, they're still rejected — the *largest passing rank* sets the cutoff.) BH controls FDR at level Q.

Worked BH example. m = 5, Q = 0.05. p-values sorted: 0.001, 0.008, 0.039, 0.041, 0.250. Critical values: 0.01, 0.02, 0.03, 0.04, 0.05. Comparisons: rank 1: 0.001 ≤ 0.01 ✓. Rank 2: 0.008 ≤ 0.02 ✓. Rank 3: 0.039 > 0.03 ✗. Rank 4: 0.041 > 0.04 ✗. Largest passing rank is 2 → reject tests 1 and 2.

Permutation-based correction — use the data's structure. Both Bonferroni and BH assume independence. In practice, tests are often correlated (fMRI voxels, survey items, repeated measurements). Standard corrections over-correct. The fix: permutation tests. *(1)* Run m tests, get uncorrected p-values. *(2)* Randomly permute group labels many times (1000+). *(3)* For each permutation, run all m tests — under shuffling, all m nulls are true by construction. *(4)* Record the distribution of extreme results. *(5)* Use the empirical threshold (e.g., the 95th percentile of the max statistic) as the criterion for the original data.

Why permutation rocks for correlated tests. Automatically accounts for correlation. If tests are highly correlated → fewer effective independent tests → empirical threshold is more lenient. If tests are independent → threshold matches Bonferroni. You recover power without sacrificing FWER control. Standard in neuroimaging.

FWER inflation table — exam fodder. For m independent tests at α = 0.05: m=2 → FWER ≈ 0.0975 (10%). m=3 → 14%. m=5 → 23%. m=10 → 40%. m=20 → 64%. m=100 → 99.4%. Memorise the m = 20 → 64% number.

Multiple comparisons and the replication crisis. Not just a technical detail — one of the causes. Combined with publication bias (Unit 1) and p-hacking, multiple comparisons inflate the apparent rate of 'discoveries' far beyond the true rate. The drug-company story is a fictional version of what happens in subtler forms all the time. The corrections in this unit are the discipline's pushback.

Choosing a correction — decision flow. *m small (< 5) and confirmatory:* Holm. *m moderate (5–50) confirmatory:* Holm (or Bonferroni). *m large (hundreds-thousands) exploratory:* Benjamini-Hochberg. *Correlated tests (fMRI, related surveys):* permutation. Meta-rule: report what you did. Failing to disclose multiple testing is itself p-hacking.

Pre-registration as the systemic fix. Lock in hypotheses, design, analyses, and corrections *before* data collection. Forces transparency; distinguishes confirmatory from exploratory. Combined with registered reports (journals accept based on the pre-registration, regardless of outcome), this addresses both multiple comparisons AND publication bias.

Definitions

Multiple comparisons problem — Running m tests at α each inflates FWER to ≈ $1 - (1 - α)^{m}$ . With m = 20, α = 0.05 → ~64% chance of any FP.
Family-Wise Error Rate (FWER) — P(at least one false positive across all m tests). 'Did I make any mistake?' Conservative.
False Discovery Rate (FDR) — E[FP/R] — expected proportion of false positives among rejections. 'How many of my claims are wrong?' Less conservative.
Bonferroni correction — $α / m$ for each test. Controls FWER via union bound. Simple but conservative; assumes independent tests.
Holm's stepwise correction — Sequential FWER control. Compare $p_{(i)}$ to $α / (m - i + 1)$ in order. Uniformly more powerful than Bonferroni.
Benjamini-Hochberg (BH) — Sequential FDR control. Sort p; reject all $p_{(j)}$ up to the largest i with $p_{(i)} \leq (i / m) Q$ .
Permutation test — Empirical null distribution from label-shuffling. Handles correlated tests naturally; standard in fMRI.
Union bound (Boole's inequality) — $P (⋃_{i} A_{i}) \leq \sum_{i} P (A_{i})$ . Foundation of Bonferroni; conservative when events overlap.
Garden of forking paths (Gelman) — Implicit multiple comparisons from analytic choices (covariate inclusion, outlier criteria, etc.) made post-hoc. Forms of p-hacking.
Pre-registration — Locking in hypotheses, design, and analysis plan before data collection. The main antidote to multiple-comparisons abuse.

Formulas

$P (\geq 1 FP) = 1 - (1 - α)^{m} (m independent tests at α each)$
$α_{Bonf} = α_{FW} / m$
$Holm sequential: p_{(i)} < ? α / (m - i + 1) for i = 1, 2, \dots$
$BH threshold: p_{(i)} \leq (i / m) \cdot Q$
$FWER = P (FP \geq 1); FDR = E [FP / R]$
$Boole / union bound: P (i ⋃ A_{i}) \leq i \sum P (A_{i})$

Derivations

Bonferroni controls FWER via union bound. Define $A_{i}$ = 'test i is a false positive'. By Boole's inequality: $P (⋃_{i} A_{i}) \leq \sum_{i} P (A_{i})$ . If each test uses $α / m$ , then $P (A_{i}) = α / m$ for each, so $P (any FP) \leq m \cdot α / m = α$ . Conservative because the union bound is tight only when the $A_{i}$ are disjoint; in practice, correlated tests share rejections and the actual FWER is less than $α$ .

Holm is uniformly more powerful than Bonferroni. Both control FWER at α. Bonferroni: every test compared to $α / m$ . Holm: smallest p compared to $α / m$ (same as Bonferroni), but if it rejects, the next is compared to $α / (m - 1) > α / m$ . Hence Holm rejects at least everything Bonferroni does AND potentially more. Same Type I rate, lower Type II rate.

BH controls FDR at Q (Benjamini-Hochberg 1995). Under independence of test statistics: by sorting and using the sequential threshold $(i / m) \cdot Q$ , the expected proportion of false positives among rejections is bounded by $(m_{0} / m) \cdot Q \leq Q$ , where $m_{0}$ is the number of true nulls. The proof uses the order statistics of uniform random variables on [0, 1] (which the p-values are under H₀).

**Why FWER inflates as $1 - (1 - α)^{m}$ under independence.** Each test is independent Bernoulli with success (false positive) probability α. Probability of no false positive across m tests = $(1 - α)^{m}$ . Hence probability of at least one = $1 - (1 - α)^{m}$ . Linear approximation for small α: $1 - (1 - α)^{m} \approx m α$ — Bonferroni's reasoning.

Permutation gives the exact null distribution. Under the null hypothesis of no group difference, group labels are exchangeable. Permuting labels gives a sample from the null distribution of the test statistic. With $K$ permutations, the empirical p-value is (# permutations with statistic ≥ observed) / K. No distributional assumption required. For FWER control: record the maximum statistic across all m tests per permutation; the 95th percentile of these maxima is the FWER-controlled threshold.

Examples

20 tests at α = 0.05. $1 - 0.9 5^{20} \approx 0.642$ → 64% chance of at least one false positive under all-true-nulls.
Bonferroni at m = 50. Per-test α = $0.05/50 = 0.001$ . Only p < 0.001 is significant.
Holm walkthrough. m = 5; α = 0.05. p sorted: 0.005, 0.012, 0.018, 0.030, 0.080. Compare 0.005 vs 0.05/5 = 0.010 → reject. Next: 0.012 vs 0.05/4 = 0.0125 → reject (0.012 < 0.0125). Next: 0.018 vs 0.05/3 = 0.0167 → 0.018 > 0.0167 → stop. Reject only the first two.
BH walkthrough. m = 5; Q = 0.05. p sorted: 0.001, 0.008, 0.039, 0.041, 0.250. Critical: 0.01, 0.02, 0.03, 0.04, 0.05. Largest i with $p \leq (i / m) Q$ : rank 2 (0.008 ≤ 0.02) passes; rank 3 fails. Reject ranks 1 and 2.
Bonferroni in genomics nightmare. 20,000 gene expression tests at α = 0.05. Per-test α = $2.5 \times 1 0^{- 6}$ . **Only tests with p < $2.5 \times 1 0^{- 6}$ survive** — many real effects with p ≈ 0.001 are missed. Use BH instead.
Multiple comparisons in fMRI. ~100,000 voxels, each tested. Bonferroni: per-voxel α = $5 \times 1 0^{- 7}$ . Most truly active voxels miss the threshold. Cluster-based permutation tests are standard.
Bonferroni overcorrects correlated tests. Three highly correlated cognitive measures with shared variance. Pretending they're independent gives per-test α/3; actually maybe 1.5 'effective' tests → over-correcting by ~2×.
Hidden multiple comparisons. Subgroup analyses: 'effect by gender × age × education'. Six subgroups, six tests, each at α = 0.05 — actual FWER ≈ 26%. Always correct.

Diagrams

FWER inflation curve. Plot of $1 - (1 - α)^{m}$ vs m for α = 0.05. Sharp rise: at m = 20, FWER ≈ 64%; at m = 100, ≈ 99%.
FWER vs FDR table. Cells showing what each controls; conservativeness ordering.
Bonferroni vs Holm vs BH. Same set of m = 5 p-values; mark which are rejected by each method.
Permutation distribution. Histogram of max statistic across 1000 permutations; 95th percentile marked as FWER threshold.
Decision flowchart. Few tests confirmatory → Holm. Many tests exploratory → BH. Correlated tests → permutation.

Edge cases

Tests not independent — Bonferroni still works (union bound is general) but is even more conservative. BH assumes positive regression dependency for strict FDR control; in practice robust to moderate violations.
Pre-specified single hypothesis doesn't need correction.
Garden of forking paths — multiple comparisons can be *implicit* even when only one test is reported. Always pre-register.
Negative dependency can in principle inflate FDR beyond Q for BH — use Benjamini-Yekutieli for guaranteed control under arbitrary dependence.
Hierarchical hypotheses — gatekeeping procedures (test A first; only test B if A significant) can avoid full correction.
Adaptive procedures — Storey's q-value approach uses estimated $\overset{m}{^}_{0}$ for less conservative FDR control.

Common mistakes

Running 20 tests, reporting only the significant without correction — drug-company story.
Applying Bonferroni when m is huge — loses too much power; use BH instead.
Treating subgroup analyses as 'free' tests — they aren't. Each is a comparison.
Optional stopping (checking p after each subject) inflates Type I beyond nominal α.
Switching from BH to Bonferroni to be 'more rigorous' when the study is exploratory — over-corrects, kills real effects.
Forgetting the union bound is conservative — actual FWER with correlated tests is less than the bound.
Reporting BH-rejected results as if at α — they are at FDR Q, not FWER α.
Confusing FWER and FDR. FWER = P(any FP); FDR = expected proportion. Different scales.

Shortcuts

FWER: P(any FP) ≤ α. FDR: E[FP/R] ≤ Q.
Bonferroni: α/m for each test. Simple, conservative.
Holm: sequential, more powerful than Bonferroni. Use it instead.
BH: sort p; reject up to largest i with $p_{(i)} \leq (i / m) Q$ . FDR-controlling.
Use FWER for few costly tests (confirmatory). Use FDR for many exploratory tests.
Permutation tests for correlated data.
m = 20, α = 0.05 → FWER ≈ 64%. Memorise.
Pre-register to avoid the garden of forking paths.

Proofs / Algorithms

Bonferroni-FWER bound. Let $A_{i}$ = event of false positive on test i. $P (FWER) = P (⋃_{i} A_{i}) \leq \sum_{i} P (A_{i})$ by union bound. If each test uses $α / m$ : $\sum_{i} P (A_{i}) = m \cdot α / m = α$ . Hence FWER ≤ α. Inequality is tight when events are disjoint; for correlated tests, actual FWER is strictly less than α — Bonferroni is over-conservative.

BH controls FDR ≤ (m₀/m)·Q under independence. Let $m_{0}$ = number of true nulls. Under independence of test statistics, the order statistics of p-values from true nulls are uniform on $[0, 1]$ . Benjamini-Hochberg (1995) showed: define $\hat{k}$ = largest i such that $p_{(i)} \leq (i / m) Q$ . Then FDR = $E [V / R ∣ R > 0] P (R > 0) \leq (m_{0} / m) Q \leq Q$ . Since $m_{0} \leq m$ , FDR is bounded by Q. More powerful than FWER when many alternatives are true ( $m_{0} ≪ m$ ).

**FWER under independence is exactly $1 - (1 - α)^{m}$ .** Each test is independent Bernoulli(α). $P (no FP) = \prod_{i} P (\overset{ˉ}{A}_{i}) = (1 - α)^{m}$ . Hence $P (any FP) = 1 - (1 - α)^{m}$ . Linear approximation: for small α, $(1 - α)^{m} \approx 1 - m α$ , so FWER ≈ $m α$ — Bonferroni's intuition. Exact formula diverges for larger m.

Behavioral Research: Statistical Methods