Revision Notes/Unit 9 — Non-parametric & Categorical Tests/Categorical & Rank-Based Tests

Categorical & Rank-Based Tests

Intuition

When the data refuse to behave — ordinal scale, heavily skewed, unfixable outliers, tiny n — drop the Normality assumption and use non-parametric tests. Each non-parametric test pairs to a parametric cousin: independent t ↔ Mann-Whitney; paired t ↔ Wilcoxon signed-rank; one-way ANOVA ↔ Kruskal-Wallis; repeated-measures ANOVA ↔ Friedman; Pearson r ↔ Spearman ρ. χ² handles purely categorical data; McNemar handles paired proportions; Fisher's exact handles small counts. Non-parametric tests have less power than parametric when parametric assumptions actually hold — use them only when needed.

Explanation

The decision tree — half the answer to half the exam. Memorise this 2×3 grid first:

*Parametric (interval/ratio, normal):* between-subjects → Independent t; within-subjects → Paired t. *Non-parametric for ordinal:* between → Mann-Whitney U; within → Wilcoxon signed-rank. *Non-parametric for categorical:* between → χ² test; within → Binomial sign test (or McNemar for 2×2).

Extensions to more than 2 groups. *More than two independent groups:* one-way ANOVA (parametric) or Kruskal-Wallis (non-parametric). *More than two repeated conditions:* RM-ANOVA (parametric) or Friedman's test (non-parametric). *Two or more categorical variables:* χ² independence; for 3+ variables, log-linear analysis.

Choosing a test — three (plus one) questions. *(1)* Level of measurement of the DV — interval/ratio (continuous, meaningful spacing), ordinal (ranks), or categorical/nominal (categories). *(2)* Number of groups or conditions — two vs more than two. *(3)* Design — between-subjects (independent) vs within-subjects (paired / repeated). *Plus:* Are parametric assumptions met? — Normality, homogeneity of variance, no severe outliers. If assumptions hold + interval/ratio: parametric (more power). Otherwise: non-parametric.

Parametric vs non-parametric trade-off. Parametric tests have more statistical power when their assumptions hold — they detect real effects with fewer participants. Non-parametric tests make fewer assumptions but have less power when parametric assumptions would have held. *'Lose power for robustness'* is the slogan. Cost of using non-parametric unnecessarily: missed real effects. Cost of using parametric inappropriately: invalid inference.

Chi-square goodness-of-fit. Tests whether observed frequencies match an expected distribution. *H₀:* observed matches expected. *H₁:* observed differs. Classic example: M&M colour distribution against the manufacturer's claim (13% brown, 14% yellow, 24% blue, 20% orange, 16% green, 13% red). $χ^{2} = \sum_{i} (O_{i} - E_{i})^{2} / E_{i}$ , df = k − 1 (one DoF spent on the total count being fixed).

Chi-square test for independence. Tests whether two categorical variables are related. Data: contingency table. *H₀:* independent. *H₁:* not independent. Expected: $E_{ij} = (row_{i} total \times col_{j} total) / grand total$ . df = (r − 1)(c − 1). For 2×2: df = 1. For 2×3: df = 2.

Why df = (r − 1)(c − 1). Once you fix the row and column totals (marginals), you can fill in $(r - 1) (c - 1)$ cells freely; the rest are determined.

χ² effect sizes. Phi $φ = χ^{2} / n$ for 2×2 tables. Cramér's V $V = χ^{2} / (n \cdot min (r - 1, c - 1))$ for larger. Both range 0 to 1. Interpretation: 0.1 small, 0.3 moderate, 0.5 large (rough benchmarks).

Chi-square limitations. *(1)* Each observation must fall in exactly one cell — between-subjects only; for paired use McNemar. *(2)* Only frequencies, not means or percentages. *(3)* Each cell should have expected count ≥ 5; below that, the χ² approximation breaks down — use Fisher's exact (for 2×2) or combine categories. *(4)* Indicates presence/absence of association but not strength — report effect size.

Mann-Whitney U — non-parametric independent t. Rank all observations across both groups; sum ranks per group; compute $U_{1} = R_{1} - n_{1} (n_{1} + 1) /2$ and similarly for $U_{2}$ . Tests whether one group tends to have stochastically larger values. Use when comparing two independent groups on ordinal data, non-normal continuous data, or small samples with outliers.

Wilcoxon signed-rank — non-parametric paired t. For each pair, compute the difference; rank the *absolute* differences; sum the ranks of positive (or negative) differences → W. Tests whether the differences are symmetrically distributed around zero. Use for paired or matched samples with non-normal differences.

Kruskal-Wallis — non-parametric one-way ANOVA. Rank all observations across all groups; compute H from rank sums. Tests whether at least one group has stochastically larger/smaller values. Follow with Dunn's post-hoc (with multiple-comparison correction) to identify which groups differ.

Friedman's test — non-parametric RM-ANOVA. Rank within each subject across the k conditions; sum ranks per condition; test whether rank sums differ. Use when same participants are measured under k conditions and the parametric assumptions fail.

McNemar's test — paired categorical (2×2). When the *same* subjects are measured on two binary outcomes (e.g., before/after intervention; correct/incorrect on two tests). **Compares only the *discordant* cells** (yes→no and no→yes counts). χ² = $(b - c)^{2} / (b + c)$ where b and c are off-diagonal counts.

Binomial sign test. Simplest paired test: count signs of differences (+ or −); test against $H_{0} : P (+) = 0.5$ . Useful when differences are not even rank-comparable (true ordinal pairs with no magnitude information).

Spearman ρ — non-parametric correlation. Pearson r computed on the ranks. Covered in Unit 6. Robust to outliers; captures monotone associations; works for ordinal data.

Fisher's exact test. For 2×2 contingency tables with small expected counts (< 5). Computes the exact probability of the observed table (and more extreme tables) under H₀ using the hypergeometric distribution. Slower than χ² but exact — no approximation. Use when χ² assumptions fail.

When to drop to non-parametric — checklist. Severely non-normal residuals + small n. Ordinal DV. Unfixable outliers. Rank-based research hypothesis ('does one group tend to rank higher overall?'). Likert data with few scale points and unequal spacing.

Robust alternatives between parametric and non-parametric. Trimmed means (remove extreme % from each tail before computing). Winsorising (replace extreme values with percentile cutoffs). Bootstrap confidence intervals (resample with replacement; no distributional assumption). These keep power closer to parametric while resisting outliers.

Reporting non-parametric results. State the test, the test statistic (U, W, H, χ², τ, ρ), the p-value, and the effect size. *Example:* 'Mann-Whitney U = 40, p = 0.03, r = (Z / √N) ≈ 0.45 (moderate effect)'. The effect size for rank tests is typically $r = Z / N$ .

Definitions

Non-parametric test — Test that does not assume a specific distribution for the data. Rank-based or count-based.
Chi-square goodness-of-fit — Tests whether observed category counts match an expected distribution. df = k − 1.
Chi-square test for independence — Tests whether two categorical variables are associated. df = (r − 1)(c − 1).
Phi (φ) — $χ^{2} / n$ . Effect size for 2×2 contingency tables. Range [0, 1].
Cramér's V — $χ^{2} / (n \cdot min (r - 1, c - 1))$ . Generalisation of φ to larger tables.
Mann-Whitney U — Non-parametric counterpart of independent t. Rank-based; tests stochastic dominance between two independent groups.
Wilcoxon signed-rank — Non-parametric counterpart of paired t. Signed-rank-based; tests symmetry of differences around zero.
Kruskal-Wallis H — Non-parametric counterpart of one-way ANOVA. Rank-based across k independent groups.
Friedman test — Non-parametric counterpart of repeated-measures ANOVA. Rank within subjects across conditions.
McNemar's test — Paired binary outcome test. Compares discordant cells b and c in a 2×2. $χ^{2} = (b - c)^{2} / (b + c)$ .
Fisher's exact test — Exact test for 2×2 contingency with small expected counts (< 5). Uses hypergeometric distribution; no asymptotic approximation.
Binomial sign test — Simplest paired test: count signs of differences; test against $P = 0.5$ .
Stochastic dominance — What rank-based tests actually test: 'one group tends to have larger values than another'. Not the same as means or medians.

Formulas

$χ^{2} = i \sum \frac{( O _{i} - E _{i} ) ^{2}}{E _{i}} (both GoF and independence)$
$E_{ij} = \frac{row _{i} total \times col _{j} total}{grand total} (independence)$
$df_{GoF} = k - 1; df_{indep} = (r - 1) (c - 1)$
$φ = χ^{2} / n (2\times2); V = χ^{2} / (n \cdot min (r - 1, c - 1)) (larger)$
$U_{1} = R_{1} - n_{1} (n_{1} + 1) /2 (Mann-Whitney)$
$W = i : d_{i} > 0 \sum rank (∣ d_{i} ∣) (Wilcoxon signed-rank)$
$χ_{McNemar}^{2} = \frac{( b - c ) ^{2}}{b + c} (b, c = discordant cells)$
$r = Z / N (effect size for rank tests)$

Derivations

Why expected = row × col / total in χ² independence. Under H₀ (independence): $P (row_{i} \cap col_{j}) = P (row_{i}) \cdot P (col_{j})$ . Estimate marginal probabilities from data: $\hat{P} (row_{i}) = r_{i} / n$ , $\hat{P} (col_{j}) = c_{j} / n$ . Multiply by $n$ to get expected count: $E_{ij} = n \cdot (r_{i} / n) \cdot (c_{j} / n) = r_{i} \cdot c_{j} / n$ . Sum of (O − E)² / E is approximately χ² with $(r - 1) (c - 1)$ df by the multivariate CLT applied to multinomial counts.

Why df = (r − 1)(c − 1) in independence. A $r \times c$ contingency table has $r c$ cells. The row totals $r_{i}$ sum to $n$ (one constraint), as do column totals $c_{j}$ (another constraint). Given marginals, we estimate one probability per row and one per column, totalling $(r - 1) + (c - 1)$ parameters. Free cells = $r c - 1 - [(r - 1) + (c - 1)] = (r - 1) (c - 1)$ .

Mann-Whitney U as area-under-curve interpretation. $U / (n_{1} n_{2})$ is the probability that a randomly chosen value from group 1 exceeds a randomly chosen value from group 2 (with ties counted as 0.5). This is the ROC-AUC interpretation — connects the test to discriminability.

McNemar's test derivation. In a paired 2×2 table with cells $a$ (both yes), $b$ (yes→no), $c$ (no→yes), $d$ (both no), the off-diagonal cells reflect *change*. Under H₀ (no change), each discordant pair is equally likely to be b or c — Binomial(b + c, 0.5). The standardised statistic $(b - c) / b + c$ is approximately Normal, so $(b - c)^{2} / (b + c) \sim χ_{1}^{2}$ .

Why non-parametric tests have less power. Parametric tests use the *magnitude* of each observation; rank-based tests use only the *order*. Ranking discards information about how far apart observations are. When parametric assumptions hold (data is well-behaved), this discarded information would have helped detect effects → non-parametric is less powerful (typically by ~5% asymptotic relative efficiency for Wilcoxon vs t-test under Normal data).

Examples

χ² goodness-of-fit. 256 artists' zodiac signs. Under H₀ uniform: expected ≈ 21.3 per sign. Observed counts and (O−E)²/E summed give χ² = 22.4, df = 11, critical at α = 0.05 ≈ 19.7 → reject H₀ → zodiac distribution is non-uniform among artists.
χ² independence — dancing comfort × personality. 100 people: 50 extroverts, 50 introverts; 50 comfortable, 50 not. Observed cells: 10/40/40/10. Expected under independence: 25 each. $χ^{2} = 4 \cdot (25 - 10)^{2} /25 = 36$ , df = 1, p < .001 → strong association. Phi = $36/100 = 0.6$ (large effect).
Cramér's V example. 2×3 mental health × gender table, n = 150, χ² = 8.23, df = 2 → p < .05. V = $8.23/150 \approx 0.23$ (moderate). Report: 'χ²(2, n = 150) = 8.23, p < 0.05, V = 0.23'.
Mann-Whitney worked example. Groups n=10 each. Ranked values group A: 2,4,5,7,9,10,11,13,16,18 (sum = 95). Group B: 1,3,6,8,12,14,15,17,19,20 (sum = 115). $U_{A} = 95 - 10 \cdot 11/2 = 40$ . Compare to critical value for $n_{1} = n_{2} = 10$ , two-tailed α = 0.05 (critical = 23) — fail to reject.
Wilcoxon signed-rank. Paired differences: −3, +1, +5, +2, −1, +4. Absolute ranks: 1.5, 1.5, 4, 3, 5, 6. Sum of positive ranks W = 1.5 + 4 + 3 + 6 = 14.5. Sum of negative = 1.5 + 5 = 6.5. Smaller sum 6.5; compare to critical for n = 6.
McNemar example. 100 patients, before vs after intervention (yes/no). Discordant: 25 went yes→no; 5 went no→yes. $χ^{2} = (25 - 5)^{2} / (25 + 5) = 400/30 \approx 13.3$ , df = 1, p < .001 → significant change.
Fisher's exact at small n. 2×2 table: 3 vs 7; 8 vs 2. Expected cells include some < 5. Use Fisher's exact rather than χ². R: fisher.test(matrix(c(3,7,8,2), 2)).

Diagrams

Parametric ↔ nonparametric pairing table. Two-column table mapping each parametric test to its non-parametric counterpart with the design (independent / paired / repeated).
Decision flowchart. Branch on DV scale (continuous/ordinal/categorical), # groups (2 / 3+), design (independent / paired) → terminal test.
χ² 2×2 contingency. Observed vs expected cell counts, with (O−E)²/E shown per cell.
Mann-Whitney ranks. Two groups' values, joint ranking on a number line; rank sums per group.
Wilcoxon signed-rank. Pre/post values, differences, absolute ranks of differences, signed rank sum.
McNemar 2×2. Highlight discordant cells b and c; concordant cells a and d don't enter the statistic.

Edge cases

Small expected counts (< 5) in χ² → use Fisher's exact instead. The χ² approximation breaks down.
Tied ranks in Mann-Whitney / Wilcoxon → use mid-ranks (average of tied positions); large numbers of ties reduce power. R's wilcox.test handles this automatically.
Non-parametric tests have less power when parametric assumptions actually hold — don't use them unnecessarily.
McNemar requires ≥ 25 discordant pairs for χ² approximation. Below that, use exact (binomial) version: McNemar's exact test.
One-cell zero in χ² — fail to compute. Combine categories or use Fisher's exact.
Highly skewed continuous data with moderate n (~30+) often analysable with parametric tests by CLT. The transition n is fuzzy; report Q-Q plots to justify the choice.
Likert with many scale points (e.g., 7-point or visual analogue) is often defensibly treated as interval. Strict ordinal would call for Mann-Whitney / Wilcoxon.
Permutation tests can replace many non-parametric tests with greater flexibility — and handle correlated data better.

Common mistakes

Reporting χ² without checking E ≥ 5 per cell. Approximation invalid; use Fisher's exact.
Confusing Mann-Whitney (independent) with Wilcoxon signed-rank (paired). Different designs.
Using Mann-Whitney for ordinal data with many ties — power drops.
Applying nonparametric tests when parametric assumptions hold — losing power for no reason.
Reporting χ² as 'evidence of causation' — only association.
Using χ² on paired binary data — use McNemar instead.
Reporting Mann-Whitney as 'compares means' — it compares stochastic dominance / medians, not means.
Forgetting effect size for rank tests — report $r = Z / N$ .
Combining cells post-hoc to make E ≥ 5 without pre-specifying — p-hacking-adjacent.

Shortcuts

χ² = Σ(O − E)²/E. df: indep (r−1)(c−1); GoF k−1.
Mann-Whitney = independent t in non-parametric form.
Wilcoxon signed-rank = paired t in non-parametric form.
Kruskal-Wallis = one-way ANOVA non-parametric. Friedman = RM-ANOVA non-parametric.
McNemar = paired χ² 2×2 (compares discordant cells only).
Fisher's exact for small expected counts in 2×2.
Effect size for χ²: φ (2×2), Cramér's V (larger). For rank tests: $r = Z / N$ .
Drop to non-parametric if: ordinal DV, severe non-normality + small n, unfixable outliers.

Proofs / Algorithms

Expected cell counts under independence. Under H₀, $P (row_{i} \cap col_{j}) = P (row_{i}) P (col_{j})$ . Estimating marginal probabilities from data: $\hat{P} (row_{i}) = r_{i} / n$ , $\hat{P} (col_{j}) = c_{j} / n$ . Expected count under H₀: $E_{ij} = n \hat{P} (row_{i}) \hat{P} (col_{j}) = r_{i} c_{j} / n$ . The test statistic $\sum (O_{ij} - E_{ij})^{2} / E_{ij}$ converges in distribution to $χ_{(r - 1) (c - 1)}^{2}$ under H₀ as $n \to \infty$ (large-sample multinomial theory).

Mann-Whitney as U / (n₁n₂) = AUC. Define $U = \sum_{i} \sum_{j} 1 [X_{1 i} > X_{2 j}] + 0.5 \cdot 1 [X_{1 i} = X_{2 j}]$ . Then $U / (n_{1} n_{2}) = \hat{P} (X_{1} > X_{2}) + 0.5 \hat{P} (X_{1} = X_{2})$ . This is exactly the area under the ROC curve when treating group as label and value as predictor. Under H₀ of identical distributions, $E [U / (n_{1} n_{2})] = 0.5$ — random discrimination.

McNemar from binomial. Under $H_{0}$ of no change, each discordant pair is independently 'yes→no' with probability $π$ and 'no→yes' with probability $1 - π$ , with $π = 0.5$ under H₀. So $b \sim Binomial (b + c, 0.5)$ . By Normal approximation, $(b - (b + c) /2) / (b + c) /4 = (b - c) / b + c$ is approximately $N (0, 1)$ . Squared: $(b - c)^{2} / (b + c) \sim χ_{1}^{2}$ .

Behavioral Research: Statistical Methods