Saral Shiksha Yojna
Courses/Behavioral Research: Statistical Methods

Behavioral Research: Statistical Methods

CG3.402
Vinoo AlluriMonsoon 2025-264 credits

Categorical & Rank-Based Tests

NotesStory

Intuition

When the data refuse to behave — ordinal scale, heavily skewed, unfixable outliers, tiny n — drop the Normality assumption and use non-parametric tests. Each non-parametric test pairs to a parametric cousin: independent t ↔ Mann-Whitney; paired t ↔ Wilcoxon signed-rank; one-way ANOVA ↔ Kruskal-Wallis; repeated-measures ANOVA ↔ Friedman; Pearson r ↔ Spearman ρ. χ² handles purely categorical data; McNemar handles paired proportions; Fisher's exact handles small counts. Non-parametric tests have less power than parametric when parametric assumptions actually hold — use them only when needed.

Explanation

The decision tree — half the answer to half the exam. Memorise this 2×3 grid first:

*Parametric (interval/ratio, normal):* between-subjects → Independent t; within-subjects → Paired t. *Non-parametric for ordinal:* between → Mann-Whitney U; within → Wilcoxon signed-rank. *Non-parametric for categorical:* between → χ² test; within → Binomial sign test (or McNemar for 2×2).

Extensions to more than 2 groups. *More than two independent groups:* one-way ANOVA (parametric) or Kruskal-Wallis (non-parametric). *More than two repeated conditions:* RM-ANOVA (parametric) or Friedman's test (non-parametric). *Two or more categorical variables:* χ² independence; for 3+ variables, log-linear analysis.

Choosing a test — three (plus one) questions. *(1)* Level of measurement of the DV — interval/ratio (continuous, meaningful spacing), ordinal (ranks), or categorical/nominal (categories). *(2)* Number of groups or conditions — two vs more than two. *(3)* Design — between-subjects (independent) vs within-subjects (paired / repeated). *Plus:* Are parametric assumptions met? — Normality, homogeneity of variance, no severe outliers. If assumptions hold + interval/ratio: parametric (more power). Otherwise: non-parametric.

Parametric vs non-parametric trade-off. Parametric tests have more statistical power when their assumptions hold — they detect real effects with fewer participants. Non-parametric tests make fewer assumptions but have less power when parametric assumptions would have held. *'Lose power for robustness'* is the slogan. Cost of using non-parametric unnecessarily: missed real effects. Cost of using parametric inappropriately: invalid inference.

Chi-square goodness-of-fit. Tests whether observed frequencies match an expected distribution. *H₀:* observed matches expected. *H₁:* observed differs. Classic example: M&M colour distribution against the manufacturer's claim (13% brown, 14% yellow, 24% blue, 20% orange, 16% green, 13% red). , df = k − 1 (one DoF spent on the total count being fixed).

Chi-square test for independence. Tests whether two categorical variables are related. Data: contingency table. *H₀:* independent. *H₁:* not independent. Expected: . df = (r − 1)(c − 1). For 2×2: df = 1. For 2×3: df = 2.

Why df = (r − 1)(c − 1). Once you fix the row and column totals (marginals), you can fill in cells freely; the rest are determined.

χ² effect sizes. Phi for 2×2 tables. Cramér's V for larger. Both range 0 to 1. Interpretation: 0.1 small, 0.3 moderate, 0.5 large (rough benchmarks).

Chi-square limitations. *(1)* Each observation must fall in exactly one cell — between-subjects only; for paired use McNemar. *(2)* Only frequencies, not means or percentages. *(3)* Each cell should have expected count ≥ 5; below that, the χ² approximation breaks down — use Fisher's exact (for 2×2) or combine categories. *(4)* Indicates presence/absence of association but not strength — report effect size.

Mann-Whitney U — non-parametric independent t. Rank all observations across both groups; sum ranks per group; compute and similarly for . Tests whether one group tends to have stochastically larger values. Use when comparing two independent groups on ordinal data, non-normal continuous data, or small samples with outliers.

Wilcoxon signed-rank — non-parametric paired t. For each pair, compute the difference; rank the *absolute* differences; sum the ranks of positive (or negative) differences → W. Tests whether the differences are symmetrically distributed around zero. Use for paired or matched samples with non-normal differences.

Kruskal-Wallis — non-parametric one-way ANOVA. Rank all observations across all groups; compute H from rank sums. Tests whether at least one group has stochastically larger/smaller values. Follow with Dunn's post-hoc (with multiple-comparison correction) to identify which groups differ.

Friedman's test — non-parametric RM-ANOVA. Rank within each subject across the k conditions; sum ranks per condition; test whether rank sums differ. Use when same participants are measured under k conditions and the parametric assumptions fail.

McNemar's test — paired categorical (2×2). When the *same* subjects are measured on two binary outcomes (e.g., before/after intervention; correct/incorrect on two tests). **Compares only the *discordant* cells** (yes→no and no→yes counts). χ² = where b and c are off-diagonal counts.

Binomial sign test. Simplest paired test: count signs of differences (+ or −); test against . Useful when differences are not even rank-comparable (true ordinal pairs with no magnitude information).

Spearman ρ — non-parametric correlation. Pearson r computed on the ranks. Covered in Unit 6. Robust to outliers; captures monotone associations; works for ordinal data.

Fisher's exact test. For 2×2 contingency tables with small expected counts (< 5). Computes the exact probability of the observed table (and more extreme tables) under H₀ using the hypergeometric distribution. Slower than χ² but exact — no approximation. Use when χ² assumptions fail.

When to drop to non-parametric — checklist. Severely non-normal residuals + small n. Ordinal DV. Unfixable outliers. Rank-based research hypothesis ('does one group tend to rank higher overall?'). Likert data with few scale points and unequal spacing.

Robust alternatives between parametric and non-parametric. Trimmed means (remove extreme % from each tail before computing). Winsorising (replace extreme values with percentile cutoffs). Bootstrap confidence intervals (resample with replacement; no distributional assumption). These keep power closer to parametric while resisting outliers.

Reporting non-parametric results. State the test, the test statistic (U, W, H, χ², τ, ρ), the p-value, and the effect size. *Example:* 'Mann-Whitney U = 40, p = 0.03, r = (Z / √N) ≈ 0.45 (moderate effect)'. The effect size for rank tests is typically .

Definitions

  • Non-parametric testTest that does not assume a specific distribution for the data. Rank-based or count-based.
  • Chi-square goodness-of-fitTests whether observed category counts match an expected distribution. df = k − 1.
  • Chi-square test for independenceTests whether two categorical variables are associated. df = (r − 1)(c − 1).
  • Phi (φ). Effect size for 2×2 contingency tables. Range [0, 1].
  • Cramér's V. Generalisation of φ to larger tables.
  • Mann-Whitney UNon-parametric counterpart of independent t. Rank-based; tests stochastic dominance between two independent groups.
  • Wilcoxon signed-rankNon-parametric counterpart of paired t. Signed-rank-based; tests symmetry of differences around zero.
  • Kruskal-Wallis HNon-parametric counterpart of one-way ANOVA. Rank-based across k independent groups.
  • Friedman testNon-parametric counterpart of repeated-measures ANOVA. Rank within subjects across conditions.
  • McNemar's testPaired binary outcome test. Compares discordant cells b and c in a 2×2. .
  • Fisher's exact testExact test for 2×2 contingency with small expected counts (< 5). Uses hypergeometric distribution; no asymptotic approximation.
  • Binomial sign testSimplest paired test: count signs of differences; test against .
  • Stochastic dominanceWhat rank-based tests actually test: 'one group tends to have larger values than another'. Not the same as means or medians.

Formulas

Derivations

Why expected = row × col / total in χ² independence. Under H₀ (independence): . Estimate marginal probabilities from data: , . Multiply by to get expected count: . Sum of (O − E)² / E is approximately χ² with df by the multivariate CLT applied to multinomial counts.

Why df = (r − 1)(c − 1) in independence. A contingency table has cells. The row totals sum to (one constraint), as do column totals (another constraint). Given marginals, we estimate one probability per row and one per column, totalling parameters. Free cells = .

Mann-Whitney U as area-under-curve interpretation. is the probability that a randomly chosen value from group 1 exceeds a randomly chosen value from group 2 (with ties counted as 0.5). This is the ROC-AUC interpretation — connects the test to discriminability.

McNemar's test derivation. In a paired 2×2 table with cells (both yes), (yes→no), (no→yes), (both no), the off-diagonal cells reflect *change*. Under H₀ (no change), each discordant pair is equally likely to be b or c — Binomial(b + c, 0.5). The standardised statistic is approximately Normal, so .

Why non-parametric tests have less power. Parametric tests use the *magnitude* of each observation; rank-based tests use only the *order*. Ranking discards information about how far apart observations are. When parametric assumptions hold (data is well-behaved), this discarded information would have helped detect effects → non-parametric is less powerful (typically by ~5% asymptotic relative efficiency for Wilcoxon vs t-test under Normal data).

Examples

  • χ² goodness-of-fit. 256 artists' zodiac signs. Under H₀ uniform: expected ≈ 21.3 per sign. Observed counts and (O−E)²/E summed give χ² = 22.4, df = 11, critical at α = 0.05 ≈ 19.7 → reject H₀ → zodiac distribution is non-uniform among artists.
  • χ² independence — dancing comfort × personality. 100 people: 50 extroverts, 50 introverts; 50 comfortable, 50 not. Observed cells: 10/40/40/10. Expected under independence: 25 each. , df = 1, p < .001 → strong association. Phi = (large effect).
  • Cramér's V example. 2×3 mental health × gender table, n = 150, χ² = 8.23, df = 2 → p < .05. V = (moderate). Report: 'χ²(2, n = 150) = 8.23, p < 0.05, V = 0.23'.
  • Mann-Whitney worked example. Groups n=10 each. Ranked values group A: 2,4,5,7,9,10,11,13,16,18 (sum = 95). Group B: 1,3,6,8,12,14,15,17,19,20 (sum = 115). . Compare to critical value for , two-tailed α = 0.05 (critical = 23) — fail to reject.
  • Wilcoxon signed-rank. Paired differences: −3, +1, +5, +2, −1, +4. Absolute ranks: 1.5, 1.5, 4, 3, 5, 6. Sum of positive ranks W = 1.5 + 4 + 3 + 6 = 14.5. Sum of negative = 1.5 + 5 = 6.5. Smaller sum 6.5; compare to critical for n = 6.
  • McNemar example. 100 patients, before vs after intervention (yes/no). Discordant: 25 went yes→no; 5 went no→yes. , df = 1, p < .001 → significant change.
  • Fisher's exact at small n. 2×2 table: 3 vs 7; 8 vs 2. Expected cells include some < 5. Use Fisher's exact rather than χ². R: fisher.test(matrix(c(3,7,8,2), 2)).

Diagrams

  • Parametric ↔ nonparametric pairing table. Two-column table mapping each parametric test to its non-parametric counterpart with the design (independent / paired / repeated).
  • Decision flowchart. Branch on DV scale (continuous/ordinal/categorical), # groups (2 / 3+), design (independent / paired) → terminal test.
  • χ² 2×2 contingency. Observed vs expected cell counts, with (O−E)²/E shown per cell.
  • Mann-Whitney ranks. Two groups' values, joint ranking on a number line; rank sums per group.
  • Wilcoxon signed-rank. Pre/post values, differences, absolute ranks of differences, signed rank sum.
  • McNemar 2×2. Highlight discordant cells b and c; concordant cells a and d don't enter the statistic.

Edge cases

  • Small expected counts (< 5) in χ² → use Fisher's exact instead. The χ² approximation breaks down.
  • Tied ranks in Mann-Whitney / Wilcoxon → use mid-ranks (average of tied positions); large numbers of ties reduce power. R's wilcox.test handles this automatically.
  • Non-parametric tests have less power when parametric assumptions actually hold — don't use them unnecessarily.
  • McNemar requires ≥ 25 discordant pairs for χ² approximation. Below that, use exact (binomial) version: McNemar's exact test.
  • One-cell zero in χ² — fail to compute. Combine categories or use Fisher's exact.
  • Highly skewed continuous data with moderate n (~30+) often analysable with parametric tests by CLT. The transition n is fuzzy; report Q-Q plots to justify the choice.
  • Likert with many scale points (e.g., 7-point or visual analogue) is often defensibly treated as interval. Strict ordinal would call for Mann-Whitney / Wilcoxon.
  • Permutation tests can replace many non-parametric tests with greater flexibility — and handle correlated data better.

Common mistakes

  • Reporting χ² without checking E ≥ 5 per cell. Approximation invalid; use Fisher's exact.
  • Confusing Mann-Whitney (independent) with Wilcoxon signed-rank (paired). Different designs.
  • Using Mann-Whitney for ordinal data with many ties — power drops.
  • Applying nonparametric tests when parametric assumptions hold — losing power for no reason.
  • Reporting χ² as 'evidence of causation' — only association.
  • Using χ² on paired binary data — use McNemar instead.
  • Reporting Mann-Whitney as 'compares means' — it compares stochastic dominance / medians, not means.
  • Forgetting effect size for rank tests — report .
  • Combining cells post-hoc to make E ≥ 5 without pre-specifying — p-hacking-adjacent.

Shortcuts

  • χ² = Σ(O − E)²/E. df: indep (r−1)(c−1); GoF k−1.
  • Mann-Whitney = independent t in non-parametric form.
  • Wilcoxon signed-rank = paired t in non-parametric form.
  • Kruskal-Wallis = one-way ANOVA non-parametric. Friedman = RM-ANOVA non-parametric.
  • McNemar = paired χ² 2×2 (compares discordant cells only).
  • Fisher's exact for small expected counts in 2×2.
  • Effect size for χ²: φ (2×2), Cramér's V (larger). For rank tests: .
  • Drop to non-parametric if: ordinal DV, severe non-normality + small n, unfixable outliers.

Proofs / Algorithms

Expected cell counts under independence. Under H₀, . Estimating marginal probabilities from data: , . Expected count under H₀: . The test statistic converges in distribution to under H₀ as (large-sample multinomial theory).

Mann-Whitney as U / (n₁n₂) = AUC. Define . Then . This is exactly the area under the ROC curve when treating group as label and value as predictor. Under H₀ of identical distributions, — random discrimination.

McNemar from binomial. Under of no change, each discordant pair is independently 'yes→no' with probability and 'no→yes' with probability , with under H₀. So . By Normal approximation, is approximately . Squared: .