Courses/Behavioral Research: Statistical Methods

Behavioral Research: Statistical Methods

CG3.402

Vinoo Alluri•Monsoon 2025-26•4 credits

Definitions

Every term, every chapter. Toggle between the textbook wording and a plain-English version (when available).

271 terms · 0 have plain-English versions

Unit 1 — Why Do Statistics? (Biases & Base Rates)

The Case for Statistics — Biases, Base Rates, Bayes

Belief bias: Judging an argument's validity by the believability of its conclusion, not by the logic. Evans, Barston & Pollard (1983).
Confirmation bias: Seeking confirming evidence for a hypothesis rather than evidence that could falsify it. Demonstrated by the Wason card-selection task.
Simpson's paradox: A trend appearing in groups reverses when the groups are combined (or vice versa). UC Berkeley 1973 admissions is the classic example.
Base-rate fallacy: Ignoring the prior probability (base rate) of an event when interpreting a positive test. People confuse sensitivity with PPV.
Bayes' rule: $P (H ∣ D) = P (D ∣ H) P (H) / P (D)$ . Posterior = likelihood × prior / evidence. Formal corrective to base-rate intuition.
PPV (Positive Predictive Value): P(disease | positive test). Depends critically on prevalence — at low prevalence even sensitive tests have low PPV.
Sensitivity / Specificity: P(+ | disease) and P(− | no disease). Properties of the test, distinct from PPV.
Independent / Dependent variable: IV = what you manipulate (predictor). DV = what you measure (outcome). Modern terminology: predictor / outcome.
Between-subjects design: Different participants in different conditions. No carryover; needs more participants to achieve power.
Within-subjects design: Same participants in all conditions. More power but vulnerable to fatigue, practice, carryover effects.
Mixed design: Some factors between-subjects, others within. Common for pre/post + group designs.
Confound: A third variable related to both the predictor and outcome, creating spurious association. Threatens internal validity.
Double-blind: Neither participant nor experimenter knows the condition. Controls both experimenter bias and reactivity.
p-hacking (data mining): Trying many analyses and reporting only the favourable one. Inflates Type I error well beyond nominal α.
HARKing: Hypothesising After Results are Known. Reporting a post-hoc finding as if it were the original hypothesis.
Publication bias: Journals preferentially publish significant findings. Negative results sit in the file drawer; the published literature overestimates effect sizes.
Replication crisis: Empirical finding (OSC 2015 and others) that a large fraction of behavioural-science findings fail to replicate. Partly driven by p-hacking and publication bias.

Unit 2 — Research Design & Measurement

Scales, Reliability, Validity

Operational definition: Working definition that specifies how to measure an abstract construct. Necessary for any empirical study.
Nominal scale: Categorical, no order. Eye colour, sex, blood type. Allowable: mode, counts, χ².
Ordinal scale: Ordered categories, intervals not equal. Race position, Likert (strictly). Allowable: median, percentiles, Spearman/Kendall.
Interval scale: Numerical, equal spacing, no true zero. °C, calendar year. Allowable: mean, SD, t, ANOVA, Pearson r. No meaningful ratios.
Ratio scale: Numerical, equal spacing, true zero. Reaction time, weight, height. All operations including ratios meaningful.
Continuous vs discrete: Orthogonal to NOIR. Whether the variable can take any value in a range or only specific values.
Reliability: Consistency / repeatability of a measurement. Four flavours: test-retest, inter-rater, parallel forms, internal consistency.
Test-retest reliability: Same measurement on same units at two times. Quantified by correlation between the two.
Inter-rater reliability: Agreement among different raters on the same items. Cohen's κ (2 raters), Fleiss κ (>2), Kendall W (ordinal), Krippendorff α (general).
Parallel forms reliability: Equivalent versions of the same measurement give similar results. Correlation of two forms.
Internal consistency: Items within a single instrument correlate. Cronbach's α, split-half, KR-20/21.
Cohen's κ: (p_o − p_e)/(1 − p_e). Inter-rater agreement above chance for nominal data. > 0.8 excellent, 0.6–0.8 substantial, 0.4–0.6 moderate, < 0.4 poor.
Cronbach's α: Internal consistency: (k/(k−1))(1 − Σσ²ᵢ/σ²_total). > 0.7 acceptable, > 0.8 good. > 0.95 may indicate redundancy.
Validity: Accuracy of a measurement w.r.t. the construct. Five flavours: internal, external, construct, face, ecological.
Internal validity: Can we attribute DV changes to the IV (no confounds)? Strengthened by random assignment, control groups, double-blind.
External validity: Do findings generalise to other people, settings, times? Strengthened by random sampling, diverse samples, replication.
Construct validity: Does the measure actually capture the construct? Established through convergent (same-construct correlation high) and discriminant (other-construct correlation low) evidence.
Face validity: Does the test superficially look like it taps the construct? Weakest type; matters more for participant buy-in and policymaker acceptance than scientific validity.
Ecological validity: Does the experimental setup resemble real-world conditions? Desirable but not strictly required — lab simplifications often generalise.
Convergent / discriminant validity: Convergent: high correlation with same-construct measures. Discriminant: low correlation with unrelated-construct measures. Both required for construct validity.
Regression to the mean: Extreme scores tend to be followed by less extreme ones. Easily mistaken for a treatment effect (Kahneman pilots example).
Confound: Third variable related to both IV and DV that could itself explain the outcome. Threatens internal validity. Random assignment is the gold-standard fix.
Double-blind: Neither participant nor experimenter knows the condition. Defeats both experimenter bias and reactivity. Standard in clinical trials.

Unit 3 — Probability & Distributions

Probability, Distributions, and the CLT

Frequentist probability: Long-run frequency of an event in repeated sampling. Objective but counter-intuitive for one-off events.
Bayesian probability: Degree of subjective belief, updated by evidence. Intuitive for one-off events; depends on priors.
Independent events: $P (A \cap B) = P (A) P (B)$ , equivalently $P (A ∣ B) = P (A)$ . Coin flips are independent; correlated measurements are not.
i.i.d.: Independent AND identically distributed. The bedrock assumption of most inferential tests.
Sample vs population: Population = the full set of interest. Sample = the subset you actually observe. Inference goes sample → population.
Sampling distribution: Distribution of a statistic across many hypothetical samples. The secret heart of inferential statistics — every test compares an observed statistic to this distribution under the null.
PDF / PMF / CDF: Density (continuous) / mass (discrete) / cumulative $F (x) = P (X \leq x)$ . For continuous RVs $P (X = exact) = 0$ .
Bernoulli(p): Single yes/no trial. $P (X = 1) = p$ . Mean $p$ , variance $p (1 - p)$ .
Binomial(n, p): Sum of $n$ i.i.d. Bernoulli(p) trials. PMF $(k n) p^{k} (1 - p)^{n - k}$ . Mean $n p$ , variance $n p (1 - p)$ .
Normal $\mathcal{N}(\mu, \sigma^2)$: Bell-shaped, symmetric, two parameters. 68/95/99.7 rule. Standard Normal is $N (0, 1)$ .
t-distribution: Like Normal with heavier tails; one parameter (df). Use when $σ$ unknown, small samples. → Normal as df → ∞.
Chi-square $\chi^2_k$: Sum of $k$ squared standard Normals. Right-skewed, $\geq 0$ . Mean $k$ , variance $2 k$ . Used in $χ^{2}$ tests.
F-distribution: Ratio of two scaled chi-squares. Right-skewed, $\geq 0$ . Two df parameters. Used in ANOVA / regression.
Central Limit Theorem (CLT): Sampling distribution of $\overset{ˉ}{X}$ → $N (μ, σ^{2} / n)$ as $n$ grows, regardless of population shape (finite variance required).
Law of Large Numbers: Sample mean → population mean as $n \to \infty$ . About convergence of the point estimate.
Standard Error of the Mean (SEM): $σ / n$ — SD of the sampling distribution. Measures precision of $\overset{ˉ}{X}$ as an estimate of $μ$ .
Empirical rule (68/95/99.7): For Normal data, ~68% within $μ \pm σ$ , ~95% within $μ \pm 2 σ$ , ~99.7% within $μ \pm 3 σ$ .
Sampling with vs without replacement: With replacement is pure i.i.d. Without is dependent in principle but negligibly so when population ≫ sample.
R four-letter pattern: d density / PMF, p cumulative CDF, q quantile (inverse CDF), r random sample. Works for every distribution: norm, binom, t, chisq, f, …

Unit 4 — Data Visualization

Plots, Matching, and Common Pitfalls

Anscombe's quartet: Four datasets sharing mean / SD / r / regression line but with wildly different scatter shapes. The slogan: statistics compress, visualisations reveal. Always plot.
Histogram: Bins a continuous variable and shows counts per bin. Reveals distribution shape; sensitive to bin width.
Boxplot (box-and-whisker): Five-number summary: min, Q1, median, Q3, max; whiskers to ±1.5 × IQR; outliers as points. Hides bimodality.
Violin plot: Mirrored KDE density on each side. Communicates summary AND shape. Cousin of the boxplot.
Raincloud plot: Violin + boxplot + individual data points. The gold standard for behavioural data — distribution, summary, every observation in one figure.
Mosaic plot: Grid of rectangles with areas proportional to joint frequencies of categorical variables. For two-way categorical relationships.
Heat map: Grid where colour encodes value. Common for correlation matrices, time × subject data. Use viridis / cividis.
Bar chart: Length encodes value. For counts / means / proportions across discrete categories. Avoid for continuous data shapes.
Pie chart: Wedge angles encode proportions. Use sparingly — angles are perceptually weak. Limit to 3–5 categories.
Tukey outlier rule: $x > Q_{3} + 1.5 \cdot IQR$ or $x < Q_{1} - 1.5 \cdot IQR$ . The boxplot whisker boundary.
IQR: $Q_{3} - Q_{1}$ . Robust spread of the middle 50%.
Skew: Asymmetry of a distribution. Positive (long right tail), negative (long left tail), or symmetric (mean = median).
Bimodal distribution: Distribution with two peaks. Often indicates two subpopulations or strategies.
KDE (Kernel Density Estimate): Smoothed estimate of a continuous distribution. The basis of violin plots.
Data-to-ink ratio (Tufte): Fraction of ink on a chart that encodes data. Higher = better. Strip decoration.
Lie factor (Tufte): Visual change ÷ data change. Should be ~1. Truncated axes inflate it.
Chart junk (Tufte): Decorative elements that don't encode data — drop shadows, 3D effects, gradient backgrounds. Remove.
Colour-blind friendly palette: Palette legible to viewers with red-green deficiency (~8% of men). Examples: viridis, cividis, ColorBrewer safe schemes.
Data transformation: Functional transformation of a variable (log, sqrt, 1/x, Box-Cox) to reduce skew or stabilise variance before parametric tests. Interpret in transformed scale only.

Unit 5 — Descriptive Statistics

Centre, Spread, Standardisation

Mean (arithmetic average): $\overset{x}{ˉ} = \frac{1}{n} \sum x_{i}$ . Centre of mass. Uses all data; sensitive to outliers and skew.
Median: Middle value when data sorted (50th percentile). Robust to outliers and skew. Use for ordinal or skewed data.
Mode: Most-frequent value. Only meaningful central tendency for nominal data.
Range: max − min. Simplest spread; extremely sensitive to outliers.
IQR: Q₃ − Q₁. Spread of the middle 50%. Robust.
Variance: Average squared deviation from the mean. Sample: $s^{2} = \frac{1}{n - 1} \sum (x_{i} - \overset{x}{ˉ})^{2}$ (Bessel). Squared units.
Standard deviation (SD): $s = s^{2}$ . Same units as the data. The standard spread measure for parametric tests.
MAD (Median Absolute Deviation): median $∣ x_{i} - \tilde{x} ∣$ . Robust analog of SD. Under Normal, $σ \approx 1.4826 \cdot MAD$ .
z-score: $(x - μ) / σ$ . Standardised value: SDs above/below the mean. Unit-less; preserves shape.
Coefficient of variation (CV): $s / \overset{x}{ˉ}$ . Unitless relative dispersion. Useful when comparing variables with different units.
Bessel's correction: Divide by $n - 1$ in sample variance to remove bias from fitting $\overset{x}{ˉ}$ to the sample. One degree of freedom spent.
Skewness: Asymmetry of the distribution. Positive (right tail), negative (left tail). Pearson: $3 (\overset{x}{ˉ} - \tilde{x}) / s$ .
Bimodal distribution: Distribution with two peaks. Often indicates two subpopulations. Central-tendency measures unrepresentative.
Geometric mean: $n x_{1} \dots x_{n}$ . Appropriate for ratio-scale data and rates / growth factors. Always $\leq$ arithmetic mean.

Unit 6 — Correlation & Reliability Quantified

Pearson, Spearman, Partial, Reliability Metrics

Pearson r: Standardised covariance $cov (X, Y) / (s_{X} s_{Y})$ . Range $[- 1, 1]$ . Captures linear association. Assumes continuous, approximately Normal, no extreme outliers.
Spearman ρ: Pearson r computed on the ranks of X and Y. Captures monotone (not necessarily linear) association. Robust to outliers. Works for ordinal data.
Kendall τ: (Concordant − discordant) / total pairs. Robust ordinal association measure. Smaller magnitude than Spearman; preferred for small n with many ties.
r² (coefficient of determination): Proportion of variance in Y shared with X (and vice versa). Range $[0, 1]$ . Equals R² in simple linear regression.
Partial correlation: Correlation between X and Y after removing the linear effect of Z from both. Tests 'does X relate to Y beyond what Z explains?'
Semi-partial (part) correlation: Correlation between Y (or X) and the residual of X (or Y) after removing Z. Z is stripped from only one side. Asymmetric.
Correlation ≠ causation: Four reasons two variables can correlate: A → B, B → A, C → both, coincidence. Correlation establishes association, not causation.
Cohen's κ: $(p_{o} - p_{e}) / (1 - p_{e})$ . Inter-rater agreement above chance for nominal data with two raters. Negative if worse than chance.
Fleiss' κ: Cohen's κ generalised to more than two raters on nominal data.
Kendall's W: Coefficient of concordance for multiple raters ranking items (ordinal data).
Krippendorff's α: General-purpose reliability coefficient: any number of raters, missing data, all measurement levels.
Intra-rater reliability: Same rater measuring same items at two time points. Pearson r for continuous; Cohen's κ for nominal.
Cronbach's α: $\frac{k}{k - 1} (1 - \sum σ_{i}^{2} / σ_{X}^{2})$ . Internal consistency of a k-item scale. > 0.70 acceptable; > 0.95 suggests redundancy.
Split-half reliability: Split items into two halves; compute the correlation between half scores. Spearman-Brown corrects for the length effect.
Kuder-Richardson 20/21: Internal consistency for binary-item tests. KR-20 for varying item difficulties; KR-21 assumes equal difficulties.
Outlier: Data point unusually distant from the rest. Caused by errors (correct/remove), processing mistakes, or natural variability (respect and investigate).

Unit 7 — Hypothesis Testing & NHST

p-values, Errors, Power, t-tests

Theory vs hypothesis: Theory = general framework. Hypothesis = specific falsifiable prediction. Theories generate hypotheses.
Falsifiability (Popper): A scientific hypothesis must have a possible observation that would prove it wrong. Science fails to falsify; never proves.
Null hypothesis (H₀): The 'no effect / no difference' default. We try to reject H₀; never 'accept'.
Alternative hypothesis (H₁): The claim the researcher believes — an effect exists.
One-tailed vs two-tailed: One-tailed: direction pre-specified; opposite treated as null. Two-tailed: any direction matters. Two-tailed is the default.
α (significance level): Threshold p-value for rejecting H₀. Probability of Type I error. Convention: 0.05 in behavioural science.
p-value: P(data this extreme or more | H₀). NOT P(H₀ | data). The most-misinterpreted concept in statistics.
Type I error: Rejecting a true H₀. False positive. Rate = α. Chosen by the researcher.
Type II error: Failing to reject a false H₀. False negative. Rate = β. Determined by n, effect size, α, variance.
Statistical power: 1 − β = P(reject H₀ | H₁ true). Probability of detecting a real effect. Convention: ≥ 0.80.
Cohen's d: Standardised mean difference: $(\overset{ˉ}{X}_{1} - \overset{ˉ}{X}_{2}) / s_{pooled}$ . 0.2/0.5/0.8 = small/medium/large.
Effect size: Standardised magnitude of an effect, independent of sample size. Always report alongside p.
Statistical vs practical significance: Statistical = p < α. Practical = effect is meaningful in context. Large n can make trivial effects significant.
One-sample t-test: Tests sample mean against a hypothesised value. $t = (\overset{x}{ˉ} - μ_{0}) / (s / n)$ , df = n − 1.
Independent (two-sample) t-test: Compares two unrelated group means. df = n₁ + n₂ − 2. Assumes equal variances (use Welch if not).
Paired t-test: Compares two related measurements on the same units. df = n − 1. More power than independent for same n.
Welch's t-test: Independent t without equal-variance assumption; adjusted df via Welch-Satterthwaite. Modern default.
Power analysis (a priori): Compute the n needed to achieve target power (e.g., 0.80) given expected effect size, α, and test type. Done BEFORE data collection.
Optional stopping: Peeking at data during collection and stopping when p < α. Inflates actual Type I rate above nominal α.
Multiple comparisons problem: Testing many hypotheses at α = 0.05 each — family-wise Type I error climbs. Unit 8 covers corrections.

Unit 8 — Multiple Comparisons (FWER, FDR)

FWER vs FDR; Bonferroni, Holm, BH

Multiple comparisons problem: Running m tests at α each inflates FWER to ≈ $1 - (1 - α)^{m}$ . With m = 20, α = 0.05 → ~64% chance of any FP.
Family-Wise Error Rate (FWER): P(at least one false positive across all m tests). 'Did I make any mistake?' Conservative.
False Discovery Rate (FDR): E[FP/R] — expected proportion of false positives among rejections. 'How many of my claims are wrong?' Less conservative.
Bonferroni correction: $α / m$ for each test. Controls FWER via union bound. Simple but conservative; assumes independent tests.
Holm's stepwise correction: Sequential FWER control. Compare $p_{(i)}$ to $α / (m - i + 1)$ in order. Uniformly more powerful than Bonferroni.
Benjamini-Hochberg (BH): Sequential FDR control. Sort p; reject all $p_{(j)}$ up to the largest i with $p_{(i)} \leq (i / m) Q$ .
Permutation test: Empirical null distribution from label-shuffling. Handles correlated tests naturally; standard in fMRI.
Union bound (Boole's inequality): $P (⋃_{i} A_{i}) \leq \sum_{i} P (A_{i})$ . Foundation of Bonferroni; conservative when events overlap.
Garden of forking paths (Gelman): Implicit multiple comparisons from analytic choices (covariate inclusion, outlier criteria, etc.) made post-hoc. Forms of p-hacking.
Pre-registration: Locking in hypotheses, design, and analysis plan before data collection. The main antidote to multiple-comparisons abuse.

Unit 9 — Non-parametric & Categorical Tests

Categorical & Rank-Based Tests

Non-parametric test: Test that does not assume a specific distribution for the data. Rank-based or count-based.
Chi-square goodness-of-fit: Tests whether observed category counts match an expected distribution. df = k − 1.
Chi-square test for independence: Tests whether two categorical variables are associated. df = (r − 1)(c − 1).
Phi (φ): $χ^{2} / n$ . Effect size for 2×2 contingency tables. Range [0, 1].
Cramér's V: $χ^{2} / (n \cdot min (r - 1, c - 1))$ . Generalisation of φ to larger tables.
Mann-Whitney U: Non-parametric counterpart of independent t. Rank-based; tests stochastic dominance between two independent groups.
Wilcoxon signed-rank: Non-parametric counterpart of paired t. Signed-rank-based; tests symmetry of differences around zero.
Kruskal-Wallis H: Non-parametric counterpart of one-way ANOVA. Rank-based across k independent groups.
Friedman test: Non-parametric counterpart of repeated-measures ANOVA. Rank within subjects across conditions.
McNemar's test: Paired binary outcome test. Compares discordant cells b and c in a 2×2. $χ^{2} = (b - c)^{2} / (b + c)$ .
Fisher's exact test: Exact test for 2×2 contingency with small expected counts (< 5). Uses hypergeometric distribution; no asymptotic approximation.
Binomial sign test: Simplest paired test: count signs of differences; test against $P = 0.5$ .
Stochastic dominance: What rank-based tests actually test: 'one group tends to have larger values than another'. Not the same as means or medians.

Unit 10 — Multicollinearity, PCA & Factor Analysis

VIF, PCA, EFA/CFA, Scree Plot

Multicollinearity: High correlation among predictors (not predictor-outcome). Inflates SEs of coefficients; signs can flip.
Variance Inflation Factor (VIF): $1/ (1 - R_{j}^{2})$ where $R_{j}^{2}$ is the R² regressing predictor j on others. > 5–10 is severe.
SMC (Squared Multiple Correlation): Maximal proportion of variance in a predictor explained by the others. $VIF = 1/ (1 - SMC)$ .
Curse of dimensionality: Data needs grow exponentially with the number of variables. Motivates dimensionality reduction.
Factor Analysis (FA): Latent-variable model: observed variables caused by unobserved factors + unique error. Models shared variance only.
Factor loading: $λ_{ij}$ — correlation between variable i and factor j. > 0.4 = strong; cross-loadings < 0.3.
Communality $h^2$: Sum of squared loadings of an item — proportion of variance explained by common factors.
EFA (Exploratory Factor Analysis): Data-driven, no prior structure. Discover how many factors fit.
CFA (Confirmatory Factor Analysis): Theory-driven, pre-specified factor structure. Test fit on independent data.
Principal Component Analysis (PCA): Orthogonal linear combinations of variables maximising variance. No latent model; data reduction.
Eigenvalue: Variance captured by a component / factor. Sum of eigenvalues = total variance.
Scree plot: Eigenvalues vs factor #. Retain factors above the elbow.
Kaiser rule: Retain factors with eigenvalue > 1. Crude; over-extracts in practice.
Parallel analysis: Retain factors whose eigenvalues exceed those of random data of the same shape. Best practice.
KMO (Kaiser-Meyer-Olkin): Sampling adequacy measure; should be > 0.6 (preferably > 0.8) for FA / PCA.
Bartlett's test of sphericity: Test that the correlation matrix is not an identity — should be significant for FA / PCA to be appropriate.
Varimax rotation: Orthogonal rotation; factors stay uncorrelated; simpler simple structure.
Oblimin / Promax rotation: Oblique rotation; factors can correlate. Appropriate when constructs overlap in reality.
Heywood case: Factor loading ≥ 1 (impossible for correlation). Indicates misspecification or too little data.
CFA fit indices: CFI > 0.95, RMSEA < 0.06, SRMR < 0.08, χ²/df < 2-3 for good fit.

Unit 11 — ANOVA (one-way, RM, two-way)

Partition, F-test, Sphericity, Post-hoc

One-way ANOVA: Omnibus F-test for differences across $k \geq 3$ group means, one IV, between-subjects. Partitions $SS_{Total} = SS_{B} + SS_{W}$ .
F-ratio: $F = MSB / MSW$ . Under H₀ centres near 1; under H₁ exceeds 1. Always one-tailed.
MSB / MSW: Mean squares: SS divided by df. MSB = signal estimate; MSW = noise estimate.
Eta-squared (η²): Effect size = SS_B / SS_Total. Proportion of variance explained by the factor. Bands .01/.06/.14.
Partial η²: $SS_{effect} / (SS_{effect} + SS_{error})$ . Used in factorial / RM ANOVA to isolate one effect's contribution.
Tukey HSD: Post-hoc pairwise comparisons for equal-n one-way ANOVA. Uses the studentized range q. Controls FWER.
Bonferroni post-hoc: Run all pairwise t-tests, compare each p to α/m. Simple, conservative, good for few comparisons.
Games-Howell: Post-hoc for unequal n or unequal variances. Welch-style df adjustment.
Scheffé: Most conservative post-hoc; valid for arbitrary linear contrasts including non-pairwise.
Dunnett: Post-hoc for comparing each group to a single control. More powerful when control comparisons are the focus.
Planned contrast: Pre-specified comparison from theory or prior literature. Few in number, mild Type I cost.
Repeated-measures ANOVA: Same participants in all conditions. SS partition adds SS_Subjects; F = MS_Between / MS_Error. More power than between-subjects.
Sphericity: Equality of variances of pairwise differences across all condition pairs in RM-ANOVA. Tested by Mauchly's W.
Mauchly's test: Test of sphericity. H₀: sphericity holds. p < .05 → violated → apply correction.
Greenhouse-Geisser correction: Multiplies df by ε estimate to correct sphericity violation. Recommended when ε < 0.75.
Huynh-Feldt correction: Less conservative sphericity correction. Recommended when ε > 0.75.
Friedman test: Non-parametric counterpart of RM-ANOVA. Ranks within subjects across conditions.
Kruskal-Wallis: Non-parametric counterpart of one-way ANOVA. Ranks all data, compares group rank sums.
Welch's ANOVA: ANOVA variant that doesn't assume equal variances. Default in modern software.
ANCOVA: ANOVA + continuous covariate. Adjusts DV for covariate's linear effect before testing IV. Assumes equal regression slopes across groups.
Factorial ANOVA: Two or more categorical IVs. Tests main effects + interactions.
Main effect: Effect of one IV averaged over the other(s).
Interaction effect: Effect of one IV depends on the level of another. Non-parallel lines in interaction plot.
MANOVA: Multivariate ANOVA — 2+ DVs tested simultaneously. Pillai's trace / Wilks' lambda. Controls Type I across DV set.
Pillai's trace: Most robust MANOVA test statistic. Default when covariance matrices are homogeneous (Box's M test).
Mixed ANOVA: Combines between-subjects and within-subjects factors. Common for pre/post intervention designs.

Unit 12 — Regression (Linear, Multiple)

OLS, Diagnostics, Multiple Regression

OLS (Ordinary Least Squares): Estimator that minimises $\sum (y_{i} - \overset{y}{^}_{i})^{2}$ . Closed-form solution; unbiased under regression assumptions.
Intercept (β₀): Predicted Y when all X = 0. Often not directly meaningful, but anchors the line.
Slope / coefficient (β_j): Predicted change in Y per unit change in $X_{j}$ , holding all other predictors constant.
Residual (ε_i): Difference between observed $Y_{i}$ and predicted $\hat{Y}_{i}$ . Used to compute SS_res and to check assumptions.
R² (coefficient of determination): Proportion of variance in Y explained by predictors. $1 - SS_{res} / SS_{tot}$ . Always ↑ with added predictors.
Adjusted R²: R² penalised by the number of predictors. Can decrease when a useless predictor is added — honest for model comparison.
Model F-test: Tests whether the model as a whole beats the intercept-only null. $F = MS_{reg} / MS_{res}$ .
Coefficient t-test: Tests $H_{0} : β_{j} = 0$ via $t = \hat{β}_{j} / SE (\hat{β}_{j})$ with df = n − k − 1.
Standardised coefficient (β): Slope after z-scoring X and Y. Allows magnitude comparison across predictors on different scales. For one predictor, equals r.
LINeM assumptions: Linearity, Independence of errors, Normality of residuals, Equal variance (homoscedasticity), no Multicollinearity. Plus exogeneity ( $E [ε ∣ X] = 0$ ).
Linearity in parameters: Coefficients enter linearly even if X enters non-linearly (polynomials, logs, interactions are fine).
Homoscedasticity: Constant residual variance across fitted values. Violation = heteroscedasticity.
Heteroscedasticity: Residual variance changes with X or fitted values. Biases SEs; fix with robust HC SEs or transformations.
Exogeneity: $E [ε ∣ X] = 0$ — predictors are uncorrelated with the unobserved error. Violated by omitted confounders, reverse causality, measurement error.
Multicollinearity: Correlated predictors. Detect via VIF > 5–10. Inflates β SEs, can flip signs.
Dummy variable: 0/1 indicator for a categorical level. For k levels create k − 1 dummies; one is the reference category.
Cook's distance: Influence diagnostic — how much each observation shifts β if removed. > 1 flags influential outliers.
Leverage: How extreme a data point's X-values are. High-leverage + large residual = influential.
AIC / BIC: Information criteria for model comparison. Lower better. AIC penalises 2k; BIC penalises ln(n)·k.
Nested F-test: Compares two models where one is a subset of the other. Tests whether the extra predictors collectively add fit.
Stepwise regression: Automated forward/backward selection by AIC. Heuristic; can disagree across directions; do not treat as theorem.
Simpson's paradox: Coefficient direction or magnitude flips when an additional variable is included. Sign of confounding.
General Linear Model: Umbrella framework — regression with continuous + categorical predictors. Subsumes t-tests, ANOVA, ANCOVA.

Unit 13 — Bayesian Statistics

Priors, Posteriors, Bayes Factors

Prior: Pre-data belief P(H) about a hypothesis or parameter. Quantifies what you know before observing D.
Likelihood: P(D | H) — how well hypothesis H predicts the observed data D. The model.
Posterior: Updated belief P(H | D) after observing data. Proportional to prior × likelihood.
Evidence (marginal likelihood): P(D) = $\sum_{H} P (D ∣ H) P (H)$ (or integral). Normalising constant; doesn't shape the posterior.
Bayes Factor (BF₁₀): Ratio P(D|H₁)/P(D|H₀). Continuous evidence; can support either hypothesis or the null.
Prior odds: P(H₁)/P(H₀). Belief ratio before data.
Posterior odds: P(H₁|D)/P(H₀|D). Equals prior odds × BF₁₀.
Credible interval: Bayesian interval; parameter has X% posterior probability of being inside. Direct interpretation, unlike CI.
Conjugate prior: Prior + likelihood pairing where the posterior is in the same family as the prior. Beta–binomial is the classic example.
Beta–binomial: Conjugate pair: Beta( $α, β$ ) prior + binomial likelihood → Beta( $α + k, β + n - k$ ) posterior.
MCMC: Markov Chain Monte Carlo — algorithm to sample from posteriors when no closed form exists. Gibbs, Metropolis-Hastings, Hamiltonian.
Optional stopping: Peeking at data and stopping when significant. Fatal for frequentist Type I; legal for Bayesian BF.
Lindley's paradox: At huge n, p-values can reject H₀ while BF strongly supports it. They answer different questions.
Jeffreys scale: Convention for BF interpretation: 1–3 anecdotal, 3–10 moderate, 10–30 strong, 30–100 very strong, > 100 decisive.
BayesFactor (R package): Implements ttestBF, anovaBF, regressionBF, contingencyTableBF with default JZS Cauchy prior of width 0.707.
Likelihood principle: All evidence about a parameter from data is contained in the likelihood. Bayes respects it; frequentism (sampling-distribution-based) doesn't.

Unit 14 — GLMs & Logistic Regression

Logistic Regression and the GLM Framework

GLM (Generalised Linear Model): Framework with three components — distribution of Y, linear predictor η = Xβ, link function g(E[Y]) = η. Encompasses OLS, logistic, Poisson, etc.
Random component: The assumed distribution of Y in a GLM (Normal, Bernoulli, Poisson, Gamma, multinomial).
Systematic component: The linear predictor η = β₀ + β₁X₁ + … + βₖXₖ. Identical in structure to OLS.
Link function (g): Maps E[Y] to η. Identity for OLS, logit for logistic, log for Poisson.
Logit function: $lo g (p / (1 - p))$ . Maps p ∈ (0, 1) to η ∈ (−∞, ∞). The canonical link for binomial.
Logistic function (sigmoid): $1/ (1 + e^{- η})$ . Inverse of logit. Maps η to p ∈ (0, 1) via the S-curve.
Odds: $p / (1 - p)$ . Ratio of probability of event to probability of non-event.
Log-odds (logit): Logarithm of the odds. Lives on (−∞, +∞).
Odds ratio (OR): $e^{β_{j}}$ . Multiplicative change in odds per unit increase in $X_{j}$ . Standard reporting format.
Maximum Likelihood Estimation (MLE): Estimate β by maximising the likelihood of observed data. Standard for all GLMs. Fit numerically via Newton-Raphson / IRLS.
Deviance: $- 2 ln L$ . GLM analog of SS_res. Smaller = better fit. Used in likelihood-ratio tests.
Likelihood Ratio Test (LRT): Compares nested GLMs via $D \sim χ_{Δ k}^{2}$ . Replaces the F-test of OLS.
AIC / BIC: Information criteria for non-nested comparison. AIC = $- 2 ln L + 2 k$ ; BIC adds $k ln n$ penalty. Lower better.
McFadden pseudo R²: $1 - ln L_{model} / ln L_{null}$ . Logistic analog of R². Bands very different — 0.2 = excellent.
Confusion matrix: 2×2 table of predicted vs actual class. TP, FP, TN, FN — the basis of accuracy, precision, recall, F1.
Precision: TP / (TP + FP). Of those predicted positive, how many actually are.
Recall (sensitivity): TP / (TP + FN). Of actual positives, how many are caught.
ROC curve: Sensitivity vs 1 − specificity across decision thresholds. Diagonal = chance.
AUC: Area under ROC. Probability that a random positive ranks above a random negative. Threshold-independent.
Perfect separation: A predictor / combination that perfectly classifies the outcome. MLE diverges → infinite β. Use Firth's correction.
Poisson regression: GLM for count data. Log link. $e^{β}$ = rate ratio. Assumes Var = Mean.
Multinomial logistic: GLM for unordered categorical Y > 2 levels. One logit per non-reference category vs reference.
Ordinal logistic (proportional odds): GLM for ordered Y. Cumulative logits with a single slope assumption.

Unit 15 — Rapid Revision & Exam Strategy

Decision Tree, Confusions, Report Checklist

Decision tree: Sequence of four questions (DV scale, IV scale, # groups, between/within) that uniquely picks a test from the BRSM toolkit.
10-point answer framework: (1) question,(2) H₀/H₁,(3) IV/DV/scales,(4) design,(5) test+justification,(6) assumptions,(7) diagnostics,(8) fallback,(9) effect size,(10) reporting sentence. Maximises partial-credit.
Reporting template: Test statistic + degrees of freedom + p + effect size + 95% CI. Five slots, all required for full marks.
Effect-size benchmark: Conventional small/medium/large thresholds: d 0.2/0.5/0.8; η² .01/.06/.14; r 0.1/0.3/0.5; OR 1.5/2.5/4.
Assumption-diagnostic pairing: Each parametric test has a fixed set of assumptions and the named diagnostic for each (Shapiro-Wilk, Levene's, Mauchly's, residual plots, VIF, Cook's, Box's M, expected counts).
Interpretation trap: Wrong canonical phrasing of a statistical concept (p-value, CI, non-significance, correlation, normality). The exam routinely tests recognition.
Statistical vs practical significance: Statistical: p < α (detectable). Practical: effect size is large enough to matter. Independent dimensions.
Family-wise error rate (FWER): Probability of any false positive across m tests. $1 - (1 - α)^{m}$ for independent tests. Bonferroni controls it.
False Discovery Rate (FDR): Expected proportion of false positives among rejections. Benjamini-Hochberg controls it. Less conservative than FWER.
Five-step inference checklist: Question → test → assumptions → effect size + CI → practical interpretation.
Open-ended exam question: Scenario with research question. Answer using the 10-point framework. Assumptions and justifications carry as many marks as the test choice.
Pattern recognition: Trained ability to map a 1-sentence scenario to the right test in < 30 seconds. Drill with the example phrasings list.