Saral Shiksha Yojna
Courses/Behavioral Research: Statistical Methods

Behavioral Research: Statistical Methods

CG3.402
Vinoo AlluriMonsoon 2025-264 credits

Definitions

Every term, every chapter. Toggle between the textbook wording and a plain-English version (when available).

271 terms · 0 have plain-English versions

Unit 1 — Why Do Statistics? (Biases & Base Rates)

The Case for Statistics — Biases, Base Rates, Bayes
Belief bias
Judging an argument's validity by the believability of its conclusion, not by the logic. Evans, Barston & Pollard (1983).
Confirmation bias
Seeking confirming evidence for a hypothesis rather than evidence that could falsify it. Demonstrated by the Wason card-selection task.
Simpson's paradox
A trend appearing in groups reverses when the groups are combined (or vice versa). UC Berkeley 1973 admissions is the classic example.
Base-rate fallacy
Ignoring the prior probability (base rate) of an event when interpreting a positive test. People confuse sensitivity with PPV.
Bayes' rule
. Posterior = likelihood × prior / evidence. Formal corrective to base-rate intuition.
PPV (Positive Predictive Value)
P(disease | positive test). Depends critically on prevalence — at low prevalence even sensitive tests have low PPV.
Sensitivity / Specificity
P(+ | disease) and P(− | no disease). Properties of the test, distinct from PPV.
Independent / Dependent variable
IV = what you manipulate (predictor). DV = what you measure (outcome). Modern terminology: predictor / outcome.
Between-subjects design
Different participants in different conditions. No carryover; needs more participants to achieve power.
Within-subjects design
Same participants in all conditions. More power but vulnerable to fatigue, practice, carryover effects.
Mixed design
Some factors between-subjects, others within. Common for pre/post + group designs.
Confound
A third variable related to both the predictor and outcome, creating spurious association. Threatens internal validity.
Double-blind
Neither participant nor experimenter knows the condition. Controls both experimenter bias and reactivity.
p-hacking (data mining)
Trying many analyses and reporting only the favourable one. Inflates Type I error well beyond nominal α.
HARKing
Hypothesising After Results are Known. Reporting a post-hoc finding as if it were the original hypothesis.
Publication bias
Journals preferentially publish significant findings. Negative results sit in the file drawer; the published literature overestimates effect sizes.
Replication crisis
Empirical finding (OSC 2015 and others) that a large fraction of behavioural-science findings fail to replicate. Partly driven by p-hacking and publication bias.

Unit 2 — Research Design & Measurement

Scales, Reliability, Validity
Operational definition
Working definition that specifies *how* to measure an abstract construct. Necessary for any empirical study.
Nominal scale
Categorical, no order. Eye colour, sex, blood type. Allowable: mode, counts, χ².
Ordinal scale
Ordered categories, intervals not equal. Race position, Likert (strictly). Allowable: median, percentiles, Spearman/Kendall.
Interval scale
Numerical, equal spacing, no true zero. °C, calendar year. Allowable: mean, SD, t, ANOVA, Pearson r. No meaningful ratios.
Ratio scale
Numerical, equal spacing, true zero. Reaction time, weight, height. All operations including ratios meaningful.
Continuous vs discrete
Orthogonal to NOIR. Whether the variable can take any value in a range or only specific values.
Reliability
Consistency / repeatability of a measurement. Four flavours: test-retest, inter-rater, parallel forms, internal consistency.
Test-retest reliability
Same measurement on same units at two times. Quantified by correlation between the two.
Inter-rater reliability
Agreement among different raters on the same items. Cohen's κ (2 raters), Fleiss κ (>2), Kendall W (ordinal), Krippendorff α (general).
Parallel forms reliability
Equivalent versions of the same measurement give similar results. Correlation of two forms.
Internal consistency
Items within a single instrument correlate. Cronbach's α, split-half, KR-20/21.
Cohen's κ
(p_o − p_e)/(1 − p_e). Inter-rater agreement above chance for nominal data. > 0.8 excellent, 0.6–0.8 substantial, 0.4–0.6 moderate, < 0.4 poor.
Cronbach's α
Internal consistency: (k/(k−1))(1 − Σσ²ᵢ/σ²_total). > 0.7 acceptable, > 0.8 good. > 0.95 may indicate redundancy.
Validity
Accuracy of a measurement w.r.t. the construct. Five flavours: internal, external, construct, face, ecological.
Internal validity
Can we attribute DV changes to the IV (no confounds)? Strengthened by random assignment, control groups, double-blind.
External validity
Do findings generalise to other people, settings, times? Strengthened by random sampling, diverse samples, replication.
Construct validity
Does the measure actually capture the construct? Established through convergent (same-construct correlation high) and discriminant (other-construct correlation low) evidence.
Face validity
Does the test superficially look like it taps the construct? Weakest type; matters more for participant buy-in and policymaker acceptance than scientific validity.
Ecological validity
Does the experimental setup resemble real-world conditions? Desirable but not strictly required — lab simplifications often generalise.
Convergent / discriminant validity
Convergent: high correlation with same-construct measures. Discriminant: low correlation with unrelated-construct measures. Both required for construct validity.
Regression to the mean
Extreme scores tend to be followed by less extreme ones. Easily mistaken for a treatment effect (Kahneman pilots example).
Confound
Third variable related to both IV and DV that could itself explain the outcome. Threatens internal validity. Random assignment is the gold-standard fix.
Double-blind
Neither participant nor experimenter knows the condition. Defeats both experimenter bias and reactivity. Standard in clinical trials.

Unit 3 — Probability & Distributions

Probability, Distributions, and the CLT
Frequentist probability
Long-run frequency of an event in repeated sampling. Objective but counter-intuitive for one-off events.
Bayesian probability
Degree of subjective belief, updated by evidence. Intuitive for one-off events; depends on priors.
Independent events
, equivalently . Coin flips are independent; correlated measurements are not.
i.i.d.
Independent AND identically distributed. The bedrock assumption of most inferential tests.
Sample vs population
Population = the full set of interest. Sample = the subset you actually observe. Inference goes sample → population.
Sampling distribution
Distribution of a statistic across many hypothetical samples. The secret heart of inferential statistics — every test compares an observed statistic to this distribution under the null.
PDF / PMF / CDF
Density (continuous) / mass (discrete) / cumulative . For continuous RVs .
Bernoulli(p)
Single yes/no trial. . Mean , variance .
Binomial(n, p)
Sum of i.i.d. Bernoulli(p) trials. PMF . Mean , variance .
Normal $\mathcal{N}(\mu, \sigma^2)$
Bell-shaped, symmetric, two parameters. 68/95/99.7 rule. Standard Normal is .
t-distribution
Like Normal with heavier tails; one parameter (df). Use when unknown, small samples. → Normal as df → ∞.
Chi-square $\chi^2_k$
Sum of squared standard Normals. Right-skewed, . Mean , variance . Used in tests.
F-distribution
Ratio of two scaled chi-squares. Right-skewed, . Two df parameters. Used in ANOVA / regression.
Central Limit Theorem (CLT)
Sampling distribution of as grows, regardless of population shape (finite variance required).
Law of Large Numbers
Sample mean → population mean as . About convergence of the point estimate.
Standard Error of the Mean (SEM)
— SD of the sampling distribution. Measures precision of as an estimate of .
Empirical rule (68/95/99.7)
For Normal data, ~68% within , ~95% within , ~99.7% within .
Sampling with vs without replacement
With replacement is pure i.i.d. Without is dependent in principle but negligibly so when population ≫ sample.
R four-letter pattern
d density / PMF, p cumulative CDF, q quantile (inverse CDF), r random sample. Works for every distribution: norm, binom, t, chisq, f, …

Unit 4 — Data Visualization

Plots, Matching, and Common Pitfalls
Anscombe's quartet
Four datasets sharing mean / SD / r / regression line but with wildly different scatter shapes. The slogan: statistics compress, visualisations reveal. Always plot.
Histogram
Bins a continuous variable and shows counts per bin. Reveals distribution shape; sensitive to bin width.
Boxplot (box-and-whisker)
Five-number summary: min, Q1, median, Q3, max; whiskers to ±1.5 × IQR; outliers as points. Hides bimodality.
Violin plot
Mirrored KDE density on each side. Communicates summary AND shape. Cousin of the boxplot.
Raincloud plot
Violin + boxplot + individual data points. The gold standard for behavioural data — distribution, summary, every observation in one figure.
Mosaic plot
Grid of rectangles with areas proportional to joint frequencies of categorical variables. For two-way categorical relationships.
Heat map
Grid where colour encodes value. Common for correlation matrices, time × subject data. Use viridis / cividis.
Bar chart
Length encodes value. For counts / means / proportions across discrete categories. Avoid for continuous data shapes.
Pie chart
Wedge angles encode proportions. Use sparingly — angles are perceptually weak. Limit to 3–5 categories.
Tukey outlier rule
or . The boxplot whisker boundary.
IQR
. Robust spread of the middle 50%.
Skew
Asymmetry of a distribution. Positive (long right tail), negative (long left tail), or symmetric (mean = median).
Bimodal distribution
Distribution with two peaks. Often indicates two subpopulations or strategies.
KDE (Kernel Density Estimate)
Smoothed estimate of a continuous distribution. The basis of violin plots.
Data-to-ink ratio (Tufte)
Fraction of ink on a chart that encodes data. Higher = better. Strip decoration.
Lie factor (Tufte)
Visual change ÷ data change. Should be ~1. Truncated axes inflate it.
Chart junk (Tufte)
Decorative elements that don't encode data — drop shadows, 3D effects, gradient backgrounds. Remove.
Colour-blind friendly palette
Palette legible to viewers with red-green deficiency (~8% of men). Examples: viridis, cividis, ColorBrewer safe schemes.
Data transformation
Functional transformation of a variable (log, sqrt, 1/x, Box-Cox) to reduce skew or stabilise variance before parametric tests. Interpret in transformed scale only.

Unit 5 — Descriptive Statistics

Centre, Spread, Standardisation
Mean (arithmetic average)
. Centre of mass. Uses all data; sensitive to outliers and skew.
Median
Middle value when data sorted (50th percentile). Robust to outliers and skew. Use for ordinal or skewed data.
Mode
Most-frequent value. Only meaningful central tendency for nominal data.
Range
max − min. Simplest spread; extremely sensitive to outliers.
IQR
Q₃ − Q₁. Spread of the middle 50%. Robust.
Variance
Average squared deviation from the mean. Sample: (Bessel). Squared units.
Standard deviation (SD)
. Same units as the data. The standard spread measure for parametric tests.
MAD (Median Absolute Deviation)
median. Robust analog of SD. Under Normal, .
z-score
. Standardised value: SDs above/below the mean. Unit-less; preserves shape.
Coefficient of variation (CV)
. Unitless relative dispersion. Useful when comparing variables with different units.
Bessel's correction
Divide by in sample variance to remove bias from fitting to the sample. One degree of freedom spent.
Skewness
Asymmetry of the distribution. Positive (right tail), negative (left tail). Pearson: .
Bimodal distribution
Distribution with two peaks. Often indicates two subpopulations. Central-tendency measures unrepresentative.
Geometric mean
. Appropriate for ratio-scale data and rates / growth factors. Always arithmetic mean.

Unit 6 — Correlation & Reliability Quantified

Pearson, Spearman, Partial, Reliability Metrics
Pearson r
Standardised covariance . Range . Captures linear association. Assumes continuous, approximately Normal, no extreme outliers.
Spearman ρ
Pearson r computed on the ranks of X and Y. Captures monotone (not necessarily linear) association. Robust to outliers. Works for ordinal data.
Kendall τ
(Concordant − discordant) / total pairs. Robust ordinal association measure. Smaller magnitude than Spearman; preferred for small n with many ties.
r² (coefficient of determination)
Proportion of variance in Y shared with X (and vice versa). Range . Equals R² in simple linear regression.
Partial correlation
Correlation between X and Y after removing the linear effect of Z from both. Tests 'does X relate to Y beyond what Z explains?'
Semi-partial (part) correlation
Correlation between Y (or X) and the residual of X (or Y) after removing Z. Z is stripped from only one side. Asymmetric.
Correlation ≠ causation
Four reasons two variables can correlate: A → B, B → A, C → both, coincidence. Correlation establishes association, not causation.
Cohen's κ
. Inter-rater agreement above chance for nominal data with two raters. Negative if worse than chance.
Fleiss' κ
Cohen's κ generalised to more than two raters on nominal data.
Kendall's W
Coefficient of concordance for multiple raters ranking items (ordinal data).
Krippendorff's α
General-purpose reliability coefficient: any number of raters, missing data, all measurement levels.
Intra-rater reliability
Same rater measuring same items at two time points. Pearson r for continuous; Cohen's κ for nominal.
Cronbach's α
. Internal consistency of a k-item scale. > 0.70 acceptable; > 0.95 suggests redundancy.
Split-half reliability
Split items into two halves; compute the correlation between half scores. Spearman-Brown corrects for the length effect.
Kuder-Richardson 20/21
Internal consistency for binary-item tests. KR-20 for varying item difficulties; KR-21 assumes equal difficulties.
Outlier
Data point unusually distant from the rest. Caused by errors (correct/remove), processing mistakes, or natural variability (respect and investigate).

Unit 7 — Hypothesis Testing & NHST

p-values, Errors, Power, t-tests
Theory vs hypothesis
Theory = general framework. Hypothesis = specific falsifiable prediction. Theories generate hypotheses.
Falsifiability (Popper)
A scientific hypothesis must have a possible observation that would prove it wrong. Science fails to falsify; never proves.
Null hypothesis (H₀)
The 'no effect / no difference' default. We try to reject H₀; never 'accept'.
Alternative hypothesis (H₁)
The claim the researcher believes — an effect exists.
One-tailed vs two-tailed
One-tailed: direction pre-specified; opposite treated as null. Two-tailed: any direction matters. Two-tailed is the default.
α (significance level)
Threshold p-value for rejecting H₀. Probability of Type I error. Convention: 0.05 in behavioural science.
p-value
P(data this extreme or more | H₀). NOT P(H₀ | data). The most-misinterpreted concept in statistics.
Type I error
Rejecting a true H₀. False positive. Rate = α. Chosen by the researcher.
Type II error
Failing to reject a false H₀. False negative. Rate = β. Determined by n, effect size, α, variance.
Statistical power
1 − β = P(reject H₀ | H₁ true). Probability of detecting a real effect. Convention: ≥ 0.80.
Cohen's d
Standardised mean difference: . 0.2/0.5/0.8 = small/medium/large.
Effect size
Standardised magnitude of an effect, independent of sample size. Always report alongside p.
Statistical vs practical significance
Statistical = p < α. Practical = effect is meaningful in context. Large n can make trivial effects significant.
One-sample t-test
Tests sample mean against a hypothesised value. , df = n − 1.
Independent (two-sample) t-test
Compares two unrelated group means. df = n₁ + n₂ − 2. Assumes equal variances (use Welch if not).
Paired t-test
Compares two related measurements on the same units. df = n − 1. More power than independent for same n.
Welch's t-test
Independent t without equal-variance assumption; adjusted df via Welch-Satterthwaite. Modern default.
Power analysis (a priori)
Compute the n needed to achieve target power (e.g., 0.80) given expected effect size, α, and test type. Done BEFORE data collection.
Optional stopping
Peeking at data during collection and stopping when p < α. Inflates actual Type I rate above nominal α.
Multiple comparisons problem
Testing many hypotheses at α = 0.05 each — family-wise Type I error climbs. Unit 8 covers corrections.

Unit 8 — Multiple Comparisons (FWER, FDR)

FWER vs FDR; Bonferroni, Holm, BH
Multiple comparisons problem
Running m tests at α each inflates FWER to ≈ . With m = 20, α = 0.05 → ~64% chance of any FP.
Family-Wise Error Rate (FWER)
P(at least one false positive across all m tests). 'Did I make any mistake?' Conservative.
False Discovery Rate (FDR)
E[FP/R] — expected proportion of false positives among rejections. 'How many of my claims are wrong?' Less conservative.
Bonferroni correction
for each test. Controls FWER via union bound. Simple but conservative; assumes independent tests.
Holm's stepwise correction
Sequential FWER control. Compare to in order. Uniformly more powerful than Bonferroni.
Benjamini-Hochberg (BH)
Sequential FDR control. Sort p; reject all up to the largest i with .
Permutation test
Empirical null distribution from label-shuffling. Handles correlated tests naturally; standard in fMRI.
Union bound (Boole's inequality)
. Foundation of Bonferroni; conservative when events overlap.
Garden of forking paths (Gelman)
Implicit multiple comparisons from analytic choices (covariate inclusion, outlier criteria, etc.) made post-hoc. Forms of p-hacking.
Pre-registration
Locking in hypotheses, design, and analysis plan before data collection. The main antidote to multiple-comparisons abuse.

Unit 9 — Non-parametric & Categorical Tests

Categorical & Rank-Based Tests
Non-parametric test
Test that does not assume a specific distribution for the data. Rank-based or count-based.
Chi-square goodness-of-fit
Tests whether observed category counts match an expected distribution. df = k − 1.
Chi-square test for independence
Tests whether two categorical variables are associated. df = (r − 1)(c − 1).
Phi (φ)
. Effect size for 2×2 contingency tables. Range [0, 1].
Cramér's V
. Generalisation of φ to larger tables.
Mann-Whitney U
Non-parametric counterpart of independent t. Rank-based; tests stochastic dominance between two independent groups.
Wilcoxon signed-rank
Non-parametric counterpart of paired t. Signed-rank-based; tests symmetry of differences around zero.
Kruskal-Wallis H
Non-parametric counterpart of one-way ANOVA. Rank-based across k independent groups.
Friedman test
Non-parametric counterpart of repeated-measures ANOVA. Rank within subjects across conditions.
McNemar's test
Paired binary outcome test. Compares discordant cells b and c in a 2×2. .
Fisher's exact test
Exact test for 2×2 contingency with small expected counts (< 5). Uses hypergeometric distribution; no asymptotic approximation.
Binomial sign test
Simplest paired test: count signs of differences; test against .
Stochastic dominance
What rank-based tests actually test: 'one group tends to have larger values than another'. Not the same as means or medians.

Unit 10 — Multicollinearity, PCA & Factor Analysis

VIF, PCA, EFA/CFA, Scree Plot
Multicollinearity
High correlation among *predictors* (not predictor-outcome). Inflates SEs of coefficients; signs can flip.
Variance Inflation Factor (VIF)
where is the R² regressing predictor j on others. > 5–10 is severe.
SMC (Squared Multiple Correlation)
Maximal proportion of variance in a predictor explained by the others. .
Curse of dimensionality
Data needs grow exponentially with the number of variables. Motivates dimensionality reduction.
Factor Analysis (FA)
Latent-variable model: observed variables caused by unobserved factors + unique error. Models shared variance only.
Factor loading
— correlation between variable i and factor j. > 0.4 = strong; cross-loadings < 0.3.
Communality $h^2$
Sum of squared loadings of an item — proportion of variance explained by common factors.
EFA (Exploratory Factor Analysis)
Data-driven, no prior structure. Discover how many factors fit.
CFA (Confirmatory Factor Analysis)
Theory-driven, pre-specified factor structure. Test fit on independent data.
Principal Component Analysis (PCA)
Orthogonal linear combinations of variables maximising variance. No latent model; data reduction.
Eigenvalue
Variance captured by a component / factor. Sum of eigenvalues = total variance.
Scree plot
Eigenvalues vs factor #. Retain factors *above the elbow*.
Kaiser rule
Retain factors with eigenvalue > 1. Crude; over-extracts in practice.
Parallel analysis
Retain factors whose eigenvalues exceed those of random data of the same shape. Best practice.
KMO (Kaiser-Meyer-Olkin)
Sampling adequacy measure; should be > 0.6 (preferably > 0.8) for FA / PCA.
Bartlett's test of sphericity
Test that the correlation matrix is not an identity — should be significant for FA / PCA to be appropriate.
Varimax rotation
Orthogonal rotation; factors stay uncorrelated; simpler simple structure.
Oblimin / Promax rotation
Oblique rotation; factors can correlate. Appropriate when constructs overlap in reality.
Heywood case
Factor loading ≥ 1 (impossible for correlation). Indicates misspecification or too little data.
CFA fit indices
CFI > 0.95, RMSEA < 0.06, SRMR < 0.08, χ²/df < 2-3 for good fit.

Unit 11 — ANOVA (one-way, RM, two-way)

Partition, F-test, Sphericity, Post-hoc
One-way ANOVA
Omnibus F-test for differences across group means, one IV, between-subjects. Partitions .
F-ratio
. Under H₀ centres near 1; under H₁ exceeds 1. Always one-tailed.
MSB / MSW
Mean squares: SS divided by df. MSB = signal estimate; MSW = noise estimate.
Eta-squared (η²)
Effect size = SS_B / SS_Total. Proportion of variance explained by the factor. Bands .01/.06/.14.
Partial η²
. Used in factorial / RM ANOVA to isolate one effect's contribution.
Tukey HSD
Post-hoc pairwise comparisons for equal-n one-way ANOVA. Uses the studentized range q. Controls FWER.
Bonferroni post-hoc
Run all pairwise t-tests, compare each p to α/m. Simple, conservative, good for few comparisons.
Games-Howell
Post-hoc for unequal n or unequal variances. Welch-style df adjustment.
Scheffé
Most conservative post-hoc; valid for arbitrary linear contrasts including non-pairwise.
Dunnett
Post-hoc for comparing each group to a single control. More powerful when control comparisons are the focus.
Planned contrast
Pre-specified comparison from theory or prior literature. Few in number, mild Type I cost.
Repeated-measures ANOVA
Same participants in all conditions. SS partition adds SS_Subjects; F = MS_Between / MS_Error. More power than between-subjects.
Sphericity
Equality of variances of pairwise differences across all condition pairs in RM-ANOVA. Tested by Mauchly's W.
Mauchly's test
Test of sphericity. H₀: sphericity holds. p < .05 → violated → apply correction.
Greenhouse-Geisser correction
Multiplies df by ε estimate to correct sphericity violation. Recommended when ε < 0.75.
Huynh-Feldt correction
Less conservative sphericity correction. Recommended when ε > 0.75.
Friedman test
Non-parametric counterpart of RM-ANOVA. Ranks within subjects across conditions.
Kruskal-Wallis
Non-parametric counterpart of one-way ANOVA. Ranks all data, compares group rank sums.
Welch's ANOVA
ANOVA variant that doesn't assume equal variances. Default in modern software.
ANCOVA
ANOVA + continuous covariate. Adjusts DV for covariate's linear effect before testing IV. Assumes equal regression slopes across groups.
Factorial ANOVA
Two or more categorical IVs. Tests main effects + interactions.
Main effect
Effect of one IV averaged over the other(s).
Interaction effect
Effect of one IV depends on the level of another. Non-parallel lines in interaction plot.
MANOVA
Multivariate ANOVA — 2+ DVs tested simultaneously. Pillai's trace / Wilks' lambda. Controls Type I across DV set.
Pillai's trace
Most robust MANOVA test statistic. Default when covariance matrices are homogeneous (Box's M test).
Mixed ANOVA
Combines between-subjects and within-subjects factors. Common for pre/post intervention designs.

Unit 12 — Regression (Linear, Multiple)

OLS, Diagnostics, Multiple Regression
OLS (Ordinary Least Squares)
Estimator that minimises . Closed-form solution; unbiased under regression assumptions.
Intercept (β₀)
Predicted Y when all X = 0. Often not directly meaningful, but anchors the line.
Slope / coefficient (β_j)
Predicted change in Y per unit change in , holding all other predictors constant.
Residual (ε_i)
Difference between observed and predicted . Used to compute SS_res and to check assumptions.
R² (coefficient of determination)
Proportion of variance in Y explained by predictors. . Always ↑ with added predictors.
Adjusted R²
R² penalised by the number of predictors. Can decrease when a useless predictor is added — honest for model comparison.
Model F-test
Tests whether the model as a whole beats the intercept-only null. .
Coefficient t-test
Tests via with df = n − k − 1.
Standardised coefficient (β)
Slope after z-scoring X and Y. Allows magnitude comparison across predictors on different scales. For one predictor, equals r.
LINeM assumptions
Linearity, Independence of errors, Normality of residuals, Equal variance (homoscedasticity), no Multicollinearity. Plus exogeneity ().
Linearity in parameters
Coefficients enter linearly even if X enters non-linearly (polynomials, logs, interactions are fine).
Homoscedasticity
Constant residual variance across fitted values. Violation = heteroscedasticity.
Heteroscedasticity
Residual variance changes with X or fitted values. Biases SEs; fix with robust HC SEs or transformations.
Exogeneity
— predictors are uncorrelated with the unobserved error. Violated by omitted confounders, reverse causality, measurement error.
Multicollinearity
Correlated predictors. Detect via VIF > 5–10. Inflates β SEs, can flip signs.
Dummy variable
0/1 indicator for a categorical level. For k levels create k − 1 dummies; one is the reference category.
Cook's distance
Influence diagnostic — how much each observation shifts β if removed. > 1 flags influential outliers.
Leverage
How extreme a data point's X-values are. High-leverage + large residual = influential.
AIC / BIC
Information criteria for model comparison. Lower better. AIC penalises 2k; BIC penalises ln(n)·k.
Nested F-test
Compares two models where one is a subset of the other. Tests whether the extra predictors collectively add fit.
Stepwise regression
Automated forward/backward selection by AIC. Heuristic; can disagree across directions; do not treat as theorem.
Simpson's paradox
Coefficient direction or magnitude flips when an additional variable is included. Sign of confounding.
General Linear Model
Umbrella framework — regression with continuous + categorical predictors. Subsumes t-tests, ANOVA, ANCOVA.

Unit 13 — Bayesian Statistics

Priors, Posteriors, Bayes Factors
Prior
Pre-data belief P(H) about a hypothesis or parameter. Quantifies what you know before observing D.
Likelihood
P(D | H) — how well hypothesis H predicts the observed data D. The *model*.
Posterior
Updated belief P(H | D) after observing data. Proportional to prior × likelihood.
Evidence (marginal likelihood)
P(D) = (or integral). Normalising constant; doesn't shape the posterior.
Bayes Factor (BF₁₀)
Ratio P(D|H₁)/P(D|H₀). Continuous evidence; can support either hypothesis or the null.
Prior odds
P(H₁)/P(H₀). Belief ratio before data.
Posterior odds
P(H₁|D)/P(H₀|D). Equals prior odds × BF₁₀.
Credible interval
Bayesian interval; parameter has X% posterior probability of being inside. Direct interpretation, unlike CI.
Conjugate prior
Prior + likelihood pairing where the posterior is in the same family as the prior. Beta–binomial is the classic example.
Beta–binomial
Conjugate pair: Beta() prior + binomial likelihood → Beta() posterior.
MCMC
Markov Chain Monte Carlo — algorithm to sample from posteriors when no closed form exists. Gibbs, Metropolis-Hastings, Hamiltonian.
Optional stopping
Peeking at data and stopping when significant. Fatal for frequentist Type I; legal for Bayesian BF.
Lindley's paradox
At huge n, p-values can reject H₀ while BF strongly supports it. They answer different questions.
Jeffreys scale
Convention for BF interpretation: 1–3 anecdotal, 3–10 moderate, 10–30 strong, 30–100 very strong, > 100 decisive.
BayesFactor (R package)
Implements ttestBF, anovaBF, regressionBF, contingencyTableBF with default JZS Cauchy prior of width 0.707.
Likelihood principle
All evidence about a parameter from data is contained in the likelihood. Bayes respects it; frequentism (sampling-distribution-based) doesn't.

Unit 14 — GLMs & Logistic Regression

Logistic Regression and the GLM Framework
GLM (Generalised Linear Model)
Framework with three components — distribution of Y, linear predictor η = Xβ, link function g(E[Y]) = η. Encompasses OLS, logistic, Poisson, etc.
Random component
The assumed distribution of Y in a GLM (Normal, Bernoulli, Poisson, Gamma, multinomial).
Systematic component
The linear predictor η = β₀ + β₁X₁ + … + βₖXₖ. Identical in structure to OLS.
Link function (g)
Maps E[Y] to η. Identity for OLS, logit for logistic, log for Poisson.
Logit function
. Maps p ∈ (0, 1) to η ∈ (−∞, ∞). The canonical link for binomial.
Logistic function (sigmoid)
. Inverse of logit. Maps η to p ∈ (0, 1) via the S-curve.
Odds
. Ratio of probability of event to probability of non-event.
Log-odds (logit)
Logarithm of the odds. Lives on (−∞, +∞).
Odds ratio (OR)
. Multiplicative change in odds per unit increase in . Standard reporting format.
Maximum Likelihood Estimation (MLE)
Estimate β by maximising the likelihood of observed data. Standard for all GLMs. Fit numerically via Newton-Raphson / IRLS.
Deviance
. GLM analog of SS_res. Smaller = better fit. Used in likelihood-ratio tests.
Likelihood Ratio Test (LRT)
Compares nested GLMs via . Replaces the F-test of OLS.
AIC / BIC
Information criteria for non-nested comparison. AIC = ; BIC adds penalty. Lower better.
McFadden pseudo R²
. Logistic analog of R². Bands very different — 0.2 = excellent.
Confusion matrix
2×2 table of predicted vs actual class. TP, FP, TN, FN — the basis of accuracy, precision, recall, F1.
Precision
TP / (TP + FP). Of those predicted positive, how many actually are.
Recall (sensitivity)
TP / (TP + FN). Of actual positives, how many are caught.
ROC curve
Sensitivity vs 1 − specificity across decision thresholds. Diagonal = chance.
AUC
Area under ROC. Probability that a random positive ranks above a random negative. Threshold-independent.
Perfect separation
A predictor / combination that perfectly classifies the outcome. MLE diverges → infinite β. Use Firth's correction.
Poisson regression
GLM for count data. Log link. = rate ratio. Assumes Var = Mean.
Multinomial logistic
GLM for unordered categorical Y > 2 levels. One logit per non-reference category vs reference.
Ordinal logistic (proportional odds)
GLM for ordered Y. Cumulative logits with a single slope assumption.

Unit 15 — Rapid Revision & Exam Strategy

Decision Tree, Confusions, Report Checklist
Decision tree
Sequence of four questions (DV scale, IV scale, # groups, between/within) that uniquely picks a test from the BRSM toolkit.
10-point answer framework
(1) question, (2) H₀/H₁, (3) IV/DV/scales, (4) design, (5) test+justification, (6) assumptions, (7) diagnostics, (8) fallback, (9) effect size, (10) reporting sentence. Maximises partial-credit.
Reporting template
Test statistic + degrees of freedom + p + effect size + 95% CI. Five slots, all required for full marks.
Effect-size benchmark
Conventional small/medium/large thresholds: d 0.2/0.5/0.8; η² .01/.06/.14; r 0.1/0.3/0.5; OR 1.5/2.5/4.
Assumption-diagnostic pairing
Each parametric test has a fixed set of assumptions and the named diagnostic for each (Shapiro-Wilk, Levene's, Mauchly's, residual plots, VIF, Cook's, Box's M, expected counts).
Interpretation trap
Wrong canonical phrasing of a statistical concept (p-value, CI, non-significance, correlation, normality). The exam routinely tests recognition.
Statistical vs practical significance
Statistical: p < α (detectable). Practical: effect size is large enough to matter. Independent dimensions.
Family-wise error rate (FWER)
Probability of any false positive across m tests. for independent tests. Bonferroni controls it.
False Discovery Rate (FDR)
Expected proportion of false positives among rejections. Benjamini-Hochberg controls it. Less conservative than FWER.
Five-step inference checklist
Question → test → assumptions → effect size + CI → practical interpretation.
Open-ended exam question
Scenario with research question. Answer using the 10-point framework. Assumptions and justifications carry as many marks as the test choice.
Pattern recognition
Trained ability to map a 1-sentence scenario to the right test in < 30 seconds. Drill with the example phrasings list.