Behavioral Research: Statistical Methods
CG3.402Vinoo Alluri•Monsoon 2025-26•4 credits
Definitions
Every term, every chapter. Toggle between the textbook wording and a plain-English version (when available).
271 terms · 0 have plain-English versions
Unit 1 — Why Do Statistics? (Biases & Base Rates)
The Case for Statistics — Biases, Base Rates, Bayes
- Belief bias
- Judging an argument's validity by the believability of its conclusion, not by the logic. Evans, Barston & Pollard (1983).
- Confirmation bias
- Seeking confirming evidence for a hypothesis rather than evidence that could falsify it. Demonstrated by the Wason card-selection task.
- Simpson's paradox
- A trend appearing in groups reverses when the groups are combined (or vice versa). UC Berkeley 1973 admissions is the classic example.
- Base-rate fallacy
- Ignoring the prior probability (base rate) of an event when interpreting a positive test. People confuse sensitivity with PPV.
- Bayes' rule
- . Posterior = likelihood × prior / evidence. Formal corrective to base-rate intuition.
- PPV (Positive Predictive Value)
- P(disease | positive test). Depends critically on prevalence — at low prevalence even sensitive tests have low PPV.
- Sensitivity / Specificity
- P(+ | disease) and P(− | no disease). Properties of the test, distinct from PPV.
- Independent / Dependent variable
- IV = what you manipulate (predictor). DV = what you measure (outcome). Modern terminology: predictor / outcome.
- Between-subjects design
- Different participants in different conditions. No carryover; needs more participants to achieve power.
- Within-subjects design
- Same participants in all conditions. More power but vulnerable to fatigue, practice, carryover effects.
- Mixed design
- Some factors between-subjects, others within. Common for pre/post + group designs.
- Confound
- A third variable related to both the predictor and outcome, creating spurious association. Threatens internal validity.
- Double-blind
- Neither participant nor experimenter knows the condition. Controls both experimenter bias and reactivity.
- p-hacking (data mining)
- Trying many analyses and reporting only the favourable one. Inflates Type I error well beyond nominal α.
- HARKing
- Hypothesising After Results are Known. Reporting a post-hoc finding as if it were the original hypothesis.
- Publication bias
- Journals preferentially publish significant findings. Negative results sit in the file drawer; the published literature overestimates effect sizes.
- Replication crisis
- Empirical finding (OSC 2015 and others) that a large fraction of behavioural-science findings fail to replicate. Partly driven by p-hacking and publication bias.
Unit 2 — Research Design & Measurement
Scales, Reliability, Validity
- Operational definition
- Working definition that specifies *how* to measure an abstract construct. Necessary for any empirical study.
- Nominal scale
- Categorical, no order. Eye colour, sex, blood type. Allowable: mode, counts, χ².
- Ordinal scale
- Ordered categories, intervals not equal. Race position, Likert (strictly). Allowable: median, percentiles, Spearman/Kendall.
- Interval scale
- Numerical, equal spacing, no true zero. °C, calendar year. Allowable: mean, SD, t, ANOVA, Pearson r. No meaningful ratios.
- Ratio scale
- Numerical, equal spacing, true zero. Reaction time, weight, height. All operations including ratios meaningful.
- Continuous vs discrete
- Orthogonal to NOIR. Whether the variable can take any value in a range or only specific values.
- Reliability
- Consistency / repeatability of a measurement. Four flavours: test-retest, inter-rater, parallel forms, internal consistency.
- Test-retest reliability
- Same measurement on same units at two times. Quantified by correlation between the two.
- Inter-rater reliability
- Agreement among different raters on the same items. Cohen's κ (2 raters), Fleiss κ (>2), Kendall W (ordinal), Krippendorff α (general).
- Parallel forms reliability
- Equivalent versions of the same measurement give similar results. Correlation of two forms.
- Internal consistency
- Items within a single instrument correlate. Cronbach's α, split-half, KR-20/21.
- Cohen's κ
- (p_o − p_e)/(1 − p_e). Inter-rater agreement above chance for nominal data. > 0.8 excellent, 0.6–0.8 substantial, 0.4–0.6 moderate, < 0.4 poor.
- Cronbach's α
- Internal consistency: (k/(k−1))(1 − Σσ²ᵢ/σ²_total). > 0.7 acceptable, > 0.8 good. > 0.95 may indicate redundancy.
- Validity
- Accuracy of a measurement w.r.t. the construct. Five flavours: internal, external, construct, face, ecological.
- Internal validity
- Can we attribute DV changes to the IV (no confounds)? Strengthened by random assignment, control groups, double-blind.
- External validity
- Do findings generalise to other people, settings, times? Strengthened by random sampling, diverse samples, replication.
- Construct validity
- Does the measure actually capture the construct? Established through convergent (same-construct correlation high) and discriminant (other-construct correlation low) evidence.
- Face validity
- Does the test superficially look like it taps the construct? Weakest type; matters more for participant buy-in and policymaker acceptance than scientific validity.
- Ecological validity
- Does the experimental setup resemble real-world conditions? Desirable but not strictly required — lab simplifications often generalise.
- Convergent / discriminant validity
- Convergent: high correlation with same-construct measures. Discriminant: low correlation with unrelated-construct measures. Both required for construct validity.
- Regression to the mean
- Extreme scores tend to be followed by less extreme ones. Easily mistaken for a treatment effect (Kahneman pilots example).
- Confound
- Third variable related to both IV and DV that could itself explain the outcome. Threatens internal validity. Random assignment is the gold-standard fix.
- Double-blind
- Neither participant nor experimenter knows the condition. Defeats both experimenter bias and reactivity. Standard in clinical trials.
Unit 3 — Probability & Distributions
Probability, Distributions, and the CLT
- Frequentist probability
- Long-run frequency of an event in repeated sampling. Objective but counter-intuitive for one-off events.
- Bayesian probability
- Degree of subjective belief, updated by evidence. Intuitive for one-off events; depends on priors.
- Independent events
- , equivalently . Coin flips are independent; correlated measurements are not.
- i.i.d.
- Independent AND identically distributed. The bedrock assumption of most inferential tests.
- Sample vs population
- Population = the full set of interest. Sample = the subset you actually observe. Inference goes sample → population.
- Sampling distribution
- Distribution of a statistic across many hypothetical samples. The secret heart of inferential statistics — every test compares an observed statistic to this distribution under the null.
- PDF / PMF / CDF
- Density (continuous) / mass (discrete) / cumulative . For continuous RVs .
- Bernoulli(p)
- Single yes/no trial. . Mean , variance .
- Binomial(n, p)
- Sum of i.i.d. Bernoulli(p) trials. PMF . Mean , variance .
- Normal $\mathcal{N}(\mu, \sigma^2)$
- Bell-shaped, symmetric, two parameters. 68/95/99.7 rule. Standard Normal is .
- t-distribution
- Like Normal with heavier tails; one parameter (df). Use when unknown, small samples. → Normal as df → ∞.
- Chi-square $\chi^2_k$
- Sum of squared standard Normals. Right-skewed, . Mean , variance . Used in tests.
- F-distribution
- Ratio of two scaled chi-squares. Right-skewed, . Two df parameters. Used in ANOVA / regression.
- Central Limit Theorem (CLT)
- Sampling distribution of → as grows, regardless of population shape (finite variance required).
- Law of Large Numbers
- Sample mean → population mean as . About convergence of the point estimate.
- Standard Error of the Mean (SEM)
- — SD of the sampling distribution. Measures precision of as an estimate of .
- Empirical rule (68/95/99.7)
- For Normal data, ~68% within , ~95% within , ~99.7% within .
- Sampling with vs without replacement
- With replacement is pure i.i.d. Without is dependent in principle but negligibly so when population ≫ sample.
- R four-letter pattern
ddensity / PMF,pcumulative CDF,qquantile (inverse CDF),rrandom sample. Works for every distribution: norm, binom, t, chisq, f, …
Unit 4 — Data Visualization
Plots, Matching, and Common Pitfalls
- Anscombe's quartet
- Four datasets sharing mean / SD / r / regression line but with wildly different scatter shapes. The slogan: statistics compress, visualisations reveal. Always plot.
- Histogram
- Bins a continuous variable and shows counts per bin. Reveals distribution shape; sensitive to bin width.
- Boxplot (box-and-whisker)
- Five-number summary: min, Q1, median, Q3, max; whiskers to ±1.5 × IQR; outliers as points. Hides bimodality.
- Violin plot
- Mirrored KDE density on each side. Communicates summary AND shape. Cousin of the boxplot.
- Raincloud plot
- Violin + boxplot + individual data points. The gold standard for behavioural data — distribution, summary, every observation in one figure.
- Mosaic plot
- Grid of rectangles with areas proportional to joint frequencies of categorical variables. For two-way categorical relationships.
- Heat map
- Grid where colour encodes value. Common for correlation matrices, time × subject data. Use viridis / cividis.
- Bar chart
- Length encodes value. For counts / means / proportions across discrete categories. Avoid for continuous data shapes.
- Pie chart
- Wedge angles encode proportions. Use sparingly — angles are perceptually weak. Limit to 3–5 categories.
- Tukey outlier rule
- or . The boxplot whisker boundary.
- IQR
- . Robust spread of the middle 50%.
- Skew
- Asymmetry of a distribution. Positive (long right tail), negative (long left tail), or symmetric (mean = median).
- Bimodal distribution
- Distribution with two peaks. Often indicates two subpopulations or strategies.
- KDE (Kernel Density Estimate)
- Smoothed estimate of a continuous distribution. The basis of violin plots.
- Data-to-ink ratio (Tufte)
- Fraction of ink on a chart that encodes data. Higher = better. Strip decoration.
- Lie factor (Tufte)
- Visual change ÷ data change. Should be ~1. Truncated axes inflate it.
- Chart junk (Tufte)
- Decorative elements that don't encode data — drop shadows, 3D effects, gradient backgrounds. Remove.
- Colour-blind friendly palette
- Palette legible to viewers with red-green deficiency (~8% of men). Examples: viridis, cividis, ColorBrewer safe schemes.
- Data transformation
- Functional transformation of a variable (log, sqrt, 1/x, Box-Cox) to reduce skew or stabilise variance before parametric tests. Interpret in transformed scale only.
Unit 5 — Descriptive Statistics
Centre, Spread, Standardisation
- Mean (arithmetic average)
- . Centre of mass. Uses all data; sensitive to outliers and skew.
- Median
- Middle value when data sorted (50th percentile). Robust to outliers and skew. Use for ordinal or skewed data.
- Mode
- Most-frequent value. Only meaningful central tendency for nominal data.
- Range
- max − min. Simplest spread; extremely sensitive to outliers.
- IQR
- Q₃ − Q₁. Spread of the middle 50%. Robust.
- Variance
- Average squared deviation from the mean. Sample: (Bessel). Squared units.
- Standard deviation (SD)
- . Same units as the data. The standard spread measure for parametric tests.
- MAD (Median Absolute Deviation)
- median. Robust analog of SD. Under Normal, .
- z-score
- . Standardised value: SDs above/below the mean. Unit-less; preserves shape.
- Coefficient of variation (CV)
- . Unitless relative dispersion. Useful when comparing variables with different units.
- Bessel's correction
- Divide by in sample variance to remove bias from fitting to the sample. One degree of freedom spent.
- Skewness
- Asymmetry of the distribution. Positive (right tail), negative (left tail). Pearson: .
- Bimodal distribution
- Distribution with two peaks. Often indicates two subpopulations. Central-tendency measures unrepresentative.
- Geometric mean
- . Appropriate for ratio-scale data and rates / growth factors. Always arithmetic mean.
Unit 6 — Correlation & Reliability Quantified
Pearson, Spearman, Partial, Reliability Metrics
- Pearson r
- Standardised covariance . Range . Captures linear association. Assumes continuous, approximately Normal, no extreme outliers.
- Spearman ρ
- Pearson r computed on the ranks of X and Y. Captures monotone (not necessarily linear) association. Robust to outliers. Works for ordinal data.
- Kendall τ
- (Concordant − discordant) / total pairs. Robust ordinal association measure. Smaller magnitude than Spearman; preferred for small n with many ties.
- r² (coefficient of determination)
- Proportion of variance in Y shared with X (and vice versa). Range . Equals R² in simple linear regression.
- Partial correlation
- Correlation between X and Y after removing the linear effect of Z from both. Tests 'does X relate to Y beyond what Z explains?'
- Semi-partial (part) correlation
- Correlation between Y (or X) and the residual of X (or Y) after removing Z. Z is stripped from only one side. Asymmetric.
- Correlation ≠ causation
- Four reasons two variables can correlate: A → B, B → A, C → both, coincidence. Correlation establishes association, not causation.
- Cohen's κ
- . Inter-rater agreement above chance for nominal data with two raters. Negative if worse than chance.
- Fleiss' κ
- Cohen's κ generalised to more than two raters on nominal data.
- Kendall's W
- Coefficient of concordance for multiple raters ranking items (ordinal data).
- Krippendorff's α
- General-purpose reliability coefficient: any number of raters, missing data, all measurement levels.
- Intra-rater reliability
- Same rater measuring same items at two time points. Pearson r for continuous; Cohen's κ for nominal.
- Cronbach's α
- . Internal consistency of a k-item scale. > 0.70 acceptable; > 0.95 suggests redundancy.
- Split-half reliability
- Split items into two halves; compute the correlation between half scores. Spearman-Brown corrects for the length effect.
- Kuder-Richardson 20/21
- Internal consistency for binary-item tests. KR-20 for varying item difficulties; KR-21 assumes equal difficulties.
- Outlier
- Data point unusually distant from the rest. Caused by errors (correct/remove), processing mistakes, or natural variability (respect and investigate).
Unit 7 — Hypothesis Testing & NHST
p-values, Errors, Power, t-tests
- Theory vs hypothesis
- Theory = general framework. Hypothesis = specific falsifiable prediction. Theories generate hypotheses.
- Falsifiability (Popper)
- A scientific hypothesis must have a possible observation that would prove it wrong. Science fails to falsify; never proves.
- Null hypothesis (H₀)
- The 'no effect / no difference' default. We try to reject H₀; never 'accept'.
- Alternative hypothesis (H₁)
- The claim the researcher believes — an effect exists.
- One-tailed vs two-tailed
- One-tailed: direction pre-specified; opposite treated as null. Two-tailed: any direction matters. Two-tailed is the default.
- α (significance level)
- Threshold p-value for rejecting H₀. Probability of Type I error. Convention: 0.05 in behavioural science.
- p-value
- P(data this extreme or more | H₀). NOT P(H₀ | data). The most-misinterpreted concept in statistics.
- Type I error
- Rejecting a true H₀. False positive. Rate = α. Chosen by the researcher.
- Type II error
- Failing to reject a false H₀. False negative. Rate = β. Determined by n, effect size, α, variance.
- Statistical power
- 1 − β = P(reject H₀ | H₁ true). Probability of detecting a real effect. Convention: ≥ 0.80.
- Cohen's d
- Standardised mean difference: . 0.2/0.5/0.8 = small/medium/large.
- Effect size
- Standardised magnitude of an effect, independent of sample size. Always report alongside p.
- Statistical vs practical significance
- Statistical = p < α. Practical = effect is meaningful in context. Large n can make trivial effects significant.
- One-sample t-test
- Tests sample mean against a hypothesised value. , df = n − 1.
- Independent (two-sample) t-test
- Compares two unrelated group means. df = n₁ + n₂ − 2. Assumes equal variances (use Welch if not).
- Paired t-test
- Compares two related measurements on the same units. df = n − 1. More power than independent for same n.
- Welch's t-test
- Independent t without equal-variance assumption; adjusted df via Welch-Satterthwaite. Modern default.
- Power analysis (a priori)
- Compute the n needed to achieve target power (e.g., 0.80) given expected effect size, α, and test type. Done BEFORE data collection.
- Optional stopping
- Peeking at data during collection and stopping when p < α. Inflates actual Type I rate above nominal α.
- Multiple comparisons problem
- Testing many hypotheses at α = 0.05 each — family-wise Type I error climbs. Unit 8 covers corrections.
Unit 8 — Multiple Comparisons (FWER, FDR)
FWER vs FDR; Bonferroni, Holm, BH
- Multiple comparisons problem
- Running m tests at α each inflates FWER to ≈ . With m = 20, α = 0.05 → ~64% chance of any FP.
- Family-Wise Error Rate (FWER)
- P(at least one false positive across all m tests). 'Did I make any mistake?' Conservative.
- False Discovery Rate (FDR)
- E[FP/R] — expected proportion of false positives among rejections. 'How many of my claims are wrong?' Less conservative.
- Bonferroni correction
- for each test. Controls FWER via union bound. Simple but conservative; assumes independent tests.
- Holm's stepwise correction
- Sequential FWER control. Compare to in order. Uniformly more powerful than Bonferroni.
- Benjamini-Hochberg (BH)
- Sequential FDR control. Sort p; reject all up to the largest i with .
- Permutation test
- Empirical null distribution from label-shuffling. Handles correlated tests naturally; standard in fMRI.
- Union bound (Boole's inequality)
- . Foundation of Bonferroni; conservative when events overlap.
- Garden of forking paths (Gelman)
- Implicit multiple comparisons from analytic choices (covariate inclusion, outlier criteria, etc.) made post-hoc. Forms of p-hacking.
- Pre-registration
- Locking in hypotheses, design, and analysis plan before data collection. The main antidote to multiple-comparisons abuse.
Unit 9 — Non-parametric & Categorical Tests
Categorical & Rank-Based Tests
- Non-parametric test
- Test that does not assume a specific distribution for the data. Rank-based or count-based.
- Chi-square goodness-of-fit
- Tests whether observed category counts match an expected distribution. df = k − 1.
- Chi-square test for independence
- Tests whether two categorical variables are associated. df = (r − 1)(c − 1).
- Phi (φ)
- . Effect size for 2×2 contingency tables. Range [0, 1].
- Cramér's V
- . Generalisation of φ to larger tables.
- Mann-Whitney U
- Non-parametric counterpart of independent t. Rank-based; tests stochastic dominance between two independent groups.
- Wilcoxon signed-rank
- Non-parametric counterpart of paired t. Signed-rank-based; tests symmetry of differences around zero.
- Kruskal-Wallis H
- Non-parametric counterpart of one-way ANOVA. Rank-based across k independent groups.
- Friedman test
- Non-parametric counterpart of repeated-measures ANOVA. Rank within subjects across conditions.
- McNemar's test
- Paired binary outcome test. Compares discordant cells b and c in a 2×2. .
- Fisher's exact test
- Exact test for 2×2 contingency with small expected counts (< 5). Uses hypergeometric distribution; no asymptotic approximation.
- Binomial sign test
- Simplest paired test: count signs of differences; test against .
- Stochastic dominance
- What rank-based tests actually test: 'one group tends to have larger values than another'. Not the same as means or medians.
Unit 10 — Multicollinearity, PCA & Factor Analysis
VIF, PCA, EFA/CFA, Scree Plot
- Multicollinearity
- High correlation among *predictors* (not predictor-outcome). Inflates SEs of coefficients; signs can flip.
- Variance Inflation Factor (VIF)
- where is the R² regressing predictor j on others. > 5–10 is severe.
- SMC (Squared Multiple Correlation)
- Maximal proportion of variance in a predictor explained by the others. .
- Curse of dimensionality
- Data needs grow exponentially with the number of variables. Motivates dimensionality reduction.
- Factor Analysis (FA)
- Latent-variable model: observed variables caused by unobserved factors + unique error. Models shared variance only.
- Factor loading
- — correlation between variable i and factor j. > 0.4 = strong; cross-loadings < 0.3.
- Communality $h^2$
- Sum of squared loadings of an item — proportion of variance explained by common factors.
- EFA (Exploratory Factor Analysis)
- Data-driven, no prior structure. Discover how many factors fit.
- CFA (Confirmatory Factor Analysis)
- Theory-driven, pre-specified factor structure. Test fit on independent data.
- Principal Component Analysis (PCA)
- Orthogonal linear combinations of variables maximising variance. No latent model; data reduction.
- Eigenvalue
- Variance captured by a component / factor. Sum of eigenvalues = total variance.
- Scree plot
- Eigenvalues vs factor #. Retain factors *above the elbow*.
- Kaiser rule
- Retain factors with eigenvalue > 1. Crude; over-extracts in practice.
- Parallel analysis
- Retain factors whose eigenvalues exceed those of random data of the same shape. Best practice.
- KMO (Kaiser-Meyer-Olkin)
- Sampling adequacy measure; should be > 0.6 (preferably > 0.8) for FA / PCA.
- Bartlett's test of sphericity
- Test that the correlation matrix is not an identity — should be significant for FA / PCA to be appropriate.
- Varimax rotation
- Orthogonal rotation; factors stay uncorrelated; simpler simple structure.
- Oblimin / Promax rotation
- Oblique rotation; factors can correlate. Appropriate when constructs overlap in reality.
- Heywood case
- Factor loading ≥ 1 (impossible for correlation). Indicates misspecification or too little data.
- CFA fit indices
- CFI > 0.95, RMSEA < 0.06, SRMR < 0.08, χ²/df < 2-3 for good fit.
Unit 11 — ANOVA (one-way, RM, two-way)
Partition, F-test, Sphericity, Post-hoc
- One-way ANOVA
- Omnibus F-test for differences across group means, one IV, between-subjects. Partitions .
- F-ratio
- . Under H₀ centres near 1; under H₁ exceeds 1. Always one-tailed.
- MSB / MSW
- Mean squares: SS divided by df. MSB = signal estimate; MSW = noise estimate.
- Eta-squared (η²)
- Effect size = SS_B / SS_Total. Proportion of variance explained by the factor. Bands .01/.06/.14.
- Partial η²
- . Used in factorial / RM ANOVA to isolate one effect's contribution.
- Tukey HSD
- Post-hoc pairwise comparisons for equal-n one-way ANOVA. Uses the studentized range q. Controls FWER.
- Bonferroni post-hoc
- Run all pairwise t-tests, compare each p to α/m. Simple, conservative, good for few comparisons.
- Games-Howell
- Post-hoc for unequal n or unequal variances. Welch-style df adjustment.
- Scheffé
- Most conservative post-hoc; valid for arbitrary linear contrasts including non-pairwise.
- Dunnett
- Post-hoc for comparing each group to a single control. More powerful when control comparisons are the focus.
- Planned contrast
- Pre-specified comparison from theory or prior literature. Few in number, mild Type I cost.
- Repeated-measures ANOVA
- Same participants in all conditions. SS partition adds SS_Subjects; F = MS_Between / MS_Error. More power than between-subjects.
- Sphericity
- Equality of variances of pairwise differences across all condition pairs in RM-ANOVA. Tested by Mauchly's W.
- Mauchly's test
- Test of sphericity. H₀: sphericity holds. p < .05 → violated → apply correction.
- Greenhouse-Geisser correction
- Multiplies df by ε estimate to correct sphericity violation. Recommended when ε < 0.75.
- Huynh-Feldt correction
- Less conservative sphericity correction. Recommended when ε > 0.75.
- Friedman test
- Non-parametric counterpart of RM-ANOVA. Ranks within subjects across conditions.
- Kruskal-Wallis
- Non-parametric counterpart of one-way ANOVA. Ranks all data, compares group rank sums.
- Welch's ANOVA
- ANOVA variant that doesn't assume equal variances. Default in modern software.
- ANCOVA
- ANOVA + continuous covariate. Adjusts DV for covariate's linear effect before testing IV. Assumes equal regression slopes across groups.
- Factorial ANOVA
- Two or more categorical IVs. Tests main effects + interactions.
- Main effect
- Effect of one IV averaged over the other(s).
- Interaction effect
- Effect of one IV depends on the level of another. Non-parallel lines in interaction plot.
- MANOVA
- Multivariate ANOVA — 2+ DVs tested simultaneously. Pillai's trace / Wilks' lambda. Controls Type I across DV set.
- Pillai's trace
- Most robust MANOVA test statistic. Default when covariance matrices are homogeneous (Box's M test).
- Mixed ANOVA
- Combines between-subjects and within-subjects factors. Common for pre/post intervention designs.
Unit 12 — Regression (Linear, Multiple)
OLS, Diagnostics, Multiple Regression
- OLS (Ordinary Least Squares)
- Estimator that minimises . Closed-form solution; unbiased under regression assumptions.
- Intercept (β₀)
- Predicted Y when all X = 0. Often not directly meaningful, but anchors the line.
- Slope / coefficient (β_j)
- Predicted change in Y per unit change in , holding all other predictors constant.
- Residual (ε_i)
- Difference between observed and predicted . Used to compute SS_res and to check assumptions.
- R² (coefficient of determination)
- Proportion of variance in Y explained by predictors. . Always ↑ with added predictors.
- Adjusted R²
- R² penalised by the number of predictors. Can decrease when a useless predictor is added — honest for model comparison.
- Model F-test
- Tests whether the model as a whole beats the intercept-only null. .
- Coefficient t-test
- Tests via with df = n − k − 1.
- Standardised coefficient (β)
- Slope after z-scoring X and Y. Allows magnitude comparison across predictors on different scales. For one predictor, equals r.
- LINeM assumptions
- Linearity, Independence of errors, Normality of residuals, Equal variance (homoscedasticity), no Multicollinearity. Plus exogeneity ().
- Linearity in parameters
- Coefficients enter linearly even if X enters non-linearly (polynomials, logs, interactions are fine).
- Homoscedasticity
- Constant residual variance across fitted values. Violation = heteroscedasticity.
- Heteroscedasticity
- Residual variance changes with X or fitted values. Biases SEs; fix with robust HC SEs or transformations.
- Exogeneity
- — predictors are uncorrelated with the unobserved error. Violated by omitted confounders, reverse causality, measurement error.
- Multicollinearity
- Correlated predictors. Detect via VIF > 5–10. Inflates β SEs, can flip signs.
- Dummy variable
- 0/1 indicator for a categorical level. For k levels create k − 1 dummies; one is the reference category.
- Cook's distance
- Influence diagnostic — how much each observation shifts β if removed. > 1 flags influential outliers.
- Leverage
- How extreme a data point's X-values are. High-leverage + large residual = influential.
- AIC / BIC
- Information criteria for model comparison. Lower better. AIC penalises 2k; BIC penalises ln(n)·k.
- Nested F-test
- Compares two models where one is a subset of the other. Tests whether the extra predictors collectively add fit.
- Stepwise regression
- Automated forward/backward selection by AIC. Heuristic; can disagree across directions; do not treat as theorem.
- Simpson's paradox
- Coefficient direction or magnitude flips when an additional variable is included. Sign of confounding.
- General Linear Model
- Umbrella framework — regression with continuous + categorical predictors. Subsumes t-tests, ANOVA, ANCOVA.
Unit 13 — Bayesian Statistics
Priors, Posteriors, Bayes Factors
- Prior
- Pre-data belief P(H) about a hypothesis or parameter. Quantifies what you know before observing D.
- Likelihood
- P(D | H) — how well hypothesis H predicts the observed data D. The *model*.
- Posterior
- Updated belief P(H | D) after observing data. Proportional to prior × likelihood.
- Evidence (marginal likelihood)
- P(D) = (or integral). Normalising constant; doesn't shape the posterior.
- Bayes Factor (BF₁₀)
- Ratio P(D|H₁)/P(D|H₀). Continuous evidence; can support either hypothesis or the null.
- Prior odds
- P(H₁)/P(H₀). Belief ratio before data.
- Posterior odds
- P(H₁|D)/P(H₀|D). Equals prior odds × BF₁₀.
- Credible interval
- Bayesian interval; parameter has X% posterior probability of being inside. Direct interpretation, unlike CI.
- Conjugate prior
- Prior + likelihood pairing where the posterior is in the same family as the prior. Beta–binomial is the classic example.
- Beta–binomial
- Conjugate pair: Beta() prior + binomial likelihood → Beta() posterior.
- MCMC
- Markov Chain Monte Carlo — algorithm to sample from posteriors when no closed form exists. Gibbs, Metropolis-Hastings, Hamiltonian.
- Optional stopping
- Peeking at data and stopping when significant. Fatal for frequentist Type I; legal for Bayesian BF.
- Lindley's paradox
- At huge n, p-values can reject H₀ while BF strongly supports it. They answer different questions.
- Jeffreys scale
- Convention for BF interpretation: 1–3 anecdotal, 3–10 moderate, 10–30 strong, 30–100 very strong, > 100 decisive.
- BayesFactor (R package)
- Implements ttestBF, anovaBF, regressionBF, contingencyTableBF with default JZS Cauchy prior of width 0.707.
- Likelihood principle
- All evidence about a parameter from data is contained in the likelihood. Bayes respects it; frequentism (sampling-distribution-based) doesn't.
Unit 14 — GLMs & Logistic Regression
Logistic Regression and the GLM Framework
- GLM (Generalised Linear Model)
- Framework with three components — distribution of Y, linear predictor η = Xβ, link function g(E[Y]) = η. Encompasses OLS, logistic, Poisson, etc.
- Random component
- The assumed distribution of Y in a GLM (Normal, Bernoulli, Poisson, Gamma, multinomial).
- Systematic component
- The linear predictor η = β₀ + β₁X₁ + … + βₖXₖ. Identical in structure to OLS.
- Link function (g)
- Maps E[Y] to η. Identity for OLS, logit for logistic, log for Poisson.
- Logit function
- . Maps p ∈ (0, 1) to η ∈ (−∞, ∞). The canonical link for binomial.
- Logistic function (sigmoid)
- . Inverse of logit. Maps η to p ∈ (0, 1) via the S-curve.
- Odds
- . Ratio of probability of event to probability of non-event.
- Log-odds (logit)
- Logarithm of the odds. Lives on (−∞, +∞).
- Odds ratio (OR)
- . Multiplicative change in odds per unit increase in . Standard reporting format.
- Maximum Likelihood Estimation (MLE)
- Estimate β by maximising the likelihood of observed data. Standard for all GLMs. Fit numerically via Newton-Raphson / IRLS.
- Deviance
- . GLM analog of SS_res. Smaller = better fit. Used in likelihood-ratio tests.
- Likelihood Ratio Test (LRT)
- Compares nested GLMs via . Replaces the F-test of OLS.
- AIC / BIC
- Information criteria for non-nested comparison. AIC = ; BIC adds penalty. Lower better.
- McFadden pseudo R²
- . Logistic analog of R². Bands very different — 0.2 = excellent.
- Confusion matrix
- 2×2 table of predicted vs actual class. TP, FP, TN, FN — the basis of accuracy, precision, recall, F1.
- Precision
- TP / (TP + FP). Of those predicted positive, how many actually are.
- Recall (sensitivity)
- TP / (TP + FN). Of actual positives, how many are caught.
- ROC curve
- Sensitivity vs 1 − specificity across decision thresholds. Diagonal = chance.
- AUC
- Area under ROC. Probability that a random positive ranks above a random negative. Threshold-independent.
- Perfect separation
- A predictor / combination that perfectly classifies the outcome. MLE diverges → infinite β. Use Firth's correction.
- Poisson regression
- GLM for count data. Log link. = rate ratio. Assumes Var = Mean.
- Multinomial logistic
- GLM for unordered categorical Y > 2 levels. One logit per non-reference category vs reference.
- Ordinal logistic (proportional odds)
- GLM for ordered Y. Cumulative logits with a single slope assumption.
Unit 15 — Rapid Revision & Exam Strategy
Decision Tree, Confusions, Report Checklist
- Decision tree
- Sequence of four questions (DV scale, IV scale, # groups, between/within) that uniquely picks a test from the BRSM toolkit.
- 10-point answer framework
- (1) question, (2) H₀/H₁, (3) IV/DV/scales, (4) design, (5) test+justification, (6) assumptions, (7) diagnostics, (8) fallback, (9) effect size, (10) reporting sentence. Maximises partial-credit.
- Reporting template
- Test statistic + degrees of freedom + p + effect size + 95% CI. Five slots, all required for full marks.
- Effect-size benchmark
- Conventional small/medium/large thresholds: d 0.2/0.5/0.8; η² .01/.06/.14; r 0.1/0.3/0.5; OR 1.5/2.5/4.
- Assumption-diagnostic pairing
- Each parametric test has a fixed set of assumptions and the named diagnostic for each (Shapiro-Wilk, Levene's, Mauchly's, residual plots, VIF, Cook's, Box's M, expected counts).
- Interpretation trap
- Wrong canonical phrasing of a statistical concept (p-value, CI, non-significance, correlation, normality). The exam routinely tests recognition.
- Statistical vs practical significance
- Statistical: p < α (detectable). Practical: effect size is large enough to matter. Independent dimensions.
- Family-wise error rate (FWER)
- Probability of any false positive across m tests. for independent tests. Bonferroni controls it.
- False Discovery Rate (FDR)
- Expected proportion of false positives among rejections. Benjamini-Hochberg controls it. Less conservative than FWER.
- Five-step inference checklist
- Question → test → assumptions → effect size + CI → practical interpretation.
- Open-ended exam question
- Scenario with research question. Answer using the 10-point framework. Assumptions and justifications carry as many marks as the test choice.
- Pattern recognition
- Trained ability to map a 1-sentence scenario to the right test in < 30 seconds. Drill with the example phrasings list.