Saral Shiksha Yojna
Courses/Behavioral Research: Statistical Methods

Behavioral Research: Statistical Methods

CG3.402
Vinoo AlluriMonsoon 2025-264 credits
Revision Notes/Unit 15 — Rapid Revision & Exam Strategy/Decision Tree, Confusions, Report Checklist

Decision Tree, Confusions, Report Checklist

NotesStory

Intuition

Two weeks before the exam, Maya doesn't need to relearn 14 sessions of content. She needs to retrieve. The exam is mostly *applied*: a scenario, a research question, and 'pick the right test, justify it, name your assumptions, write the report sentence.' If you can run the decision tree ('DV scale? IV scale? #groups? independent or paired?') in under 30 seconds, you've won half the paper. The other half is interpretation traps — p-values, confidence intervals, correlation vs causation, effect size vs significance. Memorise the correct phrasings; never give the wrong one. This unit is the *map* — the whole course on one page.

Explanation

The master decision tree (memorise). Ask four questions in order: (1) How many DVs? One → ANOVA family; multiple → MANOVA. (2) What scale is the DV? Categorical/nominal → χ² family or logistic. Ordinal / non-normal continuous → rank-based (Mann-Whitney, Wilcoxon, Kruskal-Wallis, Friedman, Spearman). Continuous parametric → t / ANOVA / regression. (3) How many groups / conditions? 1 vs 2 vs 3+. (4) Between-subjects or within-subjects (paired)?

Path A — Categorical DV. *Goodness-of-fit:* observed vs expected distribution → χ² GoF (df = k−1). *Independence:* two categorical variables → χ² independence (df = (r−1)(c−1); effect size φ for 2×2, Cramér's V for larger). *Paired binary:* McNemar / Binomial Sign test. *Three+ categorical variables:* log-linear analysis. *Binary outcome from predictors:* logistic regression.

Path B — Ordinal or non-normal continuous DV. *2 independent groups:* Mann-Whitney U. *2 paired:* Wilcoxon signed-rank. *3+ independent:* Kruskal-Wallis. *3+ repeated:* Friedman's. *Correlation with outliers/non-normal:* Spearman's ρ.

Path C — Continuous DV, parametric assumptions met. *1 sample vs known μ:* one-sample t. *2 independent:* independent t (Welch if unequal variances). *2 paired:* paired t. *3+ between:* one-way ANOVA. *3+ within:* RM-ANOVA (check sphericity → Greenhouse-Geisser if violated). *With covariate:* ANCOVA. *2+ IVs all between:* factorial ANOVA. *Mixed between/within:* mixed ANOVA. *2+ DVs:* MANOVA. *Continuous predictors:* linear regression. *Two continuous variables:* Pearson r (= r² = R² for simple regression).

The condensed lookup grid. Memorise this 3×4 table:

| | 2 indep | 2 paired | 3+ indep | 3+ repeated |

|---|---|---|---|---|

| Parametric | Indep t | Paired t | One-way ANOVA | RM-ANOVA |

| Ordinal / non-normal | Mann-Whitney U | Wilcoxon signed-rank | Kruskal-Wallis | Friedman's |

| Categorical | χ² | McNemar | χ² extended | (rare) |

Assumption × diagnostic pairings (memorise). t/ANOVA/regression normality → Shapiro-Wilk, Q-Q. t/ANOVA homogeneity of variance → Levene's. RM-ANOVA sphericity → Mauchly's. Regression linearity → residual-vs-fitted. Regression homoscedasticity → residual plot, ncvTest. Regression multicollinearity → VIF (> 5–10 problematic). Regression outlier influence → Cook's distance (> 1). MANOVA covariance homogeneity → Box's M. χ² expected counts → minimum E ≥ 5 per cell.

Interpretation traps — wrong vs right. *p-value:* wrong: '3% chance H₀ is true.' right: 'assuming H₀ true, 3% chance of data this extreme.' *CI:* wrong: '95% chance the parameter is in this interval.' right: '95% of such intervals across hypothetical repetitions would contain the parameter.' *Non-significant p:* wrong: 'no effect.' right: 'insufficient evidence to reject H₀.' *Correlation:* wrong: 'A causes B.' right: 'A and B are associated — causation needs more.' *Big n + tiny p:* wrong: 'huge effect.' right: 'statistically detectable but possibly practically trivial; check effect size.'

Effect-size benchmarks (every test has one). Cohen's d: 0.2 / 0.5 / 0.8 small/medium/large. η² (and partial η²): 0.01 / 0.06 / 0.14. Pearson r (and standardised β): 0.1 / 0.3 / 0.5. Odds ratio: 1.5 / 2.5 / 4 (rough). Cramér's V: 0.1 / 0.3 / 0.5 (depending on df). Always report alongside the test statistic.

The reporting template. Every test result needs 5 numbers: test statistic, degrees of freedom, p-value, effect size, 95% CI (where applicable). Sample sentence: 'A paired t-test revealed a significant decrease in anxiety after intervention, , , Cohen's (95% CI [0.45, 1.08]).' Without effect size, p is half the story.

The 'write your assumptions' question pattern. Slides explicitly say the exam emphasises this. For every test you propose, write: (1) research question, (2) H₀ and H₁ explicitly, (3) IV/DV/scale, (4) between/within, (5) chosen test + justification, (6) assumptions of the test, (7) how to check each, (8) fallback test if assumptions fail, (9) effect size, (10) reporting sentence. All ten earn marks. Even with wrong numbers, the framework scores partial credit.

Formulas to recall under pressure. . SEM . 95% CI . Cohen's . Pythagorean ANOVA . . . with , df = . FWER for m independent tests. Bonferroni . VIF . Logit . OR . Bayes .

Pattern recognition — common exam phrasings. 'Are men taller than women?' → 2 groups, continuous DV, indep → independent t (or Mann-Whitney if skewed). 'Same patients pre/post intervention?' → paired t (or Wilcoxon). 'Three schools' exam scores' → one-way ANOVA (or Kruskal-Wallis). 'Same people across 3 caffeine doses' → RM-ANOVA, check sphericity. 'Gender × political affiliation' → χ² independence. 'M&M colours vs advertised' → χ² goodness-of-fit. 'Education predicts income' → Pearson + simple regression. 'Education predicts income controlling for IQ' → multiple regression. 'Click 'buy' yes/no from condition' → logistic regression. 'Therapy reducing depression accounting for baseline' → ANCOVA. 'Treatment effectiveness depending on age' → factorial ANOVA (interaction). 'Same therapy improving depression *and* anxiety' → MANOVA. 'Session type (between) × week (within)' → mixed ANOVA.

Worked open-ended response example. Scenario: drug vs placebo on word recall, with age and education as covariates. *Steps:* (1) Question: does drug improve recall after adjusting for covariates? (2) H₀: drug has no effect on recall, controlling for age and education. (3) IV: drug condition (nominal binary, between). DV: words recalled (ratio, continuous). Covariates: age, education. (4) Between-subjects with covariates. (5) ANCOVA. (6) Assumptions: normality of DV per group, homogeneity of variance, independence, linearity of covariate-DV, homogeneity of regression slopes, covariate measured pre-IV, no severe multicollinearity. (7) Check via Shapiro-Wilk, Levene's, scatter, interaction terms, VIF. (8) If violated: Quade test or robust ANCOVA. (9) Effect: partial η² for drug. (10) Report: 'ANCOVA showed the drug group recalled more words than placebo after controlling for age and education, , , partial .'

Common confusions to drill. *PCA vs FA* — PCA reduces dimensions (no error term; eigendecomposition of correlation matrix). FA models latent constructs (assumes communality + uniqueness). *FWER vs FDR* — FWER = P(any false positive); FDR = expected proportion of false positives. FWER more conservative. *Reliability vs validity* — reliability = consistency; validity = accuracy. *Type I vs Type II* — α (false positive, false alarm) vs β (false negative, miss); power = 1 − β. *One-tailed vs two-tailed* — pre-specified direction vs both directions. Post-hoc switch = p-hacking. *Independent vs paired t* — different vs same participants. *Confidence vs credible interval* — frequentist procedure vs Bayesian posterior probability. *Population vs sample / parameter vs statistic* — vs , vs s. *Standard deviation vs standard error* — spread of data vs spread of the sampling distribution.

Exam-day rules of thumb. (1) Skim the whole paper first. (2) Easy descriptive questions first — quick marks. (3) Structure open-ended answers around the 10-point framework above. (4) Show your work on calculations — partial credit for method. (5) Define key terms in your own words. (6) State assumptions for every test you propose — slides explicitly emphasise this. (7) If stuck on test choice, fall back on the decision tree. (8) Watch the clock — don't spend 20 minutes on 5-mark questions. (9) For MCQ/short-answer, the four pitfalls (p-misinterpretation, CI-misinterpretation, statistical-vs-practical, correlation-vs-causation) appear constantly — recognise and dodge.

The 30-concept micro-glossary. Operational definition, IV/DV, NOIR scales, reliability (test-retest, inter-rater, parallel-forms, internal consistency), validity (internal, external, construct, face, ecological), confound, random sampling vs random assignment, Simpson's paradox, base-rate fallacy, Bayes' rule, CLT, SEM, sampling distribution, CI, H₀, p-value, Type I/II, power, effect size, statistical vs practical significance, falsifiability, multiple comparisons, FWER/FDR, Bonferroni/Holm/BH, sphericity, multicollinearity (VIF), interaction effect, Bayes Factor — these are the labels you'll see attached to questions.

Mental triage during the exam. If a scenario seems unfamiliar, ask in this order: (1) Is this a *test selection* question or an *interpretation* question? (2) If selection: run the decision tree, name the test, justify with assumptions. (3) If interpretation: am I dealing with a p-value, CI, effect size, or test statistic? Match to the right phrasing. (4) If calculation: identify formula, plug in, show work. (5) If open-ended: use the 10-point framework. Almost every BRSM exam question maps to one of these five buckets.

Definitions

  • Decision treeSequence of four questions (DV scale, IV scale, # groups, between/within) that uniquely picks a test from the BRSM toolkit.
  • 10-point answer framework(1) question, (2) H₀/H₁, (3) IV/DV/scales, (4) design, (5) test+justification, (6) assumptions, (7) diagnostics, (8) fallback, (9) effect size, (10) reporting sentence. Maximises partial-credit.
  • Reporting templateTest statistic + degrees of freedom + p + effect size + 95% CI. Five slots, all required for full marks.
  • Effect-size benchmarkConventional small/medium/large thresholds: d 0.2/0.5/0.8; η² .01/.06/.14; r 0.1/0.3/0.5; OR 1.5/2.5/4.
  • Assumption-diagnostic pairingEach parametric test has a fixed set of assumptions and the named diagnostic for each (Shapiro-Wilk, Levene's, Mauchly's, residual plots, VIF, Cook's, Box's M, expected counts).
  • Interpretation trapWrong canonical phrasing of a statistical concept (p-value, CI, non-significance, correlation, normality). The exam routinely tests recognition.
  • Statistical vs practical significanceStatistical: p < α (detectable). Practical: effect size is large enough to matter. Independent dimensions.
  • Family-wise error rate (FWER)Probability of any false positive across m tests. for independent tests. Bonferroni controls it.
  • False Discovery Rate (FDR)Expected proportion of false positives among rejections. Benjamini-Hochberg controls it. Less conservative than FWER.
  • Five-step inference checklistQuestion → test → assumptions → effect size + CI → practical interpretation.
  • Open-ended exam questionScenario with research question. Answer using the 10-point framework. Assumptions and justifications carry as many marks as the test choice.
  • Pattern recognitionTrained ability to map a 1-sentence scenario to the right test in < 30 seconds. Drill with the example phrasings list.

Formulas

Derivations

Why FWER blows up with m tests. Under all-true-nulls and independence, P(no false positive in m tests) = . Therefore . For α = .05 and m = 20, this is — 64% chance of at least one false positive. Bonferroni reverses: set per-test α to to bound family-wise error at .

Decision-tree justification. Each test is identified by its compatibility with: DV scale (limits which math is meaningful), IV scale (categorical→groups; continuous→regression), # groups (drives 2-group vs k-group choice), independence structure (between → independent test; within → paired/repeated/sphericity-aware). Mapping these four properties to the test grid is the *act of statistical reasoning* the exam wants you to demonstrate.

Examples

  • Decision example 1. 3 anxiety treatments, 60 participants randomly assigned, anxiety scores (continuous). Between, k = 3, continuous DV, normal residuals expected → one-way ANOVA. Significant F → Tukey HSD post-hoc. Effect size η².
  • Decision example 2. Same 30 people rated mood at 4 timepoints. Within, k = 4, continuous DV → repeated-measures ANOVA. Check sphericity (Mauchly's); if violated → Greenhouse-Geisser correction. Effect size partial η².
  • Decision example 3. 200 patients, smoking (yes/no) × lung cancer (yes/no). Two categorical variables → χ² independence on 2×2 table. df = 1. Effect size = φ.
  • Decision example 4. Predict whether 400 applicants are admitted (yes/no) from GRE, GPA, undergrad rank. Binary outcome, continuous + categorical predictors → logistic regression. Report exp(β) as odds ratios.
  • Decision example 5. Reaction time across 3 caffeine doses for the same 20 people, data skewed. Within, k = 3, non-normal → Friedman's test. Effect size: Kendall's W.
  • Decision example 6. Drug vs placebo on word recall, with age as covariate. Between, continuous DV, one IV + covariate → ANCOVA. Effect size partial η².
  • Decision example 7. Therapy effectiveness depending on age group. Between, 2 IVs, continuous DV → factorial ANOVA. Look for main effects + interaction.
  • Worked χ² 2×3 by hand. Smoker × outcome (recovered / partial / unchanged), n = 200. Compute expected counts , sum , df = (2−1)(3−1) = 2. Compare to .
  • Computing Cohen's d. , , . → medium effect.
  • FWER from 5 independent tests at α = .05. → ~22.6% chance of at least one false positive without correction.
  • Reporting sentence (ANCOVA). 'After adjusting for age and education, the drug group recalled more words than placebo, , , partial .'
  • Reporting sentence (logistic). 'For each additional GRE point, the odds of admission increased by a factor of 1.005 (95% CI: 1.003 to 1.007), .'
  • Misinterpretation trap. Student writes 'r = 0.7, p = .003 means GPA causes income.' Correction: 'r = 0.7 shows a strong association. Causation requires experimental or careful confound control. As correlation, this allows reverse causality (income → GPA via private tuition) or third-variable explanations (family resources → both).'
  • Practical-significance trap. With n = 100,000, an r of 0.02 gives p < .001 but explains 0.04% of variance — statistically detectable, practically negligible.

Diagrams

  • Master decision flowchart: rooted on 'How many DVs?' → branches at 'DV scale' → 'IV scale and # groups' → 'between vs within' → specific test (terminal nodes).
  • Test family chart: t-test → ANOVA → factorial ANOVA → mixed → MANOVA; t → RM-ANOVA; Pearson → simple regression → multiple regression → GLM → logistic / Poisson. Lines show 'special case of' relationships.
  • Assumption-checking checklist: a 2-column table mapping each parametric test to its assumptions and diagnostics.
  • Reporting template: a fill-in-the-blank sentence with [test], [statistic(df)], [p], [effect size], [CI] slots.
  • Five-step inference workflow as a circular feedback loop: Question → Test → Assumptions → Effect size + CI → Practical interpretation → (back to Question for next study).
  • Common confusions Venn-style: PCA ∩ FA, Reliability ∩ Validity, Type I ∩ Type II, FWER ∩ FDR — labeled overlaps where students mistakenly equate them.

Edge cases

  • One-tailed test must be pre-registered. Switching post-hoc to get significance is p-hacking and grounds for rejection.
  • Multiple comparisons — always correct (Bonferroni / Holm / BH) when m > 1.
  • Heavy outliers + small n → nonparametric or robust methods; don't blindly remove.
  • Severe assumption violations invalidate parametric inference — switch tests, don't ignore.
  • Sample size constraint: if n < 10–20 per group, prefer nonparametric or Bayesian framework over asymptotic tests.
  • Huge n caveat: statistical significance becomes trivial; effect size and practical significance dominate.
  • Open-ended questions with ambiguous design — state your interpretation of the question first, then proceed. Examiners reward defensible assumptions.

Common mistakes

  • Reporting p without effect size — half marks lost on every question that asks 'how big is the effect?'
  • Picking a one-tailed test post-hoc — p-hacking; explicit deduction.
  • Ignoring multiple comparisons when reporting several p-values from one study.
  • Confusing reliability (consistency) with validity (accuracy).
  • Treating non-significant p as 'no effect' rather than 'insufficient evidence to reject'.
  • Saying 'p = 0.03 means there's a 3% chance H₀ is true' — wrong direction of conditional.
  • Saying 'there's a 95% probability the parameter is in this CI' — that's a credible interval, not a confidence interval.
  • Reporting correlation as 'X causes Y'.
  • Saying 'the sample is normally distributed' when the relevant claim is about the sampling distribution.
  • Conflating SD and SEM. SD describes the data; SEM describes the *mean's* sampling variability.
  • Forgetting to state assumptions — slides explicitly mark this as exam-critical.
  • Running parametric tests on ordinal Likert data without justification.
  • Using a chi-square with expected cell counts < 5 (need Fisher's exact instead).

Shortcuts

  • Decision tree: DV scale → IV scale → # groups → between/within → test.
  • Always report: statistic, df, p, effect size, 95% CI.
  • Frequentist non-significance ≠ no effect. Use BF₀₁ if you need evidence for the null.
  • Statistical ≠ practical significance. Big n makes trivial effects significant.
  • FWER = 1 − (1−α)^m; Bonferroni α_new = α/m.
  • Effect-size benchmarks: Cohen's d 0.2/0.5/0.8; η² .01/.06/.14; r 0.1/0.3/0.5.
  • Reporting template: 'test, statistic(df) = X, p = Y, effect size = Z, 95% CI [a, b].'
  • 10-point answer framework for open-ended questions (question → H → variables → design → test → assumptions → diagnostics → fallback → effect size → reporting).
  • Memorise the assumption-diagnostic pairings. They appear verbatim on the exam.
  • Watch the clock. 1 mark = 1 minute is a rough budget.

Proofs / Algorithms