Revision Notes/Unit 7 — Hypothesis Testing & NHST/p-values, Errors, Power, t-tests

p-values, Errors, Power, t-tests

Intuition

NHST is the workhorse of inferential statistics. Set up a null hypothesis ('no effect'); compute a test statistic; ask 'how surprising would these data be if H₀ were true?' (the p-value); reject H₀ if p < α. Type I error (α) = false alarm; Type II error (β) = missed real effect. Power = 1 − β depends on n, effect size, α, variance, design. Falsifiability (Popper) is the philosophical foundation — we can never prove H₀, only fail to reject it. Always report effect size + CI alongside p.

Explanation

The scientific method as the frame. Science doesn't collect facts and see what emerges. It (1) develops a theory (general framework — 'anti-inflammatory compounds reduce throat infections'), (2) derives a hypothesis (specific, falsifiable prediction — 'turmeric milk reduces sore-throat duration vs placebo'), (3) tests it by collecting evidence designed to potentially refute it, (4) modifies the theory or designs new tests, (5) repeats. Theory ≠ hypothesis. Theory is general; hypothesis is specific. The exam loves to drill this distinction.

Falsifiability — the Karl Popper rule. A scientific hypothesis must be falsifiable: there must exist some possible observation that would prove it wrong. Einstein: *'No amount of experimentation can ever prove me right, but a single experiment can prove me wrong.'* You observe 100 white swans; the hypothesis 'all swans are white' is not proven — merely not yet disproven. A single black swan falsifies it. You can never confirm a universal claim by accumulating supporting cases; you can only fail to refute it.

Why scientists never say 'this proves the hypothesis'. They say 'fails to reject' or 'supports'. The asymmetry between confirmation and falsification is built into the language. This is why H₀ exists — we reject H₀, not 'accept' H₁.

Null and alternative hypotheses. H₀: the boring default — 'no effect, no difference, no relationship'. H₁ (or Hₐ): the claim the researcher actually believes. Procedure: try to reject H₀. If data are very unlikely under H₀, conclude H₀ is implausible. If data are consistent with H₀, fail to reject. Never say 'accept H₀' — say 'fail to reject'.

Exam template. Given a research question: identify IV/DV; state H₀ ('no effect / no difference'); state H₁ in the direction of interest. *Example:* 'Does exercise affect anxiety?' → IV: exercise; DV: anxiety. H₀: exercise has no effect on anxiety. H₁: exercise lowers anxiety (or H₁ two-sided: exercise affects anxiety in some direction).

One-tailed vs two-tailed tests. *Two-tailed* (default): effect in either direction is meaningful. H₁: μ ≠ μ₀. *One-tailed*: direction pre-specified; only that direction matters. H₁: μ > μ₀ (or μ < μ₀). With α = 0.05: two-tailed puts 0.025 in each tail; one-tailed puts the full 0.05 in one tail. One-tailed is easier to reject in the specified direction, blind to the opposite.

When one-tailed is appropriate. Direction pre-specified BEFORE data collection AND a result in the opposite direction would be treated as null. *Example:* a drug company tests a cheaper drug to confirm it isn't *less effective* than existing — only 'worse' matters. When one-tailed is NOT appropriate: choosing it to chase significance; switching from two-tailed after seeing the data ('the two-tailed was close — let me try one-tailed'). This is p-hacking.

The significance criterion α. Threshold below which we declare a result 'statistically significant'. It is the probability of Type I error you're willing to accept. Behavioural convention: α = 0.05. Physics: 5σ ≈ α of $3 \times 1 0^{- 7}$ (much stricter). Choice is a *convention*, not a law. *Decision rule:* p < α → reject H₀; p ≥ α → fail to reject.

The p-value — what it IS. *Correct definition:* the probability of observing data at least as extreme as what you got, assuming the null hypothesis is true. $p = P (data this extreme or more ∣ H_{0})$ . So p = 0.03 means: 'if H₀ were true, there's a 3% chance of seeing data this extreme just from random sampling variation'.

The p-value — common WRONG interpretations. ❌ 'P(H₀ is true)' — frequentist methods don't put probabilities on hypotheses, only on data. ❌ 'P(H₁ is true)' — same problem. ❌ 'p = 0.03 means 97% chance the effect is real' — wrong. ❌ 'The probability that your result is due to chance' — *almost*, but the right phrasing is conditional: *if H₀ were true*, this is the chance of seeing data this extreme. Subtle but critical.

Type I error (α). Rejecting H₀ when it's true. *False positive.* You claim an effect when none exists. You set this rate yourself when you choose α (typically 0.05). Across many independent tests at α = 0.05, you expect 5% false alarms.

Type II error (β). Failing to reject H₀ when H₁ is true. *False negative.* You miss a real effect. β is *not* directly chosen — it depends on n, true effect size, α, and variance. Conventional target: β ≤ 0.20 (so power ≥ 0.80).

The Type I/Type II trade-off. At fixed n and effect size, lowering α (stricter) increases β (you miss more real effects). The only way to reduce *both* simultaneously is: (a) more data, (b) better instruments / lower variance, (c) more sensitive designs (within-subjects).

Causes of Type II errors (exam-probable). Sample size too small; choosing one-tailed when the effect is in the *other* direction; wrong statistical test for the data; noisy measurement.

Statistical Power = 1 − β. Probability your test detects an effect when one truly exists. *High power = low chance of missing real effects.* Convention: power ≥ 0.80. Lower power means too many missed findings.

Power depends on FOUR things. Critical for power analysis (done BEFORE a study): (1) Type of test (independent t, paired t, ANOVA, regression…); (2) α level (usually 0.05); (3) Expected effect size — how big is the effect you're trying to detect? (4) Sample size — how many participants. Three knobs you control + the effect size you specify.

The power relationships. Bigger n → more power. Bigger expected effect → more power (large effects are easier to see). Stricter α → less power (harder to reject). Lower noise / variance → more power. Within-subjects designs → more power than between-subjects with same n. A priori power analysis computes the n needed to achieve target power given expected effect size.

Effect size — statistical vs practical significance. Suppose a 'group discussion' intervention with n = 10,000 students improves test scores by 1 point (83 → 84). With huge n, p < 0.001 — *statistically significant*. But is a 1-point gain *practically meaningful*? Maybe not. Large samples can detect tiny, uninteresting effects. Effect size quantifies the magnitude on a standardised scale, independent of n.

Cohen's d — standard effect size for two means. $d = (\overset{ˉ}{X}_{treatment} - \overset{ˉ}{X}_{control}) / s_{pooled}$ . The difference in *standard deviation units*. Cohen's original benchmarks: $d = 0.2$ small, $d = 0.5$ medium, $d = 0.8$ large. (Some sources give slightly different cutoffs; use what your slides specify.)

Other effect sizes for other tests. $η^{2}$ / partial $η^{2}$ for ANOVA; $r$ and $r^{2}$ for correlation; $ϕ$ and Cramér's V for chi-square; odds ratio for logistic regression.

Reporting rule. Any significant test must include an effect size. *'Significant, p < 0.001'* alone is insufficient. *'Significant, p < 0.001, d = 0.12'* tells the reader the effect is real but tiny. APA and most journals require this.

Worked example — group-discussion intervention. IV: discussion vs control. DV: post-test score. H₀: no difference. H₁: discussion > control (one-tailed). α = 0.05. n = 1000; pre 83, post 84; pooled SD 8. t-statistic; p ≈ 0.001 → significant. But Cohen's d = 1/8 = 0.125 → trivial effect. Maya correctly writes: *'Statistically significant (p = 0.001) but the effect size was small (d = 0.125), suggesting real but modest practical improvement.'* This is the nuanced reporting the course wants.

The t-test family — picking the right one. *One-sample t:* tests whether sample mean differs from a hypothesised μ₀. $t = (\overset{x}{ˉ} - μ_{0}) / (s / n)$ , df = n − 1. *Independent (two-sample) t:* compares means of two unrelated groups. $t = (\overset{x}{ˉ}_{1} - \overset{x}{ˉ}_{2}) / S E_{pooled}$ , df = n₁ + n₂ − 2. *Paired (related-samples) t:* compares two related measurements on the same units. $t = \overset{ˉ}{d} / (s_{d} / n)$ , df = n − 1. Paired t has more power than independent t because within-subject variability is removed from the error.

Welch's t-test. Like independent t but doesn't assume equal variances. Adjusted df via Welch-Satterthwaite formula: $df = \frac{( s _{1}^{2} / n _{1} + s _{2}^{2} / n _{2} ) ^{2}}{( s _{1}^{2} / n _{1} ) ^{2} / ( n _{1} - 1 ) + ( s _{2}^{2} / n _{2} ) ^{2} / ( n _{2} - 1 )}$ . Modern default when in doubt about equal variances; conservative when variances are equal.

Assumptions of t-tests. *Independence* of observations (within and between groups). *Approximate normality* of the DV (matters less at large n via CLT). *Homogeneity of variance* for independent t (test with Levene's; use Welch if violated). *Interval/ratio DV.*

Picking the test by design. *One IV, two unrelated groups* → independent t. *One IV, same people in two conditions* → paired t. *More than two groups, different people* → one-way ANOVA. *More than two conditions, same people* → repeated-measures ANOVA. *More than one IV* → factorial / mixed ANOVA.

CIs and hypothesis tests are two faces of the same coin. If the 95% CI for the difference excludes zero, the corresponding two-tailed test rejects H₀ at α = 0.05. If the CI includes zero, the test fails to reject. Equivalent. CIs report where the parameter likely lies; tests report whether a specific value is inconsistent.

What can go wrong — the pitfalls. (1) Misinterpreting p as P(H₀|data) instead of P(data|H₀). (2) p-hacking — running many tests and reporting only the significant one. (3) Optional stopping — peeking at data and stopping when p < 0.05; corrupts Type I rate. (4) Multiple comparisons — testing many hypotheses at α = 0.05 each; FWER climbs fast (Unit 8). (5) HARKing — hypothesising after results are known; invalidates frequentist inference. These practices fuelled the replication crisis.

Definitions

Theory vs hypothesis — Theory = general framework. Hypothesis = specific falsifiable prediction. Theories generate hypotheses.
Falsifiability (Popper) — A scientific hypothesis must have a possible observation that would prove it wrong. Science fails to falsify; never proves.
Null hypothesis (H₀) — The 'no effect / no difference' default. We try to reject H₀; never 'accept'.
Alternative hypothesis (H₁) — The claim the researcher believes — an effect exists.
One-tailed vs two-tailed — One-tailed: direction pre-specified; opposite treated as null. Two-tailed: any direction matters. Two-tailed is the default.
α (significance level) — Threshold p-value for rejecting H₀. Probability of Type I error. Convention: 0.05 in behavioural science.
p-value — P(data this extreme or more | H₀). NOT P(H₀ | data). The most-misinterpreted concept in statistics.
Type I error — Rejecting a true H₀. False positive. Rate = α. Chosen by the researcher.
Type II error — Failing to reject a false H₀. False negative. Rate = β. Determined by n, effect size, α, variance.
Statistical power — 1 − β = P(reject H₀ | H₁ true). Probability of detecting a real effect. Convention: ≥ 0.80.
Cohen's d — Standardised mean difference: $(\overset{ˉ}{X}_{1} - \overset{ˉ}{X}_{2}) / s_{pooled}$ . 0.2/0.5/0.8 = small/medium/large.
Effect size — Standardised magnitude of an effect, independent of sample size. Always report alongside p.
Statistical vs practical significance — Statistical = p < α. Practical = effect is meaningful in context. Large n can make trivial effects significant.
One-sample t-test — Tests sample mean against a hypothesised value. $t = (\overset{x}{ˉ} - μ_{0}) / (s / n)$ , df = n − 1.
Independent (two-sample) t-test — Compares two unrelated group means. df = n₁ + n₂ − 2. Assumes equal variances (use Welch if not).
Paired t-test — Compares two related measurements on the same units. df = n − 1. More power than independent for same n.
Welch's t-test — Independent t without equal-variance assumption; adjusted df via Welch-Satterthwaite. Modern default.
Power analysis (a priori) — Compute the n needed to achieve target power (e.g., 0.80) given expected effect size, α, and test type. Done BEFORE data collection.
Optional stopping — Peeking at data during collection and stopping when p < α. Inflates actual Type I rate above nominal α.
Multiple comparisons problem — Testing many hypotheses at α = 0.05 each — family-wise Type I error climbs. Unit 8 covers corrections.

Formulas

$t = \frac{x ˉ - μ _{0}}{s / n} (one-sample t)$
$t = \frac{x ˉ _{1} - x ˉ _{2}}{s _{p}^{2} ( 1/ n _{1} + 1/ n _{2} )} (independent t)$
$s_{p}^{2} = \frac{( n _{1} - 1 ) s _{1}^{2} + ( n _{2} - 1 ) s _{2}^{2}}{n _{1} + n _{2} - 2} (pooled variance)$
$t = \frac{d ˉ}{s _{d} / n} (paired t)$
$t_{Welch} = \frac{x ˉ _{1} - x ˉ _{2}}{s _{1}^{2} / n _{1} + s _{2}^{2} / n _{2}} (unequal variance)$
$d = \frac{x ˉ _{1} - x ˉ _{2}}{s _{p}} (Cohen’s d, independent)$
$d_{z} = \frac{d ˉ}{s _{d}} (Cohen’s d for paired)$
$p = P (data this extreme or more ∣ H_{0})$
$Power = 1 - β = P (reject H_{0} ∣ H_{1} true)$
$Type I rate = α; Type II rate = β$

Derivations

Why paired t has more power than independent t at the same n. Within-subject correlation reduces noise. For paired data with within-subject correlation $ρ$ : $Var (D) = σ_{1}^{2} + σ_{2}^{2} - 2 ρ σ_{1} σ_{2}$ . If $σ_{1} = σ_{2} = σ$ , $Var (D) = 2 σ^{2} (1 - ρ)$ . Independent t with same total n has $Var (\overset{ˉ}{X}_{1} - \overset{ˉ}{X}_{2}) = 2 σ^{2} / n$ . Paired has $Var (\overset{ˉ}{D}) = 2 σ^{2} (1 - ρ) / n$ . Ratio: paired SE smaller by factor $1 - ρ$ . With $ρ = 0.7$ : paired SE ≈ 55% of independent SE. Hence larger t for same effect, more power.

Why one-tailed has more power in the correct direction. Two-tailed critical value $z_{0.025} \approx 1.96$ . One-tailed critical value $z_{0.05} \approx 1.64$ . A test with true effect requires t > critical value to reject. Lower critical value → more frequent rejection → higher power. But: the test is blind to the opposite direction; an effect in the other direction is treated as null. Using one-tailed *after seeing the data* corrupts Type I rate and is p-hacking.

Why CI exclusion ↔ rejection. A two-tailed test rejects when $∣ t ∣ > t_{α /2, df}$ , equivalently $∣ \overset{ˉ}{X} - μ_{0} ∣ > t_{α /2, df} \cdot S E$ . CI: $\overset{ˉ}{X} \pm t_{α /2, df} \cdot S E$ . **The null value $μ_{0}$ lies outside the CI iff $∣ \overset{ˉ}{X} - μ_{0} ∣ > t_{α /2, df} \cdot S E$ ** — exactly the rejection condition. CIs and tests are mathematically equivalent for two-tailed inferences.

Type I and Type II are inversely related at fixed n. Set the rejection threshold at $z_{α}$ . Under H₀ (with true mean μ₀), $P (reject) = α$ . Under H₁ (with true mean μ₁), $P (reject) =$ power $= 1 - β$ . Moving the threshold left (more lenient α) increases both $α$ AND power (decreases β). Moving right reduces α but reduces power. Only n and effect size can shift both.

The four power knobs interact multiplicatively. Approximately: $n \approx (z_{α} + z_{β})^{2} \cdot 2/ d^{2}$ for independent t. Halving d quadruples n. Tightening α from 0.05 to 0.01 increases required n by ~30%. A useful planning formula.

Examples

Two groups n=30 each. Means 78 vs 72; SDs 10 and 12. Pooled SD ≈ 11.05. SE_pooled ≈ 2.85. t ≈ 2.10. df = 58. Two-tailed p ≈ 0.04 → reject. Cohen's d = 6/11.05 ≈ 0.54 (medium). Report: *'t(58) = 2.10, p = 0.04, d = 0.54.'*
Paired t example. Pre-post on same n = 30. Mean difference $\overset{ˉ}{d} = 4$ , SD of differences = 6. SE = 6/√30 = 1.10. t = 4/1.10 ≈ 3.64. df = 29. p ≈ 0.001 → reject. Cohen's $d_{z}$ = 4/6 ≈ 0.67 (medium-large).
Power planning. Want to detect d = 0.5 at α = 0.05, power = 0.80, independent t. G*Power gives n ≈ 64 per group. At d = 0.2: n ≈ 393 per group. Halving d quadruples sample size — small effects are expensive to detect.
Statistical vs practical. n = 10,000 students; mean discussion 84, mean control 83; SD = 8. t ≈ 8.84, p ≈ 0. Cohen's d = 1/8 = 0.125 — trivial. Significant but practically meaningless.
Wason cards (Popper revisited). 'Odd → vowel'. Cards A, 2, 7, K. Flip 7 (odd; need vowel on back to confirm rule; consonant would falsify) and K (consonant; need not-odd on back to confirm; odd would falsify). Confirmation bias = flipping A; Popper = flipping 7 and K.
CIs and tests equivalent. Maya's 95% CI on mean recovery time [3.82, 4.58]. 'No effect' benchmark 5 days lies outside → two-tailed t-test of H₀: μ = 5 against H₁: μ ≠ 5 rejects at α = 0.05. Same conclusion.
Welch vs Student t. Two groups: n₁ = 20, $s_{1} = 5$ ; n₂ = 60, $s_{2} = 12$ . Variances differ ~6×. Levene's test rejects equality. Use Welch's t — adjusted df via Welch-Satterthwaite ≈ 75 (smaller than the 78 Student would use).
One-tailed pitfall. Two-tailed t gives p = 0.07. Researcher 'realises' the effect *had* to be positive and switches to one-tailed: p = 0.035. Reject? No — p-hacking. If direction wasn't pre-specified, must report two-tailed.

Diagrams

Type I/II error 2×2 table. Rows: reject / fail-to-reject. Cols: H₀ true / H₀ false. Cells: α (false positive); 1−β = power; 1−α (correct retention); β (false negative).
Sampling distributions under H₀ vs H₁. Two overlapping distributions; shaded α tail under H₀ (rejection region); shaded β region under H₁ (Type II); power = 1 − β = unshaded area to the right of the rejection threshold under H₁.
One-tailed vs two-tailed. Same H₀ distribution; one-tailed has full 0.05 in one tail; two-tailed has 0.025 in each tail. Critical value smaller for one-tailed → easier to reject.
Power as a function of n and d. Curves showing power increasing with n and with effect size; conventional 0.80 horizontal line.
Type I-Type II trade-off. Move rejection threshold left → α↑, β↓. Right → α↓, β↑. Only changing n shifts both curves apart.
Wason cards setup. A, 2, 7, K, D, L face-up; mark 7 and K as 'must flip to falsify'; A as 'confirmation-bias trap'.
CI and test equivalence. Number line with CI; null value μ₀; reject iff μ₀ outside CI.

Edge cases

Tiny effect, huge n → p ≈ 0. d = 0.05, n = 10,000 → t ≈ 5 → p < 10⁻⁶. Significant but practically trivial. Report effect size.
Welch's t is the safe default for unequal variances or when in doubt. Loses negligible power vs Student t when variances are equal.
Non-normality + small n → consider Mann-Whitney (independent) or Wilcoxon (paired).
Don't 'accept' H₀. *Fail to reject* ≠ *accept*. The data are just insufficient to reject; the null may still be false.
Post-hoc power (computed from observed effect size and p) is essentially the same information as p — uninformative. Prospective power analysis is the standard.
One-sample t-tests require known μ₀. If μ₀ is itself estimated from data, use a different test.
Paired t with small n is robust if differences are normally distributed; check normality of *differences*, not raw values.
Practice / order effects in paired designs — counterbalance.

Common mistakes

Saying p = P(H₀ true). Wrong — p is P(data | H₀). The prosecutor's fallacy.
Accepting H₀ on a non-significant test. Always 'fail to reject'.
Switching to one-tailed post-hoc to rescue p = 0.06. p-hacking.
Reporting only p without effect size + CI. APA-violation.
Ignoring assumptions before applying parametric tests (normality, homogeneity of variance).
Confusing one-tailed and two-tailed p-values. One-tailed p ≈ two-tailed p / 2 (for the predicted direction).
Using α = 0.05 for many comparisons without correction (Unit 8).
Confusing Type I rate α with PPV (P(false discovery | rejection)) — related but different.
Reporting power post-hoc from observed effect — uninformative; do power analysis a priori.

Shortcuts

NHST 9-step recipe: state H₀ & H₁ → choose α → pick test → check assumptions → compute statistic → find p → decide → effect size + CI → interpret.
Power = 1 − β. Four knobs: test, α, effect size, n.
Cohen's d: 0.2 / 0.5 / 0.8 = small / medium / large.
Welch's t = independent t without equal-variance assumption — modern default.
Paired t > independent t for same n if within-subject correlation > 0.
CI excludes null ↔ test rejects at corresponding α.
Never 'accept' H₀. Always 'fail to reject'.
One-tailed ONLY if direction pre-specified AND opposite is treated as null.
p < α → reject; p ≥ α → fail to reject.
Always report effect size + CI alongside p.

Proofs / Algorithms

One-sample t-test statistic distribution. Under H₀: $\overset{ˉ}{X} \sim N (μ_{0}, σ^{2} / n)$ approximately (by CLT). Standardising with the *estimated* SD $s / n$ produces $t = (\overset{ˉ}{X} - μ_{0}) / (s / n)$ . Because $s$ is itself a random variable (not known $σ$ ), the standardised statistic follows a Student t-distribution with $n - 1$ df, not Normal. As $n \to \infty$ , $s \to σ$ and $t \to N (0, 1)$ .

CIs and two-tailed tests are equivalent at the corresponding α. Confidence interval: $\overset{ˉ}{X} \pm t_{α /2, n - 1} \cdot s / n$ . Two-tailed test rejection: $∣ t ∣ > t_{α /2, n - 1}$ , equivalently $∣ \overset{ˉ}{X} - μ_{0} ∣ > t_{α /2, n - 1} \cdot s / n$ . The null value $μ_{0}$ lies outside the CI iff this rejection condition holds. Hence rejecting at α ⟺ μ₀ outside (1 − α) CI.

Why paired t can be dramatically more powerful. For two-group comparison: variance of $(\overset{ˉ}{X}_{1} - \overset{ˉ}{X}_{2})$ under independent design = $2 σ^{2} / n$ assuming equal variances. Under paired design with within-subject correlation $ρ$ : variance of $\overset{ˉ}{D}$ = $2 σ^{2} (1 - ρ) / n$ . Ratio: paired/independent variance = $1 - ρ$ . For $ρ = 0.7$ : paired variance is 30% of independent — same power achieved with 30% of the participants. This is why repeated-measures designs are favoured when feasible.

Behavioral Research: Statistical Methods