Revision Notes/Unit 7 — Hypothesis Testing & NHST/p-values, Errors, Power, t-tests/Story

p-values, Errors, Power, t-tests

Unit 7 — Hypothesis Testing & NHST

Maya Confronts the Null

Maya has her sample mean: 4.2 days. She has her confidence interval: [3.82, 4.58]. The CI doesn't include the textbook average of 5 days for untreated sore throats. Informally, this looks like evidence that turmeric milk works.

But "looks like" is not science. A reviewer at a journal would ask: How surprised should we be by your data if turmeric milk does nothing at all? That single question is the engine of hypothesis testing. This unit is the most heavily examined topic in the course. Every later test — t-tests, ANOVA, chi-square, regression — is just a specific application of the machinery we build here.

The scientific method — the framework hypothesis testing sits inside

Science doesn't work by collecting facts and seeing what emerges. It works by formulating ideas, deriving testable predictions, and going looking for evidence that would destroy them:

1. Develop a theory — a general framework. "Anti-inflammatory compounds reduce throat infections." 2. Derive a hypothesis — a specific, falsifiable prediction. "Turmeric milk reduces sore-throat duration vs placebo." 3. Test it — collect evidence designed to potentially refute the hypothesis. 4. Modify the theory or design new tests. 5. Repeat.

The exam loves to drill the distinction between theory and hypothesis:

A theory is a general framework. The theory of gravity.
A hypothesis is a specific, falsifiable prediction. "If I throw this ball at 2 m/s at 45°, it will land in 0.3 seconds."

Falsifiability — the Karl Popper rule

A scientific hypothesis must be falsifiable: there must exist some possible observation that would prove it wrong.

*"No amount of experimentation can ever prove me right, but a single experiment can prove me wrong."* — Einstein

The classic illustration: *"All swans are white."* Observing 100 white swans does not prove the hypothesis. It merely *fails to refute* it. The hypothesis would be falsified by observing a single black swan (which, famously, exists in Australia).

You can never confirm a universal claim by accumulating supporting cases; you can only fail to refute it.

This is why scientists never say *"this experiment proves the hypothesis."* They say *"fails to reject"* or *"supports."* The asymmetry between confirmation and falsification is built into the language.

Null and alternative hypotheses

Since we can only ever *fail* to falsify, we set up the question as follows:

The null hypothesis (H₀) is the boring default — "there is no effect, no difference, no relationship." Maya's null: *turmeric milk has no effect on sore-throat duration.*

The alternative hypothesis (H₁ or H_a) is the claim Maya actually believes — *turmeric milk reduces sore-throat duration.*

The procedure: try to reject H₀. If the data are very unlikely under H₀, conclude H₀ is implausible and accept H₁. If the data are consistent with H₀, fail to reject. Never say we "accept H₀" — just that there isn't enough evidence to abandon it.

Exam pattern. Given a research question: identify IV/DV; state H₀ ("no effect"); state H₁ in the direction of interest.

*Does exercise affect anxiety?*

IV: exercise (yes/no). DV: anxiety.

H₀: exercise has no effect on anxiety.

H₁: exercise lowers anxiety.

One-tailed vs two-tailed tests

A subtle but heavily tested distinction.

Two-tailed test: "Is the effect different from zero, in *either* direction?" H₁: turmeric milk *affects* sore-throat duration (could be longer or shorter).
One-tailed test: "Is the effect in one *specific* direction?" H₁: turmeric milk *reduces* sore-throat duration.

With α = 0.05: two-tailed puts 0.025 in each tail; one-tailed puts the full 0.05 in one tail. One-tailed is easier to reject in the specified direction, but blind to the opposite.

Use two-tailed by default. Only use one-tailed if you have a strong pre-specified reason from prior literature, accepted theory, or extensive experience.

When one-tailed is NOT appropriate (exam favourite):

Choosing one-tailed *just to get significance* — p-hacking.
*Switching* to one-tailed after a two-tailed test fails to reject — academically dishonest.

When one-tailed IS appropriate: when you genuinely don't care about a result in one direction. A drug company tests a cheaper drug only to confirm it isn't *less effective* than the existing one. Whether the new drug is *better* doesn't matter for the business question; only "is it worse" matters.

The significance criterion α

The alpha level (α) is the threshold below which we declare a result statistically significant. It is the probability of Type I error you are willing to accept.

Convention in behavioural science: α = 0.05. Some fields use 0.01 (more stringent — particle physics uses 5σ ≈ α of $3 \times 1 0^{- 7}$ ). The choice is a *convention*, not a law of nature.

Decision rule: $p < α$ → reject H₀; $p \geq α$ → fail to reject H₀.

The p-value — what it is, and what it isn't

This is the single most misunderstood concept in statistics.

Correct definition:

p = P (data this extreme or more ∣ H_{0})

So a p-value of 0.03 means: *"If H₀ were true, there's a 3% chance of seeing data this extreme just from random sampling variation."*

Common WRONG interpretations:

❌ "The p-value is the probability that the null hypothesis is true." (Frequentist methods don't put probabilities on hypotheses.)
❌ "p = 0.03 means there's a 97% chance the effect is real."
❌ "The probability that your result was due to chance." (Almost — but the right phrasing is *conditional*: if H₀ were true.)

Type I and Type II errors

The 2×2 grid that anchors the rest of inferential statistics:

| | H₀ True | H₀ False | | --- | --- | --- | | Reject H₀ | Type I (α) ❌ false positive | ✅ Power (1 − β) | | Fail to reject | ✅ correct | Type II (β) ❌ false negative |

Type I (α): rejecting H₀ when it's true. *False positive.* You set this rate yourself when you choose α (typically 0.05).

Type II (β): failing to reject H₀ when H₁ is true. *False negative.* You miss a real effect.

The trade-off: lowering α (stricter) increases β (you miss more real effects). The only way to reduce both simultaneously: more data, better instruments, more sensitive designs.

Causes of Type II errors the exam will probe:

Sample size too small.
Choosing one-tailed when the effect is in the opposite direction.
Wrong statistical test for the data.

Statistical Power

Power = 1 − β. Probability of detecting an effect when one truly exists. High power = low chance of missing real effects.

Convention: power ≥ 0.80 — an 80% chance of detecting a real effect.

Power depends on FOUR things (critical for power analysis):

1. Type of test — independent t, paired t, ANOVA, regression… 2. α level — usually 0.05. 3. Expected effect size — how big is the effect? 4. Sample size — how many participants.

Bigger n → more power. Bigger effect → more power. Stricter α → less power. Lower noise → more power. Within-subjects designs → more power with same n.

A priori power analysis computes n needed to achieve target power given expected effect size and α. Done *before* data collection.

Effect size — statistical vs practical significance

Suppose Maya gives a "group discussion" intervention to 10,000 students. Pre-test 83/100, post-test 84/100 — a 1-point improvement. With n = 10,000 this is statistically significant — p microscopic. But is it practically meaningful?

Statistical significance (small p) ≠ practical significance (meaningful real-world effect). Large samples can detect tiny, uninteresting effects.

Cohen's d — standard effect size for two means:

d = \frac{X ˉ _{treatment} - X ˉ _{control}}{s _{pooled}}

The difference in *standard deviation units* — comparable across studies and measures.

Cohen's d interpretation:

| d | Effect | | --- | --- | | < 0.1 | trivial | | 0.1–0.3 | small | | 0.3–0.5 | moderate | | > 0.5 | large |

(Cohen's original: 0.2 small, 0.5 medium, 0.8 large.)

Other effect sizes: $η^{2}$ for ANOVA, $r$ / $r^{2}$ for correlation, $ϕ$ / Cramér's V for chi-square. Each in its own unit.

Reporting rule: any time you report a significant test, also report effect size. *"Significant, p < 0.001, d = 0.12"* tells the reader the effect is real but tiny.

A worked example

*Research question:* Does a group discussion intervention improve student knowledge?

IV: intervention (discussion vs no discussion). DV: post-test score.
H₀: no difference. H₁: discussion group scores higher (one-tailed — only useful if it helps).
α = 0.05.
Pre 83 (n = 1000). Post 84. Pooled SD: 8.
Difference: 1 point.
t-statistic; p = 0.001 — significant.
But: Cohen's d = 1/8 = 0.125. Trivial effect.

Maya writes: *"The difference in test scores was statistically significant (p = 0.001) but the effect size was small (d = 0.125), suggesting that while there is real benefit from discussion, the practical improvement is modest."* That's the kind of nuanced reporting the course wants.

The t-test family — picking the right test

The same effect can be studied with different designs, and each uses a different test:

| Design | Test | | --- | --- | | 1 sample vs known μ₀ | One-sample t | | 2 unrelated groups | Independent t (Welch if unequal var) | | 2 conditions, same people | Paired t | | 3+ unrelated groups | One-way ANOVA | | 3+ conditions, same people | Repeated-measures ANOVA | | Multiple IVs | Factorial / mixed ANOVA |

One-sample t

t = \frac{x ˉ - μ _{0}}{s / n}, df = n - 1

Independent t

t = \frac{x ˉ _{1} - x ˉ _{2}}{s _{p}^{2} ( 1/ n _{1} + 1/ n _{2} )}, df = n_{1} + n_{2} - 2

where $s_{p}^{2} = ((n_{1} - 1) s_{1}^{2} + (n_{2} - 1) s_{2}^{2}) / (n_{1} + n_{2} - 2)$ .

Paired t

t = \frac{d ˉ}{s _{d} / n}, df = n - 1

Paired t has more power than independent t because within-subject variability is removed.

Welch's t

t_{Welch} = \frac{x ˉ _{1} - x ˉ _{2}}{s _{1}^{2} / n _{1} + s _{2}^{2} / n _{2}}

Doesn't assume equal variances; adjusts df. Modern default.

Confidence intervals and hypothesis tests — two faces of the same coin

If the 95% CI for the difference excludes zero, the corresponding two-tailed test rejects H₀ at α = 0.05. If the CI includes zero, the test fails to reject.

Maya's CI [3.82, 4.58]; benchmark "no effect" = 5 days lies outside → two-tailed test rejects. Equivalent. CIs report where the parameter likely lies; tests report whether a specific value is inconsistent.

What can go wrong — a preview

Misinterpreting p as P(H₀ | data) instead of P(data | H₀).
p-hacking — running many tests and reporting only the significant.
Optional stopping — peeking and stopping when p < 0.05. Distorts Type I rate.
Multiple comparisons — testing many hypotheses at α = 0.05 each (Unit 8).
HARKing — hypothesising after results are known.

These practices fuelled the replication crisis.

What you carry into the exam

NHST 9-step recipe: H₀ & H₁ → α → test → assumptions → statistic → p → decide → effect size + CI → interpret.
Theory vs hypothesis — general framework vs specific falsifiable prediction.
Falsifiability — Popper's principle; never "prove", only "fail to reject".
One-tailed vs two-tailed — two-tailed by default; one-tailed only if pre-specified.
α = 0.05 convention; p-value = P(data | H₀) (not P(H₀ | data)).
Type I (α) = false positive; Type II (β) = false negative; power = 1 − β.
Cohen's d: 0.2 / 0.5 / 0.8. Always report alongside p.
t-test family: one-sample, independent, paired, Welch. Pick by design.
CI excludes null ↔ test rejects.
Never 'accept' H₀. Always 'fail to reject'.

When you're ready, send "next" and we'll move into multiple comparisons — when α = 0.05 multiplied across many tests inflates the family-wise error rate, and the Bonferroni / Holm / Benjamini-Hochberg corrections.

Behavioral Research: Statistical Methods