Revision Notes/Unit 3 — Probability & Distributions/Probability, Distributions, and the CLT/Story

Probability, Distributions, and the CLT

Unit 3 — Probability & Distributions

The Spell That Makes Everything Work

When we left Maya, she had a research design. Now she has data. Sixty people drank turmeric milk for three days. Some recovered. Some didn't. Some recovered without it. How does she figure out whether the turmeric actually did anything, or whether her result is just noise dressed up as a finding?

To answer that, Maya needs to understand probability. Not the high-school-textbook version, but the working version that statisticians fight about at conferences.

The Fischer-Taimanov problem

In 1971, Bobby Fischer and Mark Taimanov played a match. After 6 games, Fischer was up 3–2–1 (three wins, two losses, one draw). You're betting on the next game. What's the probability Fischer wins?

Naïvely: 3/6 = 50%. Fischer won half his completed games.

But wait — should the draw count? Should you treat all 6 games as equally informative? Maybe Fischer's early losses were warm-up? Maybe Taimanov was sandbagging? The answer depends on what model of the world you assume.

This is the lesson Maya needs: every statistical method makes assumptions, and the answer depends on them. Get the assumptions right, you get the right answer. Get them wrong, you get a confident but wrong answer.

Fischer ultimately won 6–0. So the right model wasn't "fair coin." Statistics is the art of choosing the right model, with humility.

Probability versus statistics — the directions are opposite

Here's a distinction that will save you on the exam.

Probability: you have a model, no data, and you want the probability of a hypothetical event. Given a fair coin, what's P(two tails in a row)? Answer: 0.25. You know the model. You compute the data probabilities.

Statistics (specifically inferential statistics): you have data, no firm model, and you want to figure out which model is true. Fischer won the first 3 games. Given that data, is P(Fischer wins a game) really 0.5, or is it something higher? You have data; you infer the model.

Probability flows from model → data.

Statistics flows from data → model.

The course is mostly about the second direction.

What is "probability"? Two warring tribes

The behavioural sciences are mostly frequentist, but you need to understand both views to answer exam questions on Bayesian methods later (Unit 13).

Frequentist probability

Probability is the long-run frequency of an event in repeated sampling. "P(heads) = 0.5" means: flip the coin a billion times, and the proportion of heads approaches 0.5. Convergence in the infinite limit.

*Pros:* objective. Anyone flipping the same coin under the same conditions gets the same long-run answer.

*Cons:* counter-intuitive when applied to one-off events. "70% chance of rain today" — what does that even mean frequentially? You'd have to say: "Among the infinite class of days similar to this one, it rains on 70%." Try saying that to your friend with an umbrella.

Bayesian probability

Probability is your degree of subjective belief in a proposition, updated as evidence arrives. "P(Carlsen beats Nepomniachtchi) = 0.7" means you, the observer, are 70% confident in that outcome based on prior knowledge.

*Pros:* applies to non-repeatable events. Intuitive.

*Cons:* not fully objective — depends on priors. Different priors give different answers.

For most of this course, when someone says "the probability," they mean the frequentist version, which is what classical hypothesis testing uses.

The vocabulary you'll need

Independent events. Two events A and B are independent if knowing one happened tells you nothing about the other. Formally: $P (A \cap B) = P (A) \cdot P (B)$ , equivalently $P (A ∣ B) = P (A)$ . Coin flips are independent of each other; whether you carry an umbrella and whether it's raining are not.

i.i.d. — independently and identically distributed. A sequence of variables $Y_{1}, Y_{2}, \dots, Y_{n}$ is i.i.d. if (a) they're all independent of each other, and (b) they all come from the same distribution. The i.i.d. assumption underlies almost every test in this course. Violations (same participant tested multiple times, neighbouring brain voxels) require special methods.

Sample vs population. The *population* is the full set you care about — all voters in Punjab, all Indian college students. The *sample* is the subset you actually collected — your 1000 exit-poll respondents. You almost never observe the population. You observe a sample and try to infer properties of the population.

Sampling distribution of a statistic. Take a sample. Compute a statistic (the mean, the median, whatever). Take another sample. Compute the same statistic. Do this many times. The distribution of those statistics across samples is called the sampling distribution. This is the secret heart of inferential statistics — almost every test we'll do compares the value of a statistic to its sampling distribution under some null hypothesis.

Distributions, formally

A probability distribution assigns probabilities to the possible values of a random variable.

For *discrete* variables (countable values), we use a probability mass function (PMF): $P (X = k)$ for each specific value $k$ .

For *continuous* variables (uncountable values), we use a probability density function (PDF), $f (x)$ . The probability that X equals an exact value is zero. What's meaningful is the probability of X falling in a range:

P (a \leq X \leq b) = \int_{a}^{b} f (x) d x

The total area under a PDF is always 1.

The cumulative distribution function (CDF) of X is $F (x) = P (X \leq x)$ . Goes from 0 to 1 as $x$ goes from $- \infty$ to $+ \infty$ .

The cast of distributions you must know

Each test you'll learn relies on one of these.

Bernoulli

Single trial with two outcomes. $P (X = 1) = p$ , $P (X = 0) = 1 - p$ . A coin flip with bias $p$ .

Binomial(n, p)

Sum of $n$ i.i.d. Bernoulli trials. $P (X = k) = (k n) p^{k} (1 - p)^{n - k}$ . Mean $n p$ , variance $n p (1 - p)$ .

*Example:* P(exactly 6 heads in 10 tosses, p = 0.7) — in R: dbinom(6, 10, 0.7) returns 0.200.

Normal (Gaussian) $\mathcal{N}(\mu, \sigma^2)$

The famous bell curve. Continuous, symmetric. Defined by mean $μ$ and SD $σ$ .

f (x) = \frac{1}{σ 2 π} exp (- \frac{( x - μ ) ^{2}}{2 σ ^{2}})

Empirical rule (exam staple): ~68% within $\pm 1 σ$ , ~95% within $\pm 2 σ$ , ~99.7% within $\pm 3 σ$ . This is why "outliers beyond 2 or 3 SDs" is a common threshold.

Standard Normal is $N (0, 1)$ . Any Normal can be standardised via the z-score: $Z = (X - μ) / σ$ .

t-distribution

Looks like a Normal but with heavier tails — more probability in the extremes. Used when sample size is small AND you don't know $σ$ (you have to estimate it, which introduces extra uncertainty captured by the heavier tails).

Parameter: degrees of freedom $k$ , related to sample size ( $k = n - 1$ for one sample). As $k \to \infty$ , the t-distribution approaches the standard Normal.

Chi-square ($\chi^2_k$)

Take $k$ independent standard-normal variables, square each, and add them up:

Z_{1}^{2} + Z_{2}^{2} + \dots + Z_{k}^{2} \sim χ_{k}^{2}

Right-skewed, always $\geq 0$ . Used in the chi-square test for categorical data and inside variance estimates.

F-distribution

The ratio of two scaled chi-squares: $F = (U / d_{1}) / (V / d_{2})$ where $U \sim χ_{d_{1}}^{2}$ , $V \sim χ_{d_{2}}^{2}$ independent. Right-skewed, $\geq 0$ . Two df parameters.

This is what you compare against in ANOVA. F-distributions have only one tail — they tell you "the groups differ," not "which is bigger."

R distribution functions — the four-letter pattern

Every distribution in R has four functions, all built on the same pattern:

dxxx — density / PMF
pxxx — cumulative $P (X \leq q)$
qxxx — quantile (inverse CDF)
rxxx — random sample

Examples:

``r dbinom(6, 10, 0.7) # P(X = 6); 0.2001 pbinom(4, 10, 0.7) # P(X ≤ 4); 0.0473 qbinom(0.04, 10, 0.7) # 4th percentile; returns 4 rbinom(100, 10, 0.7) # 100 random draws ``

The Central Limit Theorem — the most important thing in this course

If you remember nothing else from this unit, remember this.

Statement of the Central Limit Theorem (CLT):

Given a sufficiently large sample size, the sampling distribution of the mean approximates a Normal distribution, regardless of the original population's distribution — as long as that population has finite variance.

Read that slowly. The population distribution can be anything — skewed, bimodal, weird. Take random samples of size $n$ , compute the mean of each, collect those means, plot them. **The distribution of sample means will look Normal if $n$ is big enough.**

This is the spell. It's why so many statistical tests assume normality — they're assuming it of the sampling distribution of the statistic, not of the raw data. And the CLT guarantees that this is approximately true for almost any data, as long as your sample is big enough.

Why CLT matters in practice

Maya wants to test whether the average sore-throat duration is shorter with turmeric milk than without. Her individual data points (durations) might be right-skewed. That's not normally distributed. But the mean across a sample of 60 people will be approximately normally distributed across hypothetical re-runs of her experiment.

Two corollaries you'll use constantly

1. The sample mean is an unbiased estimator of the population mean. 2. As sample size grows, the sampling distribution narrows — bigger samples give more precise estimates.

The standard deviation of the sampling distribution has a special name: the Standard Error of the Mean (SEM):

SEM = σ / n

Larger $n$ → smaller SEM → more precise estimates. **The $n$ in the denominator is famous: to halve your standard error, you need four times as much data.**

When CLT hasn't kicked in

If your sample is tiny AND your raw data is highly non-normal, the CLT hasn't fully converged. Then you need non-parametric tests (Unit 9). Rough rule of thumb: $n > 25 - 30$ per group is sufficient for moderately skewed data.

Sampling techniques (the brief tour)

Simple random sampling — every member of the population has an equal chance. Gold standard.
Stratified sampling — split the population into subgroups (strata) and sample proportionally from each. Useful when some strata are rare but important.
Convenience sampling — sample whoever is easy to reach (your friends, your students). High risk of bias.
Snowball sampling — recruit a few participants, have them recruit others. Used for hard-to-reach groups.

If you know exactly how you sampled and what bias was introduced, advanced methods can correct for it. If you don't, you're in trouble.

What you carry into the exam

Probability vs statistics: model → data vs data → model.
Frequentist = long-run frequency. Bayesian = degree of subjective belief.
Independent events: $P (A \cap B) = P (A) P (B)$ . i.i.d. = independent + same distribution.
Sample vs population: sample is what you have; population is what you want to know about.
Sampling distribution: distribution of a statistic across many samples.
PDF/PMF for individual probabilities; CDF for cumulative.
Distributions to know: Bernoulli (1 trial), Binomial (n trials, $n p$ , $n p (1 - p)$ ), Normal (bell, $μ$ , $σ$ ), t (heavy-tailed, df), $χ^{2}$ (sum of squared normals, df), F (ratio of two chi-squares, two dfs).
R functions: d / p / q / r. Works for every distribution.
CLT: sampling distribution of the mean → Normal as n grows. The single most important theorem in this course.
SEM = $σ / n$ . To halve SEM, quadruple n.
Empirical rule for Normal: 68 / 95 / 99.7 within 1 / 2 / 3 SDs.

When you're ready, send "next" and we'll move into sampling and estimation in depth — confidence intervals, the $n$ vs $n - 1$ mystery, and how Maya finally puts a number on her uncertainty about whether turmeric milk works.

Behavioral Research: Statistical Methods