Revision Notes/Unit 3 — Probability & Distributions/Probability, Distributions, and the CLT

Probability, Distributions, and the CLT

Intuition

Probability is either long-run frequency (frequentist) or degree of belief (Bayesian). PDFs describe continuous variables (probability of any *exact* value = 0); CDFs give cumulative probabilities. The Central Limit Theorem is the spell that makes everything else work: sample means → Normal as $n$ grows, regardless of population shape (provided finite variance). That is why parametric tests apply broadly — they assume normality of the *sampling distribution*, not of the raw data, and the CLT guarantees the former.

Explanation

The Fischer-Taimanov puzzle. 1971: Bobby Fischer vs Mark Taimanov, 6 games played, Fischer up 3–2–1. Probability Fischer wins game 7? Naïvely 3/6 = 0.5. But should the draw count? Should we treat all 6 games as equally informative? Every statistical method makes assumptions, and the answer depends on them. Fischer ultimately won 6–0 — the fair-coin model was wrong. Statistics is the art of choosing the right model, with humility.

Probability vs Statistics — the directions are opposite. *Probability* flows from model → data: given a fair coin, P(two tails in a row) = 0.25. *Statistics* (inferential) flows from data → model: Fischer won 3/3; given that, is P(Fischer wins) really 0.5 or something higher? Most of this course is the second direction.

Frequentist probability. Probability = long-run frequency of an event in repeated sampling. 'P(heads) = 0.5' means: flip a billion times, proportion of heads → 0.5. *Pros:* objective. *Cons:* counter-intuitive for one-off events. '70% chance of rain today' under frequentism means 'across the infinite class of days similar to this one, it rains 70% of the time' — try saying that to a friend with an umbrella.

Bayesian probability. Probability = degree of subjective belief, updated as evidence arrives. 'P(Carlsen beats Nepomniachtchi) = 0.7' means you, the observer, are 70% confident based on prior knowledge. *Pros:* applies to non-repeatable events. *Cons:* not fully objective — depends on priors. Most of this course uses frequentism; Bayes returns in Unit 13.

Independence. Two events $A, B$ are independent if knowing one tells you nothing about the other: $P (A \cap B) = P (A) \cdot P (B)$ , equivalently $P (A ∣ B) = P (A)$ . Coin flips are independent; weather and whether you carry an umbrella are not.

i.i.d. — independent and identically distributed. A sequence $Y_{1}, Y_{2}, \dots, Y_{n}$ is i.i.d. if (a) they're all independent of each other and (b) they all come from the same distribution. i.i.d. is the assumption underlying almost every test in this course. Violations (same participant tested multiple times → repeated measures; neighbouring brain voxels → spatial autocorrelation) require special methods.

Sample vs population. Population = full set you care about (all Indian voters). Sample = subset you actually have (your 1000 respondents). You almost never observe the population; you observe a sample and try to *infer* population properties.

Sampling distribution of a statistic — the secret heart of inferential statistics. Take a sample; compute a statistic (mean, median, t-value); take another sample; compute again. The distribution of those statistics across samples is the sampling distribution. Almost every test compares an observed statistic to its sampling distribution under some null. Distinct from the population distribution AND from a single sample's distribution.

PDF / PMF / CDF formally. Discrete: PMF $P (X = k)$ assigns probability to each value. Continuous: PDF $f (x)$ — $P (a \leq X \leq b) = \int_{a}^{b} f (x) d x$ ; the probability of *any exact* value is zero. Total area under the PDF = 1. CDF $F (x) = P (X \leq x)$ runs from 0 to 1; it's the integral of the PDF.

Bernoulli(p). Simplest discrete: single trial with two outcomes. $P (X = 1) = p$ (success), $P (X = 0) = 1 - p$ (failure). One parameter.

Binomial(n, p). Sum of $n$ i.i.d. Bernoulli(p) trials. PMF $P (X = k) = (k n) p^{k} (1 - p)^{n - k}$ . Mean $n p$ , variance $n p (1 - p)$ . Example: P(exactly 6 heads in 10 flips with $p = 0.7$ ) = dbinom(6, 10, 0.7) ≈ 0.200.

**Normal $N (μ, σ^{2})$ .** Bell-shaped, symmetric. Two parameters: mean $μ$ and SD $σ$ . PDF $f (x) = \frac{1}{σ 2 π} exp (- \frac{( x - μ ) ^{2}}{2 σ ^{2}})$ . Empirical rule (exam staple): ~68% within $\pm 1 σ$ , ~95% within $\pm 2 σ$ , ~99.7% within $\pm 3 σ$ . This is why outliers beyond 2–3 SDs are flagged. Standard Normal $N (0, 1)$ ; any Normal can be standardised via $Z = (X - μ) / σ$ .

t-distribution. Bell-shaped but with heavier tails than Normal — more probability in extremes. One parameter: degrees of freedom $k$ . Used when sample size is small and $σ$ is unknown (estimating $σ$ with $s$ adds uncertainty, captured by heavier tails). As $k \to \infty$ → Normal; at $k = 10$ still visibly fatter; at $k = 1000$ indistinguishable. Critical for t-tests (Unit 7) and confidence intervals.

**Chi-square $χ_{k}^{2}$ .** Take $k$ independent standard Normals, square each, sum: $Z_{1}^{2} + \dots + Z_{k}^{2} \sim χ_{k}^{2}$ . Right-skewed, always $\geq 0$ . Mean $= k$ , variance $= 2 k$ . Used in $χ^{2}$ tests for categorical data (Unit 9) and inside variance estimates.

F-distribution. Ratio of two scaled $χ^{2}$ s: $F = (U / d_{1}) / (V / d_{2})$ where $U \sim χ_{d_{1}}^{2}$ , $V \sim χ_{d_{2}}^{2}$ , independent. Right-skewed, $\geq 0$ . Two df parameters $(d_{1}, d_{2})$ for numerator and denominator. Used in ANOVA — F values near 1 indicate similar variances; $F ≫ 1$ indicates the numerator chi-square is much larger. F-tests are one-tailed by construction.

R's four-letter pattern. For any distribution xxx: dxxx density / PMF; pxxx cumulative CDF $P (X \leq q)$ ; qxxx quantile (inverse CDF); rxxx random sample. Examples: dbinom(6, 10, 0.7) = 0.200; pbinom(4, 10, 0.7) ≈ 0.047; qnorm(0.975) ≈ 1.96; rnorm(100, mean=0, sd=1) for a sample. Warning for discrete CDFs: because discrete vars only take certain values, only some percentiles are achievable — R rounds up to the next reachable quantile.

Central Limit Theorem — the most important theorem in the course. *Given a sufficiently large sample size, the sampling distribution of the sample mean approximates a Normal distribution, regardless of the original population's distribution — as long as that population has finite variance.* The population can be anything (skewed, bimodal, weird); the *sampling distribution* of the mean is approximately Normal at large $n$ . This is why most parametric tests apply broadly — they assume normality of the test statistic's sampling distribution, which CLT guarantees, *not* normality of the raw data.

Two corollaries. (1) The sample mean $\overset{ˉ}{X}$ is an *unbiased estimator* of the population mean: $E [\overset{ˉ}{X}] = μ$ across hypothetical re-runs. (2) **SEM = $σ / n$ ** — the standard deviation of the sampling distribution. Larger $n$ → smaller SEM → more precise estimates. **The $n$ in the denominator is famous: to halve your SEM, you need *4×* as much data.** Exam fixture.

Law of Large Numbers (LLN). As $n \to \infty$ , $\overset{ˉ}{X} \to μ$ . LLN is about the *point estimate* converging to the truth. CLT is about the *shape of variability* around that truth. LLN says you're going somewhere; CLT says the path is Normal.

When CLT hasn't kicked in. If your sample is tiny AND the raw data is highly non-Normal, the sampling distribution may also be non-Normal. Then use non-parametric tests (Unit 9). Rule of thumb: $n > 25 - 30$ per group is sufficient for moderately skewed data.

Sampling with vs without replacement. With replacement: pure i.i.d. Without replacement: strictly not independent (each draw changes what's left), but if the *population* is much larger than the sample, the dependence is negligible — treat as i.i.d. Most methods assume with-replacement; without-replacement gives essentially the same answers for reasonably-sized populations.

Definitions

Frequentist probability — Long-run frequency of an event in repeated sampling. Objective but counter-intuitive for one-off events.
Bayesian probability — Degree of subjective belief, updated by evidence. Intuitive for one-off events; depends on priors.
Independent events — $P (A \cap B) = P (A) P (B)$ , equivalently $P (A ∣ B) = P (A)$ . Coin flips are independent; correlated measurements are not.
i.i.d. — Independent AND identically distributed. The bedrock assumption of most inferential tests.
Sample vs population — Population = the full set of interest. Sample = the subset you actually observe. Inference goes sample → population.
Sampling distribution — Distribution of a statistic across many hypothetical samples. The secret heart of inferential statistics — every test compares an observed statistic to this distribution under the null.
PDF / PMF / CDF — Density (continuous) / mass (discrete) / cumulative $F (x) = P (X \leq x)$ . For continuous RVs $P (X = exact) = 0$ .
Bernoulli(p) — Single yes/no trial. $P (X = 1) = p$ . Mean $p$ , variance $p (1 - p)$ .
Binomial(n, p) — Sum of $n$ i.i.d. Bernoulli(p) trials. PMF $(k n) p^{k} (1 - p)^{n - k}$ . Mean $n p$ , variance $n p (1 - p)$ .
Normal $\mathcal{N}(\mu, \sigma^2)$ — Bell-shaped, symmetric, two parameters. 68/95/99.7 rule. Standard Normal is $N (0, 1)$ .
t-distribution — Like Normal with heavier tails; one parameter (df). Use when $σ$ unknown, small samples. → Normal as df → ∞.
Chi-square $\chi^2_k$ — Sum of $k$ squared standard Normals. Right-skewed, $\geq 0$ . Mean $k$ , variance $2 k$ . Used in $χ^{2}$ tests.
F-distribution — Ratio of two scaled chi-squares. Right-skewed, $\geq 0$ . Two df parameters. Used in ANOVA / regression.
Central Limit Theorem (CLT) — Sampling distribution of $\overset{ˉ}{X}$ → $N (μ, σ^{2} / n)$ as $n$ grows, regardless of population shape (finite variance required).
Law of Large Numbers — Sample mean → population mean as $n \to \infty$ . About convergence of the point estimate.
Standard Error of the Mean (SEM) — $σ / n$ — SD of the sampling distribution. Measures precision of $\overset{ˉ}{X}$ as an estimate of $μ$ .
Empirical rule (68/95/99.7) — For Normal data, ~68% within $μ \pm σ$ , ~95% within $μ \pm 2 σ$ , ~99.7% within $μ \pm 3 σ$ .
Sampling with vs without replacement — With replacement is pure i.i.d. Without is dependent in principle but negligibly so when population ≫ sample.
R four-letter pattern — d density / PMF, p cumulative CDF, q quantile (inverse CDF), r random sample. Works for every distribution: norm, binom, t, chisq, f, …

Formulas

$P (A \cap B) = P (A) \cdot P (B) (independence)$
$P (X = k) = (k n) p^{k} (1 - p)^{n - k} (Binomial PMF)$
$E [X] = n p, Var (X) = n p (1 - p) (Binomial moments)$
$f (x) = \frac{1}{σ 2 π} exp (- \frac{( x - μ ) ^{2}}{2 σ ^{2}}) (Normal PDF)$
$Z = (X - μ) / σ \sim N (0, 1) (standardisation)$
$\overset{ˉ}{X} d N (μ, σ^{2} / n) (CLT)$
$SEM = σ / n (estimated by s / n)$
$Z_{1}^{2} + \dots + Z_{k}^{2} \sim χ_{k}^{2}, Z_{i} \sim N (0, 1) i.i.d.$
$F = \frac{U / d _{1}}{V / d _{2}}, U \sim χ_{d_{1}}^{2}, V \sim χ_{d_{2}}^{2} independent$
$P (μ - 1.96 σ \leq X \leq μ + 1.96 σ) \approx 0.95 (empirical rule)$

Derivations

Binomial mean and variance. Let $X = \sum_{i = 1}^{n} X_{i}$ with $X_{i} \sim Bernoulli (p)$ independent. Then $E [X_{i}] = p$ , $Var (X_{i}) = p (1 - p)$ . By linearity of expectation: $E [X] = \sum E [X_{i}] = n p$ . By independence (sum of variances): $Var (X) = \sum Var (X_{i}) = n p (1 - p)$ .

**Standardisation: $Z = (X - μ) / σ$ .** If $X \sim N (μ, σ^{2})$ , then $E [Z] = (E [X] - μ) / σ = 0$ and $Var (Z) = Var (X) / σ^{2} = 1$ . Z is Normal because linear transformations of Normals are Normal. This is why tables of standard-Normal quantiles work universally.

Sketch of CLT. Let $X_{1}, \dots, X_{n}$ be i.i.d. with mean $μ$ and variance $σ^{2}$ . Define $S_{n} = X_{1} + \dots + X_{n}$ . Then $E [S_{n}] = n μ$ and $Var (S_{n}) = n σ^{2}$ . Standardise: $Z_{n} = (S_{n} - n μ) / (σ n) = n \cdot (\overset{ˉ}{X} - μ) / σ$ . The characteristic-function of $Z_{n}$ converges to $e^{- t^{2} /2}$ (the characteristic function of $N (0, 1)$ ) as $n \to \infty$ . By Lévy's continuity theorem, $Z_{n} d N (0, 1)$ , equivalently $\overset{ˉ}{X} d N (μ, σ^{2} / n)$ .

Why SEM = σ/√n. $Var (\overset{ˉ}{X}) = Var (\frac{1}{n} \sum X_{i}) = \frac{1}{n ^{2}} \sum Var (X_{i}) = \frac{n σ ^{2}}{n ^{2}} = \frac{σ ^{2}}{n}$ (using independence). Hence $SD (\overset{ˉ}{X}) = σ / n$ . To halve SEM, increase n by 4× — root-n is famous.

Why χ²_k has mean k. $χ_{k}^{2} = \sum_{i = 1}^{k} Z_{i}^{2}$ with $Z_{i} \sim N (0, 1)$ . $E [Z_{i}^{2}] = Var (Z_{i}) + (E [Z_{i}])^{2} = 1 + 0 = 1$ . By linearity $E [χ_{k}^{2}] = k$ . The variance is $2 k$ (uses fourth moments of Z, which equal 3, so $Var (Z^{2}) = 3 - 1 = 2$ ).

Binomial → Normal approximation. When $n$ is large and $n p$ , $n (1 - p)$ are both $\geq 10$ , Binomial $(n, p) \approx N (n p, n p (1 - p))$ . Continuity correction: for $P (X \leq k)$ , use $Φ ((k + 0.5 - n p) / n p (1 - p))$ . Applies because Binomial is a sum of $n$ i.i.d. Bernoullis — CLT in action.

Examples

Binomial PMF. P(exactly 6 heads in 10 flips, $p = 0.7$ ) = $(6 10) (0.7)^{6} (0.3)^{4} = 210 \cdot 0.1176 \cdot 0.0081 \approx 0.200$ . In R: dbinom(6, 10, 0.7).
Binomial CDF. P(at most 4 heads in 10 fair flips) = pbinom(4, 10, 0.5) ≈ 0.377.
Inverse CDF. What is the 4th percentile of Binomial(10, 0.7)? qbinom(0.04, 10, 0.7) = 4.
Standardisation example. SAT score X = 1400, $μ = 1050$ , $σ = 200$ . $Z = (1400 - 1050) /200 = 1.75$ . From the Normal table, P(Z ≤ 1.75) ≈ 0.96 — 96th percentile.
CLT in practice. Population is right-skewed (reaction times: most fast, some slow). Take samples of n = 5; sample means still slightly skewed. n = 20; bell-shaped. n = 100; indistinguishable from Normal. The skew of the population determines how fast convergence happens.
SEM trade-off. $σ = 10$ , $n = 25$ : SEM = 2. To halve SEM to 1, need $n = 100$ (4×).
χ² sampling. Generate 1000 random $χ_{3}^{2}$ values by rchisq(1000, df=3). Mean ≈ 3, variance ≈ 6 (= $2 k = 6$ ). Distribution is right-skewed with a long tail.
Empirical rule. Heights of adult men, $μ = 175$ cm, $σ = 7$ cm. Range $μ \pm 2 σ$ = [161, 189] cm captures ~95%. A man at 200 cm is ~3.6σ above mean — extreme outlier.
i.i.d. violation. Same participant tested in two conditions → measurements within participant are correlated (not independent) → use paired t-test, not independent t-test.

Diagrams

The four key sampling distributions side-by-side. Normal (bell, symmetric); t with df = 5 (bell, heavier tails); χ²₃ (right-skewed, ≥ 0); F(3, 20) (right-skewed, ≥ 0). Annotate the parameter(s) of each.
CLT animation. Strongly right-skewed population. As n grows from 1 → 5 → 20 → 100, the sampling distribution of $\overset{ˉ}{X}$ transforms from the original shape to a Normal. The width also shrinks (SEM ∝ 1/√n).
PDF vs CDF. A bell-shaped PDF and its S-shaped CDF integral. Mark P(a ≤ X ≤ b) as an area under the PDF and as a difference of CDF values.
Empirical rule on a Normal. ±1σ (68%), ±2σ (95%), ±3σ (99.7%) shaded under the bell.
Sampling distribution illustration. Population on the left (whatever shape); many samples drawn; means computed; histogram of means forms an approximately Normal sampling distribution.
**Standardisation $X \to Z$ .** Slide the Normal so $μ = 0$ ; rescale so $σ = 1$ . Demonstrate that quantiles are preserved.

Edge cases

** $P (X = exact) = 0$ for continuous RVs** — only intervals have positive probability. Pitfall in defining 'tied' Likert vs continuous responses.
CLT requires finite variance. Cauchy distribution has infinite variance; sample means do NOT concentrate on any value — drawing more samples doesn't help.
Binomial → Normal approximation valid when $n p \geq 10$ AND $n (1 - p) \geq 10$ . Skewed cases (e.g., $p = 0.99$ ) need very large n.
t at small df is meaningfully different from Normal. $t_{5}$ has noticeably heavier tails; using Normal critical values at small n is anti-conservative (you reject too often).
Heavy skew + small n → CLT hasn't converged. Use non-parametric methods (Mann-Whitney, Wilcoxon, Kruskal-Wallis) or transform the data.
Without-replacement at small populations. If sample size approaches population size, the i.i.d. approximation breaks; use finite-population corrections.
Independent ≠ uncorrelated in general, but for jointly Normal variables, uncorrelated implies independent.

Common mistakes

Confusing PMF (discrete, 0 ≤ P(X=k) ≤ 1) with PDF (continuous, can have $f (x) > 1$ — density not probability).
Saying CLT means *raw data* are Normal at large n. CLT is about the *sampling distribution of the mean*, not raw data.
Using Normal critical values at small n with unknown σ — should use t.
Stating Binomial variance as $n p (1 - p)^{2}$ or similar — variance is $n p (1 - p)$ .
Computing P(X = 4.5) for a continuous Normal — always zero; ask for P(X ≤ 4.5) or P(4 ≤ X ≤ 5).
Forgetting the $n$ in SEM: writing $σ / n$ instead of $σ / n$ .
Treating dependent observations as i.i.d. — same participant measured twice violates i.i.d.; use paired tests.
Treating $\overset{ˉ}{X}$ itself as Normal even when the *population* is normal — true only if data are i.i.d.
Confusing the sampling distribution (across hypothetical samples) with the empirical distribution of one sample.

Shortcuts

i.i.d. assumption underlies almost every test.
Bernoulli sum → Binomial(n, p). Mean $n p$ , variance $n p (1 - p)$ .
CLT: sample-mean distribution → $N (μ, σ^{2} / n)$ .
SEM = σ/√n. To halve, need 4× data.
R prefixes: d density / PMF, p CDF, q quantile (inverse CDF), r random.
Empirical rule: ±1σ → 68%, ±2σ → 95%, ±3σ → 99.7%.
Critical Normal quantiles: $z_{0.975} \approx 1.96$ (95%), $z_{0.995} \approx 2.58$ (99%).
** $χ_{k}^{2}$ mean = k; variance = 2k.**
t → Normal as df → ∞. At df ≥ ~30 they're nearly indistinguishable.

Proofs / Algorithms

** $E [\overset{ˉ}{X}] = μ$ (unbiasedness of sample mean).** $\overset{ˉ}{X} = \frac{1}{n} \sum X_{i}$ . By linearity: $E [\overset{ˉ}{X}] = \frac{1}{n} \sum E [X_{i}] = \frac{1}{n} \cdot n μ = μ$ . The sample mean is an unbiased estimator of the population mean, *regardless* of the population distribution (only finite mean required).

** $Var (\overset{ˉ}{X}) = σ^{2} / n$ (variance of sample mean).** $Var (\overset{ˉ}{X}) = Var (\frac{1}{n} \sum X_{i}) = \frac{1}{n ^{2}} Var (\sum X_{i}) = \frac{1}{n ^{2}} \cdot n σ^{2} = σ^{2} / n$ — using $Var (a X) = a^{2} Var (X)$ and independence (sum of variances). Hence SEM = $σ / n$ .

CLT via characteristic functions (sketch). Let $ϕ (t) = E [e^{i tX}]$ be the characteristic function of $X$ with mean $μ$ and variance $σ^{2}$ . By Taylor: $ϕ (t) = 1 + i μ t - \frac{1}{2} (σ^{2} + μ^{2}) t^{2} + o (t^{2})$ . For $Y_{i} = (X_{i} - μ) / σ$ : $ϕ_{Y} (t) = 1 - t^{2} /2 + o (t^{2})$ . Sum: $ϕ_{S_{n} / n} (t) = [ϕ_{Y} (t / n)]^{n} = [1 - t^{2} / (2 n) + o (1/ n)]^{n} \to e^{- t^{2} /2}$ — characteristic function of $N (0, 1)$ . By Lévy's continuity theorem, convergence in distribution follows.

Behavioral Research: Statistical Methods