Revision Notes/Unit 13 — Bayesian Statistics/Priors, Posteriors, Bayes Factors

Priors, Posteriors, Bayes Factors

Intuition

Frequentist statistics treats probability as long-run frequency; parameters are fixed unknowns. Bayesian statistics treats probability as degree of belief in propositions — and any belief gets *updated* by evidence. The engine is one equation: posterior = (likelihood × prior) / evidence. Once you internalise that any data analysis is a belief update, every familiar test becomes a special case. The Bayes Factor — ratio of how well two hypotheses predict the data — is the Bayesian counterpart of the p-value, but with two advantages: it can support the null, and it doesn't care if you peeked at your data.

Explanation

Probability as belief, not frequency. Frequentists insist probability is what would happen in infinitely many repetitions. Bayesians say probability is a *quantification of belief* in a proposition, given everything we currently know. The Bayesian view is older (Bayes 1763, Laplace 1812) and arguably more aligned with how humans reason about uncertainty.

Bayes' theorem. $P (H ∣ D) = \frac{P ( D ∣ H ) \cdot P ( H )}{P ( D )}$ . In words: Posterior = (Likelihood × Prior) / Evidence. Posterior P(H|D) — belief about H after observing D. Likelihood P(D|H) — how well H predicts D. Prior P(H) — belief about H before D. Evidence P(D) — total probability of D across all hypotheses; a normaliser.

Posterior ∝ Likelihood × Prior. P(D) doesn't depend on H, so it doesn't shape the posterior — it only scales it to sum to 1. The *shape* of the posterior is entirely determined by Likelihood × Prior. The standard mnemonic: 'posterior is proportional to prior times likelihood.'

The umbrella example. P(rain) = 0.15 (Hyderabad April prior). Maya is forgetful: P(umbrella | rain) = 0.30; P(umbrella | no rain) = 0.05. You see her with an umbrella. Update: $P (rain ∣ umb) = \frac{0.30 \cdot 0.15}{0.30 \cdot 0.15 + 0.05 \cdot 0.85} = \frac{0.045}{0.0875} \approx 0.514$ . The umbrella raised your belief about rain from 15% to ~51% — but didn't pin it to 100%, because Maya isn't always rational.

The three components — make them concrete. Prior is your starting belief (data of the past, expert opinion, mathematical neutrality). Likelihood is the *model* — given each hypothesis is true, how probable is what we observed? Posterior is the *update*. You can iterate: today's posterior becomes tomorrow's prior.

Bayes Factor — the Bayesian p-value. $BF_{10} = \frac{P ( D ∣ H _{1} )}{P ( D ∣ H _{0} )}$ — ratio of how well each hypothesis predicts the observed data. BF₁₀ = 10 means data are 10× more likely under H₁ than H₀. BF₁₀ = 0.1 (equivalently BF₀₁ = 10) means data 10× more likely under H₀ — actively supports the null. BF is *continuous* — no magic threshold.

Posterior odds = Prior odds × Bayes Factor. $\frac{P ( H _{1} ∣ D )}{P ( H _{0} ∣ D )} = \frac{P ( D ∣ H _{1} )}{P ( D ∣ H _{0} )} \cdot \frac{P ( H _{1} )}{P ( H _{0} )}$ . Clean update rule: belief odds after data = belief odds before × evidence ratio.

Why report BF instead of posterior odds. Posterior odds depend on the reader's prior, which differs. BF is data-only — researcher-independent. Convention: report BF; readers plug in their own priors. If priors are flat (prior odds = 1), posterior odds = BF directly.

Jeffreys' interpretation scale. BF₁₀ in [1, 3] → anecdotal; [3, 10] → moderate; [10, 30] → strong; [30, 100] → very strong; > 100 → decisive. Same scale on BF₀₁ for evidence for the null. No arbitrary 0.05.

Frequentist problem 1 — p-value is hard to interpret. A p-value is P(data this extreme | H₀). Not P(H₀ | data). Not P(any hypothesis). The correct interpretation is dense, conditional, and routinely misstated by researchers and textbooks. A Bayesian posterior ('97% probability the effect is real') is far more directly interpretable.

Frequentist problem 2 — confidence intervals are misstated. A 95% CI does *not* mean 'there is a 95% probability the parameter is in this interval' — that's a Bayesian credible interval. A frequentist CI is a procedure that traps the parameter 95% of the time across hypothetical replications. The Bayesian alternative — credible interval — says exactly what people think CIs say.

Frequentist problem 3 — optional stopping is fatal. Peek at the p-value after every subject and stop when p < 0.05. Even under the null, you're almost guaranteed to eventually hit p < 0.05. One unscheduled peek at n = 50 (planning n = 80) takes Type I from 5% to ~8%. Strict frequentist rules forbid peeking — but it's exactly what real researchers want to do. Bayes Factors are robust to optional stopping because they don't depend on a sampling plan. Peek as often as you like; stop when evidence is enough. This is the single strongest practical case for Bayesian methods.

Frequentist problem 4 — α = 0.05 is arbitrary. No theory says 0.05 is the right cutoff. Clinical trials use 0.01; particle physics uses 5σ. Bayesian inference scales continuously — the BF is the evidence; you choose what evidence threshold matters for *your* decision.

Frequentist problem 5 — can't quantify evidence FOR the null. A non-significant p just means 'we failed to reject', which is silent about whether H₀ is actually true. BF₀₁ > 10 actively says 'data 10× more likely under H₀' — useful when you genuinely want to show absence of effect.

Credible intervals. Bayesian 95% credible interval: the parameter has *probability 0.95 of lying in this range, given the data and prior*. Direct. Useful. Equal-tailed (2.5%/2.5%) or HPD (highest posterior density) versions.

Conjugate priors. Some prior-likelihood pairings produce posteriors in the *same family* — analytic ease. Beta + binomial → Beta: prior $θ \sim Beta (α, β)$ , observe k successes in n trials, posterior $= Beta (α + k, β + n - k)$ . Normal + Normal (known variance) → Normal. Used heavily before MCMC made any prior tractable.

MCMC — when posteriors aren't closed form. Modern Bayesian computation uses Markov Chain Monte Carlo (Gibbs, Metropolis-Hastings, Hamiltonian) to *sample* from the posterior. Software: Stan, JAGS, brms, PyMC. You don't need a closed form; you need a likelihood + prior.

Bayesian versions of standard tests. Every classical test has a Bayesian counterpart. R's BayesFactor package: ttestBF() for one-sample / independent / paired t-tests; anovaBF() for ANOVA; regressionBF() for regression; contingencyTableBF() for χ². Default prior is Cauchy with width 0.707 — a weakly informative prior on effect size.

Lindley's paradox. With huge n, a frequentist test can reject H₀ at p < 0.05 while the Bayes Factor strongly supports H₀. They answer different questions: p asks 'is the effect *exactly* zero?'; BF asks 'how well does H₀ predict the data vs H₁?' For trivially small but non-zero effects at huge n, p screams 'reject' but BF says 'the data look just like H₀.'

Practical recommendation. Report both p-values and Bayes Factors when possible. p for the conservative audience; BF for evidence strength and direction (including for the null). The combination tells the fullest story.

Bayesian thinking in life. Maya updates her belief about whether a friend is sad by combining what she knew this morning (prior) with how the friend is replying to messages (likelihood). Doctors update disease probability with each test result. Investors update beliefs with each data release. Bayes is everywhere; the framework just formalises it.

Definitions

Prior — Pre-data belief P(H) about a hypothesis or parameter. Quantifies what you know before observing D.
Likelihood — P(D | H) — how well hypothesis H predicts the observed data D. The *model*.
Posterior — Updated belief P(H | D) after observing data. Proportional to prior × likelihood.
Evidence (marginal likelihood) — P(D) = $\sum_{H} P (D ∣ H) P (H)$ (or integral). Normalising constant; doesn't shape the posterior.
Bayes Factor (BF₁₀) — Ratio P(D|H₁)/P(D|H₀). Continuous evidence; can support either hypothesis or the null.
Prior odds — P(H₁)/P(H₀). Belief ratio before data.
Posterior odds — P(H₁|D)/P(H₀|D). Equals prior odds × BF₁₀.
Credible interval — Bayesian interval; parameter has X% posterior probability of being inside. Direct interpretation, unlike CI.
Conjugate prior — Prior + likelihood pairing where the posterior is in the same family as the prior. Beta–binomial is the classic example.
Beta–binomial — Conjugate pair: Beta( $α, β$ ) prior + binomial likelihood → Beta( $α + k, β + n - k$ ) posterior.
MCMC — Markov Chain Monte Carlo — algorithm to sample from posteriors when no closed form exists. Gibbs, Metropolis-Hastings, Hamiltonian.
Optional stopping — Peeking at data and stopping when significant. Fatal for frequentist Type I; legal for Bayesian BF.
Lindley's paradox — At huge n, p-values can reject H₀ while BF strongly supports it. They answer different questions.
Jeffreys scale — Convention for BF interpretation: 1–3 anecdotal, 3–10 moderate, 10–30 strong, 30–100 very strong, > 100 decisive.
BayesFactor (R package) — Implements ttestBF, anovaBF, regressionBF, contingencyTableBF with default JZS Cauchy prior of width 0.707.
Likelihood principle — All evidence about a parameter from data is contained in the likelihood. Bayes respects it; frequentism (sampling-distribution-based) doesn't.

Formulas

$P (H ∣ D) = \frac{P ( D ∣ H ) \cdot P ( H )}{P ( D )}$
$P (H ∣ D) \propto P (D ∣ H) \cdot P (H)$
$P (D) = H \sum P (D ∣ H) \cdot P (H) (discrete; integral for continuous)$
$BF_{10} = \frac{P ( D ∣ H _{1} )}{P ( D ∣ H _{0} )}, BF_{01} = 1/ BF_{10}$
$\frac{P ( H _{1} ∣ D )}{P ( H _{0} ∣ D )} = BF_{10} \cdot \frac{P ( H _{1} )}{P ( H _{0} )}$
$θ ∣ data \sim Beta (α + k, β + n - k) (conjugate Beta-binomial)$

Derivations

Bayes' rule from the definition of conditional probability. By definition, $P (H \cap D) = P (H ∣ D) P (D) = P (D ∣ H) P (H)$ . Equate the two: $P (H ∣ D) P (D) = P (D ∣ H) P (H)$ . Divide by P(D): $P (H ∣ D) = P (D ∣ H) P (H) / P (D)$ . QED.

Posterior odds = Prior odds × BF. Bayes' rule for H₁: $P (H_{1} ∣ D) = P (D ∣ H_{1}) P (H_{1}) / P (D)$ . For H₀: $P (H_{0} ∣ D) = P (D ∣ H_{0}) P (H_{0}) / P (D)$ . Divide: P(D) cancels, leaving $\frac{P ( H _{1} ∣ D )}{P ( H _{0} ∣ D )} = \frac{P ( D ∣ H _{1} )}{P ( D ∣ H _{0} )} \cdot \frac{P ( H _{1} )}{P ( H _{0} )} = BF_{10} \cdot prior odds$ . QED.

Beta-binomial conjugacy. Prior $p (θ) \propto θ^{α - 1} (1 - θ)^{β - 1}$ (Beta). Likelihood $p (D ∣ θ) \propto θ^{k} (1 - θ)^{n - k}$ (k successes, n−k failures). Posterior $p (θ ∣ D) \propto θ^{k + α - 1} (1 - θ)^{n - k + β - 1}$ — a Beta $(α + k, β + n - k)$ . Same family; parameters updated by counts. QED.

Why optional stopping is fine for Bayes. BF = $P (D ∣ H_{1}) / P (D ∣ H_{0})$ depends only on the joint likelihood of observed data under each hypothesis — not on the rule used to *decide* to stop collecting. The likelihood principle says all evidence about θ from D is in the likelihood; the stopping rule is information about the *researcher*, not θ.

Why frequentist inference fails under optional stopping. A p-value is computed under an assumed sampling distribution (e.g., 'fixed n, then look once'). If the actual procedure was 'keep looking until p < 0.05', the realised sampling distribution is different (longer tails, more extreme observations possible) and the nominal p underestimates the true Type I rate.

Examples

Umbrella update (numerical). Prior P(rain) = 0.15. Likelihoods: P(umb | rain) = 0.30, P(umb | no rain) = 0.05. P(umb) = 0.30·0.15 + 0.05·0.85 = 0.0875. Posterior P(rain | umb) = 0.045/0.0875 ≈ 0.514. From 15% to 51% — major shift, but not certainty.
Cancer screening (the base-rate fallacy). Disease prevalence 1%. Test sensitivity 90%; false positive rate 7%. Positive test → P(disease | +) = $\frac{0.90 \cdot 0.01}{0.90 \cdot 0.01 + 0.07 \cdot 0.99} = \frac{0.009}{0.0783} \approx 11.5%$ . Even a 'positive' result leaves an ~88% chance you don't have the disease — because the prior is so low.
Coin flip — beta-binomial. 10 flips, 8 heads. Uniform Beta(1,1) prior → Beta(9, 3) posterior. Posterior mean = 9/12 = 0.75. Posterior P(p > 0.5 | data) ≈ 0.94 — strong intuitive evidence of bias. The frequentist binomial test gives p ≈ 0.11 (not significant). Bayes paints a clearer picture at small n.
BF interpretation. BF₁₀ = 8 → moderate evidence for H₁ (3 < 8 < 10). BF₀₁ = 25 → strong evidence for H₀ (10 < 25 < 30). BF₁₀ = 1.2 → anecdotal (barely better than 1).
Lindley's paradox illustration. n = 100,000, sample mean differs from null by an amount that gives t = 3.5, p < .001. Bayes Factor with default prior may show BF₀₁ ≈ 8 — *moderate evidence for the null*. The effect is real but tiny; t scales with √n while BF accounts for parsimony.
Sequential Bayesian update. Estimate p (coin bias). Start Beta(1,1). After 5 heads in 6 flips: Beta(6, 2), mean ≈ 0.75. Flip 4 more, get 3 heads, 1 tail: posterior becomes Beta(9, 3), mean = 0.75 (same in expected value; tighter distribution).
R with BayesFactor. ttestBF(x = grp1, y = grp2) returns BF. regressionBF(y ~ x1 + x2 + x3, data = d) compares every subset of predictors against the intercept-only null. anovaBF(y ~ group, data = d) for ANOVA.

Diagrams

Prior + Likelihood → Posterior. Three curves overlaid: prior (wide/flat), likelihood (tighter/peaked around data), posterior (tighter still, balance of both). As n grows, likelihood dominates and posterior centres on the MLE.
Bayes Factor scale: a horizontal logarithmic axis with markers at 1 (anecdotal), 3 (moderate), 10 (strong), 30 (very strong), 100 (decisive). Same scale flipped for BF₀₁ (evidence for the null).
Frequentist vs Bayesian intervals: a CI shown as one of many vertical bars overlaid on a true-parameter line; 95% of bars cross the line. A credible interval shown as the central 95% of a posterior density curve.
Optional-stopping disaster: simulated p-value trajectories that wander past 0.05 by chance — frequentist fail. Same data with BF: monotonic-ish update, no spurious crossings.
Beta-binomial sequential update: Beta(1,1), Beta(6, 2), Beta(9, 3) curves on the same axis showing tightening around 0.75.

Edge cases

Strong / uninformative prior choice changes the posterior substantially at small n. Always do prior sensitivity analysis: refit with two or three priors and report the spread.
Lindley's paradox — at huge n, p and BF can disagree. Both are correct given their question; report both.
Improper priors (e.g., flat over the real line) sometimes yield improper posteriors. Check that the posterior integrates to 1.
Multiple testing — Bayesian methods don't have the FWER/FDR vocabulary of frequentism, but multiplicity still matters; use hierarchical priors that shrink small effects toward zero.
Model misspecification is just as fatal in Bayesian as frequentist analysis — likelihood is part of the model.
Bayes Factors are sensitive to prior width in a way posteriors usually aren't. Default-prior BFs (like JZS Cauchy in BayesFactor) are convenient but contestable.

Common mistakes

Saying BF = posterior probability. No. BF is the *ratio of likelihoods*; the posterior comes from BF × prior odds (and even then it's odds, not probability directly).
Treating priors as 'subjective and therefore invalid'. Priors make assumptions explicit and challengeable — that's a strength, not a bug. Frequentist methods have *implicit* priors (typically uniform) you can't even argue with.
Confusing frequentist CI ('procedure traps the parameter 95% of the time') with Bayesian credible interval ('parameter has 95% probability of being in this range').
Reporting a non-significant frequentist p as evidence for the null. Use BF₀₁ for that — it's what the question wants.
Reading BF = 0.5 as 'fail to support H₁'. BF = 0.5 → BF₀₁ = 2 → very mild evidence for H₀. Read BFs as ratios, not as binary 'sig/non-sig'.
Stopping data collection at BF = 3 and then comparing it to a frequentist α = 0.05. They're different scales answering different questions.
Forgetting that the likelihood is the *model* — Bayesian inference doesn't fix a bad model.

Shortcuts

Posterior ∝ Prior × Likelihood. P(D) just normalises.
BF₁₀ = P(D|H₁)/P(D|H₀). Continuous evidence; ratio scale.
Posterior odds = Prior odds × BF.
Jeffreys scale: 1–3 anecdotal, 3–10 moderate, 10–30 strong, 30–100 very strong, > 100 decisive.
BF₀₁ > 10 = strong evidence FOR the null (impossible with p-values).
Bayes is robust to optional stopping (frequentist isn't).
Credible interval ≠ confidence interval. Credible interval has the natural interpretation.
Conjugate priors: Beta + binomial → Beta; Normal + Normal (known σ) → Normal.
R: BayesFactor package — ttestBF, anovaBF, regressionBF, contingencyTableBF.
Report both p and BF for the fullest story.

Proofs / Algorithms

Bayes' theorem from the chain rule. By the symmetric definition of joint probability, $P (H \cap D) = P (H ∣ D) P (D)$ and $P (H \cap D) = P (D ∣ H) P (H)$ . Equate: $P (H ∣ D) P (D) = P (D ∣ H) P (H)$ . Divide both sides by $P (D)$ (assuming $P (D) > 0$ ): $P (H ∣ D) = \frac{P ( D ∣ H ) P ( H )}{P ( D )}$ . QED.

Beta-binomial conjugacy. Prior $p (θ ∣ α, β) = \frac{1}{B ( α , β )} θ^{α - 1} (1 - θ)^{β - 1}$ . Likelihood $p (D ∣ θ) = (k n) θ^{k} (1 - θ)^{n - k}$ . Posterior $p (θ ∣ D) \propto p (D ∣ θ) p (θ) \propto θ^{k + α - 1} (1 - θ)^{n - k + β - 1}$ — the kernel of a Beta( $α + k$ , $β + n - k$ ). Normalising constant supplies $1/ B (α + k, β + n - k)$ . QED.

Behavioral Research: Statistical Methods