Saral Shiksha Yojna
Courses/Behavioral Research: Statistical Methods

Behavioral Research: Statistical Methods

CG3.402
Vinoo AlluriMonsoon 2025-264 credits

Priors, Posteriors, Bayes Factors

NotesStory
Unit 13 — Bayesian Statistics

Maya's Alternative Worldview

The semester has been a stately march through the frequentist canon: p-values, confidence intervals, 'reject H₀', Type I, Type II, the strictures of α = 0.05. Maya has learned to speak it fluently. But she has noticed something. Three things, actually.

One. Every time she explains a p-value to a friend in a non-math department — "it's the probability of the data assuming the null is true" — they nod with the polite, glazed expression of someone deciding the conversation isn't worth understanding. *That's because the definition is genuinely contorted.* It's not the probability of the null. It's not the probability of the alternative. It's not "the probability the finding is a fluke." It's: assuming H₀ is *exactly true*, what's the chance of data this extreme or more? Dense. Conditional. Easy to misread.

Two. A 95% confidence interval. Same friends. Same glazed expressions when she has to explain "95% of intervals constructed by this procedure would contain the true parameter, but we don't actually know if *ours* is one of them." That's correct frequentist interpretation, and almost nobody talks that way in real life.

Three. The replication crisis. Some of it is data fabrication. Most of it is statistical practice. p-hacking. Garden of forking paths. *Optional stopping* — where a researcher peeks at the p-value as data come in, and stops collecting the moment they see p < 0.05. Even with no real effect, this practice is almost guaranteed to find something eventually. And almost everyone has done it at some point in their career.

She's beginning to wonder if there's another way.

There is. It's been there the whole time. It's older than frequentist statistics — Reverend Thomas Bayes wrote it in 1763. It just got buried in the 20th century when Fisher and Neyman-Pearson rose to academic dominance. Now it's coming back. The session is called Bayesian statistics. Maya turns the page.

---

The Umbrella

*"Imagine you see Maya walking toward you in Hyderabad in April, carrying an umbrella. Do you think it will rain today?"*

That's the entire setup. Maya — and the reader — has a prior belief about whether it will rain. In Hyderabad in April, the historical chance of rain on a given day is around 15%. So before seeing anything:

Now she factors in what she knows about *me* specifically. I'm forgetful. On actually rainy days, I carry an umbrella only 30% of the time (otherwise I forget it in my office). On dry days, I sometimes carry one *just in case*, about 5% of the time:

These are the likelihoods — how well each hypothesis (rain / no rain) predicts the observed data (umbrella). Note carefully: they don't have to sum to anything sensible. They're conditional on different worlds.

She wants — the posterior.

She uses the joint-probability accounting:

| Scenario | Probability | |---|---| | Rain AND umbrella | 0.30 × 0.15 = 0.045 | | Rain AND no umbrella | 0.70 × 0.15 = 0.105 | | No rain AND umbrella | 0.05 × 0.85 = 0.0425 | | No rain AND no umbrella | 0.95 × 0.85 = 0.8075 | | Sum | 1.000 ✓ |

She *did* see the umbrella, so she conditions on that. The umbrella scenarios together have probability 0.045 + 0.0425 = 0.0875. Of those, the share where it actually rains is 0.045:

From 15% to 51%. A massive update — but not certainty.

This is what Bayes' theorem does. It takes a prior belief, multiplies by how well each hypothesis predicts the data, and divides by the total probability of the data:

Posterior = (Likelihood × Prior) / Evidence.

Or, since P(D) is just a normalising constant that doesn't depend on H:

*"Posterior is proportional to prior times likelihood."*

She writes that on a Post-it and sticks it on her laptop. It's the entire framework.

---

The Engine Generalises

She now wants to test hypotheses Bayesian-style. In frequentist statistics she computed p = P(data this extreme | H₀). In Bayesian statistics, she computes how well each *competing* hypothesis predicts the data — and forms the ratio:

The Bayes Factor. The Bayesian counterpart to the p-value, but with two crucial differences.

First, BF can support the null. BF₁₀ < 1 means H₀ predicts the data better. BF₀₁ = 1/BF₁₀ > 1 means "evidence for H₀." Frequentist p-values can only "fail to reject" — they can't actively support the null. If you genuinely want to show absence of effect (placebo as good as drug, two methods equivalent), you *need* the BF.

Second, BF is continuous. The Jeffreys interpretation scale gives bands:

| BF₁₀ | Evidence for H₁ | |---|---| | 1 – 3 | Anecdotal | | 3 – 10 | Moderate | | 10 – 30 | Strong | | 30 – 100 | Very strong | | > 100 | Decisive |

No magical 0.05 cutoff. No coin flip across the threshold. BF is a number; you state how strong the evidence is.

The deeper rule that connects everything:

Posterior odds = Prior odds × Bayes Factor.

This is the universal update rule. Today's posterior becomes tomorrow's prior. Evidence accumulates multiplicatively, with each new data point multiplying the running BF.

---

The Big One — Optional Stopping

Of all the practical advantages, this is the one Maya circles in red.

Frequentist statistics requires you to commit to a sample size in advance. You collect that many subjects, run the test, report the result. If you peek midway and stop early because p just crossed 0.05, you have *destroyed* your inference. Your p-value is wrong. Your Type I error is way higher than nominal — even one unscheduled peek converts a 5% rate to ~8%. With unlimited peeks, it approaches 100%. Every published p-value implicitly promises *no peeking*. Most researchers peek anyway.

Bayes Factors don't care. Why? Because BF depends only on the joint likelihood of the observed data under each hypothesis. It doesn't depend on *why you stopped collecting*. The "likelihood principle" says all the information about the parameter is in the likelihood; the stopping rule is a fact about the researcher, not the parameter.

So you can:

  • Plan to collect 30 subjects.
  • After every subject, recompute BF.
  • Stop when BF crosses your evidence threshold (say, BF₁₀ > 10 or BF₀₁ > 10) — *whichever direction* — or when you run out of budget.

This is exactly how scientists *want* to do data collection. Bayes lets them.

*"The Bayesian framework doesn't punish you for paying attention to your data."*

---

Bayesian Tests Maya Will Reach For

Every classical test has a Bayesian counterpart, and R's BayesFactor package implements many of them with default priors.

| Frequentist | Bayesian (R) | |---|---| | t.test() | ttestBF() | | aov() | anovaBF() | | lm() | regressionBF() | | chisq.test() | contingencyTableBF() |

Each returns a Bayes Factor against the null. For an ANOVA, anovaBF will produce BFs for every combination of main effects and interactions — you pick the best-supported model.

She runs a quick Bayesian re-analysis of her anxiety-treatment data:

``r library(BayesFactor) ttestBF(x = anxiety_counsel, y = anxiety_both) # Bayes factor analysis # [1] Alt., r=0.707 : 6.42 ± 0.01% ``

BF₁₀ = 6.42. Moderate evidence that counselling and combination treatment differ. Her frequentist Tukey HSD had said p = .008 with the same data. The two agree in spirit; the BF tells her the *strength*.

She likes this. She likes reporting both. p tells her whether convention would let her publish; BF tells her how much she'd actually trust the result.

---

A Wrinkle: Lindley's Paradox

She reads ahead. There's a paradox. At very large n, a tiny effect can produce a tiny p-value (because ) while the Bayes Factor strongly supports the null. They're not contradicting each other — they're answering different questions.

  • p asks: 'Is the parameter *exactly* equal to the null value?' At huge n, even a trivial difference is detectable, so p says reject.
  • BF asks: 'How well does the simpler H₀ predict the data, weighted against H₁'s flexibility?' At huge n with a tiny effect, H₀'s tight prediction is actually competitive.

Maya writes a note: when n is huge, p-values become hyper-sensitive; BFs are more interpretable. When n is small, both can be uninformative.

---

The Quiet Revelation

By the end of the session, something has shifted for Maya. Frequentist statistics gave her tools — a kit of named tests, each with its assumptions and conventions. Bayesian statistics gives her a *framework*: posterior = (likelihood × prior) / evidence. Every analysis is a belief update.

When her friend tells her about a one-off study — say, a small RCT with 24 patients — she'll no longer just look at p. She'll ask: what was the prior plausibility of the effect? How well does the data update that belief? Is the posterior tight enough to act on?

*"Probability isn't a property of the world. It's a property of my beliefs about the world, given what I've seen so far."*

She closes the laptop. Outside, a light drizzle has started. From her window she sees a colleague walking past with an umbrella.

She does the math automatically. Prior, likelihood, posterior. 51%. She smiles, picks up her own umbrella from the rack, and steps out.

The framework follows you home.