The Case for Statistics — Biases, Base Rates, Bayes
Intuition
Statistics exists because humans are wired to deceive themselves about probability. We accept invalid arguments when the conclusion sounds right (belief bias), look only for confirming evidence (confirmation bias), aggregate over confounded subgroups and reverse the truth (Simpson's paradox), and ignore base rates when a 'positive test' arrives (base-rate fallacy). Statistics is the formal corrective: a discipline for figuring out whether what looks true is actually true. Bayes' rule is the central operator — the formula that rescues us from the base-rate fallacy and many others.
Explanation
Why behavioural science is hard. Meet Maya — behavioural science researcher in Hyderabad, notebook, R, and a problem: humans. Humans are *complex* (any behaviour depends on sleep, weather, mood, a fight with their mother, and what they ate), *variable* (same person reacts differently on Tuesday vs Friday), and *reactive* (knowing they're observed changes their behaviour). These three properties make behavioural sciences fundamentally noisier than physics. Maya's first question is innocent: her mother says 'drink milk with turmeric — three days, sore throat gone.' Is that actually true, or is she fooling herself the way humans often do?
The course's central thesis. *Statistics is the discipline of figuring out whether what looks true is actually true, given that our brains are wired to deceive us.* Memorise this — it's the answer to 'Why do statistics?', almost always the opening question on a BRSM exam.
Belief bias (Evans, Barston & Pollard, 1983 — *if you remember one citation, this is it*). People judge arguments by the believability of the conclusion, not the logical validity of the argument. *Example:* 'All Bengaluru engineers wear blue. This person wears blue. Therefore this person is a Bengaluru engineer.' Invalid (affirming the consequent) — but ~70% of participants accept it because the conclusion sounds plausible. Reverse the conclusion to something implausible ('therefore this is a Martian') and they correctly reject it.
Confirmation bias. When testing a hypothesis, people seek confirming evidence rather than evidence that would falsify it. Classic demonstration: Wason card selection task. Rule: 'If a card has an odd number on one side, it has a vowel on the other.' Cards on the table: A, 2, 7, K, D, L. Most people flip the A. That's confirmation bias — flipping A could only *confirm* the rule. To actually *test* the rule, you must flip the cards that could *falsify* it: the 7 (if there's a consonant on the back, the rule is broken) and the K (if there's an odd number on the back, the rule is broken again). Falsification, not confirmation, is what science does. This is the Popperian principle at the heart of NHST.
Simpson's paradox. UC Berkeley 1973: aggregate graduate admissions showed bias against women — apparent discrimination. But department-by-department, women were admitted at equal or higher rates than men in most departments. What was happening? Women were applying disproportionately to highly competitive departments (English, Humanities) with low admit rates for *everyone*; men were applying more to less-competitive ones (Engineering). The aggregate average misled. The lesson: an average over groups can reverse the trend seen within each group. Always ask whether a subgroup analysis would change the story.
The base-rate fallacy. Maya's friend gets a positive mammogram. How likely is she to actually have cancer? Most people — including two-thirds of doctors tested — guess 80–90%. Let's do the exam-style math. *Facts:* 0.8% of women getting mammograms actually have cancer (base rate); the test has 90% power (catches 90% of real cancers); the false-positive rate is 7%. Imagine 1,000 women. About 8 have cancer; the test correctly flags ~7 of them. The other 992 don't have cancer, but 7% of them — about 70 women — get false positives. Total positive tests: 7 + 70 = 77. Of those 77 positives, only 7 actually have cancer, so P(cancer | +) = 7/77 ≈ 9%, not 90%.
Bayes' rule — the formal corrective. , where (total probability). For mammogram: . Intuition: when the base rate is low, even an accurate test mostly catches false positives because the false-positive mass () dominates the true-positive mass (). People forget this. Statistics doesn't.
The four misinterpretations the intro warns about. (1) p-values — people treat p as ; it is . (2) Confidence intervals — 'there's a 95% chance μ is in [4.2, 5.8]' is wrong; the *procedure* has 95% long-run coverage, the specific interval either contains μ or not. (3) Statistical power — people confuse a test's sensitivity (P(+|sick) = 0.9 in the mammogram) with the probability that the *patient* is sick. Power is a property of the test, not the population. (4) Correlation as causation — see Unit 6.
Independent vs Dependent Variables (modern terminology). *Old:* IV is what you manipulate; DV is what you measure. *Modern (regression):* predictors (what you use to guess) and outcomes (what you guess). In Maya's turmeric study: IV/predictor = whether participants got turmeric milk vs plain milk; DV/outcome = how sore-throat severity changes. Experimental research = experimenter controls the predictors → supports causal claims. Observational research = experimenter just measures → supports only association.
Experimental design types. *Between-subjects* (independent samples): different people in different conditions. One group gets turmeric, another gets plain milk. *Within-subjects* (repeated measures / matched pairs): same people in all conditions. Each person tries both, in different weeks. More power because each person is their own control. Downsides: fatigue, longer experiment, carry-over effects (effects of the first condition contaminating the second). *Mixed design*: some factors between, others within. This distinction drives the choice between unrelated vs paired t-tests, one-way vs RM-ANOVA, etc.
Confounds. A variable that is related to both your predictor and your outcome in some systematic way — creates the illusion of a relationship that isn't really there. *Example:* 'Violent video games cause aggression' — comparing gamers to non-gamers using criminal records. Children with many gaming hours may also have less parental supervision, particular socioeconomic backgrounds. Parental support is a confound. Gold-standard fix: random assignment to conditions — over many participants, the confound averages out across groups. Realistic fix: include the confound as a covariate in your model (ANCOVA, multiple regression).
Threats to validity — the long list. *History effects* (an event during the study influences results — e.g., surgery on day 5 of hospital stay). *Maturational effects* (natural changes over time — fatigue, waning attention). *Testing/practice effects* (improvement just from doing the test more). *Selection bias* (groups differ systematically before intervention). *Differential attrition* (dropouts non-random). *Non-response bias* (only people who care respond). *Regression to the mean* (extreme scores tend toward the mean on the next measurement). *Experimenter bias* (expectations leak into the data — Clever Hans, Pfungst 1907). *Hawthorne effect* (behaviour changes because of being observed). *Placebo effects*. *Fraud*. *Study misdesigns*. *p-hacking*. *Publication bias*.
p-hacking and post-hoc hypothesising — two of the most exam-important concepts. *Data mining / p-hacking:* run 50 models, report the one that worked. Statistical correction for the other 49 attempts is needed but rarely applied. *Post-hoc hypothesising (HARKing):* your original hypothesis failed; you mine the data, find something else, write it up as if you predicted it. One of the largest causes of false findings in psychology. Antidote: pre-registration.
Publication bias. Journals preferentially publish significant, positive findings. Negative results sit in file drawers (the file drawer problem). The literature ends up looking like everything works. This contributes massively to the replication crisis (~36–47% of psychology studies replicated, OSC 2015) and limits what meta-analyses can recover. Reforms: pre-registration, registered reports, registered replications, open data, multi-lab replications.
Double-blind studies defeat both experimenter bias and reactivity. Neither participant nor experimenter knows the condition until the data are analysed. Standard in clinical trials. Combine with placebo control (everyone gets something — either active or inert) to control placebo effects.
Statistics ≠ certainty. Every inferential claim is probabilistic and conditional on assumptions. Different frameworks (frequentist vs Bayesian, parametric vs nonparametric, with vs without a covariate) can produce different conclusions on the *same* data. The course teaches you to state your assumptions, check them, quantify uncertainty (CI, effect size, posterior), and be calibrated — make claims commensurate with the evidence. A confidently wrong statistician is more dangerous than no statistician.
Definitions
- Belief bias — Judging an argument's validity by the believability of its conclusion, not by the logic. Evans, Barston & Pollard (1983).
- Confirmation bias — Seeking confirming evidence for a hypothesis rather than evidence that could falsify it. Demonstrated by the Wason card-selection task.
- Simpson's paradox — A trend appearing in groups reverses when the groups are combined (or vice versa). UC Berkeley 1973 admissions is the classic example.
- Base-rate fallacy — Ignoring the prior probability (base rate) of an event when interpreting a positive test. People confuse sensitivity with PPV.
- Bayes' rule — . Posterior = likelihood × prior / evidence. Formal corrective to base-rate intuition.
- PPV (Positive Predictive Value) — P(disease | positive test). Depends critically on prevalence — at low prevalence even sensitive tests have low PPV.
- Sensitivity / Specificity — P(+ | disease) and P(− | no disease). Properties of the test, distinct from PPV.
- Independent / Dependent variable — IV = what you manipulate (predictor). DV = what you measure (outcome). Modern terminology: predictor / outcome.
- Between-subjects design — Different participants in different conditions. No carryover; needs more participants to achieve power.
- Within-subjects design — Same participants in all conditions. More power but vulnerable to fatigue, practice, carryover effects.
- Mixed design — Some factors between-subjects, others within. Common for pre/post + group designs.
- Confound — A third variable related to both the predictor and outcome, creating spurious association. Threatens internal validity.
- Double-blind — Neither participant nor experimenter knows the condition. Controls both experimenter bias and reactivity.
- p-hacking (data mining) — Trying many analyses and reporting only the favourable one. Inflates Type I error well beyond nominal α.
- HARKing — Hypothesising After Results are Known. Reporting a post-hoc finding as if it were the original hypothesis.
- Publication bias — Journals preferentially publish significant findings. Negative results sit in the file drawer; the published literature overestimates effect sizes.
- Replication crisis — Empirical finding (OSC 2015 and others) that a large fraction of behavioural-science findings fail to replicate. Partly driven by p-hacking and publication bias.
Formulas
Derivations
Bayes from first principles. . Rearrange: . Replace , : posterior = likelihood × prior / evidence. Evidence acts as a normalising constant ensuring the posterior sums to 1.
Why the mammogram answer is ~9% even though sensitivity is 90%. The denominator is dominated by false positives drawn from the huge ¬cancer population: . PPV is small whenever the FP mass ≫ TP mass — typical at low prevalence. PPV is monotone in prevalence: for fixed Sens, Spec.
Counting-style derivation. Frame any Bayes problem in absolute counts before computing fractions. *Mammogram in 1000 women:* 8 cancer × 0.9 = 7.2 true positives; 992 no-cancer × 0.07 = 69.4 false positives. P(C | +) = 7.2 / (7.2 + 69.4) ≈ 0.094. The counting view exposes the FP mass intuition more starkly than the formula.
Why Simpson's paradox reverses trends. Aggregate slope is a weighted combination of within-group slopes plus a between-group effect mediated by the lurking variable Z. When the lurking variable correlates strongly with both X and Y, can flip sign relative to all .
Wason cards logical structure. Rule: (odd → vowel). Falsifier of is (odd AND consonant). Cards with visible (the 7) must be flipped to check for on the back. Cards with visible (the K, a consonant) must be flipped to check for on the back. Cards with visible (the A) are irrelevant — is allowed regardless of what's on the back. Confirmation bias makes people flip A; logic demands 7 and K.
Examples
- Belief-bias worked example. Premise 1: All Bengaluru engineers wear blue. Premise 2: This person wears blue. Conclusion: This person is a Bengaluru engineer. Invalid (affirming the consequent) — but plausible-sounding conclusions get accepted by ~70% of participants. Replace the conclusion with 'this person is a unicorn' and acceptance drops to ~10%.
- Wason card task answer. Rule 'odd → vowel'. Cards A, 2, 7, K. Correct flips: the 7 (if back is a consonant, the rule is falsified) AND the K (if back is an odd number, the rule is falsified). Most participants flip A (could only confirm) — the textbook confirmation-bias mistake.
- Simpson's paradox. Within-group: tutoring → ↑ exam scores. Aggregate: tutoring → ↓ exam scores. Reason: students who choose tutoring are those who were struggling — the lurking variable 'prior ability' confounds the aggregate.
- Base-rate fallacy at low prevalence. Sensitivity 99%, specificity 99%, prevalence 0.1%. P(D | +) = 0.99·0.001 / (0.99·0.001 + 0.01·0.999) = 0.00099/0.01098 ≈ 9%. Despite a 99%-accurate test, P(disease | positive) is only 9% because almost no one has the disease.
- Counting version. Of 100,000 women screened: 100 have the disease, 99 correctly flagged; 99,900 don't, 999 false positives. Total positives = 1,098; only 99 are true → P(D | +) = 9%.
- Maya's turmeric question. Mother says turmeric cured her sore throat. Confound: sore throats usually resolve in 3–7 days regardless of treatment. To attribute the cure to turmeric, Maya needs a control group that didn't get turmeric.
- Clever Hans (Pfungst, 1907). Horse appeared to do arithmetic. Pfungst showed Hans was reading subtle unconscious cues from his trainer von Osten. Even though the trainer wasn't trying to cue Hans, his expectation of the right answer changed his body language, and Hans had learned to read it. The foundational example of why double-blind studies matter.
- Hawthorne factory study. Worker productivity rose under almost every lighting condition tested. The 'effect' was the workers' response to being observed, not the lighting itself.
Diagrams
- Bayes counting diagram. 1,000 women → 8 with cancer (× 0.9 → 7 TP, 1 FN) and 992 without cancer (× 0.07 → 69 FP, 923 TN). PPV = TP / (TP + FP) ≈ 9%.
- Wason cards laid out. A, 2, 7, K, D, L face-up; mark the 7 and K as 'must flip to falsify'; mark the A as 'confirmation-bias trap'.
- Simpson's paradox scatter. Two clouds of points; within each cloud the slope is negative; combining the clouds gives a positive overall slope. Lurking variable Z explains.
- Validity-threat taxonomy tree. History / maturational / testing / selection / attrition / non-response / regression-to-mean / experimenter / Hawthorne / placebo / fraud / p-hacking / publication bias.
- IV/DV diagram. Manipulated predictor → measured outcome; arrow shows hypothesised causal direction. Experimental designs (between/within/mixed) shown as variations.
- Confound diagram. Z → X and Z → Y, creating a spurious X-Y correlation that survives even when X has no causal effect on Y.
Edge cases
- Rare conditions make any 'positive test' result low-PPV — universal screening can produce more false alarms than detections.
- Aggregation level matters. A medication may help on average and harm specific subgroups (Simpson-style). Always test for interactions.
- Likelihood is not symmetric. P(+ | cancer) ≠ P(cancer | +) — confusing them is the base-rate fallacy. The conditional flips entirely with prevalence.
- Practice / carry-over effects undermine within-subjects designs. Counterbalance condition order.
- Heterogeneous attrition (different dropout rates per group) kills comparisons — homogeneous attrition is annoying but manageable.
- Demand characteristics — participants guessing the hypothesis and behaving accordingly — undermine construct validity. Use cover stories or single/double-blind designs.
- Regression to the mean can create a fake 'effect' of feedback. Children selected for being unusually tall will tend to be shorter on next measurement (closer to average); easily misread as 'punishment works'.
Common mistakes
- Treating sensitivity as PPV — confusing P(+ | disease) with P(disease | +).
- Saying 'p = .03 means H₀ is 3% likely' — p is P(data | H₀), not P(H₀ | data).
- Trusting aggregate trends without checking subgroups (Simpson's paradox lurks).
- Flipping only confirming cards in Wason-style tests — falsification, not confirmation, is the goal.
- Equating reliability with validity.
- Mistaking IV/DV for X/Y — 'X causes Y' assumes experimental design and random assignment; correlational data can't.
- Switching to a between-subjects analysis when within-subjects data are available — losing power.
- Reporting only a 'positive' subgroup or analysis (p-hacking) without correction.
- Concluding 'no effect' from a non-significant test in an underpowered study.
Shortcuts
- Posterior = Likelihood × Prior / Evidence. Memorise verbatim.
- Low base rate → low PPV regardless of test accuracy — the FP mass dominates.
- Count, then divide. Frame Bayes problems in 1,000-person tables before computing fractions.
- Statistically significant ≠ practically meaningful — always report effect size.
- Wason answer = 7 + K (the falsifiers). The A is a confirmation trap.
- Validity threat memory: History, Maturation, Testing, Selection, Attrition, Non-response, Regression, Experimenter, Hawthorne, Placebo, Fraud, p-Hacking, Publication.
- Double-blind defeats experimenter bias AND reactivity.
- Modern terminology: predictor / outcome (interchangeable with IV / DV).
Proofs / Algorithms
PPV is monotone in prevalence. where is prevalence. Differentiating with respect to : . Hence routine screening of *high-risk* populations (high prevalence) yields more meaningful positive results than mass screening of the general population — a core principle of clinical epidemiology.
Bayes from joint probability. . Solving for : . Combined with total probability , the rule is fully determined by prior, sensitivity, and 1 − specificity.
The union bound for confirmation bias. If hypothesis makes prediction , observing is not evidence for unless . Confirmation bias = noticing and concluding without checking the alternative. Falsification = trying to observe , which would deterministically rule out if is logical.