Revision Notes/Unit 1 — Why Do Statistics? (Biases & Base Rates)/The Case for Statistics — Biases, Base Rates, Bayes/Story

The Case for Statistics — Biases, Base Rates, Bayes

NotesStory

Unit 1 — Why Do Statistics? (Biases & Base Rates)

Maya's Story — Why Statistics Exists

Meet Maya. She's a behavioural science researcher in Hyderabad. She has a notebook, a laptop with R installed, and a problem: human beings.

Humans are not steel ball bearings. They are *complex* (a person's mood depends on sleep, weather, a fight with their mother, and what they ate), *variable* (the same person reacts differently on Tuesday than on Friday), and *reactive* (the moment you watch them, they start behaving differently). Three properties that make the behavioural sciences hard.

Maya's first scientific question is innocent. Her mother tells her: *"Drink milk with turmeric, it will cure your sore throat. I've tried it. Three days, gone."* Maya wants to know whether this is actually true — or whether her mother is fooling herself the way humans often do.

This is the hook the whole course hangs on. Statistics is not arithmetic. It's the discipline of figuring out whether what looks true is actually true, given that our brains are wired to deceive us. Remember that line — it's the answer to "Why do statistics?", which is almost always the opening question on a BRSM exam.

The biases statistics protects you from

Before Maya can run any experiment, she has to understand the enemy: her own mind.

Belief bias

Humans evaluate the validity of an argument based on whether the conclusion *feels believable*, not whether the logic is sound. Evans, Barston, and Pollard (1983) showed this beautifully: people accept invalid arguments when the conclusion agrees with what they already believed, and reject valid arguments when the conclusion contradicts their beliefs. If you remember one citation from this course, this is a good one.

Try this argument: *All Bengaluru engineers wear blue. This person wears blue. Therefore this person is a Bengaluru engineer.* Logically invalid (affirming the consequent), but most people accept it because the conclusion sounds plausible.

Confirmation bias

When testing a hypothesis, people look for evidence that *confirms* it instead of evidence that would *falsify* it. The classic demonstration is the Wason card-selection task:

Rule: "If a card has an odd number on one side, it has a vowel on the other."

Cards on the table: A, 2, 7, K, D, L. Which cards must you flip to test the rule?

Most people flip the A. That's confirmation bias — flipping A could only *confirm*. To actually *test* the rule, you must flip the cards that could *falsify* it: the 7 (if there's a consonant on the back, the rule is broken) and the K (if there's an odd number on the back, broken again).

Falsification, not confirmation, is what science does. This is the Popperian principle baked into NHST.

Simpson's paradox

This one is famous and your exam will probably feature it. UC Berkeley, 1973. Looking at overall admissions, women were admitted at a lower rate than men — apparent discrimination. But when researchers broke the data down department by department, in most individual departments women were admitted at the same or higher rate than men.

What was happening? Women were applying disproportionately to highly competitive departments (English, Humanities) with low admit rates for *everyone*, while men were applying more to less-competitive ones (Engineering). The aggregate average misled.

The lesson: an average over groups can reverse the trend seen within each group. Always ask whether a subgroup analysis would change the story.

The base-rate fallacy

Maya's friend gets a positive mammogram. How likely is it she has cancer? Most people — *including two-thirds of doctors who were tested on this* — guess something like 80–90%. Let's do the math the way you'll see on the exam.

Facts: 0.8% of women getting mammograms actually have cancer. The test has 90% power (catches 90% of real cancers). The false-positive rate is 7%.

Imagine 1,000 women:

About 8 have cancer; the test correctly flags ~7 of them.
The other 992 do *not* have cancer, but 7% of them — about 70 women — get false positives.
Total positive tests: 7 + 70 = 77.
Of those 77 positive results, only 7 actually have cancer.

So $P (cancer ∣ +) = 7/77 \approx 9%$ , not 90%.

The same calculation done formally is Bayes' rule:

P (cancer ∣ +) = \frac{P ( + ∣ cancer ) \cdot P ( cancer )}{P ( + )}

where $P (+) = P (+ ∣ cancer) \cdot P (cancer) + P (+ ∣ no cancer) \cdot P (no cancer) = 0.9 \cdot 0.008 + 0.07 \cdot 0.992 \approx 0.077$ . Result: $0.9 \cdot 0.008/0.077 \approx 0.094$ .

The intuition: when the base rate of a condition is very low, even an accurate test mostly catches false positives. People forget this. Statistics doesn't.

So Maya's first lesson: humans systematically misjudge probability. Statistics is the corrective.

Maya designs her first experiment

To check whether turmeric milk cures sore throats, Maya needs a research design. This is where the course gets technical fast. Watch how each piece earns its name.

Independent and dependent variables

Old terminology: Independent Variable (IV) is what you *manipulate*; Dependent Variable (DV) is what you *measure*. Modern terminology used widely in regression: predictors (what you use to make guesses) and outcomes (what you're trying to predict).

In Maya's study: IV = whether she gave participants turmeric milk vs plain milk. DV = how their sore-throat severity changes.

Experimental research controls the predictors (random assignment, controlled conditions). It supports stronger causal claims. Observational research just measures both predictor and outcome and looks at the relationship — supports only association, not causation.

Experimental design types

Between-subjects (independent samples) — different people in different conditions. Maya gives one group turmeric milk, another group plain milk. Each person is in exactly one condition.
Within-subjects (repeated measures / matched pairs) — same people in all conditions. Each person tries both, in different weeks. More statistical power because each person is their own control. Downsides: fatigue, longer experiment, *carry-over effects* (the first condition contaminating the second).
Mixed design — some factors between, others within. Common for pre/post intervention × group designs.

You'll need this distinction for choosing between unrelated t-tests vs paired t-tests, one-way ANOVA vs repeated-measures ANOVA.

Confounds and the validity threats

A confound is a variable related to both the predictor and the outcome in some systematic way, creating the illusion of a relationship that isn't really there (or hiding one that is).

*Classic exam example:* "Do violent video games cause aggression?" Compare gamers to non-gamers using criminal records. Problem: children who spend hours playing violent games are also more likely to have absent parents, less supervision, particular socioeconomic backgrounds. Those might cause aggression, not the games. Parental support is a confound.

Gold-standard fix: the ideal experiment — random sample, randomly assign people to violent-games vs peaceful-games groups, monitor for years. Randomisation breaks the link between predictor and confound — over many participants, confounds average out across groups.

Realistic fix: include confounds as covariates in your statistical model (ANCOVA, multiple regression). You'll meet covariates throughout the course.

Threats to validity (the long list)

This is the part that loves to appear as MCQs:

History effects — something happens during the study that influences results. (Surgery on day 5 of a hospital-stay measurement on days 3 and 7.)
Maturational effects — natural changes over time, independent of the experiment. (Fatigue, waning attention in long experiments.)
(Repeated) testing effects — getting better just from doing the test more often.
Selection bias — the groups you compare differ systematically *before* intervention.
Differential attrition — participants dropping out, and dropouts are not random.
Non-response bias — random survey to 1,000 emails; 200 respond; respondents differ from non-respondents.
Regression to the mean — extreme scores tend toward the mean on next measurement. (Children of tall parents are tall, but shorter than the parents.) The classic mistake: Kahneman & Tversky's (1973) flight instructors thought punishment worked because performance regressed after extreme highs and lows.
Experimenter bias — expectations leak into the data. *Clever Hans*, the horse that appeared to do arithmetic — Pfungst (1907) showed Hans was reading subtle cues from his trainer von Osten. A foundational story.
Demand and reactivity effects (Hawthorne effect) — participants behave differently because they know they're being observed.
Placebo effects — the expectation of a positive effect produces a real effect.
Fraud — data fabrication. retractionwatch.com tracks it.
Data mining / p-hacking / post-hoc hypothesising — two of the most exam-important concepts. *p-hacking:* run 50 models, report the favourable one. *HARKing:* mine the data, write up the post-hoc finding as if it were predicted. Major causes of false findings in psychology.
Publication bias — journals preferentially publish significant findings. Negative results sit in file drawers. The literature ends up looking like everything works.

Solution to experimenter bias and reactivity

Double-blind studies. Neither the participant nor the experimenter knows which condition the participant is in until the data are analysed. Standard in clinical trials. Combine with placebo control to handle placebo effects.

A cheat-sheet for the exam

If your exam asks "List five threats to validity" or "Explain Simpson's paradox with an example" or "Walk through Bayes' rule on the mammogram case", you can now answer those in your sleep. The key memory pegs:

Statistics exists because humans are biased (belief, confirmation, Simpson's, base-rate). Stats corrects for this.

Bayes' rule rescues us from the base-rate fallacy: $P (H ∣ D) = P (D ∣ H) \cdot P (H) / P (D)$ .

Long list of validity threats: history, maturational, testing, selection, attrition, non-response, regression-to-mean, experimenter bias, Hawthorne, placebo, fraud, p-hacking, publication bias.

IV = predictor (manipulated). DV = outcome (measured).

Between-subjects vs within-subjects vs mixed designs.

Double-blind defeats experimenter bias + reactivity.

You've got this. One session at a time.

Behavioral Research: Statistical Methods