Revision Notes/Unit 2 — Research Design & Measurement/Scales, Reliability, Validity/Story

Scales, Reliability, Validity

Unit 2 — Research Design & Measurement

Maya Designs Her First Experiment

When Maya wants to test whether turmeric milk cures sore throats, she has to define every term of the question rigorously before she can run a single experiment. Statistics teaches you that even the most innocent-sounding question hides a half-dozen invisible decisions about how to measure things, what counts as evidence, and what could go wrong. This unit is the toolkit for those decisions.

Measurement and operational definitions

You cannot study "depression" or "aggression" or "intelligence" until you define them in a way you can actually measure. That definition is called an operational definition — "a working definition of what a researcher is measuring."

If Maya says *"I'll measure depression as the number of times a person hangs out with family and friends in a month,"* that's an operational definition. It might be a bad one (depressed people might still socialise; July behaves differently from November), but it is at least concrete enough to test.

Defining "on target" in a golf study as *"within ±10% of the goal distance"* — same idea. Without operational definitions, science is just opinions in lab coats.

Variable types — the four scales of measurement

This is exam gold. Memorise these in order with examples — questions often ask *"what kind of scale is X?"*

NOIR — increasing in information:

Nominal → Ordinal → Interval → Ratio

Nominal

Categorical, no order. Eye colour, sex, mode of transport. Cannot say one is greater. Cannot average. *"Average eye colour"* is nonsense. Allowable: mode, counts, χ².

Ordinal

Ordered categories, but the spacing between them isn't meaningful. Race position (1st, 2nd, 3rd — but the gap between 1st and 2nd may be huge while 2nd and 3rd are a photo finish). Ranks. *Strongly disagree → strongly agree* on a Likert scale, strictly speaking. Allowable: median, percentiles, Spearman/Kendall.

Interval

Numerical, equal spacing, no true zero. Temperature in Celsius (0°C is just freezing, not "no temperature"). Year of birth. Differences make sense (30°C − 20°C = 10°C, same as 20°C − 10°C), but ratios don't — you cannot say 20°C is "twice as hot" as 10°C.

Ratio

Numerical, equal spacing, true zero. Reaction time, weight, height, count. Ratios make sense — "I'm twice as fast as you" is meaningful when 0 means "no time".

Continuous vs discrete is a separate axis

Independent of scale type. Reaction time: *ratio, continuous*. Year of birth: *interval, discrete*. Temperature: *interval, continuous*. Mode of transport: *nominal, discrete*. Race position: *ordinal, discrete*.

The Likert wrinkle

Strictly ordinal, but most researchers treat it as interval because participants seem to use the scale roughly evenly. The exam may ask whether this is technically correct — *answer: technically no, in practice yes, depends on the task.* If you treat Likert as interval, you can use t-tests / ANOVA / Pearson; if strictly ordinal, you must use Mann-Whitney / Spearman.

Reliability — does the measurement repeat itself?

A measurement that gives a different answer every time you take it is worthless. Reliability is the consistency of a measurement. Four flavours, all of which you should be able to name:

Test-retest reliability — consistency over time. Give the same person the IQ test twice, three months apart. Do they get similar scores? They should, if the test is reliable.
Inter-rater reliability — consistency across people doing the measuring. Two trained psychologists each diagnose the same 50 patients. Do they agree? Measured by Cohen's Kappa for two raters with nominal data, Fleiss' Kappa for more than two raters, Kendall's coefficient of concordance for ordinal data, Krippendorff's Alpha as a general-purpose option.
Parallel forms reliability — consistency across theoretically-equivalent versions of the same measurement. Two different weighing scales should give the same weight.
Internal consistency reliability — consistency across the items within a single instrument. If an IQ test has 10 questions all supposedly measuring fluid intelligence, the scores on those questions should correlate. Measured by Cronbach's α, split-half reliability, KR-20/21.

Threats to reliability

Measurement error, instrumentation changes (your apparatus drifts), practice effects, sampling variability, participant error (mood, time of day), participant bias (faking answers on a mental-health questionnaire from their employer), researcher error (fatigue), researcher bias (subjective interpretation pushed toward the result they want).

Validity — are you measuring what you think you're measuring?

Reliability is repeatability. Validity is accuracy with respect to the target. A bathroom scale that always reads 5 kg too high is reliable but not valid. Five flavours:

Internal validity

Can you actually draw cause-and-effect conclusions from your study? Maya studies the cognitive effects of COVID by comparing govt-hospital patients to healthy controls recruited via online ads. Internal validity is shaky because the two groups differ in many ways besides COVID exposure (socioeconomic status, healthcare access, internet use). Any cognitive difference might be from those confounds, not from COVID.

External validity

Do your findings generalise beyond your specific sample? A study on attitudes toward psychotherapy conducted only on CogSci undergraduates at IIIT-H has poor external validity for "Indians" or "young adults" generally.

Construct validity

Is your operational definition actually capturing the construct? Trying to measure depression prevalence in students by posting a tweet asking depressed students to "like" it — construct validity is terrible. The act of liking a tweet is influenced by who follows you, who is online, who feels comfortable being publicly identified, none of which are depression.

Established through convergent validity (your measure correlates highly with other measures of the same construct — e.g., the new depression scale correlates r = 0.78 with PHQ-9) and discriminant validity (your measure correlates *less* with measures of unrelated constructs — e.g., depression scale correlates only r = 0.10 with extraversion).

Face validity

Does your test look like it does what it claims? Doesn't matter much to scientists. Matters when convincing policymakers or the general public.

Ecological validity

Does your experimental setup resemble real-world conditions? Lab-based eyewitness studies have low ecological validity because real eyewitnessing involves stress, distraction, and time pressure absent from the lab. Lab word-memory experiments also have low ecological validity, but their findings often *do* generalise — so ecological validity is desirable, not strictly required.

Reliability vs validity — the bull's-eye analogy

The classic four-square diagram:

Reliable + valid: tight cluster on the bull's-eye. The goal.

Reliable but invalid: tight cluster off-target. Bathroom scale 5 kg high.

Valid but unreliable: scattered around the bull's-eye. Hard to maintain in long run.

Neither: scattered everywhere. Useless.

Reliability is necessary but not sufficient for validity. You can be perfectly consistent and still wrong (the stopped clock). But if your measurements vary randomly, they cannot consistently track any target — so you can't be valid without being reliable.

Confounds and how Maya handles them

A confound is a variable that is related to both your predictor and your outcome in some systematic way, creating the illusion of a relationship that isn't really there (or hiding one that is).

*Classic exam example:* Does playing violent video games cause aggression? You compare gamers to non-gamers using criminal records. Problem: children who spend hours playing violent games are also more likely to have absent parents, less supervision, particular socioeconomic backgrounds. Those might cause aggression, not the games. Parental support is a confound.

The gold-standard fix is the ideal experiment: take a random sample, randomly assign people to violent-games vs peaceful-games groups, monitor them for years, compare outcomes. Randomisation breaks the link between predictor and confound — over many participants, the confound averages out across the groups.

The catch: this is rarely feasible (ethical, expensive, decades long). So instead, you incorporate the confounds as covariates in your statistical model — adjusting the outcome for the parts explained by the confound before estimating the effect of your predictor. You'll meet covariates again in ANCOVA and multiple regression.

The threats to validity

This part of the chapter loves to appear as MCQs. Each is a specific way your study can go wrong:

History effects — something happens during the study that influences results. Surgery on day 5 of a hospital-stay measurement on days 3 and 7.
Maturational effects — natural changes over time, independent of the experiment. Fatigue, waning attention in long psych experiments.
(Repeated) testing effects — getting better just from doing the test more often.
Selection bias — the groups you compare differ systematically *before* intervention.
Differential attrition — participants dropping out, and the dropouts are not random.
Non-response bias — random survey to 1,000 emails; 200 respond; respondents differ from non-respondents.
Regression to the mean — selecting on extreme scores, then observing reversion. *Classic mistake:* Kahneman & Tversky's flight instructors thought punishment worked because performance regressed after extreme highs and lows.
Experimenter bias — expectations leak into the data. Clever Hans, the horse who appeared to do arithmetic — Pfungst (1907) showed Hans was reading subtle, unconscious cues from his trainer von Osten.
Demand and reactivity effects (Hawthorne effect) — participants behave differently because they know they're being studied.
Placebo effects — the expectation of a positive effect produces a real effect, even from an inert intervention.

Solution to experimenter bias and reactivity: the double-blind study

Neither the participant nor the experimenter knows which condition the participant is in until the data are analysed. Combined with placebo control (control group gets an inert intervention) and random assignment, the double-blind randomised controlled trial (RCT) is the gold standard of causal evidence.

Exam cheat-sheet

NOIR scales: Nominal → Ordinal → Interval → Ratio. Continuous/discrete is a separate axis.

Reliability = repeatability. Validity = accuracy. Cannot be valid without being reliable.

Reliability flavours: test-retest (time), inter-rater (people), parallel forms (versions), internal consistency (items).

Validity flavours: internal (causal), external (generalise), construct (right thing), face (looks right), ecological (real-world).

Cohen's κ for inter-rater agreement; Cronbach's α for internal consistency.

Convergent + discriminant validity → construct validity.

Confounds threaten internal validity. Fix: random assignment or covariate adjustment.

Double-blind defeats both experimenter bias and reactivity.

If your exam asks "list five threats to validity" or "distinguish reliability from validity" or "explain construct validity with an example", you can now answer those in your sleep.

When you're ready, send "next" and we open Session 2: probability and distributions — frequentists, Bayesians, the Central Limit Theorem (the most important single result in the entire course), and the family of distributions that underlies every test you'll learn later.

Behavioral Research: Statistical Methods