Revision Notes/Unit 2 — Research Design & Measurement/Scales, Reliability, Validity

Scales, Reliability, Validity

Intuition

Before any test, you must (1) operationally define the construct (how do you measure 'depression'?), (2) decide what scale it lives on (NOIR: Nominal/Ordinal/Interval/Ratio), and (3) establish that the measurement is both reliable (consistent) and valid (captures the construct). The two are independent: a bathroom scale that always reads 5 kg high is *reliable but not valid*; a stopped clock is reliable but invalid. You cannot be valid without being reliable. These distinctions drive every test choice in the course.

Explanation

Operational definitions. You cannot study 'depression' or 'aggression' or 'intelligence' until you define them in a way you can actually measure. An operational definition is a working definition of what a researcher is measuring. If Maya says *'I'll measure depression as the number of times a person hangs out with family and friends in a month'*, that's an operational definition. It might be a bad one (depressed people might still socialise; July behaves differently from November), but it is at least concrete enough to test. Similarly, defining 'on target' in a golf study as 'within ±10% of the goal distance' — same idea. Without operational definitions, science is just opinions in lab coats.

The four scales of measurement — NOIR. This is exam gold. Memorise in order with examples — questions often ask 'what kind of scale is X?'

Nominal. Categorical, no order. Eye colour, sex, mode of transport, blood type, country of origin. Cannot say one is greater than another. Cannot average. *'Average eye colour' is nonsense.* Allowable statistics: mode, counts, frequencies, $χ^{2}$ .

Ordinal. Ordered categories, but the spacing between them isn't meaningful. Race position (1st, 2nd, 3rd — but the gap between 1st and 2nd may be huge while 2nd and 3rd were a photo finish). Ranks. Strongly disagree → strongly agree on a Likert scale, *strictly speaking*. Allowable: median, percentiles, rank-based correlations (Spearman, Kendall).

Interval. Numerical, equal spacing, no true zero. Temperature in Celsius (0°C is just freezing, not 'no temperature'), calendar year. Differences make sense (30°C − 20°C = 10°C = 20°C − 10°C), but ratios don't — you cannot say 20°C is 'twice as hot' as 10°C. Allowable: mean, SD, addition/subtraction, Pearson r, t-tests, ANOVA.

Ratio. Numerical, equal spacing, true zero. Reaction time, weight, height, count, age. Ratios *do* make sense — *'I'm twice as fast as you'* is meaningful when 0 means 'no time' / 'no weight'. Allowable: everything above plus geometric mean, multiplicative comparisons, coefficient of variation.

Continuous vs discrete is orthogonal to NOIR. Reaction time: ratio, continuous. Year of birth: interval, discrete. Temperature: interval, continuous. Mode of transport: nominal, discrete. Race position: ordinal, discrete. The two classifications cut the data differently and you should be able to apply both at once.

The Likert wrinkle. Strictly ordinal, but most researchers treat it as interval because participants seem to use the scale roughly evenly. The exam may ask whether this is technically correct — *answer: technically no, in practice yes, depends on the task*. Likert data analysed as interval typically uses t-tests / ANOVA / Pearson; analysed strictly as ordinal uses Mann-Whitney / Kruskal-Wallis / Spearman.

Reliability — does the measurement repeat itself? A measurement that gives a different answer every time is worthless. Reliability is the consistency of a measurement. Four flavours, all of which you should be able to name on the exam.

Test-retest reliability — consistency over time. Give the same person the IQ test twice, three months apart. Do they get similar scores? They should, if the test is reliable. Quantified by the correlation between test and retest.

Inter-rater reliability — consistency across people doing the measuring. Two trained psychologists diagnose the same 50 patients. Do they agree? Quantified by Cohen's Kappa for two raters on nominal data; Fleiss' Kappa for more than two raters; Kendall's coefficient of concordance for ordinal data; Krippendorff's Alpha as a general-purpose option. Cohen's κ = $(p_{o} - p_{e}) / (1 - p_{e})$ where $p_{o}$ is observed agreement and $p_{e}$ is agreement expected by chance.

Parallel forms reliability — consistency across theoretically-equivalent versions of the same measurement. Two different weighing scales should give the same weight. Two different forms of an English vocabulary test should rank students similarly. Often quantified by the correlation between forms.

Internal consistency reliability — consistency across the items *within a single instrument*. If an IQ test has 10 questions all supposedly measuring fluid intelligence, scores on those questions should correlate. Quantified by Cronbach's α (most common), split-half reliability, Kuder-Richardson 20/21 (KR-20/21).

Threats to reliability. Random measurement error; instrumentation changes (apparatus drifts); practice effects; sampling variability; participant mood/time-of-day; participant bias (faking on a mental-health questionnaire from their employer); researcher fatigue; researcher bias (subjective scoring pushed toward expected result).

Validity — are you measuring what you think you're measuring? Reliability is repeatability. Validity is accuracy with respect to the target. A bathroom scale that always reads 5 kg too high is reliable but not valid. Five flavours.

Internal validity — can you actually draw cause-and-effect conclusions from your study? Maya studies cognitive effects of COVID by comparing govt-hospital patients to healthy controls recruited via online ads. Internal validity is shaky — the groups differ on many variables besides COVID exposure (SES, healthcare access, internet use). Any cognitive difference might be from those confounds. Random assignment is the gold-standard remedy.

External validity — do findings generalise beyond your sample? A study on attitudes toward psychotherapy conducted only on CogSci undergraduates at IIIT-H has poor external validity for 'Indians' or 'young adults' generally.

Construct validity — is your operational definition actually capturing the construct? Trying to measure depression prevalence in students by posting a tweet asking depressed students to 'like' it — construct validity is terrible. Liking a tweet is influenced by who follows you, who is online, who feels comfortable being publicly identified — none of which are depression. Construct validity is established through *convergent* (correlates with other measures of the same construct, e.g., a new depression scale correlates r > 0.7 with PHQ-9) and *discriminant* (low correlation with measures of unrelated constructs, e.g., r ≈ 0.1 with extraversion) validity.

Face validity — does your test look like it does what it claims? Doesn't matter much to scientists (a measure can have low face validity but high construct validity). Matters for convincing policymakers or the general public — and for participant compliance.

Ecological validity — does your experimental setup resemble real-world conditions? Lab-based eyewitness studies have low ecological validity because real eyewitnessing involves stress, distraction, time pressure, none of which are present in the lab. Lab word-memory experiments also have low ecological validity, but their findings often *do* generalise — so ecological validity is desirable, not strictly required.

Reliability and validity can come apart. A bathroom scale 5 kg high is reliable (gives the same answer each time) but invalid (off-target). A stopped clock is reliable (always 3:00) but invalid as a time measure (only right twice a day, by accident). You can be reliable without being valid. You cannot be valid without being reliable — if every measurement gives a different answer, the measure cannot consistently capture the construct.

Regression to the mean — the most subtle of the validity threats. When you select participants based on an extreme score, their *next* measurement tends to be less extreme, simply because the first one was unusual. Classic mistake (Kahneman & Tversky, 1973): an old study claimed people learn better from negative feedback than positive feedback. Flight instructors observed that pilots praised after a good landing did *worse* next time, while pilots scolded after a bad landing did *better*. Conclusion: 'punishment works'. Actually it was regression to the mean — feedback was given when performance was extreme, and the next performance was naturally closer to average regardless of feedback. The classic example of how a real statistical phenomenon can be mistaken for a causal effect.

Confounds and the double-blind solution. A confound varies systematically with the IV and could itself explain the change in DV — threatens internal validity. Double-blind studies (neither participant nor experimenter knows the assignment) control both *experimenter bias* (cueing, biased scoring, selectively reporting) and *placebo / reactivity* effects. Combined with placebo control (control group gets an inert intervention) and randomisation (random assignment to conditions), the double-blind RCT is the gold standard of causal evidence.

Definitions

Operational definition — Working definition that specifies *how* to measure an abstract construct. Necessary for any empirical study.
Nominal scale — Categorical, no order. Eye colour, sex, blood type. Allowable: mode, counts, χ².
Ordinal scale — Ordered categories, intervals not equal. Race position, Likert (strictly). Allowable: median, percentiles, Spearman/Kendall.
Interval scale — Numerical, equal spacing, no true zero. °C, calendar year. Allowable: mean, SD, t, ANOVA, Pearson r. No meaningful ratios.
Ratio scale — Numerical, equal spacing, true zero. Reaction time, weight, height. All operations including ratios meaningful.
Continuous vs discrete — Orthogonal to NOIR. Whether the variable can take any value in a range or only specific values.
Reliability — Consistency / repeatability of a measurement. Four flavours: test-retest, inter-rater, parallel forms, internal consistency.
Test-retest reliability — Same measurement on same units at two times. Quantified by correlation between the two.
Inter-rater reliability — Agreement among different raters on the same items. Cohen's κ (2 raters), Fleiss κ (>2), Kendall W (ordinal), Krippendorff α (general).
Parallel forms reliability — Equivalent versions of the same measurement give similar results. Correlation of two forms.
Internal consistency — Items within a single instrument correlate. Cronbach's α, split-half, KR-20/21.
Cohen's κ — (p_o − p_e)/(1 − p_e). Inter-rater agreement above chance for nominal data. > 0.8 excellent, 0.6–0.8 substantial, 0.4–0.6 moderate, < 0.4 poor.
Cronbach's α — Internal consistency: (k/(k−1))(1 − Σσ²ᵢ/σ²_total). > 0.7 acceptable, > 0.8 good. > 0.95 may indicate redundancy.
Validity — Accuracy of a measurement w.r.t. the construct. Five flavours: internal, external, construct, face, ecological.
Internal validity — Can we attribute DV changes to the IV (no confounds)? Strengthened by random assignment, control groups, double-blind.
External validity — Do findings generalise to other people, settings, times? Strengthened by random sampling, diverse samples, replication.
Construct validity — Does the measure actually capture the construct? Established through convergent (same-construct correlation high) and discriminant (other-construct correlation low) evidence.
Face validity — Does the test superficially look like it taps the construct? Weakest type; matters more for participant buy-in and policymaker acceptance than scientific validity.
Ecological validity — Does the experimental setup resemble real-world conditions? Desirable but not strictly required — lab simplifications often generalise.
Convergent / discriminant validity — Convergent: high correlation with same-construct measures. Discriminant: low correlation with unrelated-construct measures. Both required for construct validity.
Regression to the mean — Extreme scores tend to be followed by less extreme ones. Easily mistaken for a treatment effect (Kahneman pilots example).
Confound — Third variable related to both IV and DV that could itself explain the outcome. Threatens internal validity. Random assignment is the gold-standard fix.
Double-blind — Neither participant nor experimenter knows the condition. Defeats both experimenter bias and reactivity. Standard in clinical trials.

Formulas

$Cronbach’s α = \frac{k}{k - 1} (1 - \frac{\sum _{i} σ _{i}^{2}}{σ _{total}^{2}})$
$Cohen’s κ = \frac{p _{o} - p _{e}}{1 - p _{e}} (observed vs chance agreement)$
$NOIR allowable statistics: N \subset O \subset I \subset R$
$Split-half reliability: r_{half} = cor (odd items, even items)$
$Spearman-Brown correction: ρ^{⋆} = \frac{2 r _{half}}{1 + r _{half}} (estimates full-length reliability from split halves)$

Derivations

Why a stopped clock is reliable but invalid. Reliability = consistency: the clock gives the same answer each time (3:00, every time). Validity = accuracy: it captures 'true time' only at one instant a day, by accident, randomly. Hence reliability is necessary but not sufficient for validity — you can be perfectly consistent and still wrong.

Why you cannot be valid without being reliable. Validity requires that the measurement track the true construct. If measurements vary randomly each time, they cannot consistently track anything. Hence: reliability is a *necessary precondition* for validity. (Reliability is not sufficient — a reliable measurement can still be off-target.)

Cohen's κ derivation. Two raters classify $n$ items into $k$ nominal categories. $p_{o} = \sum_{i} n_{ii} / n$ = proportion of items both raters placed in the same category (observed agreement). $p_{e} = \sum_{i} (n_{i \cdot} / n) (n_{\cdot i} / n)$ = agreement expected by chance from each rater's marginals. $κ = (p_{o} - p_{e}) / (1 - p_{e})$ rescales to [−1, 1] where 0 = chance, 1 = perfect, < 0 = worse than chance. Benchmarks (Landis & Koch): < 0.4 poor, 0.4–0.6 moderate, 0.6–0.8 substantial, > 0.8 almost perfect.

Cronbach's α intuition. Items measuring the same construct should covary — so the variance of their sum should be larger than the sum of their variances. α takes the ratio: $α = (k / (k - 1)) (1 - \sum σ_{i}^{2} / σ_{total}^{2})$ . High α (close to 1) means items move together; α near 0 means they don't. Cut-offs: > 0.7 acceptable, > 0.8 good, > 0.9 may indicate redundancy.

Why Likert can be 'almost interval' in practice. Although strictly ordinal, if participants use the scale roughly evenly (1-2 gap ≈ 4-5 gap), the data behave like interval data for parametric tests. Empirical studies show t-tests / ANOVAs on Likert data give very similar conclusions to ordinal-only tests at moderate-large n. At small n with clear floor/ceiling effects, drop to nonparametric.

Examples

Operational definition. Depression measured by score on PHQ-9 (validated 9-item questionnaire). Aggression measured by # punches thrown in 5-min sparring session. 'On-target' in a golf study = ball lands within ±10% of target distance.
NOIR worked classification. Reaction time → ratio, continuous. Year of birth → interval, discrete. Temperature in °C → interval, continuous. Mode of transport → nominal, discrete. Race finishing position → ordinal, discrete. Likert agreement 1–5 → ordinal (strictly), often treated as interval. Height in cm → ratio, continuous.
Cohen's κ example. Two clinicians rate 100 patients as anxious / not. Observed agreement: 85 patients agreed on, 15 disagreed. $p_{o} = 0.85$ . Chance agreement: if clinician 1 says anxious 60% and clinician 2 says anxious 65%: $p_{e} = 0.60 \cdot 0.65 + 0.40 \cdot 0.35 = 0.39 + 0.14 = 0.53$ . $κ = (0.85 - 0.53) / (1 - 0.53) = 0.32/0.47 \approx 0.68$ — substantial agreement.
Cronbach's α example. PHQ-9 (9 items) typically has α ≈ 0.86 — good internal consistency, items measure the same depression construct.
Convergent + discriminant validity. New depression scale correlates r = 0.78 with PHQ-9 → strong convergent. Correlates r = 0.10 with extraversion → strong discriminant. Both together support construct validity.
Practice effects. Participants tested 4 times on the same Stroop task get faster across sessions regardless of any manipulation. Counterbalance order; include practice trials before measurement.
Regression to the mean — Kahneman pilots. Flight instructors observed pilots praised after a good landing did worse next time, scolded after bad landing did better. Concluded 'punishment works'. Actually: praise was given when performance was extreme-good (next time naturally closer to average — looks worse), scolding when extreme-bad (next time naturally closer to average — looks better). No causal effect of feedback.
Selection bias. A study on aggression compares men and women but recruits men from a college boxing club and women from a college book club. Pre-existing differences in aggression unrelated to gender confound the comparison.

Diagrams

NOIR table. Rows: Nominal / Ordinal / Interval / Ratio. Columns: Order? Equal intervals? True zero? Example. Allowable statistics. Mark with ✓/✗.
Reliability vs Validity 2×2. Bull's-eye shooting analogy. Top-left: tight cluster off-target (reliable, not valid). Top-right: tight cluster on bull's-eye (both). Bottom-left: scattered everywhere (neither). Bottom-right: scattered around centre (valid, not reliable — impossible long-run).
Validity types fan-out. Internal / External / Construct / Face / Ecological — with one-sentence definition each.
Convergent + discriminant validity scatter. Target measure at centre; high-correlation cluster on one side (same-construct measures), low-correlation cluster on the other (different-construct measures).
Regression-to-the-mean illustration. Scatterplot of test 1 vs test 2. Pick the extreme-high on test 1; their test 2 score is closer to the mean. Mark the regression line vs the identity line.
Cronbach's α nomogram. As # of items and average inter-item correlation increase, α increases. Useful for designing scale length.

Edge cases

Likert scales strictly ordinal but commonly treated as interval — defensible at moderate-large n if scale points are used roughly evenly. Strict ordinal analysis (Mann-Whitney, Spearman) avoids the assumption.
Practice effects confound within-subjects designs — counterbalance order or include practice trials.
Demand characteristics (participants guessing the hypothesis) threaten construct validity. Use cover stories or single/double-blind designs.
Hawthorne effect — behaviour changes due to being observed. Hard to fully eliminate; minimise by being unobtrusive and using ecologically valid measurements.
Differential attrition in longitudinal studies — older / sicker / less-engaged participants drop out at higher rates, biasing the surviving sample.
Reactivity in self-report. Asking about sensitive behaviours (drug use, mental health) is biased by social desirability — use anonymous reporting or implicit measures.
Cronbach's α > 0.95 may indicate redundant items — α high because items are near-paraphrases rather than independent measurements of the construct.
Cohen's κ paradox — high agreement with very low κ when one category dominates. Always interpret κ alongside marginal frequencies.

Common mistakes

Confusing reliability and validity. Bathroom scale 5 kg high: reliable, not valid. Reliability is necessary but not sufficient for validity.
Treating Likert as ratio. You cannot say 'agreement of 4 is twice 2' — at best ordinal, often defensibly interval.
Computing a mean for nominal data. 'Average eye colour' is meaningless. Use mode and frequencies.
Treating 0°C as 'no temperature'. Celsius is interval, not ratio. Kelvin is ratio.
'High Cronbach's α proves construct validity'. α measures *internal consistency* only. It says items move together — it does *not* say they measure the right thing. A scale with α = 0.9 measuring 'caffeine intake' would be internally consistent but invalid for depression.
Confusing convergent and discriminant validity. Convergent = high correlation with same-construct measures. Discriminant = *low* correlation with unrelated-construct measures. You need both.
Forgetting that face validity is the weakest type — a measure can have low face validity but high construct validity, and vice versa.
Using ecological validity to dismiss any lab study. Lab settings are intentional simplifications; many findings generalise. Ecological validity is desirable, not strictly required.
Mistaking regression to the mean for a treatment effect. Always include a control group; never select on extreme scores without expecting reversion.

Shortcuts

NOIR = Nominal → Ordinal → Interval → Ratio, increasing in information. Each level inherits operations of all lower levels.
Allowable statistics escalate: mode/counts → + median → + mean/SD → + ratios.
Reliability = consistency; Validity = accuracy. Memorise verbatim.
Reliable ≠ valid; valid implies reliable.
Reliability flavours: test-retest (time), inter-rater (people), parallel forms (versions), internal consistency (items).
Validity flavours: internal (causal), external (generalise), construct (right thing), face (looks right), ecological (real-world).
Inter-rater stats: Cohen's κ (2 raters, nominal) / Fleiss κ (>2 raters) / Kendall W (ordinal) / Krippendorff α (general).
Internal consistency: Cronbach's α (most common) / split-half / KR-20/21.
Double-blind = experimenter bias + reactivity defeated. Standard for clinical trials.

Proofs / Algorithms

Reliability is necessary for validity. Suppose a measurement has zero reliability — repeated measurements on the same unit produce uncorrelated values. Then the measurement value is essentially random noise, and *no transformation* of it can systematically track the true construct (which is, by assumption, a stable property of the unit). Hence reliability ≈ 0 implies validity ≈ 0. Contrapositive: validity > 0 requires reliability > 0. Reliability is *necessary* (but not sufficient — a 5 kg-high scale has perfect reliability and zero validity).

Cohen's κ rescales chance to zero. Without correction, observed agreement $p_{o}$ confuses true rater concordance with chance agreement. Subtracting $p_{e}$ (chance agreement from marginal frequencies) and rescaling by $(1 - p_{e})$ produces a coefficient that is 0 when raters agree only at chance level and 1 when they agree perfectly. Negative κ indicates worse-than-chance agreement (raters systematically disagree).

Spearman-Brown lengthening. If a $k$ -item test has reliability $ρ$ , lengthening to $mk$ items raises reliability to $ρ^{⋆} = m ρ / (1 + (m - 1) ρ)$ . Used to: (i) estimate full-length reliability from split-half, by setting $m = 2$ ; (ii) plan how many items a scale needs to reach a target reliability. Diminishing returns: each additional item adds less reliability than the last.

Behavioral Research: Statistical Methods