Revision Notes/Unit 6 — Correlation & Reliability Quantified/Pearson, Spearman, Partial, Reliability Metrics/Story

Pearson, Spearman, Partial, Reliability Metrics

Unit 6 — Correlation & Reliability Quantified

Maya Measures Together-ness

Maya's next study isn't about turmeric milk any more. She's interested in whether students who spend more hours studying actually get better exam scores. Two continuous variables: study hours, exam scores. No experimental manipulation, just observed pairs. The natural question: do these two variables move together?

That question has a number attached to it, and the number is called a correlation coefficient. This unit covers correlation, its cousins (partial and semi-partial), the eternal warning that correlation is not causation, and then the deeper machinery of reliability — Cohen's κ, Cronbach's α, and how to handle outliers that wreck your analyses.

Correlation is NOT causation

Before any math, the warning. If you remember nothing else from this unit, remember this.

Two variables can be correlated for any of four reasons:

1. A causes B. Smoking causes lung cancer. 2. B causes A. Lung cancer causes people to smoke more for relief. 3. A third variable C causes both. Ice cream sales and drowning deaths are correlated — but neither causes the other. Hot weather causes both. 4. Coincidence / sampling noise. Especially in small samples, two genuinely unrelated variables can show a high correlation just by chance.

Showing a correlation establishes that two things move together. It does not establish why. To claim causation you need either an experiment (with manipulation and randomisation) or extremely careful causal-inference methods that rule out confounders. This is one of the most heavily examined statements in BRSM — there will almost certainly be a question that gives you a correlation and asks if you can conclude causation. The answer is always: *not from correlation alone.*

Pearson's r — the standard correlation coefficient

When both variables are continuous and approximately normally distributed, the standard measure of linear association is **Pearson's correlation coefficient $r$ **:

r = \frac{\sum _{i = 1}^{n} ( x _{i} - x ˉ ) ( y _{i} - y ˉ )}{\sum ( x _{i} - x ˉ ) ^{2} \sum ( y _{i} - y ˉ ) ^{2}}

Equivalently, $r$ is the average product of z-scores: $r = \frac{1}{n - 1} \sum z_{x, i} z_{y, i}$ .

Interpretation rules

$r$ ranges from −1 to +1.
$r = + 1$ : perfect positive linear relationship.
$r = 0$ : no linear relationship.
$r = - 1$ : perfect negative linear relationship.
Sign = direction; magnitude = strength.

Strength benchmarks (common exam table)

| $∣ r ∣$ | Interpretation | | --- | --- | | < 0.1 | negligible | | 0.1–0.3 | small / weak | | 0.3–0.5 | moderate | | ≥ 0.5 | large / strong |

The coefficient of determination, r²

The square of $r$ . Proportion of variance shared between X and Y.

If $r = 0.7$ , then $r^{2} = 0.49$ → *"49% of the variance in Y is explained by its linear relationship with X."* A very common reporting style.

**For simple linear regression with one predictor, $R^{2}$ equals $r^{2}$ ** — we'll revisit this in Unit 12.

What Pearson's r does NOT measure

Non-linear relationships. A perfect parabola $y = x^{2}$ has $r = 0$ — Pearson is blind to curvature. Always look at the scatter plot.
Heteroscedastic spreads. Pearson r still gives one number even when y-spread depends on x.
Causation. Already covered.

Pearson's r is sensitive to outliers

A single extreme point can drag $r$ dramatically. Two scatter plots can both yield $r = 0.99$ — one because the data really hugs a line, and the other because two extreme points anchor the slope while the rest of the data is a noisy cloud.

This is exactly why Anscombe's quartet (Unit 4) matters: identical r, four totally different stories.

Spearman's ρ — the rank-based alternative

When the data are ordinal, when the relationship is monotonic but not linear, or when outliers are wrecking Pearson's r, use Spearman's ρ.

Spearman's ρ is just Pearson's r computed on the ranks of the data, not the raw values. Replace each x with its rank within x's; each y with its rank within y's; compute Pearson r on the ranks.

Properties:

Ranges from $- 1$ to $+ 1$ , same as Pearson.
Monotonicity, not linearity. $ρ = 1$ when each variable is a perfectly increasing function of the other (linear or not).
Robust to outliers because ranks compress extreme values to merely the highest or lowest rank.
Works for ordinal data — no interval-scale assumption.

Pearson vs Spearman — when to use which

The exam loves this comparison:

| Situation | Use | | --- | --- | | Both vars interval/ratio + Normal + no outliers + linear | Pearson r | | Ordinal OR non-Normal OR outliers OR monotonic-non-linear | Spearman ρ |

If both are valid, they give similar values; Pearson is slightly more powerful when its assumptions hold.

A short-answer favourite: *"Why is Spearman's ρ less sensitive to outliers than Pearson's r?"* Because Spearman uses ranks; the largest value, no matter how extreme, becomes just rank $n$ . The numerical extremity gets compressed away.

Significance of correlation — strength vs significance

You compute $r = 0.85$ . Is that statistically significant? It depends on sample size.

Statistical significance tests $H_{0} : ρ = 0$ (in the population). Test statistic:

t = r \frac{n - 2}{1 - r ^{2}}, df = n - 2

Practical consequences:

With $n = 10$ , you can get $r = 0.99$ just from noise. Apparently strong correlations from tiny samples are unreliable.
With $n = 1000$ , an $r$ of 0.10 — barely visible to the eye — can be statistically significant. The effect is real but tiny.

Two things to memorise:

1. Strength is shown by the *magnitude* of $r$ — close to ±1 is strong. 2. Statistical significance is shown by the *p-value*, which depends on both $r$ AND $n$ .

When n is low, the odds are high that a "good-looking" correlation occurred by chance. When n is high, even meaningless correlations may pass the significance bar. Same "statistical vs practical significance" lesson as always.

Partial correlation

Maya cares about the relationship between test scores and IQ, but she worries that study hours might be inflating their correlation. People who study more both have higher IQs *and* score higher on tests. So part of the apparent IQ–score relationship might be "just" the effect of study time on both.

To isolate the relationship while controlling for study hours, she computes a partial correlation.

Definition: the correlation between X and Y after removing the linear effect of Z from both X and Y.

Mechanically:

1. Regress X on Z; compute residuals. 2. Regress Y on Z; compute residuals. 3. Compute the Pearson correlation between the two residual sets.

Formula (one control variable):

r_{X Y \cdot Z} = \frac{r _{X Y} - r _{X Z} \cdot r _{Y Z}}{( 1 - r _{X Z}^{2} ) ( 1 - r _{Y Z}^{2} )}

A worked exam-style example

Given:

$r (GPA, IQ) = 0.75$
$r (GPA, study hours) = 0.56$
$r (IQ, study hours) = 0.46$

Partial correlation between GPA and IQ controlling for study hours:

r_{GPA,IQ \cdot study} = \frac{0.75 - 0.56 \cdot 0.46}{( 1 - 0.5 6 ^{2} ) ( 1 - 0.4 6 ^{2} )} = \frac{0.75 - 0.258}{0.686 \cdot 0.788} = \frac{0.492}{0.735} \approx 0.67

So the GPA–IQ relationship is still strong (0.67) even after accounting for study hours. Study hours explained some of the original 0.75 correlation, but most of it remains. Maya can argue IQ has a relationship with GPA above and beyond what is explained by study hours.

Semi-partial correlation

A subtler cousin. Partial correlation removes Z from both X and Y. Semi-partial removes Z from only one of them.

For instance, Maya wants the relationship between tutoring and exam scores, removing the confounding effect of study time only on tutoring (because students who get tutored may have less study time, and she wants tutoring's *unique* contribution to exam scores).

Mechanically:

1. Regress tutoring on study time; compute residuals. 2. Compute the Pearson correlation between those residuals and the raw exam scores.

The result: the relationship between the *unique* part of tutoring (independent of study time) and exam scores in their original form.

Semi-partial correlations are central to interpreting multiple regression — the unique contribution of each predictor beyond the others (Unit 12).

Reliability deepened — quantifying agreement

Back in Unit 2 we listed the kinds of reliability. Now we'll see how each is measured.

Recall the four kinds (plus one more):

Test-retest — same instrument, same people, two time points.
Inter-rater — same instrument, different raters, same targets.
Parallel forms — two equivalent versions of the same instrument.
Internal consistency — different parts of the same instrument should agree.
Intra-rater — same rater rating the same things at two different time points.

Each has a quantitative measure. Memorise the pairings.

Cohen's Kappa — two raters, nominal data

Two raters rate items into nominal categories. Just looking at how often they agree isn't enough — they might agree a lot by chance. Cohen's κ corrects for chance agreement.

κ = \frac{p _{o} - p _{e}}{1 - p _{e}}

where $p_{o}$ = observed agreement, $p_{e}$ = expected agreement by chance.

Worked example: 50 images rated yes/no by two raters.

| | R2: Yes | R2: No | | --- | --- | --- | | R1: Yes | 4 | 16 | | R1: No | 15 | 15 |

Observed agreement: $p_{o} = (4 + 15) /50 = 0.38$ .
Rater 1 said Yes: $20/50 = 0.40$ . Rater 2 said Yes: $19/50 = 0.38$ .
Expected by chance: both Yes = $0.40 \cdot 0.38 = 0.152$ ; both No = $0.60 \cdot 0.62 = 0.372$ . Total $p_{e} = 0.524$ .
$κ = (0.38 - 0.524) / (1 - 0.524) = - 0.144/0.476 \approx - 0.30$ .

Negative κ means less agreement than chance — quite bad.

Cohen's κ interpretation (Landis & Koch)

| κ | Agreement | | --- | --- | | < 0 | worse than chance | | 0–0.20 | slight | | 0.20–0.40 | fair | | 0.40–0.60 | moderate | | 0.60–0.80 | substantial | | 0.80–1.00 | almost perfect |

The cousins

Fleiss' κ — generalisation of Cohen's κ for more than two raters rating into nominal categories.
Kendall's W (coefficient of concordance) — multiple raters ranking items rather than placing them in nominal categories. Range 0 to 1.
Krippendorff's α — Swiss army knife: any number of raters, missing data, any measurement level.
Intra-rater reliability with Pearson — same rater at two time points, continuous data → Pearson r.

Cronbach's alpha — internal consistency

Most common measure of internal consistency — do all items on a test measure the same underlying construct?

α = \frac{k}{k - 1} (1 - \frac{\sum _{i = 1}^{k} σ _{Y_{i}}^{2}}{σ _{X}^{2}})

where k = number of items, $σ_{Y_{i}}^{2}$ = variance of item i, $σ_{X}^{2}$ = variance of total scores.

Intuition: if items are highly intercorrelated, the variance of the total score is much larger than the sum of individual item variances → α high.

Interpretation:

| α | Quality | | --- | --- | | ≥ 0.9 | excellent (sometimes too high — items may be redundant) | | 0.8–0.9 | good | | 0.7–0.8 | acceptable | | 0.6–0.7 | questionable | | < 0.6 | poor |

Split-half reliability is a simpler historical alternative. Kuder-Richardson 20/21 is the related index for binary-item tests.

Outliers — when to worry, when to keep

An outlier is a data point unusually distant from the rest. The decision to keep, remove, or transform it is one of the most consequential choices in analysis, and there is no universal rule.

Why outliers happen

1. Measurement or execution errors — instrument malfunction, improper handling, extraction errors. 2. Data entry errors / missing-data codes — someone entered "999" for missing. 3. Data processing errors — accidental scaling, misaligned columns. 4. Sampling errors — mixed data from different sources. 5. Natural variability / novelties — sometimes data is just genuinely unusual. A real IQ prodigy. A miraculous treatment responder.

Errors should be corrected or removed; natural outliers should be respected and investigated.

Detecting outliers

Graphical: scatter plots, box plots (1.5 × IQR rule), histograms.

Numerical: 1.5 × IQR rule, 2 or 3 SDs from mean, Grubbs' test, Tietjen-Moore.

Dealing with outliers — five strategies

1. Omit — only with strong evidence of error. Document and justify. 2. Replace — winsorise to a nearest non-outlier value. 3. Use different analysis methods — non-parametric tests (Mann-Whitney, Wilcoxon, Spearman). 4. Value the outliers — keep them and investigate. Sometimes the most interesting data. 5. Transform the data — log / sqrt / Box-Cox to pull in extremes.

The default mistake students make: silently dropping outliers because they look weird. The exam will mark you down. Detect, investigate, decide based on cause, document.

What you carry into the exam

Correlation is NOT causation — four reasons. Memorise verbatim.
Pearson r — linear association on continuous Normal data, $[- 1, 1]$ , sensitive to outliers and blind to non-linearity.
Spearman ρ — Pearson r on ranks. Monotone, robust, ordinal-friendly.
r² = proportion of variance shared.
r is strength; p depends on r AND n. Apparently 'strong' r in small n is suspicious; weak r in large n is real but trivial.
Partial r removes Z from both X and Y; semi-partial removes from one side only.
Cohen's κ — chance-corrected nominal agreement, two raters. Fleiss κ for >2; Kendall W for ordinal; Krippendorff α general; Pearson r for intra-rater continuous.
Cronbach's α — internal consistency. > 0.7 acceptable; > 0.95 may indicate redundancy.
Outliers — detect, investigate, decide; don't silently drop.

When you're ready, send "next" and we'll move into hypothesis testing — NHST, p-values, Type I and II errors, power, Cohen's d, and the t-test family.

Behavioral Research: Statistical Methods