Revision Notes/Unit 6 — Correlation & Reliability Quantified/Pearson, Spearman, Partial, Reliability Metrics

Pearson, Spearman, Partial, Reliability Metrics

Intuition

Correlation is not causation — memorise this once and for all. Two variables can correlate because (1) A causes B, (2) B causes A, (3) a third variable C causes both, or (4) coincidence. Showing co-movement establishes association, not causation. Pearson r measures linear association in $[- 1, 1]$ ; Spearman ρ uses ranks and captures monotone (possibly nonlinear) associations and is robust to outliers; partial correlation isolates X-Y association after removing a third variable Z. Reliability is the *quantified* version of consistency from Unit 2 — Cohen's κ for inter-rater nominal, Cronbach's α for internal consistency.

Explanation

Correlation is NOT causation — the four reasons. Two variables can correlate because: (1) A causes B — smoking → lung cancer. (2) B causes A — lung cancer drives more smoking-for-relief (less plausible here, but for many correlations the direction is real). (3) A third variable C causes both — *ice cream sales* and *drowning deaths* are correlated, but neither causes the other; hot weather causes both. (4) Coincidence / sampling noise — especially in small samples, unrelated variables can show high correlation by chance. Showing a correlation establishes that two things move together. It does not establish why. To claim causation: experiment with randomisation, OR careful causal-inference methods ruling out confounders.

Pearson r — the standard correlation. When both variables are continuous and approximately Normal: $r = \frac{\sum ( x _{i} - x ˉ ) ( y _{i} - y ˉ )}{\sum ( x _{i} - x ˉ ) ^{2} \sum ( y _{i} - y ˉ ) ^{2}}$ . Equivalently, $r$ is the average product of z-scores: $r = \frac{1}{n - 1} \sum z_{x, i} z_{y, i}$ . Range −1 to +1. $r = + 1$ is perfect positive linear; $r = 0$ is no linear association; $r = - 1$ is perfect negative linear. Sign = direction; magnitude = strength.

Strength interpretation (memorise the cut-offs). $∣ r ∣ < 0.1$ — negligible. $0.1 \leq ∣ r ∣ < 0.3$ — small / weak. $0.3 \leq ∣ r ∣ < 0.5$ — moderate. $∣ r ∣ \geq 0.5$ — large / strong. (Different sources give slightly different cutoffs; the *order* is what matters.)

Coefficient of determination, r². The square of Pearson r. Proportion of variance in Y that is shared with X (and vice versa). $r = 0.7 \Rightarrow r^{2} = 0.49$ → '49% of the variance in Y is explained by its linear relationship with X'. Range 0 to 1. **For simple linear regression with one predictor, $R^{2}$ equals $r^{2}$ from the correlation** (revisited in Unit 12).

What Pearson r does NOT measure. *Non-linear relationships:* a perfect parabola $y = x^{2}$ with x in $[- 10, 10]$ gives $r = 0$ — the linear fit is flat. Pearson r is blind to curvature. Always look at the scatter plot. *Heteroscedasticity:* Pearson r gives one number even when spread of y depends on x — misleading.

Pearson r is sensitive to outliers. A single extreme point can drag r dramatically. Two scatter plots can both yield $r = 0.99$ — one because the data hugs a line, the other because two extreme points anchor the slope while the rest is noise. Anscombe's quartet (Unit 4) is the manifesto: identical r, four totally different stories.

Spearman ρ — rank-based alternative. When data are *ordinal*, when the relationship is *monotonic but not linear*, or when *outliers* are wrecking Pearson r, use Spearman ρ. Spearman ρ is just Pearson r computed on the ranks of the data, not the raw values. Replace each x with its rank within x's; each y with its rank within y's; compute Pearson r.

Spearman ρ properties. Range $[- 1, + 1]$ , same as Pearson. Monotonicity, not linearity: $ρ = 1$ when each variable is a perfectly increasing function of the other (linear or not). Robust to outliers — ranks compress extreme values to 'highest' or 'lowest'. Works for ordinal data — no interval-scale assumption.

Kendall τ. Based on concordant vs discordant *pairs*: for each pair $(i, j)$ , concordant if x and y move in the same direction, discordant otherwise. $τ = (C - D) / (2 n)$ . Even more robust than Spearman; magnitude is typically smaller; favoured at very small n or with many ties.

Pearson vs Spearman — when to use which. *Both interval/ratio + approximately Normal + no outliers + linear relationship* → Pearson r. *Ordinal data OR non-normal OR outliers OR monotonic-non-linear* → Spearman ρ. If both are valid, they give similar values; Pearson is slightly more powerful when its assumptions hold. Exam classic: 'why is Spearman less sensitive to outliers?' — because ranks compress extreme values to merely the highest or lowest rank.

Statistical significance of correlation. $H_{0} : ρ = 0$ . Test statistic $t = r (n - 2) / (1 - r^{2}) \sim t_{n - 2}$ . Larger n → more df → more power. *Practical consequences:* with $n = 10$ , you can get $r = 0.99$ just from noise. With $n = 1000$ , $r = 0.10$ — barely visible to the eye — can be statistically significant.

Strength vs significance — two separate things. Strength = magnitude of $∣ r ∣$ . Significance = p-value, which depends on $r$ AND $n$ . Apparently 'strong' correlations in small samples may not be significant; weak correlations in large samples may be highly significant. This is the same statistical-vs-practical distinction from Unit 7. Always report effect size (here: $r$ ) alongside p.

Critical r values. For given n and α, there's a critical r above which the correlation is significant. Example: n = 30, two-tailed α = 0.05 → critical r ≈ 0.36. One-tailed → critical r ≈ 0.32. Tables and R's cor.test() provide these.

**Partial correlation $r_{X Y \cdot Z}$ .** The correlation between X and Y *after removing the linear effect of Z from both*. Tests: 'does X relate to Y *beyond* what Z explains?' Mechanically: (1) regress X on Z, get residuals $e_{X}$ ; (2) regress Y on Z, get residuals $e_{Y}$ ; (3) Pearson correlation of $e_{X}$ and $e_{Y}$ . Formula (one control): $r_{X Y \cdot Z} = (r_{X Y} - r_{X Z} r_{Y Z}) / (1 - r_{X Z}^{2}) (1 - r_{Y Z}^{2})$ .

Partial correlation assumptions. All pairs have linear relationships. Independent observations. Bivariate Normal pairs (each variable approximately Normal). If violated, use partial Spearman.

Semi-partial (part) correlation. Removes the effect of Z from *only one* of X and Y. $r_{Y (X \cdot Z)}$ = correlation between Y and the residual of X after removing Z. Used in multiple regression to interpret the unique contribution of each predictor beyond the others (Unit 12). Asymmetric — different from partial.

Tutoring vs exam scores — worked example. Maya wants the relationship between *tutoring* (X) and *exam scores* (Y) while removing the confounding effect of *study time* (Z) from tutoring only. Regress tutoring on study time; correlate the tutoring-residuals with raw exam scores. **Result: tutoring's *unique* contribution to exam scores, independent of study time.**

Reliability deepened — Cohen's κ for nominal data. Two raters rate the same items into nominal categories. Just looking at observed agreement isn't enough — they might agree a lot by chance. Cohen's κ corrects for chance agreement: $κ = (p_{o} - p_{e}) / (1 - p_{e})$ . $p_{o}$ = proportion of observed agreement; $p_{e}$ = proportion expected by chance from rater marginals.

Cohen's κ interpretation (Landis & Koch). $κ < 0$ : worse than chance. $0 \leq κ < 0.20$ : slight. $0.20 \leq κ < 0.40$ : fair. $0.40 \leq κ < 0.60$ : moderate. $0.60 \leq κ < 0.80$ : substantial. $0.80 \leq κ \leq 1.00$ : almost perfect. Used when: two raters, nominal data; OR same rater across time points (intra-rater).

Fleiss' κ. Generalisation of Cohen's κ for *more than two raters* on nominal data. Same idea: observed agreement corrected for chance.

Kendall's W (coefficient of concordance). Multiple raters *ranking* items rather than placing them into nominal categories. Range 0 (no agreement) to 1 (perfect agreement). Use for ordinal data with multiple raters.

Krippendorff's α. General-purpose reliability coefficient handling any number of raters, missing data, and all four measurement levels (nominal/ordinal/interval/ratio). The Swiss army knife of inter-rater reliability.

Intra-rater reliability with Pearson. Same rater measures same items at two time points; for continuous data, use Pearson r between the two sets of ratings. Same approach works for parallel-forms reliability with continuous measures.

Cronbach's α — internal consistency. Most common measure: do all items on a test measure the same underlying construct? Formula: $α = \frac{k}{k - 1} (1 - \frac{\sum _{i} σ _{Y_{i}}^{2}}{σ _{X}^{2}})$ where $k$ = number of items, $σ_{Y_{i}}^{2}$ = variance of item $i$ , $σ_{X}^{2}$ = variance of total scores. Intuition: if items are highly intercorrelated, the variance of the total is much larger than the sum of item variances → α high.

Cronbach's α interpretation. $α \geq 0.9$ : excellent (sometimes too high — items may be redundant). $0.8 \leq α < 0.9$ : good. $0.7 \leq α < 0.8$ : acceptable. $0.6 \leq α < 0.7$ : questionable. $α < 0.6$ : poor. Split-half and Kuder-Richardson 20/21 are related alternatives (KR for binary-item tests).

Outliers — detection and treatment. Causes: (1) measurement / execution errors; (2) data-entry errors (999 for missing); (3) data processing errors; (4) sampling errors (mixed sources); (5) natural variability / novelties. Detection: scatter plots, boxplots (1.5 × IQR rule), histograms, 2–3 SDs from mean, Grubbs' test, Tietjen-Moore. Treatment: *omit* (with strong evidence of error and documentation), *replace* (winsorise / impute), *use robust methods* (nonparametric, MAD), *value the outlier* (sometimes the most interesting datum), *transform*. The default mistake students make: silently dropping outliers because they look weird. Detect, investigate, decide based on cause, document.

Definitions

Pearson r — Standardised covariance $cov (X, Y) / (s_{X} s_{Y})$ . Range $[- 1, 1]$ . Captures linear association. Assumes continuous, approximately Normal, no extreme outliers.
Spearman ρ — Pearson r computed on the ranks of X and Y. Captures monotone (not necessarily linear) association. Robust to outliers. Works for ordinal data.
Kendall τ — (Concordant − discordant) / total pairs. Robust ordinal association measure. Smaller magnitude than Spearman; preferred for small n with many ties.
r² (coefficient of determination) — Proportion of variance in Y shared with X (and vice versa). Range $[0, 1]$ . Equals R² in simple linear regression.
Partial correlation — Correlation between X and Y after removing the linear effect of Z from both. Tests 'does X relate to Y beyond what Z explains?'
Semi-partial (part) correlation — Correlation between Y (or X) and the residual of X (or Y) after removing Z. Z is stripped from only one side. Asymmetric.
Correlation ≠ causation — Four reasons two variables can correlate: A → B, B → A, C → both, coincidence. Correlation establishes association, not causation.
Cohen's κ — $(p_{o} - p_{e}) / (1 - p_{e})$ . Inter-rater agreement above chance for nominal data with two raters. Negative if worse than chance.
Fleiss' κ — Cohen's κ generalised to more than two raters on nominal data.
Kendall's W — Coefficient of concordance for multiple raters ranking items (ordinal data).
Krippendorff's α — General-purpose reliability coefficient: any number of raters, missing data, all measurement levels.
Intra-rater reliability — Same rater measuring same items at two time points. Pearson r for continuous; Cohen's κ for nominal.
Cronbach's α — $\frac{k}{k - 1} (1 - \sum σ_{i}^{2} / σ_{X}^{2})$ . Internal consistency of a k-item scale. > 0.70 acceptable; > 0.95 suggests redundancy.
Split-half reliability — Split items into two halves; compute the correlation between half scores. Spearman-Brown corrects for the length effect.
Kuder-Richardson 20/21 — Internal consistency for binary-item tests. KR-20 for varying item difficulties; KR-21 assumes equal difficulties.
Outlier — Data point unusually distant from the rest. Caused by errors (correct/remove), processing mistakes, or natural variability (respect and investigate).

Formulas

$r = \frac{\sum _{i} ( x _{i} - x ˉ ) ( y _{i} - y ˉ )}{\sum _{i} ( x _{i} - x ˉ ) ^{2} \sum _{i} ( y _{i} - y ˉ ) ^{2}} (Pearson)$
$r = \frac{cov ( X , Y )}{s _{X} \cdot s _{Y}} (standardised covariance)$
$ρ_{Spearman} = r (rank (X), rank (Y))$
$τ_{Kendall} = \frac{C - D}{( 2 n )} (concordant minus discordant pairs)$
$t = r (n - 2) / (1 - r^{2}) \sim t_{n - 2} (significance test)$
$r_{X Y \cdot Z} = \frac{r _{X Y} - r _{X Z} r _{Y Z}}{( 1 - r _{X Z}^{2} ) ( 1 - r _{Y Z}^{2} )} (partial correlation)$
$r^{2} = proportion of variance shared$
$κ = \frac{p _{o} - p _{e}}{1 - p _{e}} (Cohen’s kappa)$
$α = \frac{k}{k - 1} (1 - \frac{\sum _{i} σ _{Y_{i}}^{2}}{σ _{X}^{2}}) (Cronbach’s alpha)$

Derivations

Pearson r as the average product of z-scores. $r = \frac{1}{n - 1} \sum_{i} z_{x, i} z_{y, i}$ where $z_{x, i} = (x_{i} - \overset{x}{ˉ}) / s_{x}$ and similarly for y. Multiplying out: $r = \frac{1}{n - 1} \sum \frac{( x _{i} - x ˉ ) ( y _{i} - y ˉ )}{s _{x} s _{y}} = \frac{cov ( X , Y )}{s _{x} s _{y}}$ . Hence r is the standardised covariance — unitless because z-scores are unitless. Range $[- 1, + 1]$ by Cauchy-Schwarz inequality.

Why r² = proportion of variance shared. In simple linear regression $Y = a + b X + ϵ$ , $R^{2} = 1 - SS_{res} / SS_{tot}$ . With $\hat{b} = r \cdot s_{y} / s_{x}$ , plugging in gives $SS_{res} / SS_{tot} = 1 - r^{2}$ , so $R^{2} = r^{2}$ . The square of r is the proportion of Y's variance explained by X (and vice versa, by symmetry).

Partial correlation via residuals. Regress X on Z: $X = a_{X} + b_{X} Z + e_{X}$ . Regress Y on Z: $Y = a_{Y} + b_{Y} Z + e_{Y}$ . The residuals $e_{X}, e_{Y}$ are 'the part of X (Y) not explained by Z'. **Partial correlation $r_{X Y \cdot Z} = cor (e_{X}, e_{Y})$ .** Algebra produces the closed-form: $r_{X Y \cdot Z} = (r_{X Y} - r_{X Z} r_{Y Z}) / (1 - r_{X Z}^{2}) (1 - r_{Y Z}^{2})$ .

Why Spearman is robust to outliers. Replacing $x_{i}$ with its rank $rank (x_{i})$ maps the largest value to $n$ , regardless of how extreme. An outlier at $1 0^{6}$ becomes rank $n$ — the same as if it were 'slightly above the next-largest value'. The numerical extremity is discarded; only the ordinal information remains. Hence one extreme point can shift Pearson r dramatically but barely moves Spearman ρ.

Cohen's κ chance-corrects observed agreement. Without correction, observed agreement $p_{o}$ confuses true rater concordance with chance agreement. If rater 1 says yes 60% of the time and rater 2 says yes 65%, they'll agree 'yes' by chance with probability $0.6 \times 0.65 = 0.39$ , plus 'no' agreement by chance $0.4 \times 0.35 = 0.14$ , total $p_{e} = 0.53$ . Subtracting and rescaling: $κ = (p_{o} - p_{e}) / (1 - p_{e})$ gives 0 when agreement = chance and 1 when perfect.

Cronbach's α via item variances. If items perfectly measure the same construct, item scores covary maximally, so $Var (total) ≫ \sum Var (items)$ . As items become uncorrelated (each measuring something different), $Var (total) \to \sum Var (items)$ (by independence). Hence $1 - \sum σ_{i}^{2} / σ_{X}^{2}$ approaches 0 when items are uncorrelated and approaches 1 when perfectly correlated. The factor $k / (k - 1)$ adjusts so that $α$ is on a [0, 1] scale.

Examples

r = 0.5 → $r^{2} = 0.25$ → '25% of variance in Y explained by X'. Moderate strength.
Pearson r mislead. Anscombe Set II has $r = 0.816$ but the relationship is a perfect parabola — Pearson reports 'strong linear' incorrectly.
Outlier impact. 10 data points along $y = x$ (r ≈ 0.99). Add one outlier at (100, 0) → Pearson r drops to ≈ 0.5. Spearman ρ barely moves (the outlier becomes rank n with a low y-rank, slight effect).
Spearman ρ with monotone non-linear. Data: $(1, 1), (2, 4), (3, 9), (4, 16), (5, 25)$ — perfect $y = x^{2}$ . Pearson r ≈ 0.96 (still high but not 1). Spearman ρ = 1 exactly — perfectly monotonic.
Significance vs strength. $r = 0.85, n = 6$ : t ≈ 3.2, p ≈ 0.03 — significant but n too small to trust. $r = 0.10, n = 1000$ : t ≈ 3.2, p < 0.01 — significant but trivial effect.
Partial correlation worked example. $r (GPA, IQ) = 0.75$ , $r (GPA, study) = 0.56$ , $r (IQ, study) = 0.46$ . Partial r_{GPA, IQ · study} = $(0.75 - 0.56 \cdot 0.46) / (1 - 0.5 6^{2}) (1 - 0.4 6^{2}) = (0.75 - 0.258) / 0.686 \cdot 0.788 = 0.492/0.736 \approx 0.67$ . GPA-IQ relationship is still strong (0.67) after controlling for study hours — study time mediated some of the original 0.75 but most remains.
Spurious correlation. *Ice cream sales × drowning deaths*: correlated, but both driven by *summer*. Controlling for temperature (partial) wipes out the apparent relationship.
Cohen's κ worked example. 50 images rated yes/no by two raters. Contingency: 4 both-yes, 16 R1-yes-R2-no, 15 R1-no-R2-yes, 15 both-no. $p_{o} = (4 + 15) /50 = 0.38$ . R1 yes proportion = 20/50 = 0.40; R2 yes = 19/50 = 0.38. $p_{e} = 0.40 \cdot 0.38 + 0.60 \cdot 0.62 = 0.152 + 0.372 = 0.524$ . $κ = (0.38 - 0.524) / (1 - 0.524) = - 0.302$ . Negative κ = worse than chance — these raters systematically disagree.
Cronbach's α typical. PHQ-9 (9 depression items): α ≈ 0.86 — good internal consistency. Scale items measure the same construct.
Cronbach's α too high. A 10-item scale with α = 0.97 — likely redundant items (paraphrases of each other). Drop near-duplicate items.

Diagrams

Pearson r examples. Scatterplots labelled r = 0, 0.3, 0.5, 0.7, 0.9, −0.9. Show the visual progression.
Pearson misleads. Two scatterplots both at $r = 0.99$ : one is a clean line, the other is a noisy cloud with two extreme leverage points. Lesson: plot before trusting r.
r = 0 ≠ independence. Perfect parabola scatter with r ≈ 0. Strong relationship, zero linear association.
Spearman vs Pearson under outliers. Data along $y = x$ plus one outlier; Pearson drops from 0.99 to 0.5; Spearman stays near 0.99.
Partial correlation via residuals. Three-step diagram: regress X on Z; regress Y on Z; correlate the two residual sets.
Cohen's κ contingency table. 2×2 with marginals. Show $p_{o}$ as sum of diagonal / total; $p_{e}$ as product of marginals.
Cronbach's α illustration. Items correlate strongly → high α. Items uncorrelated → α near 0. Animation of α as inter-item correlation increases.

Edge cases

r = 0 ≠ independence. Only no *linear* association. $y = x^{2}$ has r ≈ 0 with perfect functional dependence.
Outliers wildly inflate or deflate Pearson r — always check with Spearman or visualise.
Ceiling / floor effects truncate distributions and shrink r.
Range restriction (sampling only narrow X) reduces observable r. Classic: SAT-GPA correlation looks weaker among admitted students than population.
Heteroscedasticity — Pearson r still computes but is misleading when y-spread depends on x.
Small n — high r can occur by chance; require both magnitude and significance.
Negative κ — raters disagree more than chance; investigate definitions.
High Cronbach's α with low construct validity — items are internally consistent but measuring the wrong construct (e.g., reading-comprehension scale items all heavily correlate with vocabulary, not comprehension).
Multicollinearity in partial correlation — when Z is nearly perfectly correlated with X or Y, the denominator $(1 - r_{X Z}^{2})$ goes to zero, partial r becomes unstable.

Common mistakes

Inferring causation from correlation. The single most-flagged error.
Computing Pearson r on heavy outliers — use Spearman or remove the outlier with justification.
Treating r = 0 as independence — only no *linear* association.
Confusing partial and semi-partial correlations — partial residualises both X and Y; semi-partial only one. Different denominators in the formulas.
Reporting a high r without checking n — small n + high r is suspicious.
Reporting only p without effect size r — significance ≠ strength.
Reporting r² as 'r squared by 100%' — r² is already a proportion (0 to 1); multiply by 100 if reporting as a percentage.
Treating Cohen's κ as observed agreement — κ corrects for chance, may be much smaller than raw $p_{o}$ .
Reporting Cronbach's α = 0.95 as 'excellent' — likely indicates item redundancy; the items aren't independently measuring different facets.
Silently dropping outliers — detect, investigate cause, document, then decide.

Shortcuts

r² = shared variance between X and Y.
Pearson for continuous + Normal + linear; Spearman / Kendall for ordinal, monotone-non-linear, or outlier-prone.
Partial: strip Z from both X and Y. Semi-partial: strip Z from one side only.
Cohen's κ > 0.75 ≈ excellent; κ > 0.40 is acceptable for many domains.
Cronbach's α > 0.70 acceptable; > 0.95 suggests redundancy.
Correlation ≠ causation — four reasons: A→B, B→A, C→both, coincidence.
r is strength; p is significance — both depend on n in different ways.

Proofs / Algorithms

**Pearson r is bounded in $[- 1, 1]$ .** By the Cauchy-Schwarz inequality, $∣ \sum a_{i} b_{i} ∣ \leq \sum a_{i}^{2} \sum b_{i}^{2}$ . Applying with $a_{i} = (x_{i} - \overset{x}{ˉ}), b_{i} = (y_{i} - \overset{y}{ˉ})$ : numerator of r is bounded in absolute value by the denominator. Hence $∣ r ∣ \leq 1$ . Equality holds iff $a_{i} = c \cdot b_{i}$ for some constant c — i.e., y is a perfect linear function of x.

** $R^{2} = r^{2}$ for simple linear regression.** With $\hat{b} = r \cdot s_{y} / s_{x}$ and $\overset{a}{^} = \overset{y}{ˉ} - \hat{b} \overset{x}{ˉ}$ (OLS), the fitted value at $x_{i}$ is $\overset{y}{^}_{i} = \overset{a}{^} + \hat{b} x_{i}$ . Residual SS: $SS_{res} = \sum (y_{i} - \overset{y}{^}_{i})^{2} = \sum (y_{i} - \overset{y}{ˉ})^{2} (1 - r^{2}) = SS_{tot} (1 - r^{2})$ . Hence $R^{2} = 1 - SS_{res} / SS_{tot} = r^{2}$ .

Partial correlation formula derivation. Regress X on Z: $X = a_{X} + b_{X} Z + e_{X}$ with $b_{X} = r_{X Z} s_{X} / s_{Z}$ . Regress Y on Z: $Y = a_{Y} + b_{Y} Z + e_{Y}$ . The correlation of residuals is $cov (e_{X}, e_{Y}) / (s_{e_{X}} s_{e_{Y}})$ . After algebra (substituting expressions for $e_{X}, e_{Y}$ ): $r_{X Y \cdot Z} = (r_{X Y} - r_{X Z} r_{Y Z}) / (1 - r_{X Z}^{2}) (1 - r_{Y Z}^{2})$ . The denominator terms $(1 - r_{Z X}^{2}), (1 - r_{Z Y}^{2})$ are the variance ratios of the residuals — what remains of X and Y after Z is regressed out.

Behavioral Research: Statistical Methods