Pearson, Spearman, Partial, Reliability Metrics
Intuition
Correlation is not causation — memorise this once and for all. Two variables can correlate because (1) A causes B, (2) B causes A, (3) a third variable C causes both, or (4) coincidence. Showing co-movement establishes association, not causation. Pearson r measures linear association in ; Spearman ρ uses ranks and captures monotone (possibly nonlinear) associations and is robust to outliers; partial correlation isolates X-Y association after removing a third variable Z. Reliability is the *quantified* version of consistency from Unit 2 — Cohen's κ for inter-rater nominal, Cronbach's α for internal consistency.
Explanation
Correlation is NOT causation — the four reasons. Two variables can correlate because: (1) A causes B — smoking → lung cancer. (2) B causes A — lung cancer drives more smoking-for-relief (less plausible here, but for many correlations the direction is real). (3) A third variable C causes both — *ice cream sales* and *drowning deaths* are correlated, but neither causes the other; hot weather causes both. (4) Coincidence / sampling noise — especially in small samples, unrelated variables can show high correlation by chance. Showing a correlation establishes that two things move together. It does not establish why. To claim causation: experiment with randomisation, OR careful causal-inference methods ruling out confounders.
Pearson r — the standard correlation. When both variables are continuous and approximately Normal: . Equivalently, is the average product of z-scores: . Range −1 to +1. is perfect positive linear; is no linear association; is perfect negative linear. Sign = direction; magnitude = strength.
Strength interpretation (memorise the cut-offs). — negligible. — small / weak. — moderate. — large / strong. (Different sources give slightly different cutoffs; the *order* is what matters.)
Coefficient of determination, r². The square of Pearson r. Proportion of variance in Y that is shared with X (and vice versa). → '49% of the variance in Y is explained by its linear relationship with X'. Range 0 to 1. **For simple linear regression with one predictor, equals from the correlation** (revisited in Unit 12).
What Pearson r does NOT measure. *Non-linear relationships:* a perfect parabola with x in gives — the linear fit is flat. Pearson r is blind to curvature. Always look at the scatter plot. *Heteroscedasticity:* Pearson r gives one number even when spread of y depends on x — misleading.
Pearson r is sensitive to outliers. A single extreme point can drag r dramatically. Two scatter plots can both yield — one because the data hugs a line, the other because two extreme points anchor the slope while the rest is noise. Anscombe's quartet (Unit 4) is the manifesto: identical r, four totally different stories.
Spearman ρ — rank-based alternative. When data are *ordinal*, when the relationship is *monotonic but not linear*, or when *outliers* are wrecking Pearson r, use Spearman ρ. Spearman ρ is just Pearson r computed on the ranks of the data, not the raw values. Replace each x with its rank within x's; each y with its rank within y's; compute Pearson r.
Spearman ρ properties. Range , same as Pearson. Monotonicity, not linearity: when each variable is a perfectly increasing function of the other (linear or not). Robust to outliers — ranks compress extreme values to 'highest' or 'lowest'. Works for ordinal data — no interval-scale assumption.
Kendall τ. Based on concordant vs discordant *pairs*: for each pair , concordant if x and y move in the same direction, discordant otherwise. . Even more robust than Spearman; magnitude is typically smaller; favoured at very small n or with many ties.
Pearson vs Spearman — when to use which. *Both interval/ratio + approximately Normal + no outliers + linear relationship* → Pearson r. *Ordinal data OR non-normal OR outliers OR monotonic-non-linear* → Spearman ρ. If both are valid, they give similar values; Pearson is slightly more powerful when its assumptions hold. Exam classic: 'why is Spearman less sensitive to outliers?' — because ranks compress extreme values to merely the highest or lowest rank.
Statistical significance of correlation. . Test statistic . Larger n → more df → more power. *Practical consequences:* with , you can get just from noise. With , — barely visible to the eye — can be statistically significant.
Strength vs significance — two separate things. Strength = magnitude of . Significance = p-value, which depends on AND . Apparently 'strong' correlations in small samples may not be significant; weak correlations in large samples may be highly significant. This is the same statistical-vs-practical distinction from Unit 7. Always report effect size (here: ) alongside p.
Critical r values. For given n and α, there's a critical r above which the correlation is significant. Example: n = 30, two-tailed α = 0.05 → critical r ≈ 0.36. One-tailed → critical r ≈ 0.32. Tables and R's cor.test() provide these.
**Partial correlation .** The correlation between X and Y *after removing the linear effect of Z from both*. Tests: 'does X relate to Y *beyond* what Z explains?' Mechanically: (1) regress X on Z, get residuals ; (2) regress Y on Z, get residuals ; (3) Pearson correlation of and . Formula (one control): .
Partial correlation assumptions. All pairs have linear relationships. Independent observations. Bivariate Normal pairs (each variable approximately Normal). If violated, use partial Spearman.
Semi-partial (part) correlation. Removes the effect of Z from *only one* of X and Y. = correlation between Y and the residual of X after removing Z. Used in multiple regression to interpret the unique contribution of each predictor beyond the others (Unit 12). Asymmetric — different from partial.
Tutoring vs exam scores — worked example. Maya wants the relationship between *tutoring* (X) and *exam scores* (Y) while removing the confounding effect of *study time* (Z) from tutoring only. Regress tutoring on study time; correlate the tutoring-residuals with raw exam scores. **Result: tutoring's *unique* contribution to exam scores, independent of study time.**
Reliability deepened — Cohen's κ for nominal data. Two raters rate the same items into nominal categories. Just looking at observed agreement isn't enough — they might agree a lot by chance. Cohen's κ corrects for chance agreement: . = proportion of observed agreement; = proportion expected by chance from rater marginals.
Cohen's κ interpretation (Landis & Koch). : worse than chance. : slight. : fair. : moderate. : substantial. : almost perfect. Used when: two raters, nominal data; OR same rater across time points (intra-rater).
Fleiss' κ. Generalisation of Cohen's κ for *more than two raters* on nominal data. Same idea: observed agreement corrected for chance.
Kendall's W (coefficient of concordance). Multiple raters *ranking* items rather than placing them into nominal categories. Range 0 (no agreement) to 1 (perfect agreement). Use for ordinal data with multiple raters.
Krippendorff's α. General-purpose reliability coefficient handling any number of raters, missing data, and all four measurement levels (nominal/ordinal/interval/ratio). The Swiss army knife of inter-rater reliability.
Intra-rater reliability with Pearson. Same rater measures same items at two time points; for continuous data, use Pearson r between the two sets of ratings. Same approach works for parallel-forms reliability with continuous measures.
Cronbach's α — internal consistency. Most common measure: do all items on a test measure the same underlying construct? Formula: where = number of items, = variance of item , = variance of total scores. Intuition: if items are highly intercorrelated, the variance of the total is much larger than the sum of item variances → α high.
Cronbach's α interpretation. : excellent (sometimes too high — items may be redundant). : good. : acceptable. : questionable. : poor. Split-half and Kuder-Richardson 20/21 are related alternatives (KR for binary-item tests).
Outliers — detection and treatment. Causes: (1) measurement / execution errors; (2) data-entry errors (999 for missing); (3) data processing errors; (4) sampling errors (mixed sources); (5) natural variability / novelties. Detection: scatter plots, boxplots (1.5 × IQR rule), histograms, 2–3 SDs from mean, Grubbs' test, Tietjen-Moore. Treatment: *omit* (with strong evidence of error and documentation), *replace* (winsorise / impute), *use robust methods* (nonparametric, MAD), *value the outlier* (sometimes the most interesting datum), *transform*. The default mistake students make: silently dropping outliers because they look weird. Detect, investigate, decide based on cause, document.
Definitions
- Pearson r — Standardised covariance . Range . Captures linear association. Assumes continuous, approximately Normal, no extreme outliers.
- Spearman ρ — Pearson r computed on the ranks of X and Y. Captures monotone (not necessarily linear) association. Robust to outliers. Works for ordinal data.
- Kendall τ — (Concordant − discordant) / total pairs. Robust ordinal association measure. Smaller magnitude than Spearman; preferred for small n with many ties.
- r² (coefficient of determination) — Proportion of variance in Y shared with X (and vice versa). Range . Equals R² in simple linear regression.
- Partial correlation — Correlation between X and Y after removing the linear effect of Z from both. Tests 'does X relate to Y beyond what Z explains?'
- Semi-partial (part) correlation — Correlation between Y (or X) and the residual of X (or Y) after removing Z. Z is stripped from only one side. Asymmetric.
- Correlation ≠ causation — Four reasons two variables can correlate: A → B, B → A, C → both, coincidence. Correlation establishes association, not causation.
- Cohen's κ — . Inter-rater agreement above chance for nominal data with two raters. Negative if worse than chance.
- Fleiss' κ — Cohen's κ generalised to more than two raters on nominal data.
- Kendall's W — Coefficient of concordance for multiple raters ranking items (ordinal data).
- Krippendorff's α — General-purpose reliability coefficient: any number of raters, missing data, all measurement levels.
- Intra-rater reliability — Same rater measuring same items at two time points. Pearson r for continuous; Cohen's κ for nominal.
- Cronbach's α — . Internal consistency of a k-item scale. > 0.70 acceptable; > 0.95 suggests redundancy.
- Split-half reliability — Split items into two halves; compute the correlation between half scores. Spearman-Brown corrects for the length effect.
- Kuder-Richardson 20/21 — Internal consistency for binary-item tests. KR-20 for varying item difficulties; KR-21 assumes equal difficulties.
- Outlier — Data point unusually distant from the rest. Caused by errors (correct/remove), processing mistakes, or natural variability (respect and investigate).
Formulas
Derivations
Pearson r as the average product of z-scores. where and similarly for y. Multiplying out: . Hence r is the standardised covariance — unitless because z-scores are unitless. Range by Cauchy-Schwarz inequality.
Why r² = proportion of variance shared. In simple linear regression , . With , plugging in gives , so . The square of r is the proportion of Y's variance explained by X (and vice versa, by symmetry).
Partial correlation via residuals. Regress X on Z: . Regress Y on Z: . The residuals are 'the part of X (Y) not explained by Z'. **Partial correlation .** Algebra produces the closed-form: .
Why Spearman is robust to outliers. Replacing with its rank maps the largest value to , regardless of how extreme. An outlier at becomes rank — the same as if it were 'slightly above the next-largest value'. The numerical extremity is discarded; only the ordinal information remains. Hence one extreme point can shift Pearson r dramatically but barely moves Spearman ρ.
Cohen's κ chance-corrects observed agreement. Without correction, observed agreement confuses true rater concordance with chance agreement. If rater 1 says yes 60% of the time and rater 2 says yes 65%, they'll agree 'yes' by chance with probability , plus 'no' agreement by chance , total . Subtracting and rescaling: gives 0 when agreement = chance and 1 when perfect.
Cronbach's α via item variances. If items perfectly measure the same construct, item scores covary maximally, so . As items become uncorrelated (each measuring something different), (by independence). Hence approaches 0 when items are uncorrelated and approaches 1 when perfectly correlated. The factor adjusts so that is on a [0, 1] scale.
Examples
- r = 0.5 → → '25% of variance in Y explained by X'. Moderate strength.
- Pearson r mislead. Anscombe Set II has but the relationship is a perfect parabola — Pearson reports 'strong linear' incorrectly.
- Outlier impact. 10 data points along (r ≈ 0.99). Add one outlier at (100, 0) → Pearson r drops to ≈ 0.5. Spearman ρ barely moves (the outlier becomes rank n with a low y-rank, slight effect).
- Spearman ρ with monotone non-linear. Data: — perfect . Pearson r ≈ 0.96 (still high but not 1). Spearman ρ = 1 exactly — perfectly monotonic.
- Significance vs strength. : t ≈ 3.2, p ≈ 0.03 — significant but n too small to trust. : t ≈ 3.2, p < 0.01 — significant but trivial effect.
- Partial correlation worked example. , , . Partial r_{GPA, IQ · study} = . GPA-IQ relationship is still strong (0.67) after controlling for study hours — study time mediated some of the original 0.75 but most remains.
- Spurious correlation. *Ice cream sales × drowning deaths*: correlated, but both driven by *summer*. Controlling for temperature (partial) wipes out the apparent relationship.
- Cohen's κ worked example. 50 images rated yes/no by two raters. Contingency: 4 both-yes, 16 R1-yes-R2-no, 15 R1-no-R2-yes, 15 both-no. . R1 yes proportion = 20/50 = 0.40; R2 yes = 19/50 = 0.38. . . Negative κ = worse than chance — these raters systematically disagree.
- Cronbach's α typical. PHQ-9 (9 depression items): α ≈ 0.86 — good internal consistency. Scale items measure the same construct.
- Cronbach's α too high. A 10-item scale with α = 0.97 — likely redundant items (paraphrases of each other). Drop near-duplicate items.
Diagrams
- Pearson r examples. Scatterplots labelled r = 0, 0.3, 0.5, 0.7, 0.9, −0.9. Show the visual progression.
- Pearson misleads. Two scatterplots both at : one is a clean line, the other is a noisy cloud with two extreme leverage points. Lesson: plot before trusting r.
- r = 0 ≠ independence. Perfect parabola scatter with r ≈ 0. Strong relationship, zero linear association.
- Spearman vs Pearson under outliers. Data along plus one outlier; Pearson drops from 0.99 to 0.5; Spearman stays near 0.99.
- Partial correlation via residuals. Three-step diagram: regress X on Z; regress Y on Z; correlate the two residual sets.
- Cohen's κ contingency table. 2×2 with marginals. Show as sum of diagonal / total; as product of marginals.
- Cronbach's α illustration. Items correlate strongly → high α. Items uncorrelated → α near 0. Animation of α as inter-item correlation increases.
Edge cases
- r = 0 ≠ independence. Only no *linear* association. has r ≈ 0 with perfect functional dependence.
- Outliers wildly inflate or deflate Pearson r — always check with Spearman or visualise.
- Ceiling / floor effects truncate distributions and shrink r.
- Range restriction (sampling only narrow X) reduces observable r. Classic: SAT-GPA correlation looks weaker among admitted students than population.
- Heteroscedasticity — Pearson r still computes but is misleading when y-spread depends on x.
- Small n — high r can occur by chance; require both magnitude and significance.
- Negative κ — raters disagree more than chance; investigate definitions.
- High Cronbach's α with low construct validity — items are internally consistent but measuring the wrong construct (e.g., reading-comprehension scale items all heavily correlate with vocabulary, not comprehension).
- Multicollinearity in partial correlation — when Z is nearly perfectly correlated with X or Y, the denominator goes to zero, partial r becomes unstable.
Common mistakes
- Inferring causation from correlation. The single most-flagged error.
- Computing Pearson r on heavy outliers — use Spearman or remove the outlier with justification.
- Treating r = 0 as independence — only no *linear* association.
- Confusing partial and semi-partial correlations — partial residualises both X and Y; semi-partial only one. Different denominators in the formulas.
- Reporting a high r without checking n — small n + high r is suspicious.
- Reporting only p without effect size r — significance ≠ strength.
- Reporting r² as 'r squared by 100%' — r² is already a proportion (0 to 1); multiply by 100 if reporting as a percentage.
- Treating Cohen's κ as observed agreement — κ corrects for chance, may be much smaller than raw .
- Reporting Cronbach's α = 0.95 as 'excellent' — likely indicates item redundancy; the items aren't independently measuring different facets.
- Silently dropping outliers — detect, investigate cause, document, then decide.
Shortcuts
- r² = shared variance between X and Y.
- Pearson for continuous + Normal + linear; Spearman / Kendall for ordinal, monotone-non-linear, or outlier-prone.
- Partial: strip Z from both X and Y. Semi-partial: strip Z from one side only.
- Cohen's κ > 0.75 ≈ excellent; κ > 0.40 is acceptable for many domains.
- Cronbach's α > 0.70 acceptable; > 0.95 suggests redundancy.
- Correlation ≠ causation — four reasons: A→B, B→A, C→both, coincidence.
- r is strength; p is significance — both depend on n in different ways.
Proofs / Algorithms
**Pearson r is bounded in .** By the Cauchy-Schwarz inequality, . Applying with : numerator of r is bounded in absolute value by the denominator. Hence . Equality holds iff for some constant c — i.e., y is a perfect linear function of x.
** for simple linear regression.** With and (OLS), the fitted value at is . Residual SS: . Hence .
Partial correlation formula derivation. Regress X on Z: with . Regress Y on Z: . The correlation of residuals is . After algebra (substituting expressions for ): . The denominator terms are the variance ratios of the residuals — what remains of X and Y after Z is regressed out.