Revision Notes/Unit 5 — Descriptive Statistics/Centre, Spread, Standardisation

Centre, Spread, Standardisation

Intuition

Three measures of centre: mean (centre of mass, sensitive), median (50th percentile, robust), mode (most-frequent, for nominal). Three measures of spread: range (fragile), IQR / MAD (robust), variance / SD (parametric standard, sensitive). Standardise with z-scores to compare across scales. Choice of measure is constrained by the variable scale (NOIR) and the distribution shape (skew, outliers, bimodality).

Explanation

Mean — the arithmetic average. $\overset{x}{ˉ} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}$ . The centre of mass — physically, the balance point of the data on a number line. Uses *all* data points; the most sensitive and precise measure when data are roughly symmetric. Sensitive to extreme values: a single billionaire can drag up the mean income of a small town.

Median — the middle value. Sort the data; if $n$ is odd, the middle observation; if even, the average of the two middle observations. Equivalently the 50th percentile. Robust to outliers — one billionaire moves the median by zero. Use for skewed distributions like income, reaction times, house prices.

Mode — the most-frequent value. The only meaningful central tendency for nominal data ('average eye colour' is nonsense; mode = the most common eye colour). For bimodal distributions, reporting both modes is informative — a single 'mode' obscures subpopulation structure.

Which measure by variable type — exam-fill-in. *Nominal* → mode only (mean and median meaningless for unordered categories). *Ordinal* → median and mode (mean is technically inappropriate — the spacing between ranks isn't meaningful, though Likert data often gets a mean in practice). *Interval* → mean, median, mode all valid. *Ratio* → all valid plus the geometric mean for specialised cases.

When central-tendency measures fail. *Highly skewed distributions* — mean is pulled toward the tail, no longer 'typical'. Use median. *Bimodal* — no single value is typical; report both modes. *Outliers dominate* — use median and IQR. *Small samples* — every measure becomes unstable.

Mean advantages and disadvantages. *Advantage:* most sensitive and exact measure; basis of significance testing and ANOVA; allows estimation of population parameters from sample data. *Disadvantage:* a single extreme value can seriously distort it.

Median advantages and disadvantages. *Advantage:* not susceptible to extreme values. *Disadvantage:* can be unrepresentative in small samples (small samples have flickery medians).

Mode advantages and disadvantages. *Advantage:* indicates the most typical value; unaffected by extreme scores; sometimes more informative than the mean. *Disadvantage:* not useful when several values occur equally frequently in a small dataset.

Range = max − min. The simplest dispersion measure. Extremely sensitive to outliers — one typo of 999 turns your range into nonsense. Fragile.

IQR = Q3 − Q1. Width of the middle 50% of the data. Robust to outliers (drops the top and bottom 25%). Used in boxplots and Tukey's outlier rule (anything beyond Q3 + 1.5·IQR or below Q1 − 1.5·IQR is flagged).

Variance — the mean squared deviation from the mean. Population: $σ^{2} = \frac{1}{N} \sum (x_{i} - μ)^{2}$ . Sample (estimating population): $s^{2} = \frac{1}{n - 1} \sum (x_{i} - \overset{x}{ˉ})^{2}$ — Bessel's correction (n−1) makes the estimator unbiased. Why squared, not absolute? Two reasons: (1) squaring keeps positive and negative deviations from cancelling; (2) squaring gives nice algebraic properties (additivity for independent variables, smooth derivatives) that make variance the basis of significance testing and ANOVA. *Downside:* variance is in *squared units* — interpret with care.

**Standard deviation $s = s^{2}$ .** Square root of variance — back in the original units. The most common measure of spread.

Variance / SD advantages and disadvantages. *Advantage:* fundamental to significance testing and ANOVA. Allows population parameters to be estimated from samples. *Disadvantage:* distorted by extreme values (squared deviations amplify outliers). No information about distribution shape (same SD can come from unimodal, bimodal, or skewed).

MAD — Median Absolute Deviation. $MAD = median_{i} ∣ x_{i} - \tilde{x} ∣$ . Robust analog of SD: uses median twice. For Normal data, $σ \approx 1.4826 \cdot MAD$ .

z-score (standardisation). $z = (x - μ) / σ$ . Standardised value — how many SDs above/below the mean. Allows comparison across scales / units. Under Normal, P( $∣ Z ∣ \leq 1.96$ ) ≈ 0.95. z-scores are crucial for combining variables on different scales (e.g., standardising IQ + GPA before averaging).

Coefficient of variation. $C V = s / \overset{x}{ˉ}$ . Unitless relative dispersion. Useful when comparing variables with different units (e.g., SD of weight in kg vs SD of height in cm).

Mean-median-mode relationship and skew. Symmetric distribution: mean = median = mode. *Positive (right) skew:* mean > median > mode (long tail pulls mean right). *Negative (left) skew:* mean < median < mode. This is the diagnostic for skew if you can't plot.

The Normal distribution as a descriptive shape. Bell-shaped, symmetric around the mean. Mean = median = mode. 68 / 95 / 99.7 rule. Almost all values within 3 SDs. Many parametric tests assume the data — or more precisely, the residuals or sampling distributions — are approximately Normal. CLT rescues us when raw data isn't.

Skewed distributions in practice. *Right-skewed:* reaction times, income, house prices. *Left-skewed:* test accuracy at a ceiling, lifespan. *Bimodal:* completion times of marathon runners (serious + casual); class exam scores when half studied. For each: use median instead of mean, IQR instead of SD, consider a transformation (log, sqrt, reciprocal) to symmetrise.

Data transformations. *Log:* most common for right-skewed positive data; pulls in the right tail. *Square root:* milder than log. *Reciprocal 1/x:* dramatic for very right-skewed data. *Box-Cox:* general $x^{λ}$ family; finds optimal $λ$ . Caveat: once transformed, interpret in the transformed scale only — 'log income differed' not 'income differed by $X'.

The full descriptive table for a behavioural variable. Report: n, mean, median, SD, IQR, min, max, skewness, kurtosis, count of missing. This is what Maya assembles before running any inferential test.

Definitions

Mean (arithmetic average) — $\overset{x}{ˉ} = \frac{1}{n} \sum x_{i}$ . Centre of mass. Uses all data; sensitive to outliers and skew.
Median — Middle value when data sorted (50th percentile). Robust to outliers and skew. Use for ordinal or skewed data.
Mode — Most-frequent value. Only meaningful central tendency for nominal data.
Range — max − min. Simplest spread; extremely sensitive to outliers.
IQR — Q₃ − Q₁. Spread of the middle 50%. Robust.
Variance — Average squared deviation from the mean. Sample: $s^{2} = \frac{1}{n - 1} \sum (x_{i} - \overset{x}{ˉ})^{2}$ (Bessel). Squared units.
Standard deviation (SD) — $s = s^{2}$ . Same units as the data. The standard spread measure for parametric tests.
MAD (Median Absolute Deviation) — median $∣ x_{i} - \tilde{x} ∣$ . Robust analog of SD. Under Normal, $σ \approx 1.4826 \cdot MAD$ .
z-score — $(x - μ) / σ$ . Standardised value: SDs above/below the mean. Unit-less; preserves shape.
Coefficient of variation (CV) — $s / \overset{x}{ˉ}$ . Unitless relative dispersion. Useful when comparing variables with different units.
Bessel's correction — Divide by $n - 1$ in sample variance to remove bias from fitting $\overset{x}{ˉ}$ to the sample. One degree of freedom spent.
Skewness — Asymmetry of the distribution. Positive (right tail), negative (left tail). Pearson: $3 (\overset{x}{ˉ} - \tilde{x}) / s$ .
Bimodal distribution — Distribution with two peaks. Often indicates two subpopulations. Central-tendency measures unrepresentative.
Geometric mean — $n x_{1} \dots x_{n}$ . Appropriate for ratio-scale data and rates / growth factors. Always $\leq$ arithmetic mean.

Formulas

$\overset{x}{ˉ} = \frac{1}{n} i = 1 \sum n x_{i} (arithmetic mean)$
$\tilde{x} = median (x_{1}, \dots, x_{n})$
$s^{2} = \frac{1}{n - 1} i = 1 \sum n (x_{i} - \overset{x}{ˉ})^{2} (sample variance with Bessel)$
$s = s^{2} (sample SD)$
$z = \frac{x - μ}{σ} (z-score)$
$IQR = Q_{3} - Q_{1}$
$MAD = median_{i} ∣ x_{i} - \tilde{x} ∣$
$CV = s / \overset{x}{ˉ} (coefficient of variation)$
$Skewness_{Pearson} = \frac{3 ( x ˉ - x ~ )}{s}$

Derivations

Why squared deviations and not absolute deviations? Three reasons. (1) Squaring keeps positive and negative deviations from cancelling — like absolute value but smooth. (2) The sum of squared deviations has a unique minimum at $\overset{x}{ˉ}$ , derivable from $\frac{d}{d μ} \sum (x_{i} - μ)^{2} = 0 \Rightarrow μ = \overset{x}{ˉ}$ . (3) Variance is *additive for independent variables*: $Var (X + Y) = Var (X) + Var (Y)$ — fundamental to ANOVA's SS partition. Mean absolute deviation lacks both the smoothness and the additivity.

Bessel's correction unbiases s². $E [\sum (x_{i} - \overset{x}{ˉ})^{2}] = (n - 1) σ^{2}$ . Proof sketch: $\sum (x_{i} - \overset{x}{ˉ})^{2} = \sum (x_{i} - μ)^{2} - n (\overset{x}{ˉ} - μ)^{2}$ . Taking expectations: $E [\sum (x_{i} - \overset{x}{ˉ})^{2}] = n σ^{2} - n \cdot σ^{2} / n = (n - 1) σ^{2}$ . Dividing by (n−1) gives $E [s^{2}] = σ^{2}$ . One degree of freedom is 'spent' estimating x̄, leaving (n−1) free pieces.

Mean-median-mode relation under skew. For continuous symmetric distributions, mean = median by symmetry, and equal mode if unimodal. Under right skew, the long right tail pulls the mean further right than the median (median is less affected by tail mass), and the mode is at the peak (left of both). Hence mean > median > mode. Reverses for left skew.

z-score preserves shape, changes location and scale. $Z = (X - μ) / σ$ . Linear transformation: doesn't change skewness, kurtosis, or any shape feature. Sets $E [Z] = 0$ , $Var (Z) = 1$ . Hence z-scores are *unit-less* and on a common scale — useful for combining variables of different units or for assessing extremity (|z| > 2 is outlier-ish under Normal).

Why MAD has the 1.4826 factor for Normal data. For $X \sim N (0, 1)$ , MAD = median( $∣ Z ∣$ ) = $Φ^{- 1} (0.75) \approx 0.6745$ . To estimate $σ$ , we scale MAD by $1/0.6745 \approx 1.4826$ . So $\overset{σ}{^}_{MAD} = 1.4826 \cdot MAD$ — a robust SD estimate that downweights outliers.

Examples

Mean vs median under skew. Salaries (₹k/yr): 30, 32, 35, 38, 1500. Mean ≈ 327; median = 35. Mean is dominated by one extreme; median represents the typical worker.
z-score example. SAT score $X = 1400$ , $μ = 1050$ , $σ = 200$ . $Z = (1400 - 1050) /200 = 1.75$ . Looking up Normal table: 96th percentile.
Bessel worked example. Sample ${2, 4, 4, 6, 8}$ . $\overset{x}{ˉ} = 4.8$ . Deviations: $- 2.8, - 0.8, - 0.8, 1.2, 3.2$ . Squared: $7.84, 0.64, 0.64, 1.44, 10.24$ . Sum = 20.8. With Bessel: $s^{2} = 20.8/4 = 5.2$ . Without (biased): $20.8/5 = 4.16$ — underestimates.
MAD example. Data ${1, 2, 3, 4, 100}$ . Median = 3. Deviations from median: $2, 1, 0, 1, 97$ . MAD = median ${0, 1, 1, 2, 97}$ = 1. SD ≈ 43.9 (dragged up by the outlier 100). MAD is much more representative of typical spread.
Skew diagnosis. Reaction times: mean 850 ms, median 700 ms, mode ≈ 650 ms. Mean > median > mode → positive (right) skew. Standard for RT data.
Coefficient of variation. Weight: $\overset{x}{ˉ} = 70$ kg, $s = 10$ kg → CV = 0.143. Height: $\overset{x}{ˉ} = 170$ cm, $s = 7$ cm → CV = 0.041. Weight is *relatively* more variable than height in this sample.
Bimodal distribution. Marathon completion times: serious runners cluster around 3:30, casual around 5:00. Mean ≈ 4:15 (between modes), but no one finishes in 4:15. Report both modes and the bimodal nature.
Log transform. Income data, positively skewed. After log: roughly symmetric. Now t-test / ANOVA applies. Report 'log income differed' — not 'income differed by X rupees'.
Likert mean (controversial). 5-point agreement scale, n = 100. Mean = 3.2. Strictly inappropriate (ordinal data), but commonly reported in practice if scale-points are roughly equally spaced. Median (= 3) is the safer choice.

Diagrams

Mean-median-mode under three skews. Symmetric Normal (all three coincide at the peak); right-skewed (mean rightmost, then median, then mode at peak); left-skewed (mirror).
Mean is a balance point. Number line with data points as weights; the mean is where the line balances. Adding a far outlier shifts the balance dramatically.
z-score on a Normal curve. −3 to +3 σ markers with 68/95/99.7 areas shaded.
Boxplot with Bessel's correction annotated. Q1, median, Q3, whiskers, outliers, IQR; small inset reminding that the SD used in z-score calculations uses (n−1).
MAD vs SD under outlier. Same data with and without an outlier; show SD jumps dramatically while MAD barely moves.
Bimodal distribution. Two peaks; mean and median both fall in the trough between modes — not representative of any actual observation.

Edge cases

Bimodal distributions — mean/median fall between modes, misleading 'central' summary. Always plot.
Ordinal data — mean technically inappropriate; use median. Likert is often treated as interval in practice but strictly ordinal.
Nominal data — only mode and counts are meaningful. 'Average eye colour' is nonsense.
z-scores assume Normality for percentile interpretation. For skewed data, use empirical percentiles or transform first.
Geometric mean ≠ arithmetic mean for ratio data. For growth rates and ratios, use geometric mean: $n x_{1} \cdot x_{2} \dots x_{n}$ .
Small samples have flickery medians. Adding or removing one data point can shift the median noticeably. Means are more stable at small n.
Floor and ceiling effects truncate the distribution — e.g., accuracy with a ceiling near 100% creates left skew artificially. Choose measurement scales to avoid these.

Common mistakes

Reporting mean ± SD for heavily skewed data — use median + IQR instead.
Dividing by n (not n − 1) for sample variance — biased low estimator.
Computing mean of nominal data — 'average eye colour' is nonsense.
Computing mean of strict Likert/ordinal without acknowledging the critique. Median is safer.
Forgetting Bessel's correction in pen-and-paper computations.
Treating SD as a robust spread measure — it's *not* (squared deviations amplify outliers). Use MAD or IQR for robustness.
Interpreting log-transformed results in original units — only the transformed scale is meaningful.
Reporting CV when units don't make ratios meaningful — CV requires ratio-scale data (true zero).
Confusing mean and expected value in continuous distributions — they're the same for symmetric distributions, but in heavy-tailed cases the sample mean can be far from the population mean.

Shortcuts

Robust trio: median + IQR + MAD. Non-robust trio: mean + SD + range.
Mean > median > mode ⇒ positive (right) skew. Mean < median < mode ⇒ negative (left) skew.
Bessel: divide by $n - 1$ , not $n$ — one DoF spent on x̄.
z = (x − μ)/σ — units of standard deviations; unit-less; preserves shape.
Variable type → measure: nominal → mode; ordinal → median/mode; interval/ratio → all valid.
Variance is squared units; SD is original units. Always report SD for interpretation.
MAD × 1.4826 ≈ σ under Normal — robust SD estimate.
Skewness ≈ 3(mean − median) / SD (Pearson) — quick estimate without computing higher moments.
Geometric mean for ratios / rates. Arithmetic mean for additive quantities.

Proofs / Algorithms

Bessel's correction makes s² unbiased. Let $X_{1}, \dots, X_{n}$ be i.i.d. with mean $μ$ and variance $σ^{2}$ . Define $\overset{ˉ}{X} = \frac{1}{n} \sum X_{i}$ . We compute $E [\sum (X_{i} - \overset{ˉ}{X})^{2}]$ . Expanding: $\sum (X_{i} - \overset{ˉ}{X})^{2} = \sum (X_{i} - μ + μ - \overset{ˉ}{X})^{2} = \sum (X_{i} - μ)^{2} - n (\overset{ˉ}{X} - μ)^{2}$ . Taking expectations: first term is $n σ^{2}$ ; second is $n \cdot σ^{2} / n = σ^{2}$ . So $E [\sum (X_{i} - \overset{ˉ}{X})^{2}] = (n - 1) σ^{2}$ . Therefore $E [s^{2}] = E [\frac{1}{n - 1} \sum (X_{i} - \overset{ˉ}{X})^{2}] = σ^{2}$ — unbiased with the (n−1) factor.

The mean minimises sum of squared deviations. Let $f (μ) = \sum_{i} (x_{i} - μ)^{2}$ . Differentiating: $f^{'} (μ) = - 2 \sum (x_{i} - μ)$ . Setting to zero: $\sum x_{i} - n μ = 0 \Rightarrow μ = \overset{x}{ˉ}$ . Second derivative $f^{''} (μ) = 2 n > 0$ , confirming minimum. Hence the mean is the unique point that minimises squared deviations — this is why OLS regression centres on means.

The median minimises sum of absolute deviations. Let $g (μ) = \sum ∣ x_{i} - μ ∣$ . The subgradient is $\sum sgn (μ - x_{i})$ , which equals zero when half the data is below $μ$ and half above — i.e., at the median. Hence the median minimises sum of absolute deviations (equivalently, the L1-loss centre). The mean is the L2 centre; the median is the L1 centre. This is why median is robust: extreme values are penalised linearly (L1), not quadratically (L2).

Behavioral Research: Statistical Methods