Revision Notes/Unit 4 — Data Visualization/Plots, Matching, and Common Pitfalls

Plots, Matching, and Common Pitfalls

Intuition

Before you compute anything, plot the data. *Anscombe's quartet* (four datasets with identical mean / SD / r / regression line but wildly different scatter shapes) is the manifesto: statistics compress data; visualisations reveal structure. Match plots to scales of the variables (bar for categorical, scatter for continuous-continuous, histogram / boxplot / violin for distributions), and avoid the cosmetic traps (truncated axes, rainbow colormaps, 3D pie, dual-y) that mislead viewers.

Explanation

Anscombe's story (1973). Statistician Francis Anscombe assembled four small datasets, each with eleven $(x, y)$ pairs. By every summary statistic — mean of x and y, variance of x and y, correlation (≈ 0.816), best-fit regression line — the four are indistinguishable. Then he plotted them: *Set 1* is a clean linear cloud; *Set 2* is a smooth parabola; *Set 3* is a perfect line with one extreme outlier; *Set 4* is a vertical column of points at one x-value plus a single isolated point that defines the entire correlation. Statistics compress data; visualisations reveal structure.

The two preparatory steps before any plot. (1) *Identify the variable and its type.* Nominal / ordinal / interval / ratio. Continuous or discrete. Variable type constrains which summaries are meaningful. (2) *Identify the unit of analysis.* Trial / subject / group? Behavioural data is often nested (trial → subject → group). Aggregating at the wrong level can hide or fabricate effects.

Two warnings about summaries. Means hide distribution shape — a mean of 50 could be everyone scoring 50, or half scoring 0 and half scoring 100. Aggregation hides individual differences — average across all participants might show 'no effect', while half improved dramatically and half got worse (strong but opposite patterns).

Why visualise — the five reasons. (1) Check assumptions (normality, skew, multimodality). (2) Detect structure (learning curves, fatigue, strategy shifts). (3) Identify outliers and data errors. (4) Understand variability (within- vs between-subject). (5) Decide appropriate statistical models (parametric vs nonparametric, linear vs nonlinear).

Histogram. Splits a continuous variable into bins and shows counts. For: distribution of a single continuous variable — normality, skew, multimodality. Catch: visual impression depends on bin width. Too few bins over-smooth; too many show noise. Try several bin widths.

Boxplot (Tukey's box-and-whisker). Shows five numbers visually: median (line inside box), Q1 and Q3 (box edges), whiskers (typically extending to 1.5 × IQR beyond the box). Points beyond whiskers are flagged outliers. For: comparing spread and central tendency across groups side-by-side. Catch: does NOT show bimodality. Two distributions can have identical boxes — one unimodal, one bimodal — and you'd never know.

Scatter plot. Each data point as a dot at $(x, y)$ . For: the relationship between two continuous variables — correlations, linearity, outliers, clusters. If you're going to do regression or correlation, you must see the scatter plot first.

Bar chart. Length of each bar proportional to the value. For: comparing values across discrete categories — counts, means, proportions. Catch (exam favourite): bar plots for *continuous data* (e.g., reaction times) hide the distribution. They show only the mean ± SE bar; nothing about skew, multimodality, outliers. Bad practice for behavioural data — prefer violin or raincloud.

Pie chart. A circle split into wedges, each representing a category's proportion of the whole. Use sparingly — angles are harder for human vision to compare than lengths. Only when: 3–5 categories, visually differentiable slices, proportions add to 100%, no reliance on many colours. For more than a handful of categories, a bar chart is almost always better.

Violin plot. A more honest cousin of the boxplot. Each 'violin' is a mirrored density estimate; width at any height = density at that value. For: comparing full distributions across conditions. Shows skew, multimodality, tails — everything a boxplot hides.

Raincloud / raindrop plot. Combination plot: violin (distribution density) + box (summary stats) + actual individual data points scattered alongside. Shows distribution, summary, AND every data point. The gold standard for showing behavioural data with respect.

Mosaic plot. Grid of rectangles whose areas are proportional to the joint frequencies of two (or more) categorical variables. Rows might be 'introvert vs extrovert', columns 'comfortable vs not comfortable dancing'. For: relationships between two or more categorical variables.

Heat map. Grid of cells coloured by value. For: correlation matrices, 2-D density, time × condition plots. Colour intensity must be interpretable — use a perceptually uniform scale (viridis, plasma), not rainbow.

Line plot. Continuous line connecting $(x, y)$ points, usually with $x$ as time. For: change over time — learning curves, fatigue, neural time courses. Important refinement: show individual trajectories alongside the group average. Group means can hide huge individual variation.

Specialty plots to recognise. *Spider / radar:* multiple axes radiating from a centre; area depends on axis order (manipulable). *Funnel plot:* meta-analysis precision visualisation. *Treemap:* nested hierarchical proportions. *Waffle:* friendlier alternative to pie. *Streamgraph:* time-varying proportions. *Circos:* circular relationships (genomics, chord diagrams).

Tufte's Graphical Theory (memorise the four principles). (1) Minimise the data-to-ink ratio — most ink should encode data, not decoration. Strip out unnecessary gridlines, 3D effects, colour, ornament. (2) Minimise the lie factor — visual representation should be proportional to the underlying numbers. Truncated y-axes and 3D-tilted bars inflate the lie factor. (3) Minimise chart junk — drop decorative elements that don't encode data (gradient backgrounds, drop shadows, clip art). (4) Use proper scales and labelling — honest axes, clear units, readable text.

What makes a visualisation bad. Bar plots for continuous data; truncated y-axes (small differences → visual cliff); inconsistent or unjustified non-linear axes; overplotting without transparency (thousands of dots → black blob); decorative 3D effects (3D pie particularly); too many variables in one image; wrong chart type (pie for 20 categories); bad colour choices — low-contrast, rainbow, red-green.

Colour-blind friendly palettes. ~8% of men and 0.5% of women have some form of colour vision deficiency. Rules: use viridis / cividis / ColorBrewer's colour-blind-safe schemes; avoid red with green (most common deficiency); avoid red with brown; avoid rainbow colormaps (not perceptually uniform). Use symbols / line styles / direct annotations *in addition* to colour, so the chart remains legible even in greyscale. Test with Color Oracle or similar simulator.

Initial vs final visualisation. *Initial / exploratory:* for *you*, to understand data before analysis. Histograms, boxplots, scatter plots. Doesn't need to be polished — quick checks for distributions, outliers, missing data, weird coding. *Final / publication:* for the *reader*, to communicate findings. Cleaner, well-labelled, often more specialised (forest plots, funnel plots, raincloud, mosaic). The same data may deserve different plots at these two stages.

Choosing the right plot — by research question. *Distribution / shape of single variable* → histogram, boxplot, violin, raincloud. *Proportions / part-to-whole* → bar, pie (sparingly), treemap, waffle, mosaic. *Temporal change* → line plot. *Group differences* → side-by-side box / violin / raincloud. *Two continuous variables* → scatter, bubble, hexbin. *Two categoricals* → mosaic, heat map. *Correlation matrix* → heat map. *Geographical* → choropleth. There is no universally best plot — only the best plot for the question.

Detecting outliers with the Tukey rule. $x$ is an outlier if $x > Q_{3} + 1.5 \cdot IQR$ or $x < Q_{1} - 1.5 \cdot IQR$ . Where $IQR = Q_{3} - Q_{1}$ . More aggressive: 3 × IQR for extreme outliers. Alternatively: > 2 or 3 SDs from the mean (assumes Normal), Grubbs' test, Tietjen-Moore for multiple outliers.

Skewed distributions. *Right-skewed (positive skew)*: long tail to the right; mean > median > mode. Classic: reaction times (floor at zero), income, house prices. *Left-skewed (negative skew)*: long tail to the left; mean < median < mode. Classic: test accuracy with ceiling effect; lifespan. *Bimodal*: two peaks indicating subpopulations (male + female heights; serious vs casual marathon runners).

Data transformations. When data are heavily skewed and you need normality for parametric tests, transformations can help. *Log:* most common for right-skewed positive data (incomes, RTs). *Square root:* milder than log. *Reciprocal 1/x:* dramatic for very right-skewed data. *Box-Cox:* general family $x^{λ}$ , finds optimal $λ$ by maximising normality. Caveat: once you transform, you can only interpret in terms of the *transformed* variable, not the original. 'Log RTs differed by group' — not 'RTs differed by X seconds'.

Definitions

Anscombe's quartet — Four datasets sharing mean / SD / r / regression line but with wildly different scatter shapes. The slogan: statistics compress, visualisations reveal. Always plot.
Histogram — Bins a continuous variable and shows counts per bin. Reveals distribution shape; sensitive to bin width.
Boxplot (box-and-whisker) — Five-number summary: min, Q1, median, Q3, max; whiskers to ±1.5 × IQR; outliers as points. Hides bimodality.
Violin plot — Mirrored KDE density on each side. Communicates summary AND shape. Cousin of the boxplot.
Raincloud plot — Violin + boxplot + individual data points. The gold standard for behavioural data — distribution, summary, every observation in one figure.
Mosaic plot — Grid of rectangles with areas proportional to joint frequencies of categorical variables. For two-way categorical relationships.
Heat map — Grid where colour encodes value. Common for correlation matrices, time × subject data. Use viridis / cividis.
Bar chart — Length encodes value. For counts / means / proportions across discrete categories. Avoid for continuous data shapes.
Pie chart — Wedge angles encode proportions. Use sparingly — angles are perceptually weak. Limit to 3–5 categories.
Tukey outlier rule — $x > Q_{3} + 1.5 \cdot IQR$ or $x < Q_{1} - 1.5 \cdot IQR$ . The boxplot whisker boundary.
IQR — $Q_{3} - Q_{1}$ . Robust spread of the middle 50%.
Skew — Asymmetry of a distribution. Positive (long right tail), negative (long left tail), or symmetric (mean = median).
Bimodal distribution — Distribution with two peaks. Often indicates two subpopulations or strategies.
KDE (Kernel Density Estimate) — Smoothed estimate of a continuous distribution. The basis of violin plots.
Data-to-ink ratio (Tufte) — Fraction of ink on a chart that encodes data. Higher = better. Strip decoration.
Lie factor (Tufte) — Visual change ÷ data change. Should be ~1. Truncated axes inflate it.
Chart junk (Tufte) — Decorative elements that don't encode data — drop shadows, 3D effects, gradient backgrounds. Remove.
Colour-blind friendly palette — Palette legible to viewers with red-green deficiency (~8% of men). Examples: viridis, cividis, ColorBrewer safe schemes.
Data transformation — Functional transformation of a variable (log, sqrt, 1/x, Box-Cox) to reduce skew or stabilise variance before parametric tests. Interpret in transformed scale only.

Formulas

$IQR = Q_{3} - Q_{1}$
$Tukey outlier: x > Q_{3} + 1.5 \cdot IQR or x < Q_{1} - 1.5 \cdot IQR$
$Mean > Median > Mode \Rightarrow positive skew (right tail)$
$Mean < Median < Mode \Rightarrow negative skew (left tail)$
$Symmetric \Rightarrow Mean = Median = Mode$
$Skewness_{Pearson} = \frac{3 ( x ˉ - median )}{s} (rough estimate)$

Derivations

Why bar plots hide distributions. A bar plot encodes only one number per category — typically the mean. Two distributions with the same mean (e.g., one symmetric, one bimodal with peaks at the extremes) produce identical bar plots. Information lost = everything except the mean (and possibly an error bar). Violin plots or raincloud plots restore the shape information.

Why pie charts are hard to read. Human vision compares *lengths* more accurately than *angles*. Cleveland and McGill (1984) ranked perceptual tasks by accuracy: position-along-aligned-axis (best) > position-along-non-aligned-axis > length > angle/area > volume/colour (worst). Bar charts use position; pies use angle and area — two of the worst encodings.

Anscombe's regression line is robust to all four shapes. OLS minimises $\sum (y_{i} - \overset{y}{^}_{i})^{2}$ . Two datasets with the same $(x, y)$ first and second moments and the same $cov (x, y)$ produce the same $\hat{β} = cov (x, y) / Var (x)$ and $\overset{α}{^} = \overset{y}{ˉ} - \hat{β} \overset{x}{ˉ}$ . Anscombe constructed all four to share these moments, hence identical regression coefficients despite radically different scatter.

Why log helps with right-skewed positive data. If $X$ is log-Normal (so $lo g X$ is Normal), $X$ itself has a long right tail (high values are rare but very large). Taking $lo g X$ pulls in the right tail and produces a symmetric distribution. Caveat: only works for strictly positive data; for data with zeros, use $lo g (x + 1)$ or a Box-Cox shift.

Examples

Anscombe set II has a perfect parabola — Pearson r = 0.816 looks like a strong linear association but the relationship is *quadratic*. A linear regression on this dataset is fundamentally misleading.
Anscombe set III is a perfect linear cloud with one wild outlier; the outlier drags the regression line off the line that fits the other 10 points. *Lesson:* outliers can dominate regression unless you check.
Anscombe set IV is a vertical line of 10 points at x = 8 plus one isolated point at (19, 12.5). The entire 'correlation' is anchored by that single point. *Lesson:* a single point can drive your slope.
Boxplot Tukey example. Q1 = 10, Q3 = 20; IQR = 10; upper fence = 20 + 15 = 35. A data point at 38 is flagged as an outlier; a point at 50 is an extreme outlier (> 3 × IQR).
Bar plot disaster. Two groups, both with mean RT ≈ 600 ms. Bar plot: identical bars + tiny error bars → 'no difference'. Violin plot: group 1 is unimodal around 600 ms; group 2 is bimodal at 400 and 800 ms (two subgroups using different strategies). Conclusion reverses.
Truncated y-axis. Sales rose from 100 to 105 over a year. A y-axis from 100 to 110 makes this look like a dramatic spike; a y-axis from 0 to 200 makes it look flat. Honest practice: zero baseline for absolute quantities; truncated may be defensible for ratios (e.g., temperature anomalies).
Pie chart disaster. Survey results in 12 categories, each 6–10%. The pie has 12 nearly-equal slices that are visually indistinguishable. A horizontal bar chart sorted by frequency communicates immediately.
Log transformation in practice. Reaction time data range 200–10,000 ms with strong right skew. Log-transforming compresses the tail; Shapiro-Wilk on $lo g (RT)$ no longer rejects normality. Now t-test / ANOVA is appropriate — but report 'log RT differed' not 'RT differed by X ms'.

Diagrams

Anscombe's quartet. Four side-by-side scatterplots. Below each, the identical summary table: mean(x), mean(y), var(x), var(y), r, regression equation.
Plot-decision flowchart. Start node: 'how many variables?'. Branches: one continuous → histogram / boxplot / violin / raincloud; one categorical → bar / pie; two continuous → scatter / bubble / hexbin; two categorical → mosaic / heatmap; continuous × categorical → grouped boxplot / violin / raincloud; time series → line.
Boxplot anatomy. Median line in box, Q1 / Q3 box edges, whiskers to ±1.5×IQR, outliers as individual dots. Annotate IQR.
Violin vs box for same data. Two side-by-side panels. Same dataset where the box looks identical for two groups but the violin reveals one group is bimodal.
Tufte's data-to-ink demonstration. Cluttered chart (gridlines, drop shadow, 3D bars, gradient background) vs same chart stripped to data-encoding ink only.
Lie-factor examples. Truncated y-axis exaggerating differences. 3D pie distorting angles. Bar tilted in 3D appearing taller.
Colour-blind palette comparison. Same data in (a) rainbow, (b) viridis. Rainbow has perceptual cliffs; viridis is monotonic.
Tim Cook's iPad chart. Sales 2008–2013. As shown to investors (tight y-axis) vs from-zero baseline (modest growth).

Edge cases

Bimodal distributions are hidden by boxplots and bar plots — only revealed by histograms, violins, or rainclouds.
Heaped data (preferred values like 0, 5, 10) confounds smooth density estimation. Inspect with histogram before applying KDE.
Heat maps with rainbow palette introduce illusory category boundaries — prefer viridis or single-hue gradients.
Box plots show medians, not means. If your text reports means, ensure they match what the figure shows.
Pie charts with ~equal slices are essentially unreadable — humans can't accurately compare 8% vs 10% via angle.
Skewness near 0 doesn't guarantee normality. A symmetric platykurtic or leptokurtic distribution has zero skewness but is still non-Normal — check Q-Q plots.
Log transform on zeros. Use $lo g (x + 1)$ , $lo g (x + 0.5)$ , or Box-Cox shift; or, for many zeros, consider zero-inflated models instead of transforming.
Survivorship bias in line plots over time. If subjects drop out, the late-time-point trajectory only shows survivors — group means appear to converge.

Common mistakes

Showing summary statistics without raw data. Anscombe's lesson. Always plot.
Pie charts with > 5 slices or similar-sized slices — angles unreadable.
Truncated y-axis to exaggerate differences (or compressed y-axis to hide them).
Dual y-axes on the same plot — creates false co-movement and is rarely justified.
Rainbow / red-green palettes; using colour as the only encoding.
Bar plot for continuous data (RT, accuracy as %, EEG amplitude). Hides shape. Use violin or raincloud.
3D pie / 3D bar / 3D anything — tilt distorts angles and areas.
Forgetting to label axes / units / variables — readers must understand without external context.
Silently dropping missing data instead of explicitly visualising it (use 'NA' marker or separate panel).
Reporting log-transformed results in original units — only the transformed scale is interpretable.

Shortcuts

Always plot raw data alongside any summary statistic.
Categorical → bar; continuous-continuous → scatter; distribution → histogram + boxplot or violin / raincloud.
Tukey outlier rule: ±1.5 × IQR from quartiles.
Mean > median > mode → positive skew (right tail).
Avoid: 3D pies, dual-y, rainbow palettes, red+green together, truncated y-axes (for absolute quantities).
Tufte's principles: data-to-ink ratio, lie factor, chart junk, proper scales — four principles, memorise.
Viridis for colour-blind-safe perceptually uniform palettes.
Initial visualisation = for you; final visualisation = for the reader.
Raincloud = box + violin + dots — gold standard for behavioural data.

Proofs / Algorithms

Tukey's outlier rule under Normality. Under $N (μ, σ^{2})$ , $Q_{1} \approx μ - 0.674 σ$ , $Q_{3} \approx μ + 0.674 σ$ , $IQR \approx 1.349 σ$ . Upper fence $= Q_{3} + 1.5 \cdot IQR \approx μ + 2.7 σ$ . Hence Tukey flags points beyond roughly $\pm 2.7 σ$ as outliers — slightly more aggressive than the 'beyond ±3 σ' rule, and applicable without assuming Normality.

OLS regression is invariant under same-moment data. For OLS, $\hat{β}_{1} = cov (x, y) / Var (x)$ and $\hat{β}_{0} = \overset{y}{ˉ} - \hat{β}_{1} \overset{x}{ˉ}$ . Any two datasets with identical means, variances, and covariance produce identical OLS regression lines, regardless of shape. Anscombe constructed his quartet to exploit this — moments alone do not constrain higher-order shape.

Why bar plots have lower information capacity than violins. A bar encodes one number (mean) plus optionally one number (SE). A violin encodes the full density function — infinitely many numbers — at the same plotting area. For continuous data where shape matters, the violin is strictly more informative; the bar discards shape, multimodality, tails, and individual observations.

End of chapterUnit 4 — Data Visualization · Plots, Matching, and Common Pitfalls

View definitions for this chapter →·Cheatsheet·Practice questions

Behavioral Research: Statistical Methods