Revision Notes/Unit 4 — Data Visualization/Plots, Matching, and Common Pitfalls/Story

Plots, Matching, and Common Pitfalls

Unit 4 — Data Visualization

Maya Learns to Look Before She Computes

Before Maya runs a single test, she has to do something more humble: look at her data. Plot it. Stare at it. See if it has the shape she thinks it has. This sounds obvious, but it is regularly skipped, and the consequences are catastrophic. This unit is about the discipline of *seeing before computing* — and about the small set of numbers that summarise what you see.

It's also a section your exam will draw heavily from with visual-interpretation questions: *"Which plot is best for this data?"* *"What's wrong with this visualisation?"* *"Compare X and Y plots."*

Why visualise? The Anscombe story

In 1973, the statistician Francis Anscombe assembled four small datasets, each with eleven $(x, y)$ pairs. He computed for each:

The mean of $x$ → identical across all four datasets.
The mean of $y$ → identical.
The variance of $x$ → identical.
The variance of $y$ → identical.
The correlation between $x$ and $y$ → identical (about 0.816).
The best-fit regression line → identical.

By every summary statistic, the four datasets are indistinguishable. Then Anscombe plotted them.

Dataset 1 is a clean linear cloud — exactly what the statistics suggest.
Dataset 2 is a smooth curve — the relationship is *non-linear*, but the linear fit doesn't notice.
Dataset 3 is a perfect line with one wild outlier dragging the regression line down.
Dataset 4 is a vertical column of points at one $x$ -value plus one isolated point — the entire correlation is being driven by one observation, with no relationship in the rest of the data.

This is the most important figure in the course. The slogan attached to it:

Statistics compress data; visualisations reveal structure.

You should visualise before choosing a statistical test, not after.

The lesson generalises: any summary you compute — a mean, a correlation, a regression line — has already thrown away information. The decision to use that summary is itself a choice. Visualisation lets you check whether the choice was reasonable.

Summarisation is already a choice

Before any plot, two preparatory steps:

Step 1: Identify the variable and its type. Nominal, ordinal, interval, ratio (Unit 2). Continuous or discrete. The variable type constrains which summaries are meaningful.

Step 2: Identify the unit of analysis. Trial-level, subject-level, group-level? In behavioural data, you often have *trial → subject → group* nested structure. Aggregating at the wrong level can hide or fabricate effects.

Any summary — a mean, a median, a percentage — has already decided what matters in the data. Visualisation helps you check whether that decision was reasonable. Two warnings worth memorising:

Means hide distribution shape. A mean of 50 could be everyone scoring around 50, or half the class scoring 0 and half scoring 100. Wildly different stories, identical mean.
Aggregation can hide individual differences. Average across all participants and you might see "no effect of training." Look at individuals and you might see half improving dramatically and half getting worse — strong but opposite patterns.

This is also a regular exam question: *"What can go wrong if we don't visualise before analysing?"* Hidden distribution structure, outlier-driven effects, opposite patterns in subgroups, data quality issues like missing values or coding errors.

Why visualise? The five reasons

To memorise for the inevitable short-answer question:

1. Check assumptions about the distribution of the data — normality, skew, multimodality. 2. Detect structure — learning curves, fatigue, strategy shifts. 3. Identify outliers and data errors — reaction-time artefacts, missing trials, coding mistakes. 4. Understand variability — within-subject vs between-subject. 5. Decide appropriate statistical models — parametric vs non-parametric, linear vs non-linear.

The plot catalogue

Histogram

Splits a continuous variable into bins and shows the count in each bin. For: distribution of a single continuous variable — normality, skew, multimodality.

Catch: the visual impression depends heavily on bin width. Too few bins and you over-smooth structure away; too many and noise dominates. Try several bin widths.

Boxplot (Tukey box-and-whisker)

Shows five numbers visually: median (line inside box), Q1 and Q3 (box edges), whiskers (typically extending to 1.5 × IQR beyond the box). Points beyond the whiskers are plotted as outliers.

For: comparing spread and central tendency across groups side by side.

Catch: box plots encode spread well but do not show bimodality or clusters. Two distributions can have identical boxes — one unimodal, one bimodal — and you'd never know.

Scatter plot

Each data point as a dot at $(x, y)$ . For: the relationship between two continuous variables.

This is the plot for inspecting correlations, linearity, outliers, and clusters. If you're going to do regression or correlation, you must see the scatter plot first.

Pie chart

A circle split into wedges, each representing a category's proportion of the whole. Use sparingly. Pie charts are difficult because comparing angles is harder than comparing lengths. Use only when there's a small number of categories (3–5), slices are visually differentiable, and proportions add to 100%.

Bar chart

Length of each bar proportional to the value. For: comparing values across discrete categories — counts, means, proportions.

Catch (this is an exam favourite): bar plots for continuous data (e.g., reaction times) hide the distribution. They show only the mean and maybe an error bar. They tell you nothing about skew, multimodality, or outliers. Bad practice for behavioural data.

Violin plot

A more honest cousin of the box plot. Each "violin" is a mirrored density estimate of the variable's distribution. For: comparing full distributions across conditions — reaction times, confidence ratings, continuous outcomes. Shows skew, multimodality, tails — everything a box plot hides.

Raincloud plot

A combination: a violin (distribution density) plus a box (summary statistics) plus the actual individual data points scattered next to them. Shows distribution, summary, and every data point.

This is the gold standard for showing behavioural data with respect.

Mosaic plot

A grid of rectangles whose areas are proportional to the joint frequencies of two (or more) categorical variables. For: relationships between two or more categorical variables.

Heat map

A grid of cells coloured by value. For: correlation matrices, two-dimensional density, time-by-condition plots.

Colour intensity must be interpretable — use a perceptually uniform scale (viridis, plasma) rather than rainbow.

Line plot

A continuous line connecting $(x, y)$ points, usually with $x$ as time. For: change over time — learning curves, fatigue, neural time courses.

Important refinement: show individual trajectories alongside the group average. Group means can hide huge individual variation. Show both.

The specialty plots

Spider / radar: multiple axes radiating from a centre. Looks impressive. Usually hard to interpret because area depends on axis order. Avoid unless you have a strong reason. Funnel plot: meta-analysis precision visualisation. Tree map / waffle / streamgraph / circos — specialised plots for specific data structures; recognise on sight.

What makes a visualisation good?

The criteria, condensed from Tufte and the practical guidelines:

1. Reduce cognitive load. Don't make the reader work harder than necessary. 2. Simplicity — less is more. 3. Relevancy — every element serves the question. 4. Storytelling — guide the reader's eye through the chart. 5. Convince — the chart should support a clear interpretation. 6. Show the data, not just summaries. Where possible, plot individual points alongside means. 7. Use scales honestly. Don't truncate the y-axis to make small differences look big. 8. Make assumptions visible. Say what transformations / aggregations were used. 9. Colour consistency across charts. Use the same colours for the same things across a paper. 10. Labelling — axes labelled, units clear, title concise and descriptive.

Tufte's Graphical Theory

Edward Tufte, the patron saint of clear data graphics, articulated four principles you should be able to name:

1. Minimise the data-to-ink ratio. Most ink on a chart should encode data, not decoration. Strip out unnecessary gridlines, 3D effects, colour, and ornament. 2. Minimise the lie factor (equivalently: maximise graphical integrity). The visual representation should be proportional to the underlying numbers. 3. Minimise chart junk. Decorative elements that don't encode data — gradient backgrounds, drop shadows, clip art. 4. Use proper scales and labelling. Honest axes, clear units, readable text.

What makes a visualisation bad

The exam will show you bad charts and ask what's wrong. The catalogue of sins:

Bar plots for continuous data like reaction times — hides skew and multimodality.
Truncated y-axes — turns a tiny difference into a visual cliff.
Inconsistent or non-linear axes when not justified.
Overplotting without transparency — thousands of dots stacked into a black blob.
Decorative or misleading 3D effects — 3D pie charts particularly.
Too many variables in one image.
Wrong chart type for the data — pie chart for 20 categories.
Bad colour choices — low-contrast palettes, rainbow colormaps, traffic-light colours (red/green) for non-categorical data.

Tim Cook's famous chart of rising iPad sales (2008–2013) is the canonical *"good for the storyteller, bad for the truth"* example — even when not actively misleading, a chart designed to sell a narrative often fails as honest visualisation.

Colour-blind friendly palettes

About 8% of men and 0.5% of women have some form of colour vision deficiency. If your chart relies on red vs green, a significant fraction of your audience can't read it. Rules:

Use colour-blind-friendly palettes for all scientific publications (viridis, cividis, ColorBrewer's colour-blind-safe schemes).
Avoid red with green specifically. Avoid red with brown.
Avoid rainbow colormaps — they are not perceptually uniform; equal numerical jumps look like unequal colour jumps.
Use symbols, line styles, and direct annotations in addition to colour, so the chart remains legible in greyscale.
Test with a colour-blind-simulation tool (Color Oracle is the named one).

Initial vs final visualisation

A distinction worth knowing:

Initial / exploratory visualisation — for *you*, to understand your data before analysis. Histograms, box plots, scatter plots. Check distributions, outliers, missing data, weird coding. Doesn't need to be polished.
Final / publication visualisation — for the *reader*, to communicate findings. Cleaner, well-labelled, often more specialised (forest plots, funnel plots, raincloud plots, mosaic plots).

The same data may deserve different plots at these two stages.

Choosing the right plot — by research question

A useful decision tree:

| Question | Plot | | --- | --- | | Shape of a single continuous variable | histogram, boxplot, violin, raincloud | | Proportions / part-to-whole | bar, pie (sparingly), tree map, waffle, mosaic | | Temporal change | line plot | | Group differences | side-by-side box / violin / raincloud | | Two continuous variables | scatter, bubble, hexbin | | Two categorical variables | mosaic, heat map | | Correlation matrix | heat map | | Geographical | choropleth |

There is no universally best plot — only the best plot for the question being asked.

Putting it together — Maya's data inspection

Before Maya runs a t-test on her turmeric milk data, she does this:

1. Variable types. Sore-throat duration is a ratio variable, continuous. 2. Histogram of each group's durations — checks shape, looks for skew, outliers. 3. Box plots side by side for turmeric vs no-turmeric — quick visual comparison. 4. A violin or raincloud plot for a more honest comparison if distributions aren't symmetric. 5. Descriptive statistics table: n, mean, median, SD, IQR, min, max, count of missing values for each group. 6. Spot-check the data for impossible values (negative durations, 999 codes for missing data).

Now she's ready to test. If the distributions are heavily skewed, she'll either transform or use non-parametric methods. If everything looks roughly normal, she can use a t-test.

What you carry into the exam

Anscombe's quartet — same stats, different shapes; always plot.
Five reasons to visualise: assumptions, structure, outliers, variability, model choice.
Plot catalogue: histogram / boxplot / scatter / bar / pie / violin / raincloud / mosaic / heatmap / line.
Tukey outlier rule: ±1.5 × IQR from quartiles.
Tufte's four principles: data-to-ink, lie factor, chart junk, proper scales.
Colour-blind friendly: viridis / cividis; avoid red+green and rainbow.
Initial vs final visualisation as distinct uses.
Plot-decision flow: match the plot to the question and the scales.

When you're ready, send "next" and we'll move into descriptive statistics — the small set of numbers that summarise what your plots showed, and the rules for when each is appropriate.

Behavioral Research: Statistical Methods