Revision Notes/Unit 8 — SSL: Contrastive (SimCLR, MoCo, BYOL, CLIP)/Contrastive SSL — SimCLR / MoCo / BYOL / CLIP/Story

Contrastive SSL — SimCLR / MoCo / BYOL / CLIP

Unit 8 — SSL: Contrastive (SimCLR, MoCo, BYOL, CLIP)

Learning Without a Teacher

Every method you've seen so far has needed labels. Detection needs annotated bounding boxes. Segmentation needs pixel-level masks. Even ImageNet classification needs a human to write "this is a golden retriever." That labelling is expensive — a single ImageNet costs millions of dollars in human time, and there are only so many golden-retriever-tagged photos in the world.

Then somebody asked: what if the supervision could come from the data itself?

You hand the network an image. You don't tell it what the image is. But you create a task that the image can answer with its own structure — *"predict what colour this grayscale patch should be"*, *"unscramble these jigsaw pieces"*, *"figure out which two crops came from the same image"*. The network learns features by solving these synthetic puzzles — and those features turn out to be useful for downstream tasks.

This is self-supervised learning (SSL), and over five years it has become the dominant pretraining recipe in vision. Every modern model — CLIP, DINO, MAE, the vision encoders inside PaliGemma and Qwen2-VL — is SSL-pretrained. No labels. Just structure.

Where the supervision comes from

The lecture opens by listing the structural signals SSL exploits.

In language: grammar (predict the next word), fill-in-the-blanks (BERT), sentence ordering.

In images — the pretext-task era:

Colorisation — predict the colour version of a grayscale image.
Jigsaw puzzles — given 9 shuffled patches in a $3 \times 3$ grid, predict the original arrangement.
Neighbourhood proximity — given two patches, predict their spatial relationship.

These were clever, but the features they produced were only OK. The link between *"solving jigsaws"* and *"classifying golden retrievers"* was indirect.

What replaced them is a more direct approach: contrastive learning.

The taxonomy your lecture gives you

The lecture explicitly classifies vision SSL into four families. Memorise this:

1. Old-school SSL — jigsaw, colorisation, autoencoders. 2. Contrastive — SimCLR, MoCo, DINO. 3. Language-image contrastive — CLIP, SigLIP. 4. Generative — masked autoencoders (MAE).

This unit is about families 2 and 3.

The Gelato Bet

Alyosha Efros's *Gelato Bet*: that by a certain deadline, a single self-supervised model would match supervised ImageNet pretraining on a comprehensive benchmark. The bet was won — SSL caught up around 2020–2021. Vision-research folklore worth knowing.

The contrastive recipe

All contrastive methods rest on one principle:

For each image, generate two augmented "views." The two views are a positive pair (they came from the same image). All other images in the batch are negatives. Train the network so positive pairs have similar embeddings, and positive-vs-negative pairs have different embeddings.

That's it. What changes between methods is *what augmentations*, *where the negatives come from*, and *how the loss is structured*.

SimCLR — the simplest contrastive framework

SimCLR (Chen et al., ICML 2020) is the cleanest demonstration of the recipe. *"Simple framework for Contrastive Learning of visual Representations."* The title is honest.

Four components — name them on the exam

1. Augmentation pipeline — generates positive pairs. 2. **Encoder $f_{θ}$ ** — a ResNet — produces representations $h$ . 3. **Projection head $g_{ϕ}$ ** — a 2-layer MLP — maps $h$ to a contrastive space $z$ . **The loss is on $z$ , not $h$ . After pretraining, $g_{ϕ}$ is discarded** and only $h$ is used downstream. 4. NT-Xent / InfoNCE loss.

Aggressive augmentations

SimCLR's headline finding: aggressive augmentation is essential. Random crop + colour jitter is the dominant pair; both must be strong, and the combination matters more than either individually. Crops teach scale/translation invariance, colour jitter teaches colour invariance, together they prevent the network from shortcutting on easy cues.

NT-Xent (= InfoNCE)

Given a batch of $N$ images, after augmentation you have $2 N$ views. For one positive pair $(z_{i}, z_{j})$ :

L_{i, j} = - lo g \frac{exp ( sim ( z _{i} , z _{j} ) / τ )}{\sum _{k \neq = i}^{2 N} exp ( sim ( z _{i} , z _{k} ) / τ )}

with $sim (u, v) = u^{⊤} v / (∥ u ∥∥ v ∥)$ and $τ \in [0.07, 0.5]$ .

The connection — exam gold: NT-Xent is *softmax cross-entropy*. The "logits" are similarity scores; the "true label" is the index of the positive partner. Same equation, two names.

What SimCLR taught the field

Bigger batches are dramatically better — the denominator covers more negatives. SimCLR scaled to batches of 4096–8192.
The projection head matters — putting the loss on $g_{ϕ} (h)$ rather than directly on $h$ gives 5–10% better downstream accuracy.
Aggressive augmentations are essential.

But batch 8192 means SimCLR only runs on TPU pods. The next paper solved that.

MoCo — Momentum Contrast

MoCo (He et al., CVPR 2020) keeps the contrastive recipe but decouples the number of negatives from the batch size.

Two key ideas

Idea 1 — a memory queue. Maintain a large FIFO queue of past representations ( $K = 65, 536$ ). When you process a new batch, the negatives come from this queue, not the current batch. Huge queue → many negatives → strong learning signal, *without* needing a huge batch.

Idea 2 — a momentum encoder. Maintain a separate slowly-updated encoder $f_{ξ}$ that is an EMA of the online encoder $f_{θ}$ :

ξ \leftarrow m ξ + (1 - m) θ, m \approx 0.999

The momentum encoder produces all keys (positives + the negatives put into the queue). Because $ξ$ changes slowly, the queue stays *consistent* — all entries look like they came from approximately the same encoder.

The loop

For each batch: query $q = f_{θ} (x_{aug1})$ ; positive key $k^{+} = f_{ξ} (x_{aug2})$ ; negative keys = the queue. InfoNCE: $q$ tries to match $k^{+}$ against ${k^{-} \in queue}$ . Backprop only updates $θ$ . Update $ξ$ by EMA of $θ$ . Enqueue current batch's keys, dequeue oldest.

Trade-off to memorise: SimCLR needs huge batches because every batch is the negative pool. MoCo decouples — small batches, large queue.

DINO's teacher EMA (next unit, $θ_{t} \leftarrow λ θ_{t} + (1 - λ) θ_{s}$ ) is directly borrowed from MoCo's momentum encoder idea.

BYOL — no negatives at all

BYOL (Grill et al., NeurIPS 2020) — *"Bootstrap Your Own Latent"* — asked the most heretical question of the contrastive era: what if we just dropped the negatives entirely?

The contrastive intuition was: without negatives, the network will collapse — it'll output the same vector for every image. BYOL showed *this isn't true if the architecture is right*.

The setup

Online network: $f_{θ} \to g_{θ} \to q_{θ}$ — encoder + projector + predictor.

Target network: $f_{ξ} \to g_{ξ}$ — encoder + projector (no predictor). $ξ \leftarrow$ EMA of $θ$ .

Loss: $∥ \overline{q_{θ} (g_{θ} (f_{θ} (v)))} - \overline{sg [g_{ξ} (f_{ξ} (v^{'}))]} ∥^{2} = 2 - 2 cos (online, target)$ .

Three key architectural elements:

1. **Predictor $q_{θ}$ ** — an extra MLP on the online branch only. Asymmetric. *Critical.* 2. Stop-gradient on the target branch — no backprop through $ξ$ . 3. **EMA update of $ξ$ from $θ$ ** — same idea as MoCo.

No negatives anywhere.

Why doesn't it collapse?

This is one of the most-discussed papers of 2020. The intuition:

The predictor breaks symmetry — only the online side can perfectly match, so the system can't trivially converge to a constant.
The EMA delay means the target is always *slightly behind* the online — chasing a moving target prevents lock-up.

Empirically, it just works. BYOL matched SimCLR on ImageNet without using a single negative. This was a turning point that led directly to DINO.

SwAV — contrastive clustering

SwAV (Caron et al., NeurIPS 2020) — *"Swapping Assignments between Views"* — takes yet another approach: online clustering.

Learn $K$ prototype vectors ${c_{1}, \dots, c_{K}}$ — a small codebook (e.g. $K = 3000$ ).

For each augmented view $v$ :

1. Compute $z = f_{θ} (v) /∥ f_{θ} (v) ∥$ . 2. Compute a soft cluster assignment $q \in R^{K}$ via Sinkhorn-Knopp — enforces that across a batch each prototype is used approximately equally (prevents collapse to one prototype). 3. The swap: predict $q_{v^{'}}$ (the other view's cluster assignment) from $z_{v}$ .

Two views of the same image should have the same cluster assignment. SwAV enforces this without ever directly comparing pairs of images — just comparing each view's features to the shared prototypes.

The family in one slide

| Method | Year | Negatives? | Key innovation | | --- | --- | --- | --- | | SimCLR | 2020 | Yes (in-batch) | Aggressive aug + projection head | | MoCo | 2020 | Yes (queue) | Memory bank + momentum encoder | | BYOL | 2020 | No | Predictor + EMA + stop-grad | | SwAV | 2020 | Implicit (clusters) | Online clustering + Sinkhorn | | DINO | 2021 | No | Self-distillation + multi-crop |

CLIP — when language joined the party

Everything above used images only — augmentations of the same image as the source of supervision. CLIP (Radford et al., 2021) does something different: it uses paired text as the supervision signal.

Why language matters as supervision

Three limitations of category-label supervision:

1. Hard to scale. ImageNet has 22k classes. Beyond ~10k, getting good per-class labels is infeasible. 2. Limited descriptive potential. *"White shirt"* vs *"blue shirt"* — separate classes? You'd need 10k just for clothing colours. 3. Not compositional. *"Laptop on top of a table"* combines two objects and a spatial relationship.

Natural language has none of these limitations, and the internet has billions of image-text pairs for free.

A piece of history — DeViSE (2013)

DeViSE (Frome et al., NeurIPS 2013) — *Deep Visual-Semantic Embedding* — was the earliest attempt to learn a joint visual-textual embedding space. CLIP came 8 years later and made the idea work at scale.

The CLIP setup

Two encoders, one shared embedding space:

**Image encoder $f_{I}$ ** — ResNet or ViT — outputs $I \in R^{d}$ .

**Text encoder $f_{T}$ ** — Transformer — outputs $T \in R^{d}$ .

Both L2-normalised; both projected to a shared $d_{e}$ -dim space.

The pseudocode (memorise)

`` I_f = image_encoder(I) # [n, d_i] T_f = text_encoder(T) # [n, d_t] I_e = l2_normalize(I_f @ W_i, axis=1) # [n, d_e] T_e = l2_normalize(T_f @ W_t, axis=1) # [n, d_e] logits = (I_e @ T_e.T) * exp(t) # [n, n] pairwise similarities labels = arange(n) # diagonal is positive loss_i = cross_entropy(logits, labels, axis=0) # image-to-text loss_t = cross_entropy(logits, labels, axis=1) # text-to-image loss = (loss_i + loss_t) / 2 # symmetric ``

It's symmetric InfoNCE in matrix form. The $n \times n$ logit matrix has:

Diagonal entries = positive pairs (image $i$ ↔ text $i$ ).
Off-diagonal = negative pairs.
Cross-entropy along rows = image-to-text retrieval. Along columns = text-to-image.

SigLIP later replaced this softmax with independent sigmoids; CLIP is the softmax original.

The secret sauce — data, not architecture

CLIP's secret was the WIT (WebImageText) dataset:

400 million $(ima g e, t e x t)$ pairs scraped from the public internet.
Constructed using 500k search queries, with up to 20k pairs per query.
Word count comparable to GPT-2's training corpus.

CLIP's architectural contribution was modest. Its data contribution was unprecedented. Memorise: *why does CLIP work? → scale of natural-language supervision, 400M pairs.*

Zero-shot classification — the magic trick

Given a new image classification task, no fine-tuning needed:

1. Embed each class name as text: *"a photo of a {class}"*. 2. Encode all text prompts → text embeddings. 3. Encode the input image. 4. Compute cosine similarity image-vs-each-text. 5. Predict the class with the highest similarity.

CLIP zero-shots ImageNet at ~76% top-1 — competitive with a fully-supervised ResNet-50 that was trained on ImageNet.

CLIP's limitations — the "Why?" slide

Not aware of fine-grained details — colours, exact part attributes.
Not compositional — *"person on horse"* vs *"horse on person"* — CLIP scores them similarly because it matches global representations, not structure.
Matching global representations is insufficient for spatial relationships.
Needs hard negatives for compositional reasoning.

Two papers diagnose:

Winoground (Thrush et al., CVPR 2022) — captions differ only in word order; CLIP performs near chance.
CLIP Association Bias (Yamada et al., EMNLP 2023, *"When are Lemons Purple?"*) — CLIP sometimes associates "lemons" with "purple" because of textual co-occurrence in training data.

CLIP's impact

CLIP's visual encoder is one of the most-used vision encoders today.
LAION-5B — an open-source 5-billion-pair replication of WIT.
BLIP, InstructBLIP — CLIP-style ideas with added captioning + instruction-tuning.
Stable Diffusion — uses CLIP's *text encoder* to condition image generation. Every Stable Diffusion image you see was generated by a model that uses CLIP inside.
SigLIP — directly succeeds CLIP, swaps softmax for sigmoid.

CLIP is, in many ways, the single most consequential vision paper of 2021 — and the bridge between vision and language.

What you carry into the exam

The four families of SSL. The Gelato Bet. SimCLR's four components and aggressive-augmentation finding. The NT-Xent ≡ softmax-CE connection. MoCo's queue + momentum encoder + the trade-off vs SimCLR. BYOL's predictor + stop-grad + EMA and why it doesn't collapse. SwAV's Sinkhorn-Knopp equipartition. The three motivations for language supervision. DeViSE as historical predecessor. The CLIP pseudocode end-to-end, especially the *symmetric* loss. WIT's 400M-pair scale. The zero-shot recipe. CLIP's compositional failures and the Winoground / lemons-purple results. The downstream impact list.

Every multimodal model in your course flows downstream from these ideas.

Computer Vision