Contrastive SSL — SimCLR / MoCo / BYOL / CLIP
Learning Without a Teacher
Every method you've seen so far has needed labels. Detection needs annotated bounding boxes. Segmentation needs pixel-level masks. Even ImageNet classification needs a human to write "this is a golden retriever." That labelling is expensive — a single ImageNet costs millions of dollars in human time, and there are only so many golden-retriever-tagged photos in the world.
Then somebody asked: what if the supervision could come from the data itself?
You hand the network an image. You don't tell it what the image is. But you create a task that the image can answer with its own structure — *"predict what colour this grayscale patch should be"*, *"unscramble these jigsaw pieces"*, *"figure out which two crops came from the same image"*. The network learns features by solving these synthetic puzzles — and those features turn out to be useful for downstream tasks.
This is self-supervised learning (SSL), and over five years it has become the dominant pretraining recipe in vision. Every modern model — CLIP, DINO, MAE, the vision encoders inside PaliGemma and Qwen2-VL — is SSL-pretrained. No labels. Just structure.
Where the supervision comes from
The lecture opens by listing the structural signals SSL exploits.
In language: grammar (predict the next word), fill-in-the-blanks (BERT), sentence ordering.
In images — the pretext-task era:
- Colorisation — predict the colour version of a grayscale image.
- Jigsaw puzzles — given 9 shuffled patches in a grid, predict the original arrangement.
- Neighbourhood proximity — given two patches, predict their spatial relationship.
These were clever, but the features they produced were only OK. The link between *"solving jigsaws"* and *"classifying golden retrievers"* was indirect.
What replaced them is a more direct approach: contrastive learning.
The taxonomy your lecture gives you
The lecture explicitly classifies vision SSL into four families. Memorise this:
1. Old-school SSL — jigsaw, colorisation, autoencoders. 2. Contrastive — SimCLR, MoCo, DINO. 3. Language-image contrastive — CLIP, SigLIP. 4. Generative — masked autoencoders (MAE).
This unit is about families 2 and 3.
The Gelato Bet
Alyosha Efros's *Gelato Bet*: that by a certain deadline, a single self-supervised model would match supervised ImageNet pretraining on a comprehensive benchmark. The bet was won — SSL caught up around 2020–2021. Vision-research folklore worth knowing.
The contrastive recipe
All contrastive methods rest on one principle:
For each image, generate two augmented "views." The two views are a positive pair (they came from the same image). All other images in the batch are negatives. Train the network so positive pairs have similar embeddings, and positive-vs-negative pairs have different embeddings.
That's it. What changes between methods is *what augmentations*, *where the negatives come from*, and *how the loss is structured*.
SimCLR — the simplest contrastive framework
SimCLR (Chen et al., ICML 2020) is the cleanest demonstration of the recipe. *"Simple framework for Contrastive Learning of visual Representations."* The title is honest.
Four components — name them on the exam
1. Augmentation pipeline — generates positive pairs. 2. **Encoder ** — a ResNet — produces representations . 3. **Projection head ** — a 2-layer MLP — maps to a contrastive space . **The loss is on , not . After pretraining, is discarded** and only is used downstream. 4. NT-Xent / InfoNCE loss.
Aggressive augmentations
SimCLR's headline finding: aggressive augmentation is essential. Random crop + colour jitter is the dominant pair; both must be strong, and the combination matters more than either individually. Crops teach scale/translation invariance, colour jitter teaches colour invariance, together they prevent the network from shortcutting on easy cues.
NT-Xent (= InfoNCE)
Given a batch of images, after augmentation you have views. For one positive pair :
with and .
The connection — exam gold: NT-Xent is *softmax cross-entropy*. The "logits" are similarity scores; the "true label" is the index of the positive partner. Same equation, two names.
What SimCLR taught the field
- Bigger batches are dramatically better — the denominator covers more negatives. SimCLR scaled to batches of 4096–8192.
- The projection head matters — putting the loss on rather than directly on gives 5–10% better downstream accuracy.
- Aggressive augmentations are essential.
But batch 8192 means SimCLR only runs on TPU pods. The next paper solved that.
MoCo — Momentum Contrast
MoCo (He et al., CVPR 2020) keeps the contrastive recipe but decouples the number of negatives from the batch size.
Two key ideas
Idea 1 — a memory queue. Maintain a large FIFO queue of past representations (). When you process a new batch, the negatives come from this queue, not the current batch. Huge queue → many negatives → strong learning signal, *without* needing a huge batch.
Idea 2 — a momentum encoder. Maintain a separate slowly-updated encoder that is an EMA of the online encoder :
The momentum encoder produces all keys (positives + the negatives put into the queue). Because changes slowly, the queue stays *consistent* — all entries look like they came from approximately the same encoder.
The loop
For each batch: query ; positive key ; negative keys = the queue. InfoNCE: tries to match against . Backprop only updates . Update by EMA of . Enqueue current batch's keys, dequeue oldest.
Trade-off to memorise: SimCLR needs huge batches because every batch is the negative pool. MoCo decouples — small batches, large queue.
DINO's teacher EMA (next unit, ) is directly borrowed from MoCo's momentum encoder idea.
BYOL — no negatives at all
BYOL (Grill et al., NeurIPS 2020) — *"Bootstrap Your Own Latent"* — asked the most heretical question of the contrastive era: what if we just dropped the negatives entirely?
The contrastive intuition was: without negatives, the network will collapse — it'll output the same vector for every image. BYOL showed *this isn't true if the architecture is right*.
The setup
Online network: — encoder + projector + predictor.
Target network: — encoder + projector (no predictor). EMA of .
Loss: .
Three key architectural elements:
1. **Predictor ** — an extra MLP on the online branch only. Asymmetric. *Critical.* 2. Stop-gradient on the target branch — no backprop through . 3. **EMA update of from ** — same idea as MoCo.
No negatives anywhere.
Why doesn't it collapse?
This is one of the most-discussed papers of 2020. The intuition:
- The predictor breaks symmetry — only the online side can perfectly match, so the system can't trivially converge to a constant.
- The EMA delay means the target is always *slightly behind* the online — chasing a moving target prevents lock-up.
Empirically, it just works. BYOL matched SimCLR on ImageNet without using a single negative. This was a turning point that led directly to DINO.
SwAV — contrastive clustering
SwAV (Caron et al., NeurIPS 2020) — *"Swapping Assignments between Views"* — takes yet another approach: online clustering.
Learn prototype vectors — a small codebook (e.g. ).
For each augmented view :
1. Compute . 2. Compute a soft cluster assignment via Sinkhorn-Knopp — enforces that across a batch each prototype is used approximately equally (prevents collapse to one prototype). 3. The swap: predict (the other view's cluster assignment) from .
Two views of the same image should have the same cluster assignment. SwAV enforces this without ever directly comparing pairs of images — just comparing each view's features to the shared prototypes.
The family in one slide
| Method | Year | Negatives? | Key innovation | | --- | --- | --- | --- | | SimCLR | 2020 | Yes (in-batch) | Aggressive aug + projection head | | MoCo | 2020 | Yes (queue) | Memory bank + momentum encoder | | BYOL | 2020 | No | Predictor + EMA + stop-grad | | SwAV | 2020 | Implicit (clusters) | Online clustering + Sinkhorn | | DINO | 2021 | No | Self-distillation + multi-crop |
CLIP — when language joined the party
Everything above used images only — augmentations of the same image as the source of supervision. CLIP (Radford et al., 2021) does something different: it uses paired text as the supervision signal.
Why language matters as supervision
Three limitations of category-label supervision:
1. Hard to scale. ImageNet has 22k classes. Beyond ~10k, getting good per-class labels is infeasible. 2. Limited descriptive potential. *"White shirt"* vs *"blue shirt"* — separate classes? You'd need 10k just for clothing colours. 3. Not compositional. *"Laptop on top of a table"* combines two objects and a spatial relationship.
Natural language has none of these limitations, and the internet has billions of image-text pairs for free.
A piece of history — DeViSE (2013)
DeViSE (Frome et al., NeurIPS 2013) — *Deep Visual-Semantic Embedding* — was the earliest attempt to learn a joint visual-textual embedding space. CLIP came 8 years later and made the idea work at scale.
The CLIP setup
Two encoders, one shared embedding space:
**Image encoder ** — ResNet or ViT — outputs .
**Text encoder ** — Transformer — outputs .
Both L2-normalised; both projected to a shared -dim space.
The pseudocode (memorise)
`` I_f = image_encoder(I) # [n, d_i] T_f = text_encoder(T) # [n, d_t] I_e = l2_normalize(I_f @ W_i, axis=1) # [n, d_e] T_e = l2_normalize(T_f @ W_t, axis=1) # [n, d_e] logits = (I_e @ T_e.T) * exp(t) # [n, n] pairwise similarities labels = arange(n) # diagonal is positive loss_i = cross_entropy(logits, labels, axis=0) # image-to-text loss_t = cross_entropy(logits, labels, axis=1) # text-to-image loss = (loss_i + loss_t) / 2 # symmetric ``
It's symmetric InfoNCE in matrix form. The logit matrix has:
- Diagonal entries = positive pairs (image ↔ text ).
- Off-diagonal = negative pairs.
- Cross-entropy along rows = image-to-text retrieval. Along columns = text-to-image.
SigLIP later replaced this softmax with independent sigmoids; CLIP is the softmax original.
The secret sauce — data, not architecture
CLIP's secret was the WIT (WebImageText) dataset:
- 400 million pairs scraped from the public internet.
- Constructed using 500k search queries, with up to 20k pairs per query.
- Word count comparable to GPT-2's training corpus.
CLIP's architectural contribution was modest. Its data contribution was unprecedented. Memorise: *why does CLIP work? → scale of natural-language supervision, 400M pairs.*
Zero-shot classification — the magic trick
Given a new image classification task, no fine-tuning needed:
1. Embed each class name as text: *"a photo of a {class}"*. 2. Encode all text prompts → text embeddings. 3. Encode the input image. 4. Compute cosine similarity image-vs-each-text. 5. Predict the class with the highest similarity.
CLIP zero-shots ImageNet at ~76% top-1 — competitive with a fully-supervised ResNet-50 that was trained on ImageNet.
CLIP's limitations — the "Why?" slide
- Not aware of fine-grained details — colours, exact part attributes.
- Not compositional — *"person on horse"* vs *"horse on person"* — CLIP scores them similarly because it matches global representations, not structure.
- Matching global representations is insufficient for spatial relationships.
- Needs hard negatives for compositional reasoning.
Two papers diagnose:
- Winoground (Thrush et al., CVPR 2022) — captions differ only in word order; CLIP performs near chance.
- CLIP Association Bias (Yamada et al., EMNLP 2023, *"When are Lemons Purple?"*) — CLIP sometimes associates "lemons" with "purple" because of textual co-occurrence in training data.
CLIP's impact
- CLIP's visual encoder is one of the most-used vision encoders today.
- LAION-5B — an open-source 5-billion-pair replication of WIT.
- BLIP, InstructBLIP — CLIP-style ideas with added captioning + instruction-tuning.
- Stable Diffusion — uses CLIP's *text encoder* to condition image generation. Every Stable Diffusion image you see was generated by a model that uses CLIP inside.
- SigLIP — directly succeeds CLIP, swaps softmax for sigmoid.
CLIP is, in many ways, the single most consequential vision paper of 2021 — and the bridge between vision and language.
What you carry into the exam
The four families of SSL. The Gelato Bet. SimCLR's four components and aggressive-augmentation finding. The NT-Xent ≡ softmax-CE connection. MoCo's queue + momentum encoder + the trade-off vs SimCLR. BYOL's predictor + stop-grad + EMA and why it doesn't collapse. SwAV's Sinkhorn-Knopp equipartition. The three motivations for language supervision. DeViSE as historical predecessor. The CLIP pseudocode end-to-end, especially the *symmetric* loss. WIT's 400M-pair scale. The zero-shot recipe. CLIP's compositional failures and the Winoground / lemons-purple results. The downstream impact list.
Every multimodal model in your course flows downstream from these ideas.