Contrastive SSL — SimCLR / MoCo / BYOL / CLIP
Intuition
Self-supervised learning invents the supervision signal: 'this random crop and that random crop of the same image should look similar; everything else in the batch should look different.' That's the entirety of SimCLR. Variants then attack the dependency on huge batches.
Explanation
SimCLR: take an image, apply two strong random augmentations → x₁, x₂. Pass through a shared encoder, then a projection MLP g(·), to get z₁, z₂. In a batch of N images you have 2N views — 2N − 2 negatives per anchor. Optimise the NT-Xent (InfoNCE) loss to pull z₁, z₂ together and push them away from negatives. Needs LARGE batches (4096+) for enough negatives.
InfoNCE / NT-Xent: L_i = −log [ exp(sim(z_i, z_i⁺)/τ) / Σ_j exp(sim(z_i, z_j)/τ) ]. Mathematically identical to softmax cross-entropy with one positive and many negatives. τ = temperature, controls peakiness.
Why the projection head g: the contrastive loss is aggressive — it removes information that's invariant to the augmentations (colour, position). Putting g between the encoder f and the loss lets f preserve broad image information, while g learns a contrastive-specific subspace where invariances are enforced. At downstream time: discard g, use f. Big empirical improvement.
MoCo (Momentum Contrast) decouples negatives from batch size: maintain a queue of past keys (e.g., 65 k entries) and use a momentum-updated key encoder θ_k ← m·θ_k + (1−m)·θ_q with m ≈ 0.999. Queue gives many negatives without needing them in the current batch; momentum keeps the queue's keys consistent. Trains well with batch size ~256.
BYOL goes further — no negatives at all. Online network (encoder + projector + predictor) and target network (encoder + projector, EMA of online). Train the online predictor to match the target's projection of the other augmented view; use stop-gradient on the target. Avoids collapse via the asymmetric architecture and momentum updates.
CLIP: 400 M (image, caption) pairs from the web. Two encoders (image, text) project into a shared embedding space. In each batch of N pairs, the N × N similarity matrix has the diagonal as positives. Symmetric cross-entropy: for image i, the correct match is text i; vice versa. Zero-shot classification: for each class name, encode 'a photo of a {class}' via text encoder → class embedding; embed image; predict via argmax of cosine similarities.
CLIP failure modes: (1) compositional weakness — 'a horse riding a man' and 'a man riding a horse' are nearly indistinguishable (global representation, not relational); (2) no hard negatives during training — random batch negatives are too easy, so fine-grained discrimination doesn't emerge; (3) Concept Association Bias — words treated as a bag, attributes attach to wrong objects.
Definitions
- InfoNCE — Contrastive loss — softmax CE with a positive logit and many negative logits, scaled by temperature τ.
- NT-Xent — Normalized Temperature-scaled cross-entropy — SimCLR's specific form of InfoNCE.
- Projection head — Small MLP between encoder and contrastive loss; discarded at downstream; preserves broad encoder features.
- Momentum encoder — Slowly-updated copy of the online encoder via EMA; used as the key/target side in MoCo/BYOL.
- Zero-shot classification (CLIP) — Encode 'a photo of a {class}' for every class; predict argmax cosine similarity with the image embedding.
Formulas
L_{i} = -\log \frac{\exp(\mathrm{sim}(z_i, z_i^+)/\tau)}{\sum_j \exp(\mathrm{sim}(z_i, z_j)/\tau)}\theta_k \leftarrow m\,\theta_k + (1-m)\,\theta_q,\ \ m \approx 0.999\ \ \text{(MoCo)}L_{\text{CLIP}} = \tfrac{1}{2}(L_{i\to t} + L_{t\to i}),\ \ L_{i\to t,i} = -\log \frac{\exp(z_i^I \cdot z_i^T / \tau)}{\sum_j \exp(z_i^I \cdot z_j^T / \tau)}\text{Zero-shot:}\ \hat y = \arg\max_c \cos(z^I,\ z^T_{\text{`a photo of a `} + c})
Derivations
InfoNCE ≡ cross-entropy: with one positive logit s⁺ and K − 1 negative logits {sⱼ}, the softmax CE for the positive class is −log(e^{s⁺}/Σⱼ e^{sⱼ}) = −s⁺ + log Σⱼ e^{sⱼ}. NT-Xent is exactly this with sim/τ as the logit and the positive picked by augmentation pairing.
Examples
- SimCLR augmentation pipeline that worked: random crop + colour jitter + Gaussian blur. Removing colour jitter degraded performance more than any other.
- CLIP zero-shot on ImageNet via 'a photo of a {class}': ~76% top-1 zero-shot (no fine-tuning), comparable to a supervised ResNet-50.
- BYOL collapse without stop-gradient: the online network just copies the target (gradient flows symmetrically) — representations collapse to a constant.
Diagrams
- SimCLR: two augmented views of x → shared encoder → projection g → NT-Xent loss against other batch items as negatives.
- MoCo: query encoder + momentum key encoder + queue of past keys (negatives).
- CLIP training: N×N similarity matrix; diagonal positives; symmetric CE over rows and columns.
Edge cases
- SimCLR with batch size 256 underperforms supervised — needs 4 k+ for enough negatives.
- BYOL without stop-gradient collapses to a constant representation.
- CLIP fine-grained classification (dog breeds): zero-shot performance drops; prompt engineering helps but limited.
Common mistakes
- Stating SimCLR needs no negatives — it does (BYOL is the one without).
- Discarding the encoder f and keeping the projection g for downstream — opposite of what you should do.
- Treating CLIP's loss as one-directional CE — it's symmetric over rows AND columns.
- Calling CLIP's zero-shot 'few-shot' — no labelled examples of the target classes are seen.
Shortcuts
- Projection head g present at training, discarded at downstream.
- MoCo: queue + momentum. BYOL: predictor + stop-gradient. SimCLR: just big batches.
- CLIP zero-shot template: 'a photo of a {class}'.