DINO, MAE, JEPA — Modern SSL Beyond Contrastive
Intuition
DINO trains a student to match an EMA teacher's output distribution. MAE masks 75% of patches and reconstructs pixels. JEPA predicts target REPRESENTATIONS instead of pixels. All three avoid negatives entirely.
Explanation
DINO = self-DIstillation with NO labels. Two networks with identical architecture: a student s and a teacher t. Different augmented crops of the same image pass through both. The student is trained to match the teacher's softmax distribution via cross-entropy L = −Σ p_t · log p_s. The teacher is NOT trained by gradients — its weights are an EMA of the student: θ_t ← λ · θ_t + (1 − λ) · θ_s, with λ on a cosine schedule from 0.996 → 1.
Multi-crop in DINO: 2 GLOBAL views (>50% of image area, 224²) → teacher AND student; 6-10 LOCAL views (<50%, 96²) → student only. The student must predict the teacher's global representation from limited local information — enforces local-to-global consistency.
Why naive self-distillation collapses: (a) constant output — all images map to the same vector (L = 0 trivially); (b) one-hot output dominated by one dimension. DINO's two prevention tricks must work together: Centering — subtract a running mean from teacher logits before softmax (prevents any one dimension from dominating); Sharpening — divide teacher logits by very low temperature τ_t ≈ 0.04 (peakier target). Pure centering → uniform; pure sharpening → one-hot; combined they balance.
DINO's emergent properties: k-NN on raw frozen features achieves 78.3% top-1 on ImageNet (no probe). [CLS] attention maps in trained ViT produce clean, object-centric segmentation masks WITHOUT any segmentation labels — surprising and powerful. Output dim is 65,536 (large, to discourage collapse-to-one-dim).
DINOv2/v3 refinements: Sinkhorn-Knopp centering (replaces simple running mean), KoLeo regularizer (uniformity in feature space), registers, patch-level losses (iBOT). 'Registers' = extra learnable tokens prepended to the sequence with no positional meaning; the model uses them as scratchpad for global information, freeing real patch tokens from acting as scratchpad → cleaner attention maps.
MAE = Masked Autoencoder. Patchify image (16×16); randomly mask 75% of patches; encoder operates only on the visible 25% (huge speedup); insert learnable mask tokens at masked positions; a small decoder processes the full sequence and reconstructs the masked patches' pixel values; loss = MSE on reconstructed pixels (only at masked positions). 75% mask ratio (vs BERT's 15%) is essential: image patches are highly spatially redundant — at 15% the model can interpolate from neighbours; at 75% it must learn semantic representations.
JEPA = Joint Embedding Predictive Architecture (Yann LeCun's program). 'Representations instead of pixels.' Context encoder + Target encoder (EMA, frozen) + Predictor. Given an image, encode the context regions; predict the target regions' REPRESENTATIONS in latent space; loss = L₂ between predicted and target features. Variants: I-JEPA (image), V-JEPA (video), VL-JEPA (vision-language). Rationale: pixel prediction wastes capacity on low-level texture; latent prediction focuses on semantically predictable structure.
Definitions
- Self-distillation — Student trained to match teacher's output distribution; teacher and student share architecture; teacher updated via EMA of student.
- Centering — Subtract running mean of teacher logits before softmax; spreads target distribution; prevents collapse-to-one-dim.
- Sharpening — Divide teacher logits by very low temperature; produces peaky target; gives clear gradient to student.
- Multi-crop — DINO augmentation: 2 global (>50% area) + 6-10 local (<50%) views; teacher sees only global; student sees all.
- Mask ratio (MAE) — Fraction of patches masked; 75% in MAE vs 15% in BERT — images have more spatial redundancy.
- JEPA — Joint Embedding Predictive Architecture; predicts target features (EMA target encoder) rather than pixels.
Formulas
\theta_t \leftarrow \lambda \theta_t + (1-\lambda)\theta_s,\ \ \lambda: 0.996 \to 1\ \text{(cosine)}p_t = \text{softmax}\!\left(\tfrac{g_t(x) - c}{\tau_t}\right),\ \ c \leftarrow m \cdot c + (1-m)\,\mathbb{E}[g_t(x)]L_{\text{DINO}} = -\sum_x p_t(x) \log p_s(x)L_{\text{MAE}} = \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \|x_i - \hat x_i\|_2^2L_{\text{JEPA}} = \|f_\phi(\text{ctx}) - \bar f(\text{target})\|_2^2\ \text{(latent space)}
Derivations
Why MAE masks 75% and BERT only 15%: language tokens carry distinct semantic content; masking 15% leaves enough redundancy for the task to be non-trivial but not impossible. Image patches are far more redundant spatially — neighbouring patches are nearly identical. At 15% mask the model can essentially copy nearby patches; at 75% interpolation fails and only semantic understanding can complete the reconstruction.
Examples
- DINO ViT-B/16: trains on 2× 8-GPU servers for 3 days; achieves 78.3% k-NN top-1 on ImageNet with no labels.
- MAE encoder cost: at 75% mask, encoder sees N/4 patches → roughly 4× speed-up versus a non-masked encoder.
- Registers in DINOv2: 4 extra learnable tokens, no positional embedding; absorbed all the 'high-attention on blank sky' artefacts.
Diagrams
- DINO architecture: student + EMA teacher with multi-crop (2 global + many local); centering + sharpening on teacher output; CE loss.
- MAE asymmetric encoder-decoder: encoder sees only 25%, decoder fills in mask tokens and reconstructs pixels.
- JEPA: context encoder + EMA target encoder + predictor; loss in latent space.
Edge cases
- Without centering OR without sharpening DINO collapses — both are essential.
- MAE pixel loss can dominate by low-level texture; some variants use perceptual loss instead.
- JEPA's latent collapse — predictor can predict a constant; combat with KoLeo / regularisers.
Common mistakes
- Stating DINO uses negatives — it does not (only self-distillation).
- Confusing DINO's EMA teacher with MoCo's momentum encoder — different mechanisms, similar mathematical form.
- Saying MAE uses BERT-like 15% mask — it's 75%.
- Treating JEPA as 'MAE in latent space' — JEPA explicitly predicts EMA-target features, not reconstructed pixels.
Shortcuts
- DINO anti-collapse pair = centering + sharpening (with low τ_t ≈ 0.04).
- MAE encoder sees 25%, decoder sees 100% (mask tokens + visible features). Asymmetry is the point.
- Output dim DINO = 65,536 — anti-collapse via large softmax space.