Key idea

The model invents labels from the input itself. Hide half a sentence and predict the rest. Crop two pieces of one image and learn to recognise they're related. Mask 15% of an image's patches and rebuild them. The supervision is free — the structure is in the data — and the resulting representations transfer beautifully.

Watch a contrastive loss pull positive pairs together and push negative pairs apart in the embedding space
step 0

8 "images" each get 2 augmented views (16 points total). Same-image views are positive pairs — the contrastive loss wants their embeddings close together. Different-image views are negative pairs — push apart. Step a few times and watch the 8 pairs cluster while the clusters spread away from each other. The dashed lines connect positives; same-coloured pairs are augmentations of the same source.

Masked prediction. Hide part of the input, predict it back. BERT masks ~15% of tokens. Masked autoencoders (MAE) mask ~75% of image patches. The model learns rich representations because it has to understand the input to fill in the gaps.

Next-token prediction. Predict the next token given the previous ones. Trains GPT and friends. Simpler than masked prediction; scales spectacularly.

Contrastive learning. Take two augmented "views" of the same input. Pull their embeddings together; push apart from views of different inputs. SimCLR (Chen et al. 2020), MoCo (He et al. 2020), DINO (Caron et al. 2021) are landmark methods.

Non-contrastive methods. BYOL (Grill et al. 2020) and SimSiam (Chen & He 2021) — just predict one view's embedding from the other, with a stop-gradient. Surprisingly, no negatives needed.

The product is the encoder. The pretext task is a means to an end. After pre-training, you keep the encoder and discard the head. The encoder is then either frozen (linear probe) or fine-tuned for downstream tasks.

Reach for it when

  • You have huge unlabelled data and small labelled data
  • You want general-purpose representations (foundation models)
  • The supervised baseline is data-bottlenecked
  • You need to share an encoder across many tasks

Limits

  • Pre-training is computationally expensive — millions of GPU-hours at scale
  • Designing the pretext task is its own research problem
  • Linear-probe vs fine-tune trade-off is task-dependent
  • Some downstream tasks need very different features than the pretext task selected for
import torch
import torch.nn.functional as F

# SimCLR — info-NCE contrastive loss
def info_nce(z1, z2, temperature=0.07):
    # z1, z2: (B, d) — paired views of the same images
    B = z1.size(0)
    z = F.normalize(torch.cat([z1, z2], dim=0), dim=-1)
    sim = z @ z.t() / temperature                    # (2B, 2B)
    sim.fill_diagonal_(-1e9)                         # mask self-similarity
    # Positives: index i and i+B (and vice versa)
    labels = torch.cat([torch.arange(B, 2 * B),
                         torch.arange(0, B)]).to(z.device)
    return F.cross_entropy(sim, labels)

# Typical training step
view1, view2 = augment(x), augment(x)
z1 = encoder(view1); z2 = encoder(view2)
loss = info_nce(z1, z2)
loss.backward(); opt.step()
Want the contrastive math, alignment / uniformity, and BYOL's trick?
InfoNCE loss $$ \mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\mathrm{sim}(z_i, z_i^+)/\tau)}{\sum_{j} \exp(\mathrm{sim}(z_i, z_j)/\tau)} $$
  • zi+positive pair of zi (different view of same image)
  • τtemperature — controls how peaked the similarity is
  • Lower bound on mutual information between views

Alignment and uniformity. Wang & Isola (2020) showed contrastive learning optimises two terms: alignment (positives close) and uniformity (representations spread out on the hypersphere). A well-trained encoder has both — collapse to a constant is the failure mode.

The role of negatives. Negatives prevent collapse. The bigger the batch (more negatives), the harder the pretext task — SimCLR uses huge batch sizes. MoCo decouples the negative pool from the batch via a memory queue. Both work.

Non-contrastive methods (BYOL, SimSiam). Surprisingly, you can avoid negatives by using a momentum encoder + stop-gradient + predictor head. The asymmetry breaks the collapse equilibrium. The community still doesn't fully agree on why this works.

Augmentation matters. Strong augmentations (random crops, colour jitter, blur, cutout) are essential. The encoder learns to be invariant to whatever you augment over — choose augmentations that preserve what you care about and destroy what you don't.

Masked image modelling. MAE (He et al. 2022): mask 75% of an image's patches, reconstruct from the remaining 25%. Asymmetric encoder (sees only visible)/decoder (rebuilds) is key for compute efficiency. State of the art for vision representation learning at scale.

Multi-modal SSL. CLIP (Radford et al. 2021) pairs an image and its caption — text and image encoders trained jointly with contrastive loss. The resulting embeddings align modalities, enabling zero-shot classification by similarity to text prompts.

import torch.nn as nn

# Masked Autoencoder (MAE) — encoder sees only unmasked patches
class MAE(nn.Module):
    def __init__(self, encoder, decoder, mask_ratio=0.75):
        super().__init__()
        self.encoder, self.decoder = encoder, decoder
        self.mask_ratio = mask_ratio

    def forward(self, x_patches):
        N = x_patches.size(1)
        keep = int(N * (1 - self.mask_ratio))
        perm = torch.randperm(N)
        ids_keep, ids_mask = perm[:keep], perm[keep:]

        z = self.encoder(x_patches[:, ids_keep])     # only visible patches
        x_hat = self.decoder(z, ids_keep, ids_mask)  # reconstruct everything
        return F.mse_loss(x_hat[:, ids_mask], x_patches[:, ids_mask])
Want JEPA, DINO, multi-modal, and scaling laws?
Joint-embedding predictive architecture (JEPA) $$ \mathcal{L} = \big\lVert \mathrm{pred}\big(s_x(\text{context})\big) - s_y(\text{target}) \big\rVert^2 $$
  • Predict the target's embedding, not the target itself
  • No pixel-level reconstruction → no wasted capacity on irrelevant detail
  • LeCun's preferred frame for self-supervised learning at scale

Distillation-based SSL (DINO). Caron et al. (2021). Teacher + student encoders see different crops; the student matches the teacher's prediction; the teacher is an EMA of the student. No negatives, no labels, surprising stability. The features it learns are remarkably semantic — segment objects without supervision.

JEPA family. Predict embeddings, not raw inputs. I-JEPA (Assran et al. 2023) extends MAE-style masking to embedding space. Avoids the "pixel perfection" trap where the model wastes capacity on details that don't matter.

Scaling laws. SSL pretraining benefits enormously from scale. Empirical evidence (Goyal et al. 2022, OpenCLIP 2022): doubling data ⇒ predictable improvement on downstream linear probe. Foundation models exploit this directly.

The semantic richness of CLIP. Contrastive image-text pre-training produces embeddings where similarity ~ semantic alignment. The zero-shot recipe ("classify by similarity to text prompts") works surprisingly well even without any fine-tuning.

SSL evaluation. Linear probe (freeze encoder, train logistic regression head) is the standard. k-NN probe is faster and often correlates. Fine-tuning is the more practically-relevant test but slower; reflects the operating mode you'd deploy in.

Why does SSL work so well? The bet is that pretext tasks like "fill in the blank" force the model to learn the actual structure of the data — what's likely and unlikely. With enough data, this structure is exactly what downstream tasks need. It's not magic; it's information-rich supervision.

Limits. SSL features encode what the pretext task incentivises — fine-grained semantics, but rarely numeric reasoning, planning, or counterfactual abilities. The choice of pretext task is itself an inductive bias.

import torch
import torch.nn.functional as F

# DINO loss — student predicts teacher's distribution (no negatives)
def dino_loss(student_logits, teacher_logits, temp_s=0.1, temp_t=0.04):
    teacher_probs = F.softmax(teacher_logits / temp_t, dim=-1)
    student_log_probs = F.log_softmax(student_logits / temp_s, dim=-1)
    return -(teacher_probs * student_log_probs).sum(dim=-1).mean()

# Teacher EMA update
def update_teacher(student, teacher, m=0.996):
    for p_t, p_s in zip(teacher.parameters(), student.parameters()):
        p_t.data.mul_(m).add_((1 - m) * p_s.data)
Too dense?