Key idea

The unlabelled data tells you about the input distribution. Even without labels, you can see that the data has clusters, a manifold, low-density boundaries. Those structural signals tell a model "the decision boundary probably runs here, not there." Pair that with a small labelled set and you can do real work.

Same dataset, four strategies — only a few points are labelled; watch how each method uses the rest

8 labelled points (filled, indigo or orange), many unlabelled (grey). Supervised only draws a boundary from just the labelled — often crooked. Pseudo-label classifies the unlabelled with the supervised model, retrains on everything. Graph propagation spreads labels through k-NN edges based on similarity. Self-training iterates — confidence-thresholded pseudo-labels added each round.

Self-training / pseudo-labels. Train on the labelled; predict on unlabelled; add the confident predictions to the training set; repeat. Simple, works well when the supervised model is already decent.

Consistency regularization. Augment an unlabelled example two ways; punish the model for predicting different labels. Forces invariances; underpins FixMatch (Sohn et al. 2020) and friends.

Graph-based methods. Build a k-NN graph; let labels propagate along high-weight edges; absorb to a smooth function on the graph. Classical (Zhu et al. 2003); still useful on structured data.

Pre-training + fine-tuning. The dominant modern recipe — see Self-Supervised Learning. Pre-train on all the unlabelled data; fine-tune on the labelled. Conceptually a different family but functionally the same goal: do more with little label data.

Reach for it when

  • Labelled data is expensive but unlabelled is plentiful
  • The data has cluster / manifold structure you can exploit
  • Active learning has been tried and you still need more
  • You can iterate cheaply on a pseudo-labelling pipeline

Watch out

  • Pseudo-label errors compound — start with confident predictions only
  • The cluster assumption can be false — clusters may not align with classes
  • Consistency regularization needs careful augmentation choices
  • Graph methods scale badly to huge data
from sklearn.semi_supervised import (
    SelfTrainingClassifier, LabelPropagation, LabelSpreading,
)
from sklearn.linear_model import LogisticRegression

# Self-training: -1 means "no label". The classifier predicts and adds
# confident predictions to its own training set, iteratively.
clf = SelfTrainingClassifier(LogisticRegression(), threshold=0.9, max_iter=10)
clf.fit(X_mixed, y_mixed)             # y has -1 for unlabelled

# Label propagation: k-NN graph, labels spread along edges
lp = LabelPropagation(kernel="knn", n_neighbors=7)
lp.fit(X_mixed, y_mixed)              # -1 unlabelled
predicted = lp.transduction_          # labels assigned to each unlabelled
Want FixMatch, MixMatch, & the cluster assumption?
Consistency loss $$ \mathcal{L}_{\text{cons}} = \mathbb{E}_{x_u}\,\big\lVert f_\theta(\alpha_1(x_u)) - f_\theta(\alpha_2(x_u)) \big\rVert^2 $$
  • α1, α2two random augmentations of unlabelled xu
  • Encourage the same prediction under augmentation — the cluster assumption operationalised

The cluster assumption. Examples in the same cluster have the same label. Pseudo-labelling, graph propagation, and consistency methods all bet on this. When it fails — clusters that span multiple classes — semi-supervised methods can hurt rather than help.

The smoothness assumption. The decision function should be locally constant in regions of high data density. Manifold / graph-based methods optimise exactly this.

FixMatch. Sohn et al. (2020). Apply a weak augmentation, get a pseudo-label; apply a strong augmentation, predict that pseudo-label as ground truth. Threshold by confidence. Sets the modern standard for vision semi-supervised learning.

MixMatch / ReMixMatch / FlexMatch. Variants that smooth pseudo-labels by averaging across augmentations, mix-up labelled and unlabelled examples, or dynamically adjust the confidence threshold per class. Marginally better than FixMatch on standard benchmarks.

Mean teacher. Maintain a moving-average copy of the model (the "teacher") and use it to generate pseudo-labels. The student matches the teacher; the teacher is an EMA of the student. Less noisy than vanilla pseudo-labelling.

Connection to self-supervised. Pretrain self-supervised, fine-tune on the labels. With foundation models, this is often the best semi-supervised method by a wide margin — the encoder already encodes the data manifold, the labels just specialise.

import torch, torch.nn.functional as F

# FixMatch — weak and strong augmentation
def fixmatch_step(model, x_l, y_l, x_u, weak, strong, tau=0.95, lam=1.0):
    # Supervised loss on the labelled set
    l_sup = F.cross_entropy(model(x_l), y_l)
    # Confident pseudo-labels from the weakly-augmented version
    with torch.no_grad():
        probs = F.softmax(model(weak(x_u)), dim=-1)
        max_p, pl = probs.max(dim=-1)
        mask = (max_p >= tau).float()
    # Consistency loss: prediction on strong-augmented matches pseudo-label
    l_cons = (F.cross_entropy(model(strong(x_u)), pl, reduction="none") * mask).mean()
    return l_sup + lam * l_cons

# Mean Teacher
class MeanTeacher:
    def __init__(self, model, m=0.999):
        self.student = model
        self.teacher = type(model)().load_state_dict(model.state_dict())
        self.m = m
    @torch.no_grad()
    def update(self):
        for p_t, p_s in zip(self.teacher.parameters(), self.student.parameters()):
            p_t.data.mul_(self.m).add_((1 - self.m) * p_s.data)
Want noisy student, MPL, and theoretical bounds?
Risk bound with unlabelled data $$ R(\hat f) \leq \hat R_l(\hat f) + \Omega(\hat f; X_u) + O\!\left(\sqrt{\tfrac{\mathrm{VC}(\mathcal{H})}{n_l}}\right) $$
  • llabelled empirical risk
  • Ω(f; Xu)complexity penalty informed by unlabelled data (cluster / smoothness)
  • Unlabelled data effectively shrinks the hypothesis class

Noisy Student (Xie et al. 2020). Train a teacher on labelled; pseudo-label the unlabelled; train a larger student with noise injection (dropout, augmentation, stochastic depth) on both. Iterate. State of the art on ImageNet for a long time; the recipe extends to many vision tasks.

Meta Pseudo Labels (Pham et al. 2021). The teacher is also learned — adjust its pseudo-labelling based on how it affects the student's val loss. Closes a feedback loop that vanilla self-training lacks.

Co-training. Two models trained on different feature views of the data; each labels examples for the other. Works when the views are conditionally independent given the label.

Transductive vs inductive SSL. Transductive: predict labels for the specific unlabelled set you have (label propagation does this). Inductive: produce a function that generalises to new examples (most modern methods).

Theoretical guarantees. Several frameworks (Niyogi 2008, Ben-David et al. 2008) give conditions under which unlabelled data provably helps. The assumptions are strong; in practice the empirical record is more important than the bounds.

When SSL hurts. If the cluster assumption fails (e.g. classes overlap, or clusters in input space don't align with label boundaries), self-training amplifies the supervised model's errors. Always keep a labelled-only baseline.

from sklearn.linear_model import LogisticRegression
import numpy as np

# Noisy Student-style iterative self-training
def noisy_student(X_l, y_l, X_u, n_iter=5):
    teacher = LogisticRegression().fit(X_l, y_l)
    for _ in range(n_iter):
        probs = teacher.predict_proba(X_u)
        conf  = probs.max(axis=1)
        pl    = probs.argmax(axis=1)
        mask  = conf > 0.9
        # Train a (potentially larger / noisier) student on union
        X_all = np.r_[X_l, X_u[mask]]
        y_all = np.r_[y_l, pl[mask]]
        teacher = LogisticRegression(C=1.0).fit(X_all, y_all)
    return teacher
Too dense?