Key idea

The paradigm is defined by what the model gets to see. Supervised has (x, y) pairs. Unsupervised has just x. Self-supervised invents y from x. Reinforcement learning only sees a reward signal. Each tells a model "here's the kind of feedback you'll get" — and that shapes everything else.

Toggle each paradigm — same data, different things the learner gets to see

The same underlying dataset is presented five ways. Notice how much is "given" to the learner under each paradigm — full labels, partial labels, structure that lets you make up labels (predict the right half from the left half), nothing but the points themselves, or only a numerical reward for an action.

Supervised learning. Each training example is a pair (x, y). The model learns a function that maps x to y. Classification (discrete y) and regression (continuous y) are the two halves. Most "applied ML" problems start here because labels are expensive but tractable.

Unsupervised learning. Only x — no labels. The model has to find structure on its own: clusters, lower-dimensional manifolds, densities, anomalies. Useful when labels are unavailable or when you want to understand the data before predicting.

Self-supervised learning. Labels invented from the input itself: predict the next token from past tokens, predict the masked patch from the visible ones, contrast augmented views of the same image. Powers most modern foundation models — labels are essentially free, and the resulting representations transfer beautifully.

Semi-supervised learning. Mostly unlabelled x, with a small labelled subset. Often the realistic setting in industry: labels are expensive, unlabelled data is everywhere. Pseudo-labelling, consistency training, and pre-training-then-fine-tuning are the dominant strategies.

Reinforcement learning. An agent acts in an environment and receives rewards. No labels — just feedback on whether its actions are working. Used for control (robotics), strategy (games), and increasingly for aligning language models to human preferences.

When each shines

  • Supervised: clean labels, the prediction is the product
  • Self-supervised: huge unlabelled corpora, foundation models
  • Semi-supervised: small labelled budget, large unlabelled pool
  • Unsupervised: exploration, segmentation, anomalies
  • RL: sequential decisions, evaluative feedback only

Where each struggles

  • Supervised — needs lots of clean labels
  • Self-supervised — pre-training is expensive, distillation is hard
  • Semi-supervised — pseudo-label errors can compound
  • Unsupervised — evaluation is awkward (what's "right"?)
  • RL — sample-inefficient, unstable training, reward hacking
# The five paradigms in one breath.

# Supervised — labelled (X, y)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Unsupervised — only X
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters=3).fit_predict(X)

# Semi-supervised — mostly unlabelled, a few labels
from sklearn.semi_supervised import SelfTrainingClassifier
clf = SelfTrainingClassifier(base_classifier)
clf.fit(X_mixed, y_mixed)         # -1 marks unlabelled

# Self-supervised — labels from the input itself (next-token prediction)
loss = F.cross_entropy(model(tokens[:, :-1]), tokens[:, 1:])

# Reinforcement learning — reward-driven
state, info = env.reset()
for t in range(T):
    action = policy(state)
    state, reward, done, _, _ = env.step(action)
    policy.update(state, action, reward)
Want the formalisms and modern blends?
Empirical risk minimisation $$ \hat\theta = \arg\min_\theta \frac{1}{n} \sum_{i=1}^n \ell\!\big(f_\theta(x_i),\, y_i\big) $$
  • The unifying formalism of supervised learning
  • y can be a true label (supervised), invented from x (self-supervised), or partially observed (semi-supervised)
  • RL replaces this with a Bellman equation

Beyond the textbook categories. The boundaries are fuzzy and mostly historical. A self-supervised model is "supervised" once you invent the labels; a clustering model is "supervised" if you have one labelled example per cluster; pre-training-then-fine-tuning blurs every category. The useful question is what feedback the model gets, when.

Foundation models & pre-training. A modern recipe: pre-train a huge model on huge data with a self-supervised objective (next-token prediction, masked image modelling, contrastive learning), then fine-tune on a smaller labelled dataset for a specific task. The pre-training distils general structure; the fine-tuning specialises. Drove the GPT / BERT / DINO families.

Multi-task and meta-learning. Multi-task: one model, many tasks; share the early layers, specialise the heads. Meta-learning: "learn to learn" — the training data is a distribution over tasks, and the model learns to adapt quickly to a new one (MAML, ProtoNets). Useful when labels are scarce per task but many similar tasks exist.

Transfer learning & domain adaptation. Train on one distribution, deploy on another. Most successes in deep learning rely on this — ImageNet pre-training, BERT pre-training, etc. The risk is distribution shift: features useful in the source domain may not transfer.

Active learning. The model chooses what to label next — query the most informative example, get a human to label it, repeat. Useful when labels are expensive (medical imaging, expert annotation). Choosing the right "informativeness" criterion is the art.

Imitation learning. Learn a policy from expert demonstrations. Avoids RL's exploration problem but requires expert data. Behavior cloning is the simplest form; DAgger, inverse RL, and offline RL are more sophisticated.

import torch.nn as nn

# Pre-train, then fine-tune — the canonical modern recipe
encoder = nn.Sequential(*backbone_layers)
ssl_head = SSLHead()

# Phase 1: self-supervised pre-training
ssl_loss = contrastive_loss(ssl_head(encoder(x)), positives)
ssl_loss.backward(); ssl_opt.step()

# Phase 2: supervised fine-tuning
encoder.requires_grad_(True)
classifier = nn.Linear(d_model, n_classes)
loss = F.cross_entropy(classifier(encoder(x)), y)
loss.backward(); ft_opt.step()
Want the MDP formalism, offline RL, and weak supervision?
Bellman equation (RL) $$ V^\pi(s) = \mathbb{E}_{a \sim \pi}\!\left[ R(s, a) + \gamma\, \mathbb{E}_{s' \sim P}\,V^\pi(s') \right] $$
  • Vπvalue of state s under policy π
  • γdiscount factor (1 = caring about the far future)
  • The optimisation problem is over policies, not pointwise predictions

The MDP formalism. RL studies Markov Decision Processes: states S, actions A, transitions P(s' | s, a), rewards R(s, a), discount γ. The goal is a policy π(a | s) that maximises expected discounted reward. Value iteration / policy iteration are the textbook algorithms; modern deep RL approximates the value function or policy with a neural network.

On-policy vs off-policy. On-policy (PPO, A2C): learn from data collected by the current policy. Off-policy (Q-learning, SAC, DDPG): learn from any policy's data (replay buffer). Off-policy is more sample-efficient; on-policy is more stable.

RLHF. Reinforcement Learning from Human Feedback. Train a reward model from human preferences, then optimise a language model against that reward with PPO. Made instruction-following LLMs possible. The classic three-phase recipe: pre-train → SFT → RLHF.

Offline RL. Train an RL agent from a fixed dataset without environment access. Hard because the agent can't recover from bad action selection. Conservative Q-Learning (CQL), Implicit Q-Learning (IQL), Behavior-Regularised Actor-Critic are recent strong approaches.

Weak supervision. Instead of clean labels, use multiple noisy, conflicting, or partial labelling functions. Snorkel-style systems combine them via a generative model. Useful when expert labels are unaffordable but heuristics exist.

Self-training & pseudo-labels. Train on labelled data, predict on unlabelled, add confident predictions to the training set, iterate. Works surprisingly well in practice — but errors compound, so confidence thresholds and consistency checks are essential.

Curriculum learning. Order training examples from easy to hard. Sometimes accelerates convergence dramatically; sometimes does nothing. The "easy first" intuition is solid; the specific curriculum is often domain-specific art.

import torch.nn.functional as F

# Conservative Q-Learning for offline RL — penalise OOD actions
def cql_loss(q_net, batch, alpha=1.0):
    s, a, r, s_next, done = batch
    q_sa = q_net(s, a)
    with torch.no_grad():
        q_target = r + (1 - done) * gamma * q_net.target(s_next).max(-1).values
    td_loss = F.mse_loss(q_sa, q_target)

    # CQL term: pull down Q for random actions, push up for dataset actions
    q_random  = q_net(s, sample_random_actions(s))
    cql_term  = (q_random.logsumexp(-1) - q_sa).mean()

    return td_loss + alpha * cql_term

# Snorkel-style weak supervision — multiple labelling functions vote
def majority_vote(lf_outputs):
    # lf_outputs: (N, K) where K = number of labelling functions
    valid = lf_outputs != -1                          # -1 means abstain
    votes = (lf_outputs * valid).sum(axis=1)
    return votes / valid.sum(axis=1).clip(min=1)
Too dense?