Key idea

Modelling P(data) instead of P(label | data). Once you've learned the data distribution, you can sample new data, score the likelihood of new data, infill missing parts, or condition on side-information for controlled generation. Different families make different trade-offs between sample quality, likelihood, speed, and ease of training.

Compare the four major families on a simple 2D distribution — same data, very different generative behaviours

A simple 2D target distribution (a ring) shown four ways. Each family is good at different things. VAE — fast, blurry. GAN — sharp samples, mode collapse risk, no likelihood. Flow — exact likelihood, restricted architecture. Diffusion — high quality, slow sampling. Autoregressive — exact likelihood, sequential sampling.

VAE (Variational Autoencoder). Encode to a Gaussian latent; decode back. Train with reconstruction + KL-to-prior. Fast sampling (one forward pass); usually blurry; exact likelihood (sort of — an ELBO bound). See Autoencoders & VAEs.

GAN (Generative Adversarial Network). Generator and discriminator in a minimax game. Sharp samples; no likelihood; mode collapse common; finicky training. See GANs.

Normalising flows. Invertible transformations from a simple base distribution to the data distribution. Exact likelihood and exact sampling; architecture restricted to invertible maps. Useful when you need likelihoods (anomaly detection, density estimation).

Diffusion. Iteratively denoise from Gaussian noise. Slow sampling (10s–1000s of steps), highest sample quality, stable training. State of the art for images. See Diffusion Models.

Autoregressive. Model P(x) = ∏ P(xi | x<i). Exact likelihood; sequential (slow) sampling; the architecture behind every modern LLM and PixelRNN/CNN.

Pick by what you need

  • Sample quality: diffusion > GAN ≈ AR > flow > VAE
  • Sample speed: GAN ≈ VAE ≈ flow > AR > diffusion (parallel: AR loses)
  • Exact likelihood: flow = AR >> VAE (lower bound only) >> GAN (none)
  • Mode coverage: AR ≈ diffusion > flow > VAE > GAN
  • Training stability: AR > diffusion > flow ≈ VAE > GAN

None is a free lunch

  • Diffusion: slow at inference unless distilled
  • GAN: unstable, mode collapse, no likelihood
  • Flow: architecture constraints hurt quality
  • VAE: blurry; high-quality VAE needs a lot of capacity
  • AR: sequential, slow for long sequences
# Each family in one breath
import torch, torch.nn as nn, torch.nn.functional as F

# VAE — reconstruct + KL
mu, logvar = encoder(x)
z = mu + (0.5 * logvar).exp() * torch.randn_like(mu)
rec = decoder(z)
loss = F.mse_loss(rec, x) - 0.5 * (1 + logvar - mu ** 2 - logvar.exp()).sum()

# GAN — discriminator tries to tell real from fake
fake = generator(torch.randn(B, z_dim))
d_loss = F.binary_cross_entropy_with_logits(D(real), torch.ones(B)) \
       + F.binary_cross_entropy_with_logits(D(fake.detach()), torch.zeros(B))

# Diffusion — predict noise added at random timestep
t = torch.randint(0, T, (B,))
noise = torch.randn_like(x)
xt = sqrt_alpha_bar[t] * x + sqrt_one_minus[t] * noise
loss = F.mse_loss(model(xt, t), noise)

# Autoregressive — next-token cross-entropy
logits = model(x[:, :-1])
loss   = F.cross_entropy(logits.flatten(end_dim=-2), x[:, 1:].flatten())
Want EBMs, score-based models, and conditional generation?
Score matching $$ \mathcal{L} = \mathbb{E}_{x \sim p_{\text{data}}}\big\lVert \nabla_x \log p_\theta(x) - \nabla_x \log p_{\text{data}}(x) \big\rVert^2 $$
  • Match the gradient of log-density, not the density itself
  • Avoids the partition function
  • The foundation of score-based and diffusion models

Energy-based models. Define p(x) ∝ exp(-Eθ(x)). Maximum likelihood requires the partition function (intractable). Trained via contrastive divergence, score matching, or variational methods. Conceptually powerful; tricky in practice.

Score-based models. Learn x log p(x) (the "score"). Sample via Langevin dynamics or related SDEs. Closely related to diffusion: a denoising network is essentially a score estimator at multiple noise levels. Unified by Song et al. 2021's SDE formulation.

Conditional generation. p(x | y) instead of p(x). Class-conditional images, text-to-image, image-to-image. Classifier-free guidance (Ho & Salimans 2021) trains a single model on conditional and unconditional examples; trade off quality and diversity at sampling time by mixing the two.

Latent diffusion. Train a VAE to compress images to a latent space; train a diffusion model in that space. Stable Diffusion is exactly this. Faster sampling (smaller latents) without losing much quality.

Likelihood vs sample quality. Not the same! A model can have great likelihood and ugly samples (over-smooth average) or beautiful samples and terrible likelihood (mode-collapsed). Pick the metric that matches what you care about.

Evaluation. No single metric is right. FID (Fréchet Inception Distance) compares feature statistics; IS (Inception Score) measures diversity + classifiability; PR (Precision/Recall in image space) decouples mode coverage from sample quality; CLIP score for text-image alignment. All have known failure modes.

import torch, torch.nn as nn

# Classifier-free guidance — train with random dropout of the condition
def sample_cfg(model, y, num_steps, guidance=7.5):
    x = torch.randn(...)
    for t in reversed(range(num_steps)):
        noise_cond   = model(x, t, y)
        noise_uncond = model(x, t, None)
        noise = noise_uncond + guidance * (noise_cond - noise_uncond)
        x = denoise_step(x, noise, t)
    return x
Want SDE formulation, consistency models, & ImageNet scaling?
Score SDE $$ dx = \big[\, f(x, t) - g(t)^2 \nabla_x \log p_t(x)\, \big]\, dt + g(t)\, d\bar W $$
  • Reverse-time SDE for sampling — needs the score ∇log pt(x)
  • Unifies diffusion, score-based, and Langevin samplers
  • Score networks trained at multiple noise levels approximate ∇log pt

Consistency models. Song et al. (2023). Train a one-step distillation of a diffusion model — sample in 1 to 4 steps instead of 50–1000. Trade-off: somewhat lower quality than the multi-step teacher; orders-of-magnitude faster.

Flow matching & rectified flow. Lipman et al. (2023), Liu et al. (2022). Train a vector field that maps noise to data; sample by solving the ODE. Often faster than diffusion at comparable quality. The frontier of "diffusion done better".

Discrete diffusion & masked language models. The diffusion framework generalises beyond Gaussian noise — for discrete data (text, tokens), use absorbing-state diffusion or masked modelling. Mask-and-predict objectives are the conceptual cousin.

Autoregressive scaling. Most modern foundation models are autoregressive over discrete tokens, including over images (Parti, MaskGIT, ImageGPT). Tokenise the image with a VQ-VAE or similar, then run an autoregressive transformer over the tokens. Slower at inference than parallel diffusion but conceptually simpler.

Likelihood-based vs adversarial. Likelihood-based models (AR, diffusion, flows, VAEs) tend to cover all modes but produce blurry/smoothed samples. Adversarial models (GANs) produce sharp samples but miss modes. Hybrids (VAE-GAN, diffusion-GAN) try to combine.

Controllable generation. ControlNet, T2I-Adapter, LoRA, IP-Adapter, image conditioning, depth-conditioning, prompt-engineering — the modern stack for steering diffusion models. Less a "model family" than a meta-pattern of adding more conditioning signals.

Evaluation difficulties. FID is the de facto standard but can disagree with human judgement. CLIP-IQA, DINOv2-FID, and human evaluation (preference models) are alternative measures. Always look at samples, not just numbers.

import torch
import torch.nn.functional as F

# Consistency-model-style one-step generation
def one_step_sample(consistency_model, noise):
    return consistency_model(noise, t=1.0)   # one forward pass

# Flow matching — train a velocity field
def flow_loss(model, x_data, sigma=1.0):
    t = torch.rand(x_data.size(0))
    x0 = sigma * torch.randn_like(x_data)
    x_t = (1 - t.view(-1, 1)) * x0 + t.view(-1, 1) * x_data
    target_velocity = x_data - x0
    return F.mse_loss(model(x_t, t), target_velocity)
Too dense?