Diffusion Models — ML Resources Hub

Key idea

Add noise to your data, then learn to remove it. Take a real image, blur it slightly, then more, then more, until it's pure noise. Now train a network to reverse the process — given a noisy image, predict the noise. Once trained, start from pure noise and run the denoiser many times. What pops out is a new sample.

Drag the slider or hit Forward / Reverse — watch the image walk between clean and pure noise along the cosine schedule

Image t = 0

The middle panel is what the model sees at timestep t. The right panel is what a "perfect denoiser" would predict for x₀ — and you can watch that prediction sharpen as you move backward through time. The equation under the images shows the forward formula explicitly: x_t = √α̅ · x₀ + √(1-α̅) · ε. Training a diffusion model is just teaching a network to predict ε given x_t; everything else falls out of that.

The miracle: you only ever train on the inverse of a process you fully control (adding Gaussian noise). The forward process is just x + noise at varying strengths. The model only has to learn one thing — "given this noisy image and this noise level, what noise was added?" — and that's enough to generate completely new samples from pure noise at inference.

Diffusion now powers most modern image generators (Stable Diffusion, DALL-E 3, Imagen, Midjourney), video models (Sora, Veo), and even some 3D and audio generators.

Reach for it when

High-quality image / video / audio generation
Text-conditioned generation with strong control via classifier-free guidance
You can afford 20–50 forward passes at inference (or use distillation)
Sample diversity matters more than inference speed

Skip it when

Single-step inference is required (use a GAN or distilled diffusion)
You need likelihoods (use a normalizing flow)
You don't have compute for both training and inference
Domain doesn't have enough training data to learn the noise-to-signal mapping

from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")

image = pipe("A serene mountain lake at sunset", num_inference_steps=30).images[0]
image.save("output.png")

Want the forward/reverse process math?

Forward and reverse processes $$ q(\mathbf{x}_t \mid \mathbf{x}_0) \;=\; \mathcal{N}\!\big(\sqrt{\bar\alpha_t}\,\mathbf{x}_0,\; (1 - \bar\alpha_t)\,\mathbf{I}\big), \qquad p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) \;=\; \mathcal{N}(\boldsymbol{\mu}_\theta, \boldsymbol{\Sigma}_\theta) $$

q(x_t|x₀)forward process — closed-form Gaussian, no learning needed
p_θ(x_t-1|x_t)reverse process — learned by the model
ᾱ_tcumulative noise schedule — controls how much signal remains at step t

Forward process. A Markov chain that adds Gaussian noise over T steps (typically 1000). After T steps the data is indistinguishable from pure noise. Crucially, you can jump directly to any timestep — the closed-form formula above lets you compute x_t from x₀ in one step.

Training objective. Sample a data point x₀, sample a timestep t, sample noise ε, compute the noisy version x_t, then train the model to predict ε. Loss is simple MSE between predicted and true noise. That's it — no adversarial training, no ELBO, no instability.

Sampling. Start from pure noise x_T, iteratively predict and subtract noise to get x_T-1, x_T-2, …, x₀. Each step uses the model's noise prediction at the current step. DDPM uses 1000 steps; DDIM (Song et al., 2021) reduces this to 50 or fewer with deterministic sampling.

Classifier-free guidance. Train the model both conditionally (given a text prompt) and unconditionally (random null prompt with some probability). At inference, blend: ε̂ = ε_uncond + s · (ε_cond − ε_uncond). The guidance scale s controls how strongly the prompt influences the result. Trades diversity for prompt fidelity.

Architecture. A U-Net (or now, a transformer) that takes (x_t, t) and predicts noise. The timestep is encoded with sinusoidal embeddings. Text conditioning is added via cross-attention.

Reach for it when

State-of-the-art quality matters more than speed
Conditional generation (text, class, image-to-image)
You need diversity across samples
You're willing to use 20–50 inference steps

Skip it when

Strict latency budget — use a single-step distilled model
Likelihoods are required for downstream use
Small dataset and no pretrained model — diffusion needs scale
You're doing density estimation (the model predicts noise, not log-density)

import torch
import torch.nn.functional as F

# DDPM training loop, distilled to its essence
def diffusion_loss(model, x_0, alphas_cumprod):
    B = x_0.size(0)
    T = alphas_cumprod.size(0)
    t = torch.randint(0, T, (B,), device=x_0.device)
    noise = torch.randn_like(x_0)

    # Closed-form noisy version: x_t = sqrt(alpha_bar) * x_0 + sqrt(1 - alpha_bar) * noise
    a_bar = alphas_cumprod[t].view(-1, 1, 1, 1)
    x_t   = a_bar.sqrt() * x_0 + (1 - a_bar).sqrt() * noise

    # Model predicts the noise; loss is just MSE
    pred = model(x_t, t)
    return F.mse_loss(pred, noise)

Want the SDE view, score matching, and modern accelerations?

Score-based view (SDE) $$ \mathrm{d}\mathbf{x} \;=\; \boldsymbol{f}(\mathbf{x}, t)\,\mathrm{d}t + g(t)\,\mathrm{d}\mathbf{w}, \qquad \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) \approx \mathbf{s}_\theta(\mathbf{x}, t) $$

f, gdrift and diffusion coefficients of the noising SDE
s_θlearned score function — gradient of log-density at time t
Sampling = solve the reverse-time SDE, which only needs the score

SDE formulation (Song et al., 2021). The discrete-time DDPM is a special case of a continuous stochastic differential equation. The forward process is an SDE adding noise; the reverse process is another SDE that depends on the score ∇ log p_t(x). Training is equivalent to learning the score at every noise level — denoising score matching.

DDIM vs. DDPM sampling. DDPM is stochastic (adds noise at every step). DDIM is deterministic — the same noise vector always produces the same image. DDIM enables interpolation in latent space and is faster: 50 DDIM steps ≈ 1000 DDPM steps in quality.

Latent diffusion. Train a VAE encoder to compress images to a 4× smaller latent space, then train the diffusion model in that latent space. Massively reduces compute. Stable Diffusion is exactly this. Decoder generates final pixels. The VAE is task-agnostic — it just removes perceptual redundancy.

Consistency models & distillation. Training a separate "student" model to map noise directly to data in one or a few steps, using the trained diffusion model as supervision. Consistency models (Song et al., 2023), Progressive Distillation (Salimans & Ho, 2022), and Adversarial Diffusion Distillation (SDXL Turbo) all attack the inference-speed problem.

Diffusion transformers (DiT). Replace the U-Net with a transformer over latent patches. Scales better than U-Nets with compute and data; the backbone of Stable Diffusion 3 and Sora. Vit-style architecture with timestep + class conditioning via AdaLN.

Why diffusion beat GANs. Stable training, no mode collapse, easy conditioning (classifier-free guidance), excellent likelihood-free metrics, and a sampling process you can stop or modify mid-way. The trade-off — inference cost — has been steadily closed.

Reach for it when

DiT in latent space: text-to-image / video at scale
Score matching: density estimation and out-of-distribution detection
Consistency model: when latency matters
Conditional generation with controllable guidance strength

Skip it when

Extreme real-time latency — even consistency models are slower than a single-step GAN
You need likelihoods exactly — diffusion gives variational bounds, not exact
You can't afford the U-Net / transformer at inference time
Small data — diffusion still needs significant data to generalize

import torch

# DDIM sampling — deterministic, faster than DDPM
@torch.no_grad()
def ddim_sample(model, shape, alphas_cumprod, n_steps=50, device="cuda"):
    # Use a coarse subset of timesteps
    T  = alphas_cumprod.size(0)
    timesteps = torch.linspace(T - 1, 0, n_steps + 1).long().to(device)
    x = torch.randn(shape, device=device)

    for i in range(n_steps):
        t      = timesteps[i].expand(shape[0])
        t_next = timesteps[i + 1].expand(shape[0])
        a_t    = alphas_cumprod[t].view(-1, 1, 1, 1)
        a_next = alphas_cumprod[t_next].view(-1, 1, 1, 1)

        # Predict the noise, derive x_0, then advance to next timestep
        noise = model(x, t)
        x0    = (x - (1 - a_t).sqrt() * noise) / a_t.sqrt()
        x     = a_next.sqrt() * x0 + (1 - a_next).sqrt() * noise

    return x

Too dense?

Where to learn more

Lilian Weng — What are Diffusion Models? The clearest unified treatment of DDPM, score matching, and SDE views. The single best resource.
Ho et al. (2020) — DDPM The paper that made diffusion competitive. Establishes the noise-prediction objective and the U-Net backbone.
Yang Song — Generative Modeling by Estimating Gradients The unified SDE / score-matching view from one of the field's main authors. The right pivot for understanding the modern theory.
Hugging Face Diffusers The library most modern diffusion work runs on. Excellent tutorials, model zoo, training scripts, and schedulers.
Rombach et al. (2022) — Latent Diffusion Stable Diffusion's paper. Shows how decoupling perceptual compression and diffusion makes high-res generation tractable.