Iterative denoising — the current SOTA for image, audio, and video generation.
Key idea
Add noise to your data, then learn to remove it. Take a real image, blur it slightly, then more, then more, until it's pure noise. Now train a network to reverse the process — given a noisy image, predict the noise. Once trained, start from pure noise and run the denoiser many times. What pops out is a new sample.
Drag the slider or hit Forward / Reverse — watch the image walk between clean and pure noise along the cosine schedule
t = 0
The middle panel is what the model sees at timestep t. The right panel is what a "perfect denoiser" would predict for x₀ — and you can watch that prediction sharpen as you move backward through time. The equation under the images shows the forward formula explicitly: x_t = √α̅ · x₀ + √(1-α̅) · ε. Training a diffusion model is just teaching a network to predict ε given x_t; everything else falls out of that.
The miracle: you only ever train on the inverse of a process you fully control (adding Gaussian noise). The forward process is just x + noise at varying strengths. The model only has to learn one thing — "given this noisy image and this noise level, what noise was added?" — and that's enough to generate completely new samples from pure noise at inference.
Diffusion now powers most modern image generators (Stable Diffusion, DALL-E 3, Imagen, Midjourney), video models (Sora, Veo), and even some 3D and audio generators.
Reach for it when
High-quality image / video / audio generation
Text-conditioned generation with strong control via classifier-free guidance
You can afford 20–50 forward passes at inference (or use distillation)
Sample diversity matters more than inference speed
Skip it when
Single-step inference is required (use a GAN or distilled diffusion)
You need likelihoods (use a normalizing flow)
You don't have compute for both training and inference
Domain doesn't have enough training data to learn the noise-to-signal mapping
from diffusers import StableDiffusionPipeline
import torch
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
).to("cuda")
image = pipe("A serene mountain lake at sunset", num_inference_steps=30).images[0]
image.save("output.png")
q(xt|x0)forward process — closed-form Gaussian, no learning needed
pθ(xt-1|xt)reverse process — learned by the model
ᾱtcumulative noise schedule — controls how much signal remains at step t
Forward process. A Markov chain that adds Gaussian noise over T steps (typically 1000). After T steps the data is indistinguishable from pure noise. Crucially, you can jump directly to any timestep — the closed-form formula above lets you compute xt from x0 in one step.
Training objective. Sample a data point x0, sample a timestep t, sample noise ε, compute the noisy version xt, then train the model to predict ε. Loss is simple MSE between predicted and true noise. That's it — no adversarial training, no ELBO, no instability.
Sampling. Start from pure noise xT, iteratively predict and subtract noise to get xT-1, xT-2, …, x0. Each step uses the model's noise prediction at the current step. DDPM uses 1000 steps; DDIM (Song et al., 2021) reduces this to 50 or fewer with deterministic sampling.
Classifier-free guidance. Train the model both conditionally (given a text prompt) and unconditionally (random null prompt with some probability). At inference, blend: ε̂ = εuncond + s · (εcond − εuncond). The guidance scale s controls how strongly the prompt influences the result. Trades diversity for prompt fidelity.
Architecture. A U-Net (or now, a transformer) that takes (xt, t) and predicts noise. The timestep is encoded with sinusoidal embeddings. Text conditioning is added via cross-attention.
f, gdrift and diffusion coefficients of the noising SDE
sθlearned score function — gradient of log-density at time t
Sampling = solve the reverse-time SDE, which only needs the score
SDE formulation (Song et al., 2021). The discrete-time DDPM is a special case of a continuous stochastic differential equation. The forward process is an SDE adding noise; the reverse process is another SDE that depends on the score ∇ log pt(x). Training is equivalent to learning the score at every noise level — denoising score matching.
DDIM vs. DDPM sampling. DDPM is stochastic (adds noise at every step). DDIM is deterministic — the same noise vector always produces the same image. DDIM enables interpolation in latent space and is faster: 50 DDIM steps ≈ 1000 DDPM steps in quality.
Latent diffusion. Train a VAE encoder to compress images to a 4× smaller latent space, then train the diffusion model in that latent space. Massively reduces compute. Stable Diffusion is exactly this. Decoder generates final pixels. The VAE is task-agnostic — it just removes perceptual redundancy.
Consistency models & distillation. Training a separate "student" model to map noise directly to data in one or a few steps, using the trained diffusion model as supervision. Consistency models (Song et al., 2023), Progressive Distillation (Salimans & Ho, 2022), and Adversarial Diffusion Distillation (SDXL Turbo) all attack the inference-speed problem.
Diffusion transformers (DiT). Replace the U-Net with a transformer over latent patches. Scales better than U-Nets with compute and data; the backbone of Stable Diffusion 3 and Sora. Vit-style architecture with timestep + class conditioning via AdaLN.
Why diffusion beat GANs. Stable training, no mode collapse, easy conditioning (classifier-free guidance), excellent likelihood-free metrics, and a sampling process you can stop or modify mid-way. The trade-off — inference cost — has been steadily closed.
Reach for it when
DiT in latent space: text-to-image / video at scale
Score matching: density estimation and out-of-distribution detection
Consistency model: when latency matters
Conditional generation with controllable guidance strength
Skip it when
Extreme real-time latency — even consistency models are slower than a single-step GAN
You need likelihoods exactly — diffusion gives variational bounds, not exact
You can't afford the U-Net / transformer at inference time
Small data — diffusion still needs significant data to generalize
import torch
# DDIM sampling — deterministic, faster than DDPM
@torch.no_grad()
def ddim_sample(model, shape, alphas_cumprod, n_steps=50, device="cuda"):
# Use a coarse subset of timesteps
T = alphas_cumprod.size(0)
timesteps = torch.linspace(T - 1, 0, n_steps + 1).long().to(device)
x = torch.randn(shape, device=device)
for i in range(n_steps):
t = timesteps[i].expand(shape[0])
t_next = timesteps[i + 1].expand(shape[0])
a_t = alphas_cumprod[t].view(-1, 1, 1, 1)
a_next = alphas_cumprod[t_next].view(-1, 1, 1, 1)
# Predict the noise, derive x_0, then advance to next timestep
noise = model(x, t)
x0 = (x - (1 - a_t).sqrt() * noise) / a_t.sqrt()
x = a_next.sqrt() * x0 + (1 - a_next).sqrt() * noise
return x