Encoder-decoder networks for representation learning, dimensionality reduction, and generation.
Key idea
Squeeze data through a narrow channel, then reconstruct it. An encoder compresses input to a small "code"; a decoder tries to reconstruct the original from just that code. The bottleneck forces the network to learn what matters.
Click a preset to encode it · drag the orange dot to roam the latent space · toggle VAE mode for stochastic codes
Input:
The 12×12 image on the left is encoded into just two numbers — that's what the orange dot's position represents. The decoder takes those two numbers and tries to reconstruct the image on the right. Because the latent space is continuous, points between prototypes give blends — that's why a VAE can generate "new" faces by walking around in latent space.
Picture an image being shrunk to a tiny vector of, say, 32 numbers, then expanded back to a full image. The reconstruction won't be perfect — but if you train the encoder and decoder together to minimize reconstruction error, the 32 numbers end up capturing the most important features.
That compressed "code" is useful: as a learned embedding for downstream tasks, for dimensionality reduction (like a non-linear PCA), and — with the right twist — as a starting point for a VAE that can generate new images by sampling new codes.
qφ(z|x)encoder — outputs a distribution over codes, not a single point
pθ(x|z)decoder — likelihood of reconstructing x from z
p(z)prior, typically 𝒩(0, I)
Maximize this lower bound on log p(x)
Plain autoencoder. Loss is just reconstruction error (MSE for continuous, cross-entropy for binary). The latent space has no structure — there's no reason a midpoint between two codes corresponds to anything meaningful.
Denoising autoencoder. Add noise to the input, train to reconstruct the clean version. Forces the network to learn structure rather than memorize the identity function. Often used as a pretraining objective.
Variational autoencoder (VAE). The encoder produces a distribution over codes — typically 𝒩(μ(x), σ(x)²) — and the decoder reconstructs from samples. Training enforces (a) good reconstruction and (b) the encoder's distribution stays close to a fixed prior, usually a unit Gaussian. The KL term in the loss does the "stays close" work.
The reparameterization trick. Sampling from 𝒩(μ, σ²) is non-differentiable, so we rewrite the sample as z = μ + σ · ε where ε ~ 𝒩(0, I). Gradients flow through μ and σ; the randomness is offloaded to ε.
Why VAEs work for generation. Because the latent space matches a known distribution (a unit Gaussian), you can sample new codes, run them through the decoder, and get plausible new samples. The trade-off: VAE samples are usually blurry compared to GANs or diffusion — the model is hedging.
Reach for it when
You need a continuous, navigable latent space (interpolation, arithmetic)
Anomaly detection — likelihood under the model
Representation learning with unlabelled data
You want a fast, single-step generator (much faster than diffusion)
Skip it when
Highest possible sample quality matters — GANs / diffusion win
You only need embeddings — use a contrastive method instead
Sharp generation (no blur) is required for the application
You need exact likelihoods (normalizing flows give those)
Maximizing ELBO simultaneously fits the data and makes q close to the posterior
β-VAE. Re-weight the KL term: ℒ = E[log p(x|z)] − β · KL(q‖p). With β > 1, the model is pushed harder toward the prior, encouraging "disentangled" latents where each dimension corresponds to one factor of variation. With β < 1, sharper samples but worse latent structure.
Posterior collapse. When the decoder is too powerful, the model learns to ignore z entirely — the encoder maps everything to the prior, and the decoder reconstructs from autoregression alone. Fixes: KL annealing (start with KL weight 0, ramp up), free bits (don't penalize below a threshold), or weaken the decoder.
VQ-VAE. Replace the Gaussian latent with a discrete code from a learned codebook. Forces the encoder to commit to specific codes; eliminates posterior collapse; and is the basis of modern image tokenizers (used in DALL-E v1, Parti, latent diffusion).
Latent diffusion (Rombach et al., 2022). Train a VAE to encode images to a compact latent space, then train a diffusion model in that latent space. Stable Diffusion is exactly this. Decouples the perceptual compression (VAE) from the generative process (diffusion).
Normalizing flows. A different direction — invertible neural networks that map between the data and a simple distribution exactly. Likelihood is computable in closed form (change of variables), but architectures are restricted by invertibility. Examples: RealNVP, Glow, FFJORD.
Disentanglement. The hope that each latent dimension corresponds to a single semantic factor (pose, lighting, identity, …). Locatello et al. (2019) proved that unsupervised disentanglement is impossible without inductive biases — but specific architectural choices (β-VAE, FactorVAE) can encourage it in practice.
Reach for it when
VQ-VAE: tokenizers for multi-modal models, discrete codes
Latent diffusion: high-resolution generation with controllable compute
β-VAE: studying disentangled representations
Flows: when you need exact likelihoods
Skip it when
Sharp samples matter most — diffusion wins
You don't need a latent space — contrastive embeddings are simpler and often better
The likelihood is the goal but the prior is wrong
Architecture is too constrained — flows have strict invertibility limits
import torch, torch.nn as nn
import torch.nn.functional as F
# β-VAE with KL annealing and free bits
class BetaVAE(VAE):
pass # same module; loss is what changes
def beta_vae_loss(x_hat, x, mu, logvar, beta=4.0, free_bits=0.05):
rec = F.mse_loss(x_hat, x, reduction="sum")
# KL per latent dim, with a "free bits" floor (don't penalize below it)
kl_per_dim = -0.5 * (1 + logvar - mu.pow(2) - logvar.exp())
kl_per_dim = (kl_per_dim - free_bits).clamp(min=0).sum()
return rec + beta * kl_per_dim
# Train with KL annealing: ramp beta from 0 to target over warmup steps
def beta_schedule(step, warmup=5000, target=4.0):
return min(1.0, step / warmup) * target