A generator and a discriminator in adversarial training — the architecture that started the modern image-generation era.
Key idea
Two networks playing against each other. A generator tries to make fake samples that look real. A discriminator tries to tell real from fake. They train together — each gets better as the other improves, like a counterfeiter and a detective sharpening each other's skills.
Click Train and watch the orange (fake) particles slide toward the indigo (real) distribution as the discriminator chases them
Step 0
The orange wash in the background shows where the discriminator currently thinks "real" samples live; the indigo wash is "fake". As D learns, the boundary tightens around the real points — and each fake particle climbs the gradient of D toward higher-orange regions. Try the Two clusters preset to see mode collapse, where the generator pours all its mass into a single mode.
You give the generator a vector of random noise, and it produces an image. Initially the images are garbage. The discriminator sees real images from your dataset and fake images from the generator, and learns to tell them apart. The generator's loss is "did I fool the discriminator?" — so it updates to make better fakes. They escalate.
At equilibrium (in theory), the generator produces samples indistinguishable from real data and the discriminator can only guess 50/50. In practice training is finicky, but when it works the samples are beautiful.
Reach for it when
High-quality image generation with fast inference
Image-to-image translation (pix2pix, CycleGAN)
Super-resolution
You need a single-step generator (faster than diffusion)
Skip it when
You want maximum sample diversity — GANs are prone to mode collapse
You need likelihoods (GANs don't have them)
You don't want to tune training — diffusion is more forgiving
You want controllable generation — diffusion + guidance is easier
Dmaximizes its accuracy; G minimizes D's accuracy on fakes
Training dynamics. Alternate one gradient step on D (maximize), then one on G (minimize). In practice the generator's gradient saturates early — fakes are obviously fake, so log(1 − D(G(z))) is near zero. The standard fix is the non-saturating loss: minimize −log D(G(z)) instead, which gives stronger gradients when D is winning.
Mode collapse. The generator finds a few outputs that fool the discriminator and produces only those, ignoring the rest of the data distribution. Symptom: lots of training time, but samples look very similar to each other. Diagnostic: visualize generator output diversity over training.
DCGAN (Radford et al., 2015). Established the convolutional GAN recipe: strided convolutions instead of pooling, batch norm in both networks (but not in G's output or D's input), ReLU in G, LeakyReLU in D. Made stable image-GAN training reproducible.
Conditional GAN. Concatenate a label (or a learned class embedding) to both the generator input and the discriminator input. Now G generates class-conditional samples, and D evaluates them given the class. Foundation for everything from pix2pix to BigGAN to StyleGAN.
Common failure modes. Discriminator dominates (generator's gradients vanish), generator dominates (discriminator can't keep up), oscillation (both networks fight, never converge). Monitor both losses; instability often shows up as one collapsing first.
𝒟1-Lip1-Lipschitz functions — D must have bounded gradient
Approximates the Wasserstein-1 distance between real and generated distributions
Gradient doesn't vanish when distributions don't overlap (unlike JS-divergence)
WGAN (Arjovsky et al., 2017). The vanilla GAN objective is the Jensen-Shannon divergence. When real and fake distributions don't overlap (early in training), JS is locally constant — zero gradient. The Wasserstein distance gives meaningful gradients everywhere. Implementing it requires keeping D 1-Lipschitz.
Lipschitz enforcement. Original WGAN used weight clipping (crude). WGAN-GP (Gulrajani et al., 2017) added a gradient penalty: penalize the discriminator if its gradient norm deviates from 1 on samples between real and fake. Spectral normalization (Miyato et al., 2018) bounds the largest singular value of each weight matrix — the cleanest modern approach.
Progressive growing & StyleGAN. Karras et al. trained generators that start at 4×4 and double resolution progressively. StyleGAN replaced the input vector with a learned style code injected at every layer — gives controllable, disentangled generation. StyleGAN2 fixed artifacts; StyleGAN3 made the model equivariant to translation.
BigGAN (Brock et al., 2018). Scale up — large batch sizes (2048), large models, class-conditional everything. Reached then-SOTA on ImageNet. Highlighted that GANs respond well to scale, but require careful hyperparameter tuning.
Why GANs lost to diffusion (for now). Mode coverage, sample diversity, training stability, and controllability all favour diffusion. GANs win on inference speed (one forward pass vs. many denoising steps). Recent work on consistency models and distillation tries to close the inference-speed gap — see the Diffusion page.
Evaluation. No likelihoods, so we use proxies: FID (Fréchet Inception Distance) comparing feature statistics, IS (Inception Score) for diversity + recognizability, KID (Kernel Inception Distance), Precision/Recall in image space. All have failure modes — FID is the de facto standard.
Reach for it when
StyleGAN: face / object synthesis with explicit style control
CycleGAN: unpaired image-to-image translation
Conditional GANs: real-time generation for interactive applications
Inference speed matters more than absolute sample quality
Skip it when
Text-to-image — diffusion has won this fight decisively
You need diversity coverage of a complex distribution
You want straightforward training without convergence drama
You need probabilistic outputs or controllable inference
import torch
import torch.nn as nn
from torch.nn.utils import spectral_norm
# Spectral-norm discriminator for stable adversarial training
class SNDiscriminator(nn.Module):
def __init__(self, img_channels=3):
super().__init__()
self.net = nn.Sequential(
spectral_norm(nn.Conv2d(img_channels, 64, 4, 2, 1)), nn.LeakyReLU(0.2),
spectral_norm(nn.Conv2d(64, 128, 4, 2, 1)), nn.LeakyReLU(0.2),
spectral_norm(nn.Conv2d(128, 256, 4, 2, 1)), nn.LeakyReLU(0.2),
spectral_norm(nn.Conv2d(256, 1, 4, 1, 0)),
)
def forward(self, x):
return self.net(x).view(x.size(0), -1).mean(dim=1)
# Hinge loss (common modern choice alongside spectral norm)
def hinge_loss(d_real, d_fake):
d_loss = torch.relu(1.0 - d_real).mean() + torch.relu(1.0 + d_fake).mean()
return d_loss
def hinge_g_loss(d_fake):
return -d_fake.mean()