Key idea

Two networks playing against each other. A generator tries to make fake samples that look real. A discriminator tries to tell real from fake. They train together — each gets better as the other improves, like a counterfeiter and a detective sharpening each other's skills.

Click Train and watch the orange (fake) particles slide toward the indigo (real) distribution as the discriminator chases them
Step 0

The orange wash in the background shows where the discriminator currently thinks "real" samples live; the indigo wash is "fake". As D learns, the boundary tightens around the real points — and each fake particle climbs the gradient of D toward higher-orange regions. Try the Two clusters preset to see mode collapse, where the generator pours all its mass into a single mode.

You give the generator a vector of random noise, and it produces an image. Initially the images are garbage. The discriminator sees real images from your dataset and fake images from the generator, and learns to tell them apart. The generator's loss is "did I fool the discriminator?" — so it updates to make better fakes. They escalate.

At equilibrium (in theory), the generator produces samples indistinguishable from real data and the discriminator can only guess 50/50. In practice training is finicky, but when it works the samples are beautiful.

Reach for it when

  • High-quality image generation with fast inference
  • Image-to-image translation (pix2pix, CycleGAN)
  • Super-resolution
  • You need a single-step generator (faster than diffusion)

Skip it when

  • You want maximum sample diversity — GANs are prone to mode collapse
  • You need likelihoods (GANs don't have them)
  • You don't want to tune training — diffusion is more forgiving
  • You want controllable generation — diffusion + guidance is easier
import torch
import torch.nn as nn

class G(nn.Module):
    def __init__(self, z_dim=100, img_dim=784):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(z_dim, 256), nn.ReLU(),
            nn.Linear(256, img_dim), nn.Tanh(),
        )
    def forward(self, z): return self.net(z)

class D(nn.Module):
    def __init__(self, img_dim=784):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(img_dim, 256), nn.LeakyReLU(0.2),
            nn.Linear(256, 1), nn.Sigmoid(),
        )
    def forward(self, x): return self.net(x)
Want the minimax game and mode collapse?
The minimax game $$ \min_G \max_D \;\; \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}[\log D(\mathbf{x})] \;+\; \mathbb{E}_{\mathbf{z}\sim p_z}[\log(1 - D(G(\mathbf{z})))] $$
  • D(x)discriminator's probability that x is real
  • G(z)generator's output given noise z
  • Dmaximizes its accuracy; G minimizes D's accuracy on fakes

Training dynamics. Alternate one gradient step on D (maximize), then one on G (minimize). In practice the generator's gradient saturates early — fakes are obviously fake, so log(1 − D(G(z))) is near zero. The standard fix is the non-saturating loss: minimize −log D(G(z)) instead, which gives stronger gradients when D is winning.

Mode collapse. The generator finds a few outputs that fool the discriminator and produces only those, ignoring the rest of the data distribution. Symptom: lots of training time, but samples look very similar to each other. Diagnostic: visualize generator output diversity over training.

DCGAN (Radford et al., 2015). Established the convolutional GAN recipe: strided convolutions instead of pooling, batch norm in both networks (but not in G's output or D's input), ReLU in G, LeakyReLU in D. Made stable image-GAN training reproducible.

Conditional GAN. Concatenate a label (or a learned class embedding) to both the generator input and the discriminator input. Now G generates class-conditional samples, and D evaluates them given the class. Foundation for everything from pix2pix to BigGAN to StyleGAN.

Common failure modes. Discriminator dominates (generator's gradients vanish), generator dominates (discriminator can't keep up), oscillation (both networks fight, never converge). Monitor both losses; instability often shows up as one collapsing first.

Reach for it when

  • Real-time image generation — single forward pass
  • Image-to-image translation tasks (CycleGAN, pix2pix)
  • Domain adaptation
  • StyleGAN-style controllable face / object synthesis

Skip it when

  • Mode coverage is critical — diffusion captures more diversity
  • You need text-to-image at high quality — modern diffusion wins
  • You can't afford training instability
  • You need likelihoods or controllable inference at sample time
import torch
import torch.nn.functional as F

# Standard GAN training step with the non-saturating generator loss
def train_step(real_x, G, D, g_opt, d_opt, z_dim=100):
    B = real_x.size(0)

    # --- Train D ---
    z      = torch.randn(B, z_dim, device=real_x.device)
    fake_x = G(z).detach()
    d_real = D(real_x)
    d_fake = D(fake_x)
    # BCE for both real (label=1) and fake (label=0)
    d_loss = F.binary_cross_entropy(d_real, torch.ones_like(d_real)) \
           + F.binary_cross_entropy(d_fake, torch.zeros_like(d_fake))
    d_opt.zero_grad(); d_loss.backward(); d_opt.step()

    # --- Train G with non-saturating loss ---
    z      = torch.randn(B, z_dim, device=real_x.device)
    fake_x = G(z)
    d_fake = D(fake_x)
    # minimize -log D(G(z))   ==  BCE with target=1
    g_loss = F.binary_cross_entropy(d_fake, torch.ones_like(d_fake))
    g_opt.zero_grad(); g_loss.backward(); g_opt.step()

    return d_loss.item(), g_loss.item()
Want WGAN, spectral normalization, and StyleGAN?
Wasserstein GAN objective $$ \min_G \max_{D \in \mathcal{D}_{1\text{-Lip}}} \;\; \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}[D(\mathbf{x})] \;-\; \mathbb{E}_{\mathbf{z}\sim p_z}[D(G(\mathbf{z}))] $$
  • 𝒟1-Lip1-Lipschitz functions — D must have bounded gradient
  • Approximates the Wasserstein-1 distance between real and generated distributions
  • Gradient doesn't vanish when distributions don't overlap (unlike JS-divergence)

WGAN (Arjovsky et al., 2017). The vanilla GAN objective is the Jensen-Shannon divergence. When real and fake distributions don't overlap (early in training), JS is locally constant — zero gradient. The Wasserstein distance gives meaningful gradients everywhere. Implementing it requires keeping D 1-Lipschitz.

Lipschitz enforcement. Original WGAN used weight clipping (crude). WGAN-GP (Gulrajani et al., 2017) added a gradient penalty: penalize the discriminator if its gradient norm deviates from 1 on samples between real and fake. Spectral normalization (Miyato et al., 2018) bounds the largest singular value of each weight matrix — the cleanest modern approach.

Progressive growing & StyleGAN. Karras et al. trained generators that start at 4×4 and double resolution progressively. StyleGAN replaced the input vector with a learned style code injected at every layer — gives controllable, disentangled generation. StyleGAN2 fixed artifacts; StyleGAN3 made the model equivariant to translation.

BigGAN (Brock et al., 2018). Scale up — large batch sizes (2048), large models, class-conditional everything. Reached then-SOTA on ImageNet. Highlighted that GANs respond well to scale, but require careful hyperparameter tuning.

Why GANs lost to diffusion (for now). Mode coverage, sample diversity, training stability, and controllability all favour diffusion. GANs win on inference speed (one forward pass vs. many denoising steps). Recent work on consistency models and distillation tries to close the inference-speed gap — see the Diffusion page.

Evaluation. No likelihoods, so we use proxies: FID (Fréchet Inception Distance) comparing feature statistics, IS (Inception Score) for diversity + recognizability, KID (Kernel Inception Distance), Precision/Recall in image space. All have failure modes — FID is the de facto standard.

Reach for it when

  • StyleGAN: face / object synthesis with explicit style control
  • CycleGAN: unpaired image-to-image translation
  • Conditional GANs: real-time generation for interactive applications
  • Inference speed matters more than absolute sample quality

Skip it when

  • Text-to-image — diffusion has won this fight decisively
  • You need diversity coverage of a complex distribution
  • You want straightforward training without convergence drama
  • You need probabilistic outputs or controllable inference
import torch
import torch.nn as nn
from torch.nn.utils import spectral_norm

# Spectral-norm discriminator for stable adversarial training
class SNDiscriminator(nn.Module):
    def __init__(self, img_channels=3):
        super().__init__()
        self.net = nn.Sequential(
            spectral_norm(nn.Conv2d(img_channels, 64, 4, 2, 1)),  nn.LeakyReLU(0.2),
            spectral_norm(nn.Conv2d(64, 128, 4, 2, 1)),           nn.LeakyReLU(0.2),
            spectral_norm(nn.Conv2d(128, 256, 4, 2, 1)),          nn.LeakyReLU(0.2),
            spectral_norm(nn.Conv2d(256, 1, 4, 1, 0)),
        )

    def forward(self, x):
        return self.net(x).view(x.size(0), -1).mean(dim=1)

# Hinge loss (common modern choice alongside spectral norm)
def hinge_loss(d_real, d_fake):
    d_loss = torch.relu(1.0 - d_real).mean() + torch.relu(1.0 + d_fake).mean()
    return d_loss

def hinge_g_loss(d_fake):
    return -d_fake.mean()
Too dense?