Key idea

Step in the direction that decreases the loss. Keep stepping. You're standing on a hilly landscape in fog. You can feel which way is downhill at your feet. You step that way. Repeat until the ground is flat. The "landscape" is the loss as a function of model parameters; gradient descent is the foot-feeling-its-way algorithm.

Click anywhere on the surface to start a new descent · adjust η to change step size

The surface above is a scaled Rosenbrock function — minimum at the white marker at (1, 1), inside a curved banana-shaped valley. Try clicking different starting points: the path along the valley floor is slow even at well-chosen step sizes. Crank η up: the trajectory starts oscillating or diverges. This is exactly why momentum and adaptive optimisers exist.

Every weight in a neural network — billions of them in modern models — is tuned by gradient descent. So is logistic regression's β, linear regression's β (it has a closed form too, but GD also works), SVM's w via SGD, and almost every other "trainable" model.

The recipe is short: compute the gradient (how loss changes if you nudge each parameter), subtract a small fraction of it from each parameter, repeat. The "small fraction" is the learning rate — too small and you'll be there forever; too large and you'll bounce around or diverge.

Reach for it when

  • Training any neural network (no real alternative)
  • Large datasets where closed-form solutions are too expensive
  • Streaming / online learning
  • Any model where the loss is differentiable

Skip it when

  • The problem has a closed-form solution AND the data fits (e.g. linear regression on a small dataset)
  • Loss is not differentiable (use coordinate descent, EM, or specialized methods)
  • You need a global guarantee — GD only finds local minima for non-convex problems
  • Discrete optimization (combinatorial, integer programming)
# Tiny gradient descent for f(x) = (x - 3)^2.  Minimum at x = 3.
x  = 0.0
lr = 0.1
for step in range(50):
    grad = 2 * (x - 3)
    x   -= lr * grad
print(x)   # → ~3.0
Want the formula and the optimizer zoo?
The update rule $$ \boldsymbol{\theta}_{t+1} \;=\; \boldsymbol{\theta}_t \;-\; \eta \,\nabla_{\!\boldsymbol{\theta}}\, \mathcal{L}(\boldsymbol{\theta}_t) $$
  • θparameters being optimised
  • ηlearning rate (step size)
  • ∇Lgradient of the loss w.r.t. parameters

Batch vs SGD vs mini-batch. Full-batch GD computes the gradient over the entire dataset every step — exact but expensive. Stochastic gradient descent (SGD) uses just one example per step — noisy but cheap. Mini-batch SGD uses ~32–256 examples — the practical default. The noise from sampling is what makes SGD generalise: it can escape sharp local minima that hurt held-out performance.

Learning rate. The single most important hyperparameter. Too small → painfully slow convergence. Too large → divergence or oscillation. Try 1e-3 with Adam, 1e-1 with SGD for a starting point. Use a learning-rate schedule (decay, cosine, warmup) for serious training.

Momentum. Accumulate a moving average of past gradients: v ← βv + ∇L, then θ ← θ − η·v. This "rolls through" small oscillations and accelerates along consistent directions. Classical momentum: β ≈ 0.9.

Adam. Per-parameter adaptive learning rate. Tracks both the first moment (momentum) and second moment (gradient variance) of recent gradients, then normalises. Works well out-of-the-box on a huge range of problems and is the default optimiser in most deep-learning frameworks.

AdamW. Adam + decoupled weight decay. Strictly better than Adam in most settings — used by every modern LLM trainer. Reach for AdamW unless you have a specific reason not to.

Reach for it when

  • Training any deep model (AdamW with a warmup-cosine schedule is the modern default)
  • Online or streaming learning with very large data
  • Fine-tuning pretrained models
  • You can afford hyperparameter tuning on at least the learning rate

Skip it when

  • Closed form exists and is cheap to compute (OLS regression, ridge regression with small p)
  • Non-differentiable losses (use coordinate descent or specialised solvers)
  • Tiny problems where Newton's method converges in <10 iterations
  • You need provable global optimality on non-convex problems
import torch
from torch.optim import SGD, Adam, AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR

model = ...  # your network
opt = AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
scheduler = CosineAnnealingLR(opt, T_max=epochs)

for epoch in range(epochs):
    for xb, yb in loader:
        loss = criterion(model(xb), yb)
        opt.zero_grad()
        loss.backward()           # autodiff fills in .grad on each parameter
        opt.step()                # θ ← θ − η · update_rule(grad)
    scheduler.step()
Want convergence rates, conditioning, and second-order methods?
Convergence (convex, L-smooth) $$ \mathcal{L}(\boldsymbol{\theta}_T) - \mathcal{L}(\boldsymbol{\theta}^*) \;\le\; \frac{\|\boldsymbol{\theta}_0 - \boldsymbol{\theta}^*\|^2}{2\eta T} \quad \text{for}\; \eta \le 1/L $$
  • LLipschitz constant of the gradient — bounds curvature
  • Convergence rate is O(1/T) — slow, but better than nothing on non-strongly-convex problems

Convex landscapes. If the loss is convex and L-smooth, vanilla GD converges as O(1/T); strongly convex losses converge as O(cT) for some c < 1. Nesterov's accelerated gradient achieves O(1/T²) on convex losses — optimal among first-order methods. Practical SGD doesn't quite hit these rates because of noise but tracks them up to a noise floor.

Conditioning. The hardness of GD depends on the condition number κ = λmaxmin of the Hessian — the ratio of biggest to smallest curvature. High κ = stretched valley = SGD zigzags. Mitigations: preconditioning (Adam approximates this), batch normalisation (keeps activations on similar scales), proper initialisation.

Non-convex. Neural network loss landscapes are non-convex with many local minima and saddle points. Surprisingly, SGD reliably finds good minima. Why: (1) overparameterised networks have many global minima; (2) SGD's noise prefers flat minima, which generalise better than sharp ones (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017).

Second-order methods. Newton's method uses the Hessian: θ ← θ − H−1∇L. Converges in fewer iterations but each iteration is O(p³) — infeasible for deep networks. Approximations: L-BFGS (limited-memory quasi-Newton, good for ≤ millions of params), K-FAC (Kronecker-factored Fisher), Shampoo (block-diagonal preconditioner). Rarely beat AdamW in practice at scale, but seeing some renewed interest.

Schedules. Warmup → cosine decay is the modern default for large-batch training. Linear warmup of η for ~1000 steps avoids early divergence on large models; cosine decay smoothly anneals toward zero. Cyclic LR (Smith, 2017) and 1cycle schedules also work well in many regimes.

Per-layer LR. Different parameter groups often want different learning rates — biases vs. weights, transformer attention vs. MLP, classifier head vs. backbone during fine-tuning. PyTorch makes this trivial with parameter groups; the gains can be substantial.

Reach for it when

  • Large-batch training — proper warmup-cosine schedules are essential
  • Distributed training — synchronous SGD with LARS / LAMB for very large batches
  • Long-context LLM training — gradient clipping + cosine decay is the recipe
  • Fine-tuning — per-parameter-group learning rates matter

Skip it when

  • Problem is small enough for closed-form or quasi-Newton
  • You need exact gradients of a non-differentiable operator
  • Reinforcement learning with very high-variance gradient estimates — use natural gradient or TRPO-style methods
  • Combinatorial / discrete optimisation
import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import LambdaLR

# Per-parameter-group setup: lower LR for backbone, higher for classifier head
opt = AdamW([
    {"params": model.backbone.parameters(),   "lr": 1e-5},
    {"params": model.classifier.parameters(), "lr": 1e-3},
], weight_decay=1e-4)

# Warmup → cosine decay
import math
def lr_lambda(step, warmup=1000, total=100_000):
    if step < warmup:
        return step / warmup
    p = (step - warmup) / (total - warmup)
    return 0.5 * (1 + math.cos(math.pi * p))

scheduler = LambdaLR(opt, lr_lambda)

for step, (xb, yb) in enumerate(loader):
    loss = criterion(model(xb), yb)
    opt.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    opt.step()
    scheduler.step()
Too dense?