Iteratively step in the direction that decreases the loss. The algorithm at the heart of training almost every modern model.
Key idea
Step in the direction that decreases the loss. Keep stepping. You're standing on a hilly landscape in fog. You can feel which way is downhill at your feet. You step that way. Repeat until the ground is flat. The "landscape" is the loss as a function of model parameters; gradient descent is the foot-feeling-its-way algorithm.
Click anywhere on the surface to start a new descent · adjust η to change step size
The surface above is a scaled Rosenbrock function — minimum at the white marker at (1, 1), inside a curved banana-shaped valley. Try clicking different starting points: the path along the valley floor is slow even at well-chosen step sizes. Crank η up: the trajectory starts oscillating or diverges. This is exactly why momentum and adaptive optimisers exist.
Every weight in a neural network — billions of them in modern models — is tuned by gradient descent. So is logistic regression's β, linear regression's β (it has a closed form too, but GD also works), SVM's w via SGD, and almost every other "trainable" model.
The recipe is short: compute the gradient (how loss changes if you nudge each parameter), subtract a small fraction of it from each parameter, repeat. The "small fraction" is the learning rate — too small and you'll be there forever; too large and you'll bounce around or diverge.
Reach for it when
Training any neural network (no real alternative)
Large datasets where closed-form solutions are too expensive
Streaming / online learning
Any model where the loss is differentiable
Skip it when
The problem has a closed-form solution AND the data fits (e.g. linear regression on a small dataset)
Loss is not differentiable (use coordinate descent, EM, or specialized methods)
You need a global guarantee — GD only finds local minima for non-convex problems
# Tiny gradient descent for f(x) = (x - 3)^2. Minimum at x = 3.
x = 0.0
lr = 0.1
for step in range(50):
grad = 2 * (x - 3)
x -= lr * grad
print(x) # → ~3.0
Batch vs SGD vs mini-batch. Full-batch GD computes the gradient over the entire dataset every step — exact but expensive. Stochastic gradient descent (SGD) uses just one example per step — noisy but cheap. Mini-batch SGD uses ~32–256 examples — the practical default. The noise from sampling is what makes SGD generalise: it can escape sharp local minima that hurt held-out performance.
Learning rate. The single most important hyperparameter. Too small → painfully slow convergence. Too large → divergence or oscillation. Try 1e-3 with Adam, 1e-1 with SGD for a starting point. Use a learning-rate schedule (decay, cosine, warmup) for serious training.
Momentum. Accumulate a moving average of past gradients: v ← βv + ∇L, then θ ← θ − η·v. This "rolls through" small oscillations and accelerates along consistent directions. Classical momentum: β ≈ 0.9.
Adam. Per-parameter adaptive learning rate. Tracks both the first moment (momentum) and second moment (gradient variance) of recent gradients, then normalises. Works well out-of-the-box on a huge range of problems and is the default optimiser in most deep-learning frameworks.
AdamW. Adam + decoupled weight decay. Strictly better than Adam in most settings — used by every modern LLM trainer. Reach for AdamW unless you have a specific reason not to.
Reach for it when
Training any deep model (AdamW with a warmup-cosine schedule is the modern default)
Online or streaming learning with very large data
Fine-tuning pretrained models
You can afford hyperparameter tuning on at least the learning rate
Skip it when
Closed form exists and is cheap to compute (OLS regression, ridge regression with small p)
Non-differentiable losses (use coordinate descent or specialised solvers)
Tiny problems where Newton's method converges in <10 iterations
You need provable global optimality on non-convex problems
import torch
from torch.optim import SGD, Adam, AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
model = ... # your network
opt = AdamW(model.parameters(), lr=3e-4, weight_decay=1e-4)
scheduler = CosineAnnealingLR(opt, T_max=epochs)
for epoch in range(epochs):
for xb, yb in loader:
loss = criterion(model(xb), yb)
opt.zero_grad()
loss.backward() # autodiff fills in .grad on each parameter
opt.step() # θ ← θ − η · update_rule(grad)
scheduler.step()
Want convergence rates, conditioning, and second-order methods?
LLipschitz constant of the gradient — bounds curvature
Convergence rate is O(1/T) — slow, but better than nothing on non-strongly-convex problems
Convex landscapes. If the loss is convex and L-smooth, vanilla GD converges as O(1/T); strongly convex losses converge as O(cT) for some c < 1. Nesterov's accelerated gradient achieves O(1/T²) on convex losses — optimal among first-order methods. Practical SGD doesn't quite hit these rates because of noise but tracks them up to a noise floor.
Conditioning. The hardness of GD depends on the condition number κ = λmax/λmin of the Hessian — the ratio of biggest to smallest curvature. High κ = stretched valley = SGD zigzags. Mitigations: preconditioning (Adam approximates this), batch normalisation (keeps activations on similar scales), proper initialisation.
Non-convex. Neural network loss landscapes are non-convex with many local minima and saddle points. Surprisingly, SGD reliably finds good minima. Why: (1) overparameterised networks have many global minima; (2) SGD's noise prefers flat minima, which generalise better than sharp ones (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017).
Second-order methods. Newton's method uses the Hessian: θ ← θ − H−1∇L. Converges in fewer iterations but each iteration is O(p³) — infeasible for deep networks. Approximations: L-BFGS (limited-memory quasi-Newton, good for ≤ millions of params), K-FAC (Kronecker-factored Fisher), Shampoo (block-diagonal preconditioner). Rarely beat AdamW in practice at scale, but seeing some renewed interest.
Schedules. Warmup → cosine decay is the modern default for large-batch training. Linear warmup of η for ~1000 steps avoids early divergence on large models; cosine decay smoothly anneals toward zero. Cyclic LR (Smith, 2017) and 1cycle schedules also work well in many regimes.
Per-layer LR. Different parameter groups often want different learning rates — biases vs. weights, transformer attention vs. MLP, classifier head vs. backbone during fine-tuning. PyTorch makes this trivial with parameter groups; the gains can be substantial.
Reach for it when
Large-batch training — proper warmup-cosine schedules are essential
Distributed training — synchronous SGD with LARS / LAMB for very large batches
Long-context LLM training — gradient clipping + cosine decay is the recipe