Distributed debugging. First rule: reproduce on 1 GPU. ~80% of "distributed bugs" are actually single-GPU bugs that happen to surface in the distributed setting. When the bug genuinely needs > 1 GPU: log per-rank metrics, look for the rank where something diverges, then bisect on what's different.
Gradient sync bugs. Symptom: all ranks converge slowly or to a worse minimum than single-GPU. Causes: missing DDP wrapper on a model that gets cloned; ranks computing different losses (e.g., different normalisation across the batch); gradient hooks that don't allreduce.
OOM forensics. torch.cuda.memory_summary() shows allocations. Memory growing over time → leak (tensors held alive by some Python reference, often inside a logging dictionary). Memory high but stable → just need a smaller batch or more aggressive activation checkpointing.
Subtle correctness bugs. Detached graphs (gradient doesn't flow). Frozen layers (parameters that should train but don't). Off-by-one in masks. Wrong dimension for softmax. Reduction over the wrong axis. The 1-batch overfit catches most of these.
Profiler + debugger. When training is slow, profile. When it's wrong, debug. Mixing the two wastes time. PyTorch profiler with the profile_memory=True flag is the canonical tool for both.
Catastrophic-forgetting variant. Fine-tuning destroys earlier capability. Symptom: a model that was great on one task is now bad on it after training on another. Mitigations: lower lr, fewer epochs, freeze earlier layers, EWC-style regularization.
The "git bisect" trick. If a model used to train fine and now diverges, bisect through git history to find the commit that broke it. Faster than reading every diff.