Log every run — params, metrics, artefacts — and make them searchable. The single highest-leverage engineering habit in ML.
Key idea
If a run isn't logged, it didn't happen. Treat every training run as an experiment with a unique ID, all its hyperparameters captured, all its metrics streamed, all its artefacts saved. You'll thank yourself in three weeks when you're trying to find "the one where I tried dropout 0.3 with cosine schedule."
Without tracking: spreadsheets, dated folders, and the eternal regret of "which checkpoint was the good one?". With tracking: a searchable dashboard of every run, plottable side-by-side, with every config and artefact one click away.
What to log. Hyperparameters, the git commit, the dataset version, training loss and val metrics (per step), system metrics (GPU util), and final artefacts (checkpoints, predictions, plots). For deep learning, also gradients norms and learning rate.
Tools
W&B (wandb): hosted, free for individuals, nicest UI. Default in many shops.
MLflow: open-source, self-hosted. Solid for production-adjacent workflows.
Comet: similar to W&B, hosted
TensorBoard: built into PyTorch, fine for one-off projects
Neptune, ClearML, Aim: niche but loyal fanbases
What to log
All hyperparameters, automatically (Hydra integration is nice)
Train loss + every val metric, per step
The git commit hash and any uncommitted diffs
The dataset version / hash
Final checkpoints and prediction artefacts
Learning rate, gradient norms, system utilisation
import wandb
import torch
run = wandb.init(
project="my-project",
config={
"lr": 1e-3,
"batch": 64,
"model": "resnet18",
"dataset": "cifar10",
},
tags=["baseline"],
)
for step in range(num_steps):
loss = train_step()
val_acc = validate() if step % 500 == 0 else None
wandb.log({"train/loss": loss.item(),
"val/acc": val_acc,
"lr": scheduler.get_last_lr()[0]}, step=step)
# Save the final model
torch.save(model.state_dict(), "final.pt")
wandb.save("final.pt")
run.finish()
Want the schema discipline, comparison patterns, & integration with Hydra?
A run is a row, a metric is a column$$ \text{Run}: (\text{config}, \text{git}, \text{data version}) \;\to\; (\text{metrics}_{t}, \text{artefacts}) $$
Every run uniquely identified
Configs flatten to columns; metrics are time series; artefacts are pointers
Compare across runs by querying the dashboard
Logging discipline. Decide on a schema upfront and stick to it: train/loss, val/loss, val/acc, val/precision. Don't have loss in one run and training_loss in another — the dashboard can't compare them.
Logging frequency. Train loss per batch is too noisy and expensive; per N batches (50–500) is fine. Val metrics per epoch (or every K steps). Gradient norms periodically — they're often the first sign of trouble.
Tagging and grouping. Add tags ("baseline", "ablation-dropout", "phase-2") to make runs filterable. Group runs by sweep ID so you can compare "all hyperparameter search trials from yesterday".
Versioning artefacts. Models, prediction dumps, and evaluation reports. W&B's Artifacts and MLflow's Model Registry both handle this — promote to "staging" then "production" with explicit tracking of which run produced each.
The git + diff trick. Save the commit hash AND the uncommitted diff (git diff HEAD). Now you can reproduce any run exactly, even if it was launched from a dirty working tree.
Hydra + W&B. Hydra parses the YAML config; pass the resulting dict directly to wandb.init(config=...). Now every Hydra override automatically shows up as a column in the dashboard.
import hydra, wandb, subprocess
from omegaconf import DictConfig, OmegaConf
@hydra.main(config_path="configs", config_name="train")
def main(cfg: DictConfig):
# Capture the git state for reproducibility
git_sha = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip()
git_diff = subprocess.check_output(["git", "diff", "HEAD"]).decode()
run = wandb.init(
project=cfg.project,
config=OmegaConf.to_container(cfg, resolve=True),
tags=cfg.get("tags", []),
notes=f"git: {git_sha[:8]}",
)
if git_diff:
# Save the uncommitted diff as an artefact
with open("uncommitted.diff", "w") as f: f.write(git_diff)
wandb.save("uncommitted.diff")
train(cfg)
run.finish()
Want offline runs, distributed logging, and team workflows?
A complete run record$$ \text{record} = (\text{code}, \text{config}, \text{data}, \text{env}, \text{seeds}, \text{metrics}, \text{artefacts}) $$
Code: git commit + uncommitted diff
Env: lockfile or container hash
Data: dataset hash / version pointer
Seeds: random, numpy, torch, cuda
Offline runs. When training on a cluster without internet, log to local files first (WANDB_MODE=offline or MLflow's local backend) and sync after. Same for W&B Service mode on isolated networks.
Distributed logging. Only rank 0 should log scalar metrics — otherwise every rank writes them and you get an N× inflated step counter. Rank 0 broadcasts the run ID; other ranks can save per-rank artefacts (gradient histograms) under separate keys.
Team workflows. Shared projects with role-based access (W&B Teams, MLflow's auth). Naming conventions (project_phase_owner). Tagging discipline (baseline, candidate, production). Reports/Notebooks for sharing findings.
Cost tracking. Modern dashboards can ingest GPU-hours and dollar cost per run. Useful for blameless retros and for spotting runs that are 80% of the budget for 5% of the gain.
Integration with experiment platforms. Optuna's trials can stream into W&B sweeps. Lightning's Trainer auto-logs to whichever tracker you set. AzureML, Vertex, SageMaker all bridge to common trackers.
The "experiment journal" pattern. Per project, maintain a short markdown "what I tried, what happened, why" file alongside the code. Tracking dashboards are noisy; a few hand-written sentences per branch is what you'll actually re-read.
import os, torch.distributed as dist
import wandb
def setup_logging(cfg):
is_rank0 = dist.get_rank() == 0 if dist.is_initialized() else True
if is_rank0:
run = wandb.init(project=cfg.project, config=cfg)
# Share the run ID with other ranks so they know who they are
os.environ["WANDB_RUN_ID"] = run.id
return is_rank0
def log_metrics(metrics, step, is_rank0):
if not is_rank0: return
wandb.log(metrics, step=step)
# Offline-then-sync workflow for clusters without internet
# Run: WANDB_MODE=offline python train.py
# Later: wandb sync wandb/offline-run-...