Project Structure

Setting Up a Project

  • Repository Layout
    Where to put your data, models, configs, and code so future-you doesn't curse past-you.
  • Configuration Management
    Hydra, YAML, dotenv — keeping hyperparameters and paths out of your code.
  • Reproducibility
    Seeds, deterministic ops, locking dependencies — making "it worked on my machine" actually portable.
  • Environments & Dependencies
    uv, conda, Docker — choosing the right level of isolation for your workflow.

Development Loop

Iterating Fast

  • Experiment Tracking
    W&B, MLflow — log every run so you can find "the good one" three weeks later.
  • Hyperparameter Search
    Grid, random, Bayesian, Hyperband — pick by trial cost.
  • Testing ML Code
    Shape, range, gradient flow + the 1-batch overfit. The pyramid that catches 95% of bugs.
  • Logging & Debugging
    Structured logs, NaN forensics, the 5-minute checklist for any training failure.

CI / CD

Automation

  • CI for ML
    GitHub Actions workflows for an ML repo — fast on PR, slow on schedule.
  • Data Validation
    Schemas, ranges, drift — catching bad data before it poisons your model.
  • Automated Retraining
    When to retrain (calendar / drift / new data), promotion gates, rollback.
  • Model Registries
    One source of truth for every model version — MLflow, W&B Artifacts.

Debugging & Profiling

Diagnostics

  • Training Debugging
    NaNs, divergence, gradient pathologies — the small checklist that catches almost everything.
  • Loss Curve Forensics
    Six canonical patterns and what each one diagnoses.
  • Profiling
    PyTorch profiler, memory, data-loader bottlenecks — find before you optimise.
  • Distributed Training Pitfalls
    DDP, FSDP, ZeRO, tensor / pipeline parallel — and the silent failure modes.

Production

Deployment

  • Serving
    FastAPI, BentoML, Triton, vLLM — the four standard serving stacks.
  • Monitoring
    System, data, prediction, performance — what to watch and when to alert.
  • Quantization & Distillation
    int8 / int4, knowledge distillation, pruning — make models smaller and faster.
  • A/B Testing
    Sample size, sequential testing, CUPED — comparing models in production.

Time-Savers

Patterns & Tools

  • Notebook → Script
    Moving from exploration to reproducible code. Jupytext, modules, CLI.
  • Training Scaffolding
    Lightning, Accelerate, callbacks — stop rewriting the training loop.
  • CLI Patterns
    Typer, click, Hydra — clean command lines + config files.
  • Useful Libraries
    A curated list of the small libraries that punch above their weight.