ML Engineering - ML Resources Hub

Project Structure

Setting Up a Project

Repository Layout
Where to put your data, models, configs, and code so future-you doesn't curse past-you.
Configuration Management
Hydra, YAML, dotenv — keeping hyperparameters and paths out of your code.
Reproducibility
Seeds, deterministic ops, locking dependencies — making "it worked on my machine" actually portable.
Environments & Dependencies
uv, conda, Docker — choosing the right level of isolation for your workflow.

Development Loop

Iterating Fast

Experiment Tracking
W&B, MLflow — log every run so you can find "the good one" three weeks later.
Hyperparameter Search
Grid, random, Bayesian, Hyperband — pick by trial cost.
Testing ML Code
Shape, range, gradient flow + the 1-batch overfit. The pyramid that catches 95% of bugs.
Logging & Debugging
Structured logs, NaN forensics, the 5-minute checklist for any training failure.

CI / CD

Automation

CI for ML
GitHub Actions workflows for an ML repo — fast on PR, slow on schedule.
Data Validation
Schemas, ranges, drift — catching bad data before it poisons your model.
Automated Retraining
When to retrain (calendar / drift / new data), promotion gates, rollback.
Model Registries
One source of truth for every model version — MLflow, W&B Artifacts.

Debugging & Profiling

Diagnostics

Training Debugging
NaNs, divergence, gradient pathologies — the small checklist that catches almost everything.
Loss Curve Forensics
Six canonical patterns and what each one diagnoses.
Profiling
PyTorch profiler, memory, data-loader bottlenecks — find before you optimise.
Distributed Training Pitfalls
DDP, FSDP, ZeRO, tensor / pipeline parallel — and the silent failure modes.

Production

Deployment

Serving
FastAPI, BentoML, Triton, vLLM — the four standard serving stacks.
Monitoring
System, data, prediction, performance — what to watch and when to alert.
Quantization & Distillation
int8 / int4, knowledge distillation, pruning — make models smaller and faster.
A/B Testing
Sample size, sequential testing, CUPED — comparing models in production.

Time-Savers

Patterns & Tools

Notebook → Script
Moving from exploration to reproducible code. Jupytext, modules, CLI.
Training Scaffolding
Lightning, Accelerate, callbacks — stop rewriting the training loop.
CLI Patterns
Typer, click, Hydra — clean command lines + config files.
Useful Libraries
A curated list of the small libraries that punch above their weight.