Useful Libraries — ML Resources Hub

Key idea

Most "I just need to..." problems have a library. Knowing the right small library beats writing 200 lines of bespoke code. This page is a curated list — not exhaustive, but the ones that come up in nearly every ML project.

The set below isn't a tier list — every library here is worth a 30-minute investment. If you haven't used 4 of them, that's 4 places where you've probably been writing more code than you needed to.

Data & configs

polars: faster pandas. Eager + lazy API. The new default for tabular work above ~1M rows.
pydantic: typed dataclasses with validation. Standard for configs and API schemas.
hydra: composable YAML configs with CLI overrides.
jsonargparse: argparse + dataclasses + YAML. Less heavy than Hydra.
great_expectations: declarative data validation.

Modelling & training

lightning: training-loop scaffolding
accelerate: distributed + mixed precision for your custom loop
einops: tensor reshaping with named axes — readable in months
timm: every vision backbone, pretrained
transformers: every NLP / multimodal model, pretrained
peft: LoRA, adapters, prompt tuning

# einops — readable tensor reshaping
from einops import rearrange, repeat, reduce, einsum

# Batch × Heads × Seq × Dim → Batch × Seq × (Heads · Dim)
out = rearrange(x, "b h n d -> b n (h d)")

# Tile a vector across the batch
batched = repeat(v, "d -> b d", b=32)

# Reduce over the spatial axes
pooled = reduce(features, "b c h w -> b c", "mean")

# Explicit einsum
attn = einsum(q, k, "b h n d, b h m d -> b h n m")

Want experiment tooling, dev quality of life, & the production stack?

The "always reach for" set $$ \text{uv}, \text{ruff}, \text{pre-commit}, \text{loguru}, \text{rich}, \text{pytest} $$

These six are the "every project, no excuse" baseline
Together: dependency mgmt, lint, hooks, logging, pretty output, tests

Experiment tracking. wandb, mlflow, aim, comet, neptune. Pick one and stick with it. See Experiment Tracking.

Dev quality of life. uv (fast Python package manager, replaces pip/poetry/venv). ruff (linter + formatter; replaces black + flake8 + isort + many more). pre-commit (run hooks on git commit). rich (pretty terminal output, progress bars). loguru (logging that just works). pytest (the testing framework).

Tabular & data. polars (faster pandas). duckdb (SQL on parquet, blazingly fast for analytics). pyarrow (the columnar foundation). fsspec (uniform interface to local/S3/GCS).

Modelling. scikit-learn (still the right answer for tabular ML). xgboost / lightgbm / catboost (gradient boosting). statsmodels (classical statistics). scipy (everything else).

NLP. sentence-transformers for embedding. tiktoken for OpenAI-compatible tokenisation. spacy for classical NLP. llama-cpp-python for local LLMs.

Inference / serving. onnxruntime, vllm, tgi, bentoml, fastapi. See Serving.

Visualisation. matplotlib + seaborn (the classics). plotly (interactive). altair (declarative). holoviews for time series.

# uv — fast dependency management. Replaces pip + venv + pip-tools.
# pip install uv
# uv venv && source .venv/bin/activate
# uv pip install torch transformers wandb hydra-core polars
# uv pip compile pyproject.toml -o requirements.lock      # lockfile

# ruff — lint + format. Replaces black + flake8 + isort.
# pip install ruff
# ruff check .              # lint
# ruff format .             # format
# Add to pre-commit:
# - repo: https://github.com/astral-sh/ruff-pre-commit
#   rev: v0.4.0
#   hooks:
#     - id: ruff
#     - id: ruff-format

# rich — pretty terminal output
from rich.progress import track
from rich.console import Console
console = Console()
for item in track(items, description="Training..."):
    process(item)
console.print(f"[bold green]Done![/]")

Want the niche-but-saves-a-day libraries?

"I wish I'd known about that" $$ \text{specialised library} \;\gg\; \text{your hand-rolled version} $$

Niche libraries usually beat hand-rolled code by 5–50×
Maintainers have already hit the edge cases you'll hit later

Optimisation. optuna for hyperparameter search. scipy.optimize for classical optimisation. pyomo / cvxpy for math programming.

Geometry & signal. scikit-image for image processing. librosa for audio. shapely + geopandas for geospatial. biopython for bio.

Profiling. py-spy (sampling profiler, attach to running process). scalene (CPU + GPU + memory). line_profiler (per-line CPU). memray (per-allocation memory).

Storage. safetensors (safe alternative to pickle for model weights). zarr (chunked array storage). lmdb (memory-mapped key-value store). webdataset (tar-based streaming for vision).

Reproducibility. dvc (git for data). nbstripout (strip notebook outputs before commit). mlflow for experiment + model versioning.

Causality & uncertainty. dowhy, EconML for causal inference. NumPyro / PyMC for Bayesian modelling. arviz for posterior diagnostics.

Speed-ups. numba (JIT for numpy). cython (compile-to-C for hot loops). cupy (numpy on GPU). jax (numpy + autodiff + JIT + vmap — the underrated alternative to PyTorch for research).

Concurrency. joblib (parallel for-loops). dask (pandas / numpy at scale). ray (distributed Python, including Ray Train + Ray Serve).

The "I write this in every project" set. A function to fix all seeds. A function to log GPU memory. A function to count parameters. Put them in utils/. Steal from your last project.

# A few utility functions worth copying between projects

import os, random
import numpy as np
import torch

def fix_seeds(seed=42):
    os.environ["PYTHONHASHSEED"] = str(seed)
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed)
    if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)

def count_params(model):
    total     = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, trainable

def gpu_memory_summary():
    if not torch.cuda.is_available(): return
    alloc = torch.cuda.memory_allocated() / 1e9
    reserved = torch.cuda.memory_reserved() / 1e9
    print(f"alloc {alloc:.2f} GB  reserved {reserved:.2f} GB")

Too dense?