Key idea

Every script you'll run more than twice deserves a CLI. A clean command line + a config file gives you reproducibility, schedulability, and shareability for free. Modern Python has good tools (Typer, Hydra) that make this nearly trivial.

Three reasonable choices. typer: type-hint-driven, modern, great defaults. click: classic, mature, used everywhere. argparse: stdlib; fine for tiny scripts. fire: zero-config; useful for prototypes. Skip sys.argv parsing; you'll regret it.

The pattern. One CLI per script. Each command takes a config (YAML / Hydra). Hyperparameters in the config; flags for things that vary per-run (output dir, debug mode). The CLI just dispatches; the work is in modules.

What goes where

  • CLI flags: run-specific (out dir, debug, dry-run)
  • Config file: hyperparameters, paths, model architecture
  • Env vars: secrets, API keys, runtime config
  • Code: never put paths or hyperparameters here

Common mistakes

  • 15 positional arguments — make them named
  • Hard-coded paths in the CLI defaults
  • One giant script with subcommands for unrelated things
  • No --help text — users (you, in 3 weeks) will hate you
import typer
from pathlib import Path
import yaml

app = typer.Typer(no_args_is_help=True)

@app.command()
def train(
    config: Path = typer.Argument(..., help="Path to YAML config"),
    out:    Path = typer.Option("runs/", help="Output directory"),
    debug:  bool = typer.Option(False, help="Quick smoke run"),
    seed:   int  = typer.Option(0,    help="Random seed"),
):
    """Train a model from a YAML config."""
    cfg = yaml.safe_load(config.read_text())
    if debug:
        cfg["max_steps"] = 10
    run_training(cfg, out=out, seed=seed)

@app.command()
def evaluate(
    checkpoint: Path,
    test_data:  Path,
    threshold:  float = 0.5,
):
    """Evaluate a checkpoint on a held-out test set."""
    ...

if __name__ == "__main__":
    app()
Want Hydra, multi-command apps, & clean help-text patterns?
Config + CLI interaction $$ \text{values} \;=\; \text{defaults} \;\triangleleft\; \text{config file} \;\triangleleft\; \text{env vars} \;\triangleleft\; \text{CLI flags} $$
  • = "overridden by"
  • Each layer wins over the previous
  • Standard precedence; matches what users expect

Hydra. Facebook's config framework. YAML configs, override from CLI (python train.py model.lr=1e-3), composable configs (defaults: [model: resnet, data: cifar10]), multi-runs / sweeps. Most ML production projects converge on Hydra.

Pydantic + CLI. Define configs as Pydantic models — get validation, type coercion, defaults. typer integrates well. Useful for strict schemas; pairs nicely with Pydantic-everywhere codebases.

Subcommands. my-tool train ..., my-tool evaluate ..., my-tool deploy .... Typer's decorator pattern. Better than one huge script with a --mode flag.

Help is documentation. Every flag gets a one-line description. --help output is what you'll read in 3 weeks; make it good. Examples in the docstring are nicer than the user manual.

Dry-run flag. --dry-run prints what would happen without doing it. Useful for destructive operations (training that overwrites, deployments, data writes).

Config logging. The script writes the fully-resolved config to the run's output directory. Every flag override, every default, every env var — recorded. Reproducibility starts here.

import hydra
from omegaconf import DictConfig, OmegaConf

@hydra.main(version_base=None, config_path="configs", config_name="train")
def main(cfg: DictConfig):
    # Hydra automatically creates an output dir and writes cfg.yaml there
    print(OmegaConf.to_yaml(cfg))                 # print the resolved config
    run_training(cfg)

if __name__ == "__main__":
    main()

# Usage:
# python train.py                                 # defaults
# python train.py model.lr=1e-2 data=cifar100      # override individual values
# python train.py --multirun model.lr=1e-2,1e-3,1e-4  # sweep
Want completion, plugins, & long-running daemon-style CLIs?
The CLI contract $$ \text{stdin} \to \text{exit code, stdout, stderr} $$
  • Exit code 0 on success, non-zero on failure (matters for CI / shells)
  • stdout for data / results; stderr for logs / progress
  • Pipeable, scriptable, testable

Shell completion. Typer and Click both auto-generate completion scripts for bash / zsh / fish. my-tool --install-completion. Cheap UX win; pays dividends every day.

Plugins. Click and Typer both support plugin loading from entry-points. Useful for very large CLIs (a "platform" CLI with sub-tools). Most ML projects don't need this — but it's there when you do.

Long-running CLIs. A train command might run for days. Print structured logs to stderr, write metrics to a logger / file, write checkpoints. Support graceful shutdown on Ctrl+C — save state and exit cleanly.

Daemon mode. Some CLIs spawn long-lived services (serving, monitoring). systemd unit files, Docker containers, or supervisord. The CLI itself should fork-and-detach cleanly or run in the foreground for the supervisor.

Testable CLIs. Click and Typer both ship CliRunner for invoking commands programmatically. Asserts exit code + stdout. Same as testing any function.

Environment variable conventions. Pydantic Settings + dotenv for secrets. Prefix env vars (MYAPP_LOG_LEVEL) to avoid collision. Document them.

Versioning. Every CLI has --version. Helps debugging "which version is the CI runner using" mysteries.

import typer
from typer.testing import CliRunner

app = typer.Typer()

@app.command()
def add(a: int, b: int):
    """Add two numbers."""
    typer.echo(a + b)

# Test the CLI as a unit
def test_add():
    runner = CliRunner()
    result = runner.invoke(app, ["3", "4"])
    assert result.exit_code == 0
    assert result.stdout.strip() == "7"
Too dense?