Key idea

Hyperparameter search is itself an optimisation problem with a budget. Random search beats grid search in > 3 dimensions. Bayesian optimisation (Optuna, scikit-optimize) is much better when each trial is expensive. Hyperband / ASHA aggressively prune unpromising trials. The trick is matching the search strategy to the cost of a trial.

Grid search vs random search — same number of trials, very different coverage of the important dimensions

An objective f(lr, wd) with an "important" dimension (left-right) and an "unimportant" one (up-down). Grid wastes most of its budget on the unimportant axis; random hits more unique values of the important one; Bayesian quickly finds the high-value region after a few exploratory probes.

Grid search. The default beginner choice. Wasteful — most hyperparameters aren't equally important, but grid spends the same effort on each axis. Useful only with ≤ 3 hyperparameters.

Random search. Sample N points uniformly. Bergstra & Bengio (2012) showed this is strictly better than grid in high dimensions because the search distributes budget across the relevant axes. Surprisingly hard to beat in practice when N is large enough.

Bayesian optimisation. Build a surrogate model (Gaussian process or random forest) of the objective; pick the next trial to maximise expected improvement. Wins when each trial is expensive. Tools: Optuna, scikit-optimize, BoTorch.

Hyperband / ASHA. Allocate many trials, but kill the bad ones early. Train at low compute; promote the top-k to more compute; iterate. Often the most cost-efficient strategy for deep learning.

Population-based training (PBT). Run many models in parallel; periodically copy the best ones' weights and perturb their hyperparameters. Adapts hyperparameters during training.

Pick by trial cost

  • Cheap trial (< minutes): random search
  • Moderate trial (~hour): Bayesian via Optuna
  • Expensive trial (~day): Hyperband / ASHA
  • Continuous training: PBT
  • ≤ 3 hyperparameters with discrete values: grid is fine

Pitfalls

  • Searching too many parameters — most don't matter; profile importances first
  • Search range too narrow — you can't find what you don't include
  • Comparing trials with different seeds — variance dominates
  • Picking by val score from a single seed
import optuna

def objective(trial):
    lr  = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    wd  = trial.suggest_float("wd", 1e-6, 1e-2, log=True)
    bs  = trial.suggest_categorical("batch_size", [32, 64, 128])
    act = trial.suggest_categorical("activation", ["relu", "gelu", "swish"])

    val_loss = train_and_validate(lr, wd, bs, act)

    # Report intermediate values for pruning
    trial.report(val_loss, step=epoch)
    if trial.should_prune():
        raise optuna.TrialPruned()
    return val_loss

study = optuna.create_study(
    direction="minimize",
    sampler=optuna.samplers.TPESampler(),
    pruner=optuna.pruners.HyperbandPruner(),
)
study.optimize(objective, n_trials=50, timeout=3600 * 6)
print(study.best_params, study.best_value)
Want the math behind Bayesian optimisation, & ASHA scheduling?
Expected Improvement $$ \mathrm{EI}(x) = \mathbb{E}\!\left[\max(0,\, f^* - f(x))\right] $$
  • f*best value seen so far
  • Expectation over the surrogate's posterior — high for promising unexplored regions
  • Acquisition function that drives Bayesian optimisation

Surrogate models. Bayesian optimisation builds a probabilistic model of f(hyperparams). Gaussian processes are the textbook choice — good with few trials, slow when many. Tree-structured Parzen Estimator (TPE, used by Optuna by default) scales better and handles categoricals.

Acquisition functions. Expected Improvement, Upper Confidence Bound, Probability of Improvement, Knowledge Gradient. EI is the standard; UCB is parameterised by an exploration knob. Most libraries default to EI; pick UCB when you want explicit exploration control.

Asynchronous Successive Halving (ASHA). Li et al. (2020). Promote the top η-1 fraction of trials at each rung; trials that survive rung r get η× the compute. Asynchronous (no synchronisation barrier) makes it scale to hundreds of workers.

Multi-fidelity methods. Use cheap proxies (subsampled data, fewer epochs) to estimate the expensive metric. Hyperband + BOHB combine ASHA's early pruning with Bayesian optimisation of which trials to try. The strongest general-purpose approach for deep learning.

Pruning. Median pruner: kill a trial if its intermediate value is below the median at the same step. Patient pruner: like median but with a grace period. Hyperband pruner: structured promotion schedule.

Search space design. Use log-uniform for learning rate and weight decay. Use uniform for layer counts and dropout. Use categorical for activation choice and optimizer name. Conditioning ("if optimizer == sgd, also tune momentum") is supported by Optuna's trial.suggest_* inside if-blocks.

import optuna

# Multi-fidelity with HyperbandPruner — train cheap, promote winners
study = optuna.create_study(
    direction="minimize",
    sampler=optuna.samplers.TPESampler(multivariate=True),
    pruner=optuna.pruners.HyperbandPruner(min_resource=1, max_resource=100, reduction_factor=3),
)

def objective(trial):
    cfg = {
        "lr": trial.suggest_float("lr", 1e-5, 1e-1, log=True),
        "model": trial.suggest_categorical("model", ["resnet18", "resnet50", "vit"]),
    }
    for epoch in range(100):
        val_loss = train_one_epoch(cfg)
        trial.report(val_loss, epoch)
        if trial.should_prune(): raise optuna.TrialPruned()
    return val_loss

study.optimize(objective, n_trials=200, n_jobs=4)
Want PBT, neural architecture search, & multi-objective?
Population-Based Training $$ \theta_i^{(t+1)} \leftarrow \begin{cases} \theta_j^{(t)} \;\text{(copy)} & \text{if rank}(i) < \text{bottom k} \\ \theta_i^{(t)} \;\text{(continue)} & \text{otherwise} \end{cases} $$
  • Bottom performers copy top performers' weights
  • … then perturb the hyperparameters
  • Adapts hyperparameters during training — useful when the optimal LR schedule isn't known a priori

PBT. Jaderberg et al. (2017). Run N agents in parallel with random hyperparameters; periodically exploit (bad agents copy good agents' weights) and explore (perturb hyperparameters). The hyperparameter trajectory becomes a "schedule" rather than a fixed value. Used at DeepMind for many production-grade hyperparameter sweeps.

Neural Architecture Search (NAS). Hyperparameter search for architectures. Once-popular; now mostly subsumed by foundation-model fine-tuning. Differentiable NAS (DARTS) and gradient-based methods are the modern face.

Multi-objective optimisation. Trade off accuracy vs latency, accuracy vs cost, etc. Pareto fronts; NSGA-II for evolutionary, qNEHVI for Bayesian. Optuna has built-in support.

Cost-aware search. Each trial has a known cost; the budget is constrained. Cost-EI, Cost-LCB. Useful when trials vary enormously in compute (batch size, model size).

Continual / online hyperparameter tuning. Production models drift; what was the best LR a year ago may not be now. Periodic re-tuning, possibly with PBT-style methods running on production.

Beware overfitting the search. If your search trains on training data and selects on val, you're fine. If you tune the validation set itself (e.g., picking which features to engineer based on val performance), you're overfitting your search procedure. Hold out a final test set.

Reporting. Always report your search budget (number of trials, total GPU-hours). Without this, a "we found 95% accuracy" is meaningless — was that 5 trials or 50 000?

import optuna

# Multi-objective — maximise accuracy AND minimise inference latency
study = optuna.create_study(
    directions=["maximize", "minimize"],
    sampler=optuna.samplers.NSGAIISampler(),
)
def objective(trial):
    width = trial.suggest_int("width", 32, 512, log=True)
    depth = trial.suggest_int("depth", 2, 8)
    acc, latency_ms = train_and_benchmark(width, depth)
    return acc, latency_ms

study.optimize(objective, n_trials=100)

# Inspect the Pareto front
for trial in study.best_trials:
    print(trial.values, trial.params)
Too dense?