Key idea

Instead of one guess, give me a range. A Gaussian process predicts both a value and how confident it is about that value. Where you have lots of nearby data, it's confident; where you don't, it widens its uncertainty.

Click to add observations · shift+click to remove · slide ℓ to change how far each point's influence reaches
ℓ = 0.25 σₙ = 0.05

The bold line is the posterior mean; the shaded band is the ±2σ uncertainty. The thin lines are four sample functions drawn from the posterior — they show what kinds of functions are still consistent with your data. Clear all observations and you're seeing the prior — the GP's belief before any data. Add a point and watch the band collapse to it (within σₙ); the uncertainty inflates the further you go.

Most regression models give you one number per prediction. A Gaussian process gives you a number and an error bar — and the error bar is data-aware: it shrinks near your training points and grows when you predict far away from anything you've seen.

That uncertainty is the whole point. GPs are slow and don't scale to big data, but when you have a few hundred points and you care about knowing what you don't know — design of experiments, Bayesian optimization, robotics — they're the right tool.

Reach for it when

  • You have small data (≲1000 points)
  • Calibrated uncertainty matters
  • You want a non-parametric model that adapts to the data
  • You're doing Bayesian optimization

Skip it when

  • You have a lot of data (GPs are O(N³))
  • You only need point predictions, not uncertainty
  • The signal is well-behaved and a simpler model would do
  • You're working with images or text (use neural nets)
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF

gp = GaussianProcessRegressor(kernel=RBF(length_scale=1.0))
gp.fit(X_train, y_train)

# Predict with uncertainty
y_mean, y_std = gp.predict(X_test, return_std=True)
Want to see what's actually happening under the hood?
Key idea $$ f(\mathbf{x}) \;\sim\; \mathcal{GP}\!\left(m(\mathbf{x}),\; k(\mathbf{x}, \mathbf{x}')\right) $$
  • m(x)mean function (often zero)
  • k(x,x')kernel — encodes how similar outputs should be when inputs are similar

A GP is a distribution over functions such that any finite set of function values is jointly Gaussian. The kernel determines what kind of functions are likely — smooth, periodic, rough, etc.

The kernel k is the heart of the model. The most common one — the RBF kernel k(x, x') = exp(−‖x − x'‖² / 2ℓ²) — assumes smooth functions; nearby inputs produce similar outputs, with "nearby" controlled by the length scale ℓ.

Given training data, the posterior over f(x*) at a test point is also Gaussian, in closed form:

μ* = k*T(K + σ²I)−1y     σ²* = k(x*,x*) − k*T(K + σ²I)−1k*

That matrix inverse is the source of the O(N³) cost — it's what kills GPs on large data.

Reach for it when

  • Small to medium data with strong prior beliefs about smoothness
  • You want principled hyperparameter selection via marginal likelihood
  • Active learning / Bayesian optimization (need uncertainty to pick next point)
  • Time-series with seasonality (composable kernels)

Skip it when

  • N > a few thousand — use sparse approximations or switch models
  • Functions are non-stationary in ways the kernel can't capture
  • You have categorical features (need custom kernels)
  • Output is high-dimensional (vector-valued GPs are awkward)
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel, ConstantKernel as C

# Composable kernel: amplitude * RBF + noise
kernel = C(1.0) * RBF(length_scale=1.0) + WhiteKernel(noise_level=0.1)

gp = GaussianProcessRegressor(
    kernel=kernel,
    n_restarts_optimizer=10,   # random restarts for marginal-likelihood optimization
    normalize_y=True,
).fit(X_train, y_train)

print("Fitted kernel:", gp.kernel_)
print("Log-marginal likelihood:", gp.log_marginal_likelihood(gp.kernel_.theta))

y_mean, y_std = gp.predict(X_test, return_std=True)
Want sparse approximations and the kernel-engineering tradeoffs?
Marginal likelihood (hyperparameter learning) $$ \log p(\mathbf{y} \mid X, \boldsymbol{\theta}) \;=\; -\tfrac{1}{2}\mathbf{y}^\top K_\theta^{-1}\mathbf{y} \;-\; \tfrac{1}{2}\log\!|K_\theta| \;-\; \tfrac{n}{2}\log 2\pi $$
  • Kθcovariance matrix induced by kernel hyperparameters θ
  • fit term — yTK-1y penalizes poor data fit
  • complexity term — log|K| penalizes overly-flexible kernels

The log marginal likelihood automatically trades off fit vs. complexity — a built-in Occam's razor. Optimize it w.r.t. kernel hyperparameters to set length scales, amplitudes, and noise simultaneously.

Scaling. The cubic cost of inverting the N×N kernel matrix is the headline limitation. Sparse approximations represent the GP via m ≪ N inducing points and reduce cost to O(Nm²). FITC and DTC are early variants; SVGP (variational, Hensman et al.) is the modern workhorse and scales to millions of points via stochastic optimization.

Kernel composition. Sums and products of valid kernels are valid kernels. This lets you encode structure: RBF + Periodic for time series with seasonality, RBF × Linear for trends, etc. Automatic kernel discovery (Duvenaud et al.) tries to compose these structures from data.

Connection to neural networks. An infinitely-wide neural network with random Gaussian weights converges to a GP (Neal, 1996). The NTK (neural tangent kernel) describes the GP that trained networks effectively act as in the infinite-width limit — connecting GPs to deep learning theory.

Likelihood freedom. The closed-form posterior requires Gaussian likelihood. For classification or other non-Gaussian likelihoods, approximate inference is needed: Laplace, EP (expectation propagation), or variational methods.

Reach for it when

  • Bayesian optimization / experimental design
  • Spatial statistics (kriging is a GP)
  • You want interpretable hyperparameters (length scale = "how far is similar?")
  • Multi-fidelity modelling — cheap and expensive evaluations

Skip it when

  • You can't choose a sensible kernel (no inductive bias)
  • You need extrapolation — RBF kernels decay to the prior mean
  • Strict inference latency — sparse GPs are still slower than NNs
  • Heavy-tailed noise — Gaussian likelihood is too sensitive to outliers
import gpytorch, torch

class ExactGP(gpytorch.models.ExactGP):
    def __init__(self, X, y, likelihood):
        super().__init__(X, y, likelihood)
        self.mean   = gpytorch.means.ConstantMean()
        self.cov    = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
    def forward(self, x):
        return gpytorch.distributions.MultivariateNormal(self.mean(x), self.cov(x))

likelihood = gpytorch.likelihoods.GaussianLikelihood()
model      = ExactGP(X_train, y_train, likelihood)
mll        = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

# Optimize marginal likelihood via Adam
model.train(); likelihood.train()
opt = torch.optim.Adam(model.parameters(), lr=0.1)
for _ in range(200):
    opt.zero_grad()
    loss = -mll(model(X_train), y_train)
    loss.backward()
    opt.step()
Too dense?