Scaling, encoding, normalising — making the data match what the model can actually digest.
Key idea
Most models can't read raw data. Linear models break when features are on wildly different scales; trees don't mind. Distance-based models (k-NN, k-means) are dominated by whichever feature has the biggest numbers. Neural nets train faster with normalised inputs. Encoding categories, handling missing values, normalising scales — that's preprocessing.
Toggle the scaler — watch the boundary go from useless to correct as the features get put on the same footing
Same dataset, three transforms. On the raw data, feature 1 ranges over [0, 1000] while feature 2 is in [0, 1]. A k-NN classifier ignores feature 2 entirely — every neighbour decision is dominated by feature 1. After standardisation (subtract mean, divide by std), both features get equal say. Min-max scales to [0, 1]; robust uses median and IQR so outliers don't blow up the scale.
Standardisation (z-score): subtract the mean, divide by the standard deviation. Each feature has mean 0, std 1. Default for most linear models and neural networks.
Min-max scaling: linearly squash to [0, 1] (or [-1, 1]). Sensitive to outliers; useful for image pixels or where bounded input is required.
Robust scaling: subtract the median, divide by the IQR. Doesn't care about outliers. Reach for it when the distribution has heavy tails or known anomalies.
Categorical encoding: one-hot for ordinal-free categories, ordinal for ordered ones, target encoding (mean target value per category) for high-cardinality categories — used carefully to avoid leakage.
Log / Box-Cox transforms: when the target or feature is skewed (prices, counts, durations), a log transform often makes it Gaussian-ish and helps any model.
Must preprocess for
k-NN, k-means, SVM with non-tree kernels — distance-based
Linear and logistic regression with regularization
Neural networks — converges faster, more stably
PCA / dimensionality reduction
Can usually skip for
Decision trees, random forests, gradient boosting
Naive Bayes (depending on the variant)
Models with built-in normalisation (BatchNorm, LayerNorm)
When all features are already on the same scale (pixels)
from sklearn.preprocessing import (
StandardScaler, MinMaxScaler, RobustScaler,
OneHotEncoder, OrdinalEncoder, PowerTransformer,
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# A real preprocessing pipeline — fit on TRAIN ONLY
numerical_cols = ["age", "income", "balance"]
categorical_cols = ["country", "occupation"]
prep = ColumnTransformer([
("num", StandardScaler(), numerical_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
])
pipe = Pipeline([("prep", prep), ("model", LogisticRegression())])
pipe.fit(X_train, y_train) # scaler fitted on TRAIN
pipe.predict(X_test) # scaler applied to TEST
Want target encoding, robust scalers, and quantile transforms?
Features on very different scales ⇒ ill-conditioned gradient ⇒ slow / unstable convergence
Standardising puts all features at σ ≈ 1, dramatically improving conditioning
Target encoding for high-cardinality categories. Replace each category with its mean target value (e.g. mean of y per ZIP code). Captures the signal in a single feature — but if you compute it on the whole dataset before splitting, that's data leakage. Always compute target encodings within CV folds.
Quantile transforms. Map each feature to its rank, then optionally to a Gaussian. Robust to outliers, makes the feature uniform or normal. Useful for skewed distributions or when downstream models assume Gaussian inputs.
Power transforms (Box-Cox, Yeo-Johnson). Parametric family of transforms that find the closest-to-Gaussian shape. Yeo-Johnson handles negative values; Box-Cox needs positives. Stabilises variance — important for regression with heteroscedastic noise.
Encoders for ordered categories. Ordinal encoding when there's a natural order ("low" < "medium" < "high"). One-hot when there isn't. Trees handle ordinal encoding fine; linear models often need one-hot to avoid imposing a false ordering.
Sparse handling. One-hot encoding with thousands of categories blows up memory. Sparse matrices, hashing tricks, learnable embeddings, and feature crosses are all common alternatives.
Preprocess inside the CV loop. The single most common preprocessing mistake is fitting the scaler / encoder on the full dataset before splitting. The validation fold's mean / std then leaks into the training fold's "scaled" features. Pipelines exist to enforce the right order.
import numpy as np, pandas as pd
from sklearn.preprocessing import PowerTransformer, QuantileTransformer
from category_encoders import TargetEncoder
# Box-Cox / Yeo-Johnson for skewed features
pt = PowerTransformer(method="yeo-johnson")
X_pt = pt.fit_transform(X_train_numeric)
# Rank-Gaussian transform for very skewed data
qt = QuantileTransformer(output_distribution="normal")
X_qt = qt.fit_transform(X_train_numeric)
# Target encoding with smoothing — high-cardinality categories
te = TargetEncoder(smoothing=10.0)
X_te = te.fit_transform(X_train[["zip"]], y_train)
# Fit on train only; transform on test
Want feature stores, online preprocessing, and lookup tables?
Pre-processing as a learned function$$ \mathbf{x}_{\text{enc}} = f_\phi(\mathbf{x}_{\text{raw}}) \quad \text{(differentiable, jointly trained)} $$
Embeddings, BatchNorm, LayerNorm — preprocessing as part of the model
Updated jointly with the rest of the parameters
Generalises classical scalers — but harder to enforce no-leakage by construction
Feature stores. Production ML systems share preprocessed features across training and online inference. A feature store (Feast, Tecton, Vertex, etc.) is essentially a database of precomputed features keyed by entity ID, with versioning and time-travel queries to prevent train/serve skew.
Train / serve skew. The classic deployment bug: training pipeline computes feature X one way, the online serving system computes it differently. Hard to detect until production data hits. Mitigation: share the same code, or enforce schema and statistic checks at both endpoints.
BatchNorm and friends as preprocessing. BatchNorm normalises activations by their mini-batch statistics — essentially "scale, but inside the model". LayerNorm and RMSNorm do similar things along different axes. The interaction with explicit weight decay is subtle (see Regularization).
Embedding tables for categories. Instead of one-hot, learn a dense vector per category. Reduces dimensionality and captures similarity automatically. Standard for high-cardinality categorical inputs in deep learning (users, items, tokens).
Streaming statistics. When data is streamed and you can't fit it all in memory, use online formulas (Welford's algorithm) to compute mean and variance incrementally. Same idea behind StandardScaler.partial_fit in scikit-learn.
Preprocessing for fairness. Some preprocessing is itself a fairness intervention — re-weighting examples, removing protected attributes, learning fair representations. See the Fairness page for details.
import torch
import torch.nn as nn
# Embedding table — learnable preprocessing for high-cardinality categories
class CategoricalEmbed(nn.Module):
def __init__(self, n_categories, d=8):
super().__init__()
self.emb = nn.Embedding(n_categories, d)
def forward(self, x): # x: (B,) long tensor of category indices
return self.emb(x) # (B, d) dense vector per category
# Online mean/std using Welford's algorithm
class RunningStats:
def __init__(self):
self.n, self.mean, self.M2 = 0, 0.0, 0.0
def update(self, x):
self.n += 1
delta = x - self.mean
self.mean += delta / self.n
self.M2 += delta * (x - self.mean)
@property
def std(self):
return (self.M2 / max(1, self.n - 1)) ** 0.5