Missing Data — ML Resources Hub

Key idea

Missing isn't random. People skip survey questions for reasons; sensors fail in patterns; missingness itself carries information. The question is whether you can ignore the why (drop), pretend it's noise (impute), or model it explicitly (use missingness as a feature).

Toggle the imputation strategy — see what each one does to a column with 30% missing values

A 2D dataset with 30% of one column's values missing. Mean imputation pulls the missing rows to the column mean — visibly distorts the distribution. Median is more robust to outliers. k-NN imputes from nearby (in the observed dimensions) rows — preserves local structure. Iterative (MICE) regresses each missing column on the others and iterates — captures correlations. Adding a missingness flag lets the model learn from the fact something was missing.

Drop rows. The simplest answer. Fine when missingness is rare (< 1–2% of rows) and unrelated to the target. Wasteful when data is precious.

Drop columns. If a column is > ~50% missing, often easier to drop. But beware: a column with a high missing rate and strong signal in the observed rows is exactly the kind of column you should keep + flag.

Mean / median / mode imputation. Fast, simple, available in every library. Mean preserves the mean; median is more robust; mode for categorical. All three shrink the variance of the imputed column — downstream models think it's less variable than it is.

k-NN imputation. Fill each missing cell from the average of its k nearest neighbours (computed on the observed columns). Preserves local structure; can be slow on large datasets.

Iterative imputation (MICE). Regress each column on all the others, iterate. Best general-purpose strategy on tabular data. sklearn.IterativeImputer implements it.

Add a missing-flag. Add a binary column "was column X missing for this row?". Lets the model see the missingness pattern itself — often more useful than the imputed value.

Pick your strategy

< 1% missing → drop rows
Moderate missing, tabular → MICE + missingness flag
Sequential / time-series → forward-fill or interpolate, then flag
Trees (XGBoost, LightGBM) → handle missing natively, often best

Common mistakes

Imputing on the full dataset before splitting — leakage
Mean imputation everywhere — shrinks variance, biases models
Treating "missing" as a magic value like -999 — trees treat it as just another number
Forgetting that missingness itself can be predictive

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer
from sklearn.pipeline import Pipeline
import pandas as pd

# Always inside a Pipeline so it fits on TRAIN ONLY
mean   = SimpleImputer(strategy="mean")
median = SimpleImputer(strategy="median")
knn    = KNNImputer(n_neighbors=5, weights="distance")
mice   = IterativeImputer(max_iter=10, random_state=0)

# Indicator columns — add a 0/1 flag for each missing position
indicator = SimpleImputer(add_indicator=True, strategy="median")

# XGBoost / LightGBM handle NaN natively — often the best answer for tabular
import xgboost as xgb
xgb.XGBClassifier().fit(X, y)         # NaNs OK

Want MAR/MCAR/MNAR theory and multiple imputation?

The Little & Rubin taxonomy $$ \text{MCAR: } P(M\mid X, Y) = P(M)\quad\text{MAR: } P(M\mid X, Y) = P(M\mid X)\quad\text{MNAR: depends on } Y $$

Mindicator of missingness
MCAR — completely random (rare in practice)
MAR — depends only on observed values (most assumed)
MNAR — depends on the missing value itself (the hard case)

MCAR vs MAR vs MNAR. Missing Completely At Random: the missingness pattern is independent of everything (essentially never true). Missing At Random: depends only on observed variables (the assumption behind most imputation methods). Missing Not At Random: depends on the missing value itself (e.g., people with high income don't disclose income). MNAR is the hard case — needs domain knowledge or modelled missingness mechanism.

Multiple imputation. Don't impute once and pretend the gap was filled. Impute multiple times (drawing from the predictive distribution), fit your model on each, pool the results. statsmodels.MICEData and the fancyimpute / miceforest libraries implement this. Critical for valid inference (confidence intervals) on partially missing data.

Why mean imputation hurts. Replacing missing values with the mean shrinks the variance of that column by exactly the fraction missing. Linear regressions get attenuated coefficients; SEs are too small; any downstream test will be over-confident. Multiple imputation fixes this; missingness flags partially help.

Tree-based models handle NaN. XGBoost, LightGBM, CatBoost, and modern scikit-learn HistGradientBoosting all have built-in NaN handling — they learn a default direction at each split. Often outperforms any imputation strategy on tabular data.

Time-series patterns. Forward-fill ("carry the last observed value forward") and linear interpolation are common. Both assume slow-varying signals; for fast-varying or irregular sampling, fit a model (Kalman filter, Gaussian process) and impute from the posterior.

Sensitivity analysis. Try multiple imputation methods and compare. Big differences mean missingness is doing real work; choose the method whose assumptions match your domain.

import miceforest as mf
import pandas as pd

# Multiple imputation by chained equations (MICE) with random-forest steps
kernel = mf.ImputationKernel(df, num_datasets=5, save_all_iterations=True)
kernel.mice(iterations=10)
df_imputed_list = [kernel.complete_data(dataset=i) for i in range(5)]

# Fit your model on each imputation and average — Rubin's rules
preds = [model.fit(d, y).predict(X_test) for d in df_imputed_list]
final = np.mean(preds, axis=0)            # variance across runs = uncertainty

Want missingness as a feature, MNAR models, and selection-bias correction?

Imputation with the observed likelihood $$ \hat\theta = \arg\max_\theta \int p(X_\text{obs}, X_\text{miss} \mid \theta)\, dX_\text{miss} $$

EM is the canonical algorithm for missing-data MLE
Treats imputation and parameter estimation as joint optimisation
Requires a (correct) model for both the data and the missingness

MNAR modelling. When missingness depends on the unobserved value (income, drug-use, sensitive variables), no purely observational fix is unbiased. Two general approaches: selection models (model P(missing | X, Y) explicitly, e.g. Heckman correction) and pattern-mixture models (model the distribution conditional on missingness pattern). Both require strong assumptions that you should be transparent about.

Missingness as a feature. In some domains the missingness is the signal. EMR data: a test wasn't ordered because the doctor didn't think it was necessary — the missingness encodes information. Add binary flags or use models that natively support missingness.

Generative imputation. Modern approaches use GAINS (GAN-based imputation), normalising flows, or VAEs to impute. State of the art on heterogeneous tabular data; sometimes overkill when MICE works.

Selection bias correction. When the missingness depends on the target, propensity-score weighting or doubly-robust estimation can correct biased estimates — but only if you correctly model the selection mechanism. Domain expertise is irreplaceable here.

Out-of-distribution missingness. Train-time and serve-time missingness patterns often differ. Build robustness into both the model and the monitoring — flag prediction-time inputs where the missingness pattern is unusual.

Right-censoring. A special case: events that haven't happened yet are "missing" in time-to-event data. Survival analysis (Kaplan-Meier, Cox PH) handles this without imputing.

import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

# Iterative imputer with non-linear estimator → robust to non-Gaussian patterns
mice = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=20, random_state=0),
    max_iter=10, random_state=0,
)
X_imp = mice.fit_transform(X)

# Heckman correction for selection bias (income-disclosure example)
# Step 1: model the probability of disclosure as probit
# Step 2: include the inverse Mills ratio as a regressor in the outcome model
import statsmodels.api as sm
sel  = sm.Probit(disclosed, X_sel).fit()
imr  = -sel.pdf(sel.fittedvalues) / sel.cdf(sel.fittedvalues)
out  = sm.OLS(income[disclosed], np.c_[X_out[disclosed], imr[disclosed]]).fit()

Too dense?