Compress high-dimensional data to a smaller representation that preserves what matters.
Key idea
Most high-dim data lives on a lower-dim "shape". A photo has millions of pixels, but only a few hundred numbers can describe most of what's in it. Dimensionality reduction finds those few numbers.
Slide θ to spin the projection axis — watch the 1D histogram widen and narrow; "Find PC1" snaps to the max-variance angle
θ = 0°
PCA is exactly this game played analytically: find the axis that maximises projected variance. The faint indigo cross-hairs are the principal axes the model found (PC1 and PC2); the orange axis is yours to spin. When yours matches PC1, the histogram is at its widest. The number under it is what people mean by "captured variance" — for the tilted dataset, one component recovers ~90% of the structure.
Two reasons to do this. First, visualization: you can't plot 50 dimensions, but you can plot 2. Algorithms like t-SNE and UMAP project high-dim data into 2D in a way that preserves which points are near each other.
Second, compression and denoising: removing noise dimensions can make downstream models faster and sometimes more accurate. PCA is the classical workhorse here — it finds the directions of greatest variance and projects onto them.
Reach for it when
Visualizing high-dim data in 2D / 3D
Speeding up downstream models on high-dim features
Removing redundant correlated features
Feature engineering for tabular ML
Skip it when
You need every feature to be interpretable — projected dimensions are mixes
Each feature already carries unique signal (no redundancy)
You're modelling images / text — let the model learn the embedding
You need to reconstruct the original space exactly
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Standardize first — PCA is variance-based, scale-sensitive
X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=2).fit(X_scaled)
X_2d = pca.transform(X_scaled)
print(f"Explained variance: {pca.explained_variance_ratio_}")
print(f"Cumulative: {pca.explained_variance_ratio_.cumsum()}")
Want PCA's math and the t-SNE / UMAP story?
PCA via SVD$$ X_{\text{centered}} \;=\; U \Sigma V^\top, \qquad \text{principal components} = V_{:,1{:}k} $$
Vright singular vectors — the principal directions in feature space
Σsingular values — magnitudes; squared ∝ variance along each direction
Top k columns of V give the best k-dim linear approximation in the L2 sense
PCA finds the linear projection that preserves the most variance. It's optimal under squared-error loss, fast (closed form via SVD), interpretable (each component is a weighted combination of original features), and computes uncertainty / reconstruction errors. Downside: linear only — can't unfold curved manifolds.
t-SNE models pairwise similarities in high-dim space as Gaussian, in low-dim space as Student-t (heavier tails). Minimizes KL divergence between the two. Preserves local neighborhood structure beautifully but distorts global distances. Stochastic — different runs give different layouts. Don't use for downstream modelling, only visualization.
UMAP is t-SNE's faster successor. Builds a fuzzy topological graph, then optimizes a low-dim embedding to match it. Faster than t-SNE, better at preserving global structure, deterministic-ish with a seed. Has become the default visualization method for embeddings.
ICA (Independent Component Analysis) finds projections that are statistically independent, not just uncorrelated. Recovers true source signals when they're non-Gaussian — classic application is blind source separation (the "cocktail party problem").
Reach for it when
PCA: denoising, decorrelating features for downstream models
t-SNE / UMAP: 2D visualization of high-dim embeddings
You need to preserve global distances exactly (t-SNE / UMAP distort them)
Non-linear structure with no global trend (PCA fails)
Features are sparse — PCA densifies and destroys sparsity benefits
Embeddings will be used as features downstream — t-SNE / UMAP outputs aren't designed for that
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
# Reduce to 50 dims with PCA first — t-SNE / UMAP are slow on raw high-dim data
X_pca = PCA(n_components=50, random_state=0).fit_transform(X)
# Then non-linear embedding to 2D
X_tsne = TSNE(n_components=2, perplexity=30, init="pca", random_state=0).fit_transform(X_pca)
X_umap = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=0).fit_transform(X_pca)
# PCA -> t-SNE / UMAP is the standard recipe for visualizing image / text embeddings
Want the manifold view, autoencoders, and pitfalls?
PCA = max-likelihood limit (σ² → 0) of this latent-variable model
Generalizes to factor analysis, mixture-of-PPCA, and variational autoencoders
PCA pitfalls. (1) Scale matters — components are dominated by high-variance features unless you standardize. (2) Linear only — fails on curved manifolds like the Swiss roll. (3) Outliers can hijack the top components — use robust PCA (decomposition into low-rank + sparse) if outliers are a concern.
Choosing k. Cumulative explained variance to a threshold (e.g. 95%). Scree-plot elbow. Cross-validated reconstruction error. For downstream classification, choose k via held-out task performance — the variance criterion isn't task-aligned.
Kernel PCA uses the kernel trick to do PCA in a feature space implicitly defined by a kernel. Solves the linearity limitation but loses the interpretability of axes.
Manifold learning. Isomap (geodesic distances + classical MDS), LLE (local linear embedding), Laplacian eigenmaps. Each makes a different assumption about local structure. t-SNE and UMAP dominate in practice but the older methods have stronger theoretical guarantees for specific manifold types.
Autoencoders are the deep-learning generalization. Encoder maps x → z, decoder maps z → x̂; train to minimize reconstruction error. Bottleneck dimension = embedding dim. VAEs add a probabilistic prior on z, giving you a generative model.
t-SNE caveats. Distances in the t-SNE plot are not meaningful. Cluster sizes are not meaningful. The number of clusters is dictated by perplexity as much as by the data. Always re-run with several perplexity values to check stability — see Wattenberg et al. "How to Use t-SNE Effectively".
Reach for it when
Robust PCA — large outliers expected
Kernel PCA — known kernel inductive bias and small / moderate data
Variational autoencoders — generative model + embedding in one
Topological data analysis — UMAP's preserved structure feeds persistent homology
Skip it when
Embedding must be stable under small data changes (t-SNE / UMAP are sensitive)
You need to invert the embedding back to original space (manifold methods can't)
Out-of-sample extension matters — t-SNE doesn't natively transform new points
You care about preserved geodesic distances — only Isomap claims that
import torch
import torch.nn as nn
class Autoencoder(nn.Module):
def __init__(self, d_in, d_latent):
super().__init__()
self.enc = nn.Sequential(
nn.Linear(d_in, 256), nn.ReLU(),
nn.Linear(256, d_latent),
)
self.dec = nn.Sequential(
nn.Linear(d_latent, 256), nn.ReLU(),
nn.Linear(256, d_in),
)
def forward(self, x):
z = self.enc(x)
return self.dec(z), z
model = Autoencoder(d_in=X.shape[1], d_latent=16)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()
for epoch in range(50):
x_hat, _ = model(X)
loss = loss_fn(x_hat, X)
opt.zero_grad(); loss.backward(); opt.step()
# Embedding is model.enc(X)