Key idea

The hardest problem is often the easiest one in the right space. Linear models can't separate concentric circles in (x, y) — but they can in (x², y², xy). Decision trees can't see the day-of-week pattern unless you extract a "day" feature. Most of classical ML's success is feature engineering pretending to be modelling.

Toggle feature transforms — watch a non-linearly-separable problem become trivially linear

The same concentric-rings dataset. Raw (x, y) coordinates: no straight line separates the classes. Add and features and a hyperplane in the higher-dimensional space cuts cleanly. The simplest possible engineered feature — r = √(x² + y²) — makes the problem one-dimensional. The model didn't change; the inputs did.

Domain features. The highest-value features encode knowledge the model can't figure out by itself: hour-of-day, day-of-week, distance to a known landmark, account-age-in-days. Trees and linear models both benefit; even deep models often work much better with a handful of well-chosen engineered features.

Polynomial features. Replace (x, y) with (x, y, x², xy, y²). A linear model in this space is a quadratic in the original — captures interactions automatically. Combinatorial explosion with many features though, so combine with regularization or restrict the degree.

Interaction features. Multiply or combine pairs (or triples) of features. Captures "this matters more when that is high". Essential for many tabular problems where a single feature is uninformative but the pair is predictive.

Binning & discretisation. Convert continuous features into categories. Makes non-linear thresholds trivial for linear models; can hurt with trees, which already discover thresholds. Useful with target encoding (mean target per bin).

Time features. Hour, day, month, weekend, holiday, time-since-event, rolling means. Most of forecasting is feature engineering on the time axis.

Worth investing in

  • Tabular data with strong domain context (finance, healthcare, supply chain)
  • Small datasets — every well-chosen feature is worth dozens of training examples
  • Linear models or shallow trees — they can't synthesise features themselves
  • Forecasting — time features dominate

Less critical when

  • You have huge data and a deep network can synthesise its own features
  • The input is already a useful representation (image pixels, embeddings)
  • You're using gradient boosting — handles interactions well already
  • Domain knowledge is unavailable or untrustworthy
from sklearn.preprocessing import PolynomialFeatures, KBinsDiscretizer
import pandas as pd

# Polynomial features — degree 2 with interactions
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
X_poly = poly.fit_transform(X)         # adds x², xy, y², x³, ... features

# Quantile binning + one-hot — useful with linear models
disc = KBinsDiscretizer(n_bins=10, encode="onehot-dense", strategy="quantile")
X_bin = disc.fit_transform(X[["income"]])

# Time features from a datetime column
df["hour"]    = df["ts"].dt.hour
df["dow"]     = df["ts"].dt.dayofweek
df["weekend"] = df["dow"].isin([5, 6]).astype(int)
df["hour_sin"] = np.sin(2 * np.pi * df["hour"] / 24)
df["hour_cos"] = np.cos(2 * np.pi * df["hour"] / 24)   # periodic encoding
Want the kernel trick, learned embeddings, & feature selection?
The kernel trick $$ K(x, x') = \langle \phi(x), \phi(x') \rangle $$
  • φimplicit feature map into a high-dimensional space
  • Working with K directly avoids computing φ — feature engineering for free, if K is well-chosen
  • SVMs, GPs, kernel ridge regression all exploit this

The kernel trick. Many classical algorithms (SVM, ridge regression, PCA) only need inner products between data points — never the individual vectors. Replace the inner product with a kernel K(x, x') and you've implicitly engineered features in a (possibly infinite-dimensional) space. The RBF kernel is the most common; polynomial and string kernels are useful for structured data.

Learned embeddings. Deep networks learn features. The first layers of a CNN find edges; the middle layers find textures; the last find object parts. Modern foundation models (CLIP, ViT, BERT) are reusable feature extractors — embed your data once with their backbone and run classical models on top.

Feature selection. Filter (correlation, mutual information), wrapper (forward / backward selection), embedded (LASSO, tree importances), and permutation-importance methods. Useful when you have hundreds of candidate features and need to pick the few that matter.

Target encoding properly. Replace categories with the mean target value, but do it inside CV folds with smoothing. Without those, target encoding leaks target information and overstates training performance dramatically.

Periodic features. Hour-of-day shouldn't be encoded as 0–23 (23 isn't "far from" 0). Use sin(2πh/24), cos(2πh/24) — wraps continuously. Same for day-of-year, day-of-week.

Aggregations and joins. For tabular problems with related tables (transactions per user, clicks per ad), the right features are usually aggregates: count, mean, max, last-N-day rolling sum. SQL is feature engineering at scale.

import numpy as np, pandas as pd
from sklearn.feature_selection import SelectKBest, mutual_info_classif

# Mutual information feature selection — works with non-linear relationships
mi = SelectKBest(mutual_info_classif, k=20)
X_top = mi.fit_transform(X, y)
print(X.columns[mi.get_support()])     # which features were kept

# Aggregations per user (e.g. for fraud detection)
agg = (df.groupby("user_id")["amount"]
         .agg(["count", "mean", "std", "max"])
         .add_suffix("_amount"))
df = df.merge(agg, on="user_id")
Want auto-feature-engineering, deep feature synthesis, and AutoML?
Deep Feature Synthesis $$ \mathcal{F} = \{\,g \circ \text{Aggr}(R, f) : f \in \mathcal{F}_{\text{prim}}, R \in \mathcal{R},\, g \in \mathcal{G}\,\} $$
  • Stack primitive transforms (count, mean, max, …) over relationships (R) to enumerate features automatically
  • Featuretools' algorithm — replaces a lot of hand feature-engineering for relational data

Featuretools and Deep Feature Synthesis. Algorithmic feature engineering for relational data. Given the schema of your tables, DFS automatically constructs features by stacking aggregations and transformations across relationships. Useful for fast prototyping, especially when you don't yet know which features matter.

AutoML feature pipelines. AutoGluon, H2O, TPOT, and others bundle preprocessing + feature engineering + model selection into a single search. Saves time; the resulting pipelines are often opaque but competitive on tabular benchmarks.

tsfresh and time-series feature libraries. Hundreds of statistical features automatically extracted from each time series — autocorrelation, spectral entropy, change points. Combined with feature selection, often beats hand-tuned features.

Pretrained embeddings as features. Take a column of free text and run it through a sentence transformer. The 768-dim vector can be concatenated with your tabular features and dropped into a downstream model. Same for product descriptions, URLs, addresses.

Causal features. When the goal is to estimate a treatment effect, the features that matter are different — they should make the treatment exogenous conditional on them. See the causal inference literature for the right framework (DAGs, propensity scores, doubly-robust estimation).

Adversarial feature engineering. Feature-importance reveals what the model depends on — and what an adversary might attack. Robust features (less manipulable) sometimes deserve preference over accurate ones.

import featuretools as ft

# DFS over a relational dataset (users + transactions)
es = ft.EntitySet("payments")
es.add_dataframe(dataframe=users,    dataframe_name="users",    index="user_id")
es.add_dataframe(dataframe=transactions, dataframe_name="txns", index="txn_id",
                 time_index="ts")
es.add_relationship(parent_dataframe_name="users",
                    parent_column_name="user_id",
                    child_dataframe_name="txns",
                    child_column_name="user_id")

features, defs = ft.dfs(entityset=es, target_dataframe_name="users",
                        agg_primitives=["count", "mean", "max", "trend"],
                        trans_primitives=["hour", "day", "month"],
                        max_depth=2)
# features: a pandas DataFrame with auto-generated columns like
#   MEAN(txns.amount), MAX(txns.amount), TREND(txns.amount, ts), …
Too dense?