The simplest, most interpretable models — and still the right answer surprisingly often.
Key idea
Predict the output as a weighted sum of the inputs. Linear regression does this for continuous outputs (predicting a number). Logistic regression squishes the output through an S-curve to predict probabilities (yes/no). Both are interpretable, fast, and surprisingly hard to beat on simple problems.
Drag points · click empty space to add · shift+click to remove
The fit minimises the sum of squared vertical distances (the faint orange lines). Try the Sine preset, then crank the degree up — degree 9 will pass through almost every point but wiggle wildly in between. That's overfitting in two clicks.
Linear regression: ŷ = β₀ + β₁·x₁ + β₂·x₂ + …. The coefficients βi tell you how much each feature changes the prediction. Big positive coefficient on "square footage" means more square footage → higher predicted price. Negative on "age" means older → lower price.
Logistic regression is the same idea but for classification. Take the same weighted sum, then squash it through a sigmoid (S-shaped function) to get a probability between 0 and 1. The coefficients now tell you how much each feature moves the predicted log-odds.
Both are the workhorse baselines you should try before anything fancier. If a linear model gets 90% accuracy, a deep neural net adding 1% probably isn't worth it.
Reach for it when
Linear (or roughly linear) relationships
Interpretability matters — coefficients are readable
Few hundred to few thousand data points
You want a fast, hard-to-beat baseline before going complex
Skip it when
Strongly non-linear relationships (use trees / neural nets)
Feature interactions matter and you can't engineer them in
Very high-dim sparse data — needs lots of regularization care
The output is a sequence, image, or other structured object
Xdesign matrix (N × p; usually include a column of ones for intercept)
βcoefficient vector — closed-form solution exists
Ordinary Least Squares (OLS). Linear regression with a squared-error loss has a closed-form solution — no iteration needed. The estimator is unbiased (under classical assumptions) and minimum-variance among linear unbiased estimators (Gauss-Markov).
Logistic regression. Replace the squared loss with cross-entropy and the linear output with a sigmoid: p̂ = σ(Xβ). No closed form — solve via gradient descent or Newton's method (IRLS). The loss is convex, so it always converges to the global optimum.
Regularization. When features are correlated or you have more features than data points, OLS variance explodes. Add a penalty on the coefficients:
Ridge (L2): penalty = λ‖β‖². Shrinks all coefficients toward zero but rarely makes them exactly zero. Closed form: β̂ = (XᵀX + λI)⁻¹Xᵀy.
Lasso (L1): penalty = λ‖β‖₁. Drives many coefficients exactly to zero — performs feature selection. No closed form; solve via coordinate descent.
Elastic Net: mix of both. Often the right default when you're not sure.
Cross-validated α. Sweep over a grid of penalty strengths, pick the one that minimises held-out error. LassoCV, RidgeCV, ElasticNetCV do this automatically.
Reach for it when
Feature engineering can capture the structure (polynomials, interactions, splines)
Many features, modest data — regularization wins
Sparse models needed (Lasso for feature selection)
You need calibrated probability estimates (logistic regression is well-calibrated by default)
Skip it when
Strong feature interactions that aren't engineered in
Discontinuous or sharply non-linear relationships
You want hierarchical / multi-output models — switch to Bayesian or GLM frameworks
The data is overwhelmingly non-tabular
from sklearn.linear_model import RidgeCV, LassoCV, LogisticRegressionCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np
# Regression with cross-validated regularization strength
ridge = Pipeline([
("scale", StandardScaler()),
("reg", RidgeCV(alphas=np.logspace(-3, 3, 50), cv=5)),
]).fit(X_train, y_train)
print(f"Chosen α: {ridge.named_steps['reg'].alpha_:.4f}")
print(f"R² test: {ridge.score(X_test, y_test):.3f}")
# Lasso for sparse coefficients
lasso = Pipeline([
("scale", StandardScaler()),
("reg", LassoCV(cv=5, max_iter=10000)),
]).fit(X_train, y_train)
selected = (lasso.named_steps["reg"].coef_ != 0).sum()
print(f"Lasso selected {selected} of {X_train.shape[1]} features")
# Classification — same pattern, with L1 or L2 penalty
clf = LogisticRegressionCV(Cs=10, cv=5, penalty="l2", max_iter=1000).fit(X_train, y_train)
print(f"Best C: {clf.C_}")
Want the GLM family, MLE derivation, and the assumptions?
Generalized Linear Models$$ g(\mathbb{E}[y \mid x]) \;=\; \boldsymbol{\beta}^\top \mathbf{x}, \qquad y \sim \text{exponential family} $$
glink function — identity (linear), logit (logistic), log (Poisson)
Linear regression, logistic, Poisson, gamma — all the same recipe with different link & likelihood
The GLM viewpoint. Linear regression assumes y | x ~ N(βᵀx, σ²). Logistic regression assumes y | x ~ Bernoulli(σ(βᵀx)). Both fit by maximum likelihood. Generalize: pick any distribution from the exponential family + a "link function" mapping the linear predictor to the distribution's mean. Poisson regression for count data, gamma for positive continuous, multinomial for multi-class.
OLS assumptions. Strictly speaking, OLS is unbiased + minimum-variance under: (1) linearity, (2) independent observations, (3) homoscedasticity (constant variance), (4) no perfect multicollinearity. Normality of residuals is needed for inference (CIs, p-values), not for the point estimates.
Multicollinearity. When features are highly correlated, OLS coefficients become unstable — sign and magnitude can flip with small data changes. Diagnose with VIF (variance inflation factor). Fix with ridge regularization or by dropping / combining correlated features.
Regularization paths. As you increase λ from 0 to ∞, coefficients shrink (ridge) or hit zero one by one (lasso). Plotting the path is informative — shows you the order in which features matter.
Coordinate descent & LARS. The standard fast algorithms for lasso. LARS (Least Angle Regression) builds the full path in one pass — useful when you want to see all λ values, not just one.
Beyond linear. Generalized Additive Models (GAMs) keep additivity but let each term be a flexible function: ŷ = f₁(x₁) + f₂(x₂) + …. Interpretable like GLMs but expressive like neural nets in each dimension. Use pyGAM or statsmodels for these.
Reach for it when
You want interpretable coefficients with confidence intervals — GLM in statsmodels
Bayesian linear regression with priors on coefficients
GAMs when you want non-linear effects but additive structure
Skip it when
You need top-tier predictive accuracy with non-additive interactions
Sequential or spatial structure dominates
Deep representations are doing most of the work
The link function and likelihood don't match the data — better to use a flexible regressor
import statsmodels.api as sm
# GLM with explicit family / link — proper inference for coefficients
X_train_const = sm.add_constant(X_train)
# Poisson regression for count data
model = sm.GLM(y_count, X_train_const, family=sm.families.Poisson()).fit()
print(model.summary()) # coefficients, std errors, p-values, deviance
# Logistic with full inference
logit = sm.GLM(y_binary, X_train_const, family=sm.families.Binomial()).fit()
print(f"Coefs: {logit.params.round(3).to_dict()}")
print(f"95% CI: {logit.conf_int().values.round(3)}")