Stacked layers of weighted sums and nonlinearities — the foundation everything else builds on.
Key idea
Stack simple operations until something complex emerges. Each layer does a weighted sum of its inputs, then bends the result through a nonlinear function. Stack enough of these and you can approximate just about any function — image classification, language translation, game playing.
Click any heatmap to set the input point — see how each neuron responds
Each hidden neuron shows its activation across the entire input space — that's the heatmap inside the circle. Different neurons learn to detect different regions of the input. The output neuron on the right combines them into the final decision (indigo = class 0, orange = class 1). Drag the marker to see how the network responds to different points. Try the XOR-like pattern — it's the canonical "this can't be done with a single linear layer" example.
A neural network is just a chain of layers. Each layer takes a vector in, multiplies it by a matrix of weights, adds a bias, and runs the result through a "nonlinearity" (like ReLU: max(0, x)). The nonlinearity is essential — without it, no matter how many layers you stack, the whole thing collapses to a single linear function.
Training works by gradient descent: nudge each weight in the direction that reduces the prediction error, computed via the chain rule (a.k.a. backpropagation).
Reach for it when
Tabular data and you want more flexibility than a linear model
You have plenty of data and compute
You'll use this as a building block in a bigger model
You want a starting point before reaching for CNNs / transformers
Skip it when
Small tabular data — gradient boosting usually wins
Data has obvious structure (images, sequences, graphs) — use a specialized architecture
You need interpretability of individual predictions
You're short on training data
import torch.nn as nn
model = nn.Sequential(
nn.Linear(n_features, 128), nn.ReLU(),
nn.Linear(128, 64), nn.ReLU(),
nn.Linear(64, n_classes),
)
Training. Define a loss ℒ(θ), take its gradient with respect to all parameters via backpropagation (chain rule applied layer-by-layer), and step in the negative-gradient direction. SGD, Adam, AdamW are the optimizers people actually use.
Activations. ReLU is the default — cheap and avoids the saturating-gradient problem of tanh / sigmoid. GELU is smoother and dominates in transformers. Sigmoid / tanh appear only inside gates (LSTM, attention) where bounded outputs matter.
Initialization. Random Gaussian weights need to be scaled carefully — too big and activations blow up, too small and they vanish. Xavier/Glorot for tanh, He/Kaiming for ReLU. Most frameworks default to a sensible scheme.
Regularization. Dropout (randomly zero some activations during training), weight decay (L2 penalty), early stopping, data augmentation. Modern nets often need less explicit regularization than older ones — the implicit regularization of SGD does a lot of work.
Reach for it when
You need a flexible function approximator with no structural prior
Embedding components in a larger architecture (the MLP head everywhere)
You can afford the data + compute
You want batch / online training over a long-running stream
Skip it when
The data has structure you can exploit — use it
You need calibrated probabilities — DNNs are over-confident without post-hoc calibration
Single-CPU inference and strict latency budget
You need certified robustness — current MLPs are easy to attack
import torch, torch.nn as nn
from torch.optim import AdamW
class MLP(nn.Module):
def __init__(self, d_in, d_out, hidden=(256, 128), dropout=0.1):
super().__init__()
layers, prev = [], d_in
for h in hidden:
layers += [nn.Linear(prev, h), nn.GELU(), nn.Dropout(dropout)]
prev = h
layers.append(nn.Linear(prev, d_out))
self.net = nn.Sequential(*layers)
def forward(self, x): return self.net(x)
model = MLP(d_in=20, d_out=10)
opt = AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
loss_fn = nn.CrossEntropyLoss()
for epoch in range(50):
for xb, yb in loader:
loss = loss_fn(model(xb), yb)
opt.zero_grad(); loss.backward(); opt.step()
Want depth-vs-width, universal approximation, and the failure modes?
A single hidden layer with enough units can approximate any continuous f on a compact set K
The catch: "enough" can be astronomically large; depth makes this tractable
Depth vs. width. The universal approximation theorem says one hidden layer suffices. In practice, deep narrow networks generalize much better than shallow wide ones for the same parameter count — depth lets the model compose features hierarchically. Modern intuition: width gives capacity, depth gives composition.
Vanishing / exploding gradients. In deep networks, repeated multiplication of small or large Jacobian factors makes gradients shrink or explode through the chain rule. Modern fixes: ReLU (constant gradient on positive inputs), batch / layer normalization (keep activations on a sensible scale), residual connections (gradient highways).
The optimization landscape is non-convex with many local minima and saddle points. SGD finds flat minima that generalize well; sharp minima generalize poorly. This is part of why batch size and learning rate matter for generalization, not just for training speed.
Dead ReLUs. If a unit's pre-activation goes strongly negative and stays there, its gradient is zero and it never updates again. Mitigations: smaller learning rate, Leaky ReLU / GELU, careful initialization, batch norm.
Double descent. Past the interpolation threshold (where the network can perfectly memorize the training set), test error often decreases as you grow the model further. Connect to bias-variance — classical theory predicts the opposite.
Reach for it when
You can afford to scale (more data, more parameters)
End-to-end differentiability matters — embed in any pipeline
You have a budget for hyperparameter search (LR, depth, width, regularization)
Pre-trained embeddings exist for your domain
Skip it when
Very small data — gradient boosting and probabilistic models win
You can't engineer reasonable hyperparameters and don't want to AutoML them
Adversarial robustness is a hard requirement
Causal inference / counterfactual reasoning is the goal
import torch, torch.nn as nn
# Residual MLP block — mitigates vanishing gradients, enables deeper models
class ResBlock(nn.Module):
def __init__(self, d, dropout=0.1):
super().__init__()
self.norm = nn.LayerNorm(d)
self.fc = nn.Sequential(
nn.Linear(d, 4 * d), nn.GELU(), nn.Dropout(dropout),
nn.Linear(4 * d, d),
)
def forward(self, x):
return x + self.fc(self.norm(x))
# A "modern" deep MLP: pre-norm + residual + GELU expansion (the MLP-Mixer pattern)
class DeepMLP(nn.Module):
def __init__(self, d_in, d_out, hidden=384, n_blocks=6):
super().__init__()
self.embed = nn.Linear(d_in, hidden)
self.blocks = nn.ModuleList([ResBlock(hidden) for _ in range(n_blocks)])
self.head = nn.Linear(hidden, d_out)
def forward(self, x):
x = self.embed(x)
for blk in self.blocks: x = blk(x)
return self.head(x)