Key idea

Stack simple operations until something complex emerges. Each layer does a weighted sum of its inputs, then bends the result through a nonlinear function. Stack enough of these and you can approximate just about any function — image classification, language translation, game playing.

Click any heatmap to set the input point — see how each neuron responds

Each hidden neuron shows its activation across the entire input space — that's the heatmap inside the circle. Different neurons learn to detect different regions of the input. The output neuron on the right combines them into the final decision (indigo = class 0, orange = class 1). Drag the marker to see how the network responds to different points. Try the XOR-like pattern — it's the canonical "this can't be done with a single linear layer" example.

A neural network is just a chain of layers. Each layer takes a vector in, multiplies it by a matrix of weights, adds a bias, and runs the result through a "nonlinearity" (like ReLU: max(0, x)). The nonlinearity is essential — without it, no matter how many layers you stack, the whole thing collapses to a single linear function.

Training works by gradient descent: nudge each weight in the direction that reduces the prediction error, computed via the chain rule (a.k.a. backpropagation).

Reach for it when

  • Tabular data and you want more flexibility than a linear model
  • You have plenty of data and compute
  • You'll use this as a building block in a bigger model
  • You want a starting point before reaching for CNNs / transformers

Skip it when

  • Small tabular data — gradient boosting usually wins
  • Data has obvious structure (images, sequences, graphs) — use a specialized architecture
  • You need interpretability of individual predictions
  • You're short on training data
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(n_features, 128), nn.ReLU(),
    nn.Linear(128, 64),          nn.ReLU(),
    nn.Linear(64, n_classes),
)
Want the forward / backward math?
Forward pass $$ \mathbf{h}^{(\ell)} \;=\; \sigma\!\left(W^{(\ell)} \mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)}\right) $$
  • h(ℓ)activations of layer , with h(0) = x
  • W(ℓ), b(ℓ)weight matrix and bias of layer
  • σelementwise nonlinearity — ReLU, GELU, tanh, …

Training. Define a loss ℒ(θ), take its gradient with respect to all parameters via backpropagation (chain rule applied layer-by-layer), and step in the negative-gradient direction. SGD, Adam, AdamW are the optimizers people actually use.

Activations. ReLU is the default — cheap and avoids the saturating-gradient problem of tanh / sigmoid. GELU is smoother and dominates in transformers. Sigmoid / tanh appear only inside gates (LSTM, attention) where bounded outputs matter.

Initialization. Random Gaussian weights need to be scaled carefully — too big and activations blow up, too small and they vanish. Xavier/Glorot for tanh, He/Kaiming for ReLU. Most frameworks default to a sensible scheme.

Regularization. Dropout (randomly zero some activations during training), weight decay (L2 penalty), early stopping, data augmentation. Modern nets often need less explicit regularization than older ones — the implicit regularization of SGD does a lot of work.

Reach for it when

  • You need a flexible function approximator with no structural prior
  • Embedding components in a larger architecture (the MLP head everywhere)
  • You can afford the data + compute
  • You want batch / online training over a long-running stream

Skip it when

  • The data has structure you can exploit — use it
  • You need calibrated probabilities — DNNs are over-confident without post-hoc calibration
  • Single-CPU inference and strict latency budget
  • You need certified robustness — current MLPs are easy to attack
import torch, torch.nn as nn
from torch.optim import AdamW

class MLP(nn.Module):
    def __init__(self, d_in, d_out, hidden=(256, 128), dropout=0.1):
        super().__init__()
        layers, prev = [], d_in
        for h in hidden:
            layers += [nn.Linear(prev, h), nn.GELU(), nn.Dropout(dropout)]
            prev = h
        layers.append(nn.Linear(prev, d_out))
        self.net = nn.Sequential(*layers)
    def forward(self, x): return self.net(x)

model = MLP(d_in=20, d_out=10)
opt   = AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
loss_fn = nn.CrossEntropyLoss()

for epoch in range(50):
    for xb, yb in loader:
        loss = loss_fn(model(xb), yb)
        opt.zero_grad(); loss.backward(); opt.step()
Want depth-vs-width, universal approximation, and the failure modes?
Universal approximation $$ \forall \,\varepsilon{>}0,\; \exists\, n,\, \{w_i, b_i, c_i\}: \quad \sup_{\mathbf{x}\in K} \left|\, f(\mathbf{x}) - \sum_{i=1}^{n} c_i\, \sigma(\mathbf{w}_i^\top \mathbf{x} + b_i)\,\right| < \varepsilon $$
  • A single hidden layer with enough units can approximate any continuous f on a compact set K
  • The catch: "enough" can be astronomically large; depth makes this tractable

Depth vs. width. The universal approximation theorem says one hidden layer suffices. In practice, deep narrow networks generalize much better than shallow wide ones for the same parameter count — depth lets the model compose features hierarchically. Modern intuition: width gives capacity, depth gives composition.

Vanishing / exploding gradients. In deep networks, repeated multiplication of small or large Jacobian factors makes gradients shrink or explode through the chain rule. Modern fixes: ReLU (constant gradient on positive inputs), batch / layer normalization (keep activations on a sensible scale), residual connections (gradient highways).

The optimization landscape is non-convex with many local minima and saddle points. SGD finds flat minima that generalize well; sharp minima generalize poorly. This is part of why batch size and learning rate matter for generalization, not just for training speed.

Dead ReLUs. If a unit's pre-activation goes strongly negative and stays there, its gradient is zero and it never updates again. Mitigations: smaller learning rate, Leaky ReLU / GELU, careful initialization, batch norm.

Double descent. Past the interpolation threshold (where the network can perfectly memorize the training set), test error often decreases as you grow the model further. Connect to bias-variance — classical theory predicts the opposite.

Reach for it when

  • You can afford to scale (more data, more parameters)
  • End-to-end differentiability matters — embed in any pipeline
  • You have a budget for hyperparameter search (LR, depth, width, regularization)
  • Pre-trained embeddings exist for your domain

Skip it when

  • Very small data — gradient boosting and probabilistic models win
  • You can't engineer reasonable hyperparameters and don't want to AutoML them
  • Adversarial robustness is a hard requirement
  • Causal inference / counterfactual reasoning is the goal
import torch, torch.nn as nn

# Residual MLP block — mitigates vanishing gradients, enables deeper models
class ResBlock(nn.Module):
    def __init__(self, d, dropout=0.1):
        super().__init__()
        self.norm = nn.LayerNorm(d)
        self.fc   = nn.Sequential(
            nn.Linear(d, 4 * d), nn.GELU(), nn.Dropout(dropout),
            nn.Linear(4 * d, d),
        )
    def forward(self, x):
        return x + self.fc(self.norm(x))

# A "modern" deep MLP: pre-norm + residual + GELU expansion (the MLP-Mixer pattern)
class DeepMLP(nn.Module):
    def __init__(self, d_in, d_out, hidden=384, n_blocks=6):
        super().__init__()
        self.embed  = nn.Linear(d_in, hidden)
        self.blocks = nn.ModuleList([ResBlock(hidden) for _ in range(n_blocks)])
        self.head   = nn.Linear(hidden, d_out)
    def forward(self, x):
        x = self.embed(x)
        for blk in self.blocks: x = blk(x)
        return self.head(x)
Too dense?