Key idea

A neural network is a flexible function you teach by example. It's built from millions of little "neurons" — each just a weighted sum followed by a small bend. Adjust all those weights via gradient descent, and the whole thing learns to map inputs (images, text, sensor data) to outputs (labels, predictions, the next word).

Forget the brain metaphor for a moment. An artificial neuron is a tiny computation: take several numbers in, multiply each by a learnable weight, add a bias, and pass the result through a nonlinear function like ReLU. Stack a few of these into a "layer". Stack a few layers. That's a neural network.

The magic is in the training: an algorithm called backpropagation figures out, for every weight in the network, which direction to nudge it to make the predictions slightly better. Repeat that for millions of examples and the network learns whatever pattern the data contains.

If this is your first network, the natural next step is the Deep Neural Network page — the simplest concrete architecture.

Reach for it when

  • The pattern is too complex for a linear model
  • You have plenty of data — neural nets are data-hungry
  • End-to-end learning matters (no hand-engineered features)
  • The data has structure (images, text, audio) — use a specialized architecture

Skip it when

  • Small data — gradient boosting or a simple model wins
  • Interpretability of individual predictions matters
  • You need calibrated probabilities without post-hoc adjustment
  • The task is well-served by a known closed-form solution
import torch
import torch.nn as nn

# A 3-layer network: 20 inputs → 64 hidden → 32 hidden → 10 outputs
model = nn.Sequential(
    nn.Linear(20, 64), nn.ReLU(),
    nn.Linear(64, 32), nn.ReLU(),
    nn.Linear(32, 10),
)

# Train it
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
for xb, yb in loader:
    loss = loss_fn(model(xb), yb)
    opt.zero_grad(); loss.backward(); opt.step()
Want to see what's actually happening?
One neuron $$ y \;=\; \sigma\!\left(\sum_{i=1}^{n} w_i\, x_i \;+\; b\right) $$
  • xiinputs
  • wi, blearnable weights and bias
  • σnonlinear activation — ReLU, sigmoid, tanh

A neural network is just a graph of these little computations. The "neural" metaphor is mostly decorative.

Layers, not neurons. In practice we don't write individual neurons — we write layers. A fully-connected layer takes a vector x and produces σ(Wx + b), where W is a matrix of weights (one row per output neuron). Stacking layers gives a deep network.

Why nonlinearity is essential. A composition of linear functions is still linear. Without the σ, no matter how many layers you stack, the whole network collapses to a single linear function — same expressive power as ordinary linear regression. The nonlinearity is what makes depth meaningful.

Training: gradient descent. Define a loss ℒ(θ) measuring how wrong the network is on the training data, then update each parameter θ in the direction that decreases : θ ← θ − η ∂ℒ/∂θ. The learning rate η controls how big each step is.

Backpropagation is just the chain rule applied to the loss layer-by-layer. Compute the loss, then propagate gradients backward through the layers — each layer turns the gradient from above into a gradient w.r.t. its own weights and inputs. PyTorch / JAX do this automatically via autodiff.

The training loop. Repeat: sample a mini-batch, forward pass, compute loss, backward pass, update weights. After many passes through the data (epochs), the loss decreases and the model has learned.

Reach for it when

  • You need a flexible function approximator
  • The dataset is large enough to support millions of parameters
  • You're going to build on top of this with a specialized architecture
  • End-to-end differentiability matters for your pipeline

Skip it when

  • The data is tiny or noisy — overfitting risk is high
  • You'd benefit from baking in domain priors a linear model already has
  • Inference latency is tight without ability to distill / quantize
  • Strict probabilistic guarantees needed
import torch, torch.nn as nn

class MLP(nn.Module):
    def __init__(self, d_in, d_out, hidden=(128, 64), dropout=0.1):
        super().__init__()
        layers, prev = [], d_in
        for h in hidden:
            layers += [nn.Linear(prev, h), nn.ReLU(), nn.Dropout(dropout)]
            prev = h
        layers.append(nn.Linear(prev, d_out))
        self.net = nn.Sequential(*layers)
    def forward(self, x): return self.net(x)

# Training loop sketch
def train(model, loader, epochs=10, lr=1e-3):
    opt    = torch.optim.Adam(model.parameters(), lr=lr)
    loss_fn = nn.CrossEntropyLoss()
    for epoch in range(epochs):
        for xb, yb in loader:
            loss = loss_fn(model(xb), yb)
            opt.zero_grad()
            loss.backward()       # autodiff fills in gradients
            opt.step()            # update every parameter
Want universal approximation and the math behind backprop?
Backprop in one equation $$ \frac{\partial \mathcal{L}}{\partial W^{(\ell)}} \;=\; \delta^{(\ell)} \big(\mathbf{h}^{(\ell-1)}\big)^\top, \qquad \delta^{(\ell)} \;=\; \big(W^{(\ell+1)}\big)^\top \delta^{(\ell+1)} \,\odot\, \sigma'\!\big(\mathbf{z}^{(\ell)}\big) $$
  • δ(ℓ)error signal at layer , propagated backward from the loss
  • h(ℓ-1)activations from the previous layer (the "incoming" signal)
  • z(ℓ)pre-activation; σ' is the derivative of the activation

Universal approximation (Hornik et al., 1989). A feedforward network with a single hidden layer and enough units can approximate any continuous function on a compact domain to arbitrary accuracy. The catch: "enough" can be astronomically large. Depth lets the network compose features hierarchically, which is dramatically more parameter-efficient.

The computational graph. A neural network is a directed acyclic graph of tensor operations. Each node knows how to compute its forward output and how to compute the gradient w.r.t. its inputs given the gradient w.r.t. its output. Autodiff just walks this graph backward, multiplying local Jacobians — the chain rule in matrix form.

The optimization landscape is non-convex with many local minima and saddle points. In practice, SGD with momentum (or Adam) reliably finds flat minima that generalize well. Two surprising facts: (1) the loss surface contains many equivalent global minima reachable from random init; (2) SGD's stochasticity acts as implicit regularization, biasing it toward flatter minima.

Initialization matters. If weights are too small, activations and gradients vanish through the layers. If too large, they explode. Modern initializations — He (for ReLU) and Xavier/Glorot (for tanh) — scale weights so that variance is preserved layer-to-layer. Without correct init, even a well-designed network won't train.

Activations. ReLU is the default — cheap, avoids saturation on positive inputs, but has the "dead ReLU" failure mode. GELU is smoother and dominates in transformers. Modern variants: Swish, SiLU, Mish. For output layers: sigmoid for binary, softmax for multi-class, linear for regression.

Regularization in modern nets. Less explicit than older networks: weight decay (L2), dropout, early stopping, data augmentation. The implicit regularization of SGD does a lot of work — increasing batch size or learning rate changes generalization behaviour even with no other regularizer.

Reach for it when

  • You can scale data and parameters — the modern win condition
  • You'll embed this as part of a bigger model (an encoder, a head)
  • You want to integrate with any differentiable component
  • Pretrained embeddings exist for your domain

Skip it when

  • Small data and no transfer-learning option
  • Adversarial robustness is mission-critical
  • Causal interpretation or counterfactual analysis needed
  • Compute budget can't support hyperparameter tuning
import torch
import torch.nn as nn

# Manual backward to understand what autograd is doing
class TinyNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.W1 = nn.Parameter(torch.randn(4, 8) * (2 / 4) ** 0.5)   # He init
        self.b1 = nn.Parameter(torch.zeros(8))
        self.W2 = nn.Parameter(torch.randn(8, 2) * (2 / 8) ** 0.5)
        self.b2 = nn.Parameter(torch.zeros(2))

    def forward(self, x):
        z1 = x @ self.W1 + self.b1
        a1 = torch.relu(z1)
        z2 = a1 @ self.W2 + self.b2
        return z2

# Autograd will compute the gradient of loss w.r.t. every Parameter automatically.
# Internally, it does exactly what the backprop equation says:
# δ at the output layer is (softmax_pred − target); backward through each layer
# multiplies by the layer's weight transpose and the activation's derivative.
Too dense?