Universal approximation (Hornik et al., 1989). A feedforward network with a single hidden layer and enough units can approximate any continuous function on a compact domain to arbitrary accuracy. The catch: "enough" can be astronomically large. Depth lets the network compose features hierarchically, which is dramatically more parameter-efficient.
The computational graph. A neural network is a directed acyclic graph of tensor operations. Each node knows how to compute its forward output and how to compute the gradient w.r.t. its inputs given the gradient w.r.t. its output. Autodiff just walks this graph backward, multiplying local Jacobians — the chain rule in matrix form.
The optimization landscape is non-convex with many local minima and saddle points. In practice, SGD with momentum (or Adam) reliably finds flat minima that generalize well. Two surprising facts: (1) the loss surface contains many equivalent global minima reachable from random init; (2) SGD's stochasticity acts as implicit regularization, biasing it toward flatter minima.
Initialization matters. If weights are too small, activations and gradients vanish through the layers. If too large, they explode. Modern initializations — He (for ReLU) and Xavier/Glorot (for tanh) — scale weights so that variance is preserved layer-to-layer. Without correct init, even a well-designed network won't train.
Activations. ReLU is the default — cheap, avoids saturation on positive inputs, but has the "dead ReLU" failure mode. GELU is smoother and dominates in transformers. Modern variants: Swish, SiLU, Mish. For output layers: sigmoid for binary, softmax for multi-class, linear for regression.
Regularization in modern nets. Less explicit than older networks: weight decay (L2), dropout, early stopping, data augmentation. The implicit regularization of SGD does a lot of work — increasing batch size or learning rate changes generalization behaviour even with no other regularizer.