Learn by doing — no labels, only rewards. The framework that powers game-playing agents, robotics, and RLHF.
Key idea
An agent acts in an environment; the environment gives reward. No labels. No supervised target. The agent has to figure out, from sparse and often delayed reward signals, what good behaviour looks like. The classical loop: observe state, choose action, receive reward and new state, repeat.
Watch tabular Q-learning learn a grid-world policy — Q-values flood backward from the goal as episodes accumulate
ε = 0.20episode 0
A 5×5 grid world. The agent (indigo dot) starts top-left and gets +1 for reaching the goal (orange square) and -1 for stepping on a wall (grey). Each cell shows the max Q-value across actions — colour intensity = how good that state is. Watch the "warmth" flood backward from the goal as more episodes play out. Drop ε to greedy (almost no exploration); the agent gets stuck.
The RL loop. State s → action a → reward r → next state s'. Repeat. The agent learns a policy π(a | s) that maximises expected discounted future reward.
Q-learning. Learn a value Q(s, a) = "expected discounted reward if I take action a in state s and act greedily afterwards." Update rule: Q(s, a) ← Q(s, a) + α · (r + γ maxa' Q(s', a') − Q(s, a)).
Exploration vs exploitation. ε-greedy: with probability ε pick a random action, otherwise pick the best Q. Too much ε → never converges. Too little → stuck in local optima.
Policy gradient. Parametrize the policy directly (πθ(a | s)) and follow the gradient of expected reward. REINFORCE, A2C, PPO. Works when actions are continuous or the action space is huge.
Modern deep RL. Replace the Q-table with a neural network → DQN. Add policy networks → A2C, PPO. Add a value-network critic + replay buffer → SAC. The pieces are old; the engineering is what makes them work.
Reach for it when
Sequential decisions with delayed reward
Game-playing, robotics, control
RLHF — aligning models to human preferences
No labelled data, but a simulator or rollout mechanism
Limits
Sample-inefficient — usually needs millions of rollouts
Unstable training — exploration / exploitation balance is fragile
Reward design is hard — the agent will exploit whatever you wrote
Sim-to-real gap — policies that work in simulation often fail on hardware
import numpy as np
import gymnasium as gym
env = gym.make("FrozenLake-v1", is_slippery=False)
Q = np.zeros((env.observation_space.n, env.action_space.n))
alpha, gamma, eps = 0.1, 0.95, 0.2
for episode in range(5000):
s, _ = env.reset()
done = False
while not done:
a = np.random.randint(4) if np.random.random() < eps \
else Q[s].argmax()
s_next, r, done, *_ = env.step(a)
Q[s, a] += alpha * (r + gamma * Q[s_next].max() - Q[s, a])
s = s_next
# Greedy policy
policy = Q.argmax(axis=1)
Want MDPs, Bellman equations, and actor-critic?
Bellman optimality$$ Q^*(s, a) = \mathbb{E}_{s'}\!\left[\, R(s, a) + \gamma \max_{a'} Q^*(s', a') \,\right] $$
Q*optimal action-value function
γdiscount factor (0 = myopic; 1 = far-sighted)
Q-learning bootstraps this equation iteratively
MDPs. States S, actions A, transitions P(s' | s, a), rewards R(s, a, s'), discount γ. The agent's job is a policy π that maximises expected discounted return.
Value functions.Vπ(s) = expected return from s following π. Qπ(s, a) = expected return from s, taking action a, then following π. Both satisfy Bellman equations; both can be learned.
Model-free vs model-based. Model-free: learn V or Q directly from experience (Q-learning, SARSA, DQN). Model-based: learn the transition model and plan (Dyna-Q, MuZero). Model-based is more sample-efficient when the model is accurate; less so when it isn't.
Policy gradient methods. Parameterise πθ and take steps in the gradient of expected return: ∇θ J = E[Σt ∇θ log πθ(at | st) · A(st, at)] where A is the advantage. REINFORCE is the simplest form; A2C, A3C, PPO add stability.
Actor-critic. Combine policy gradient (actor) with a learned value function (critic). The critic reduces variance; the actor exploits it. Modern RL is mostly actor-critic in some form.
On-policy vs off-policy. On-policy (PPO, A2C): learn from data the current policy collected. Stable but sample-inefficient. Off-policy (DQN, SAC, DDPG): learn from any data (replay buffer). More sample-efficient; trickier to make stable.
import torch, torch.nn as nn, torch.nn.functional as F
# DQN: Q-function approximated by a neural net, replay buffer, target network
class DQN(nn.Module):
def __init__(self, state_dim, n_actions):
super().__init__()
self.net = nn.Sequential(nn.Linear(state_dim, 128), nn.ReLU(),
nn.Linear(128, n_actions))
def forward(self, x): return self.net(x)
# One training step
def dqn_step(net, target_net, batch, gamma=0.99):
s, a, r, s_next, done = batch
q = net(s).gather(1, a.unsqueeze(1)).squeeze(1)
with torch.no_grad():
q_next = target_net(s_next).max(dim=1).values
target = r + gamma * (1 - done) * q_next
return F.smooth_l1_loss(q, target)
Want PPO, soft actor-critic, RLHF, and the exploration zoo?
Clip prevents large policy updates that destabilise training
Default actor-critic for the practical RL world
PPO. Schulman et al. (2017). The de facto default for continuous control. Clipped ratio objective prevents large policy updates; multiple epochs over the same rollouts; advantage estimation via GAE. Practical, simple, robust.
SAC. Haarnoja et al. (2018). Off-policy actor-critic with entropy regularization — encourages exploration by maximising "expected return + policy entropy". Sample-efficient; the right choice for many continuous-action benchmarks.
MuZero. Schrittwieser et al. (2020). Learn the dynamics model in a latent space and plan with Monte-Carlo tree search. Achieves AlphaGo-level play without a hand-coded simulator. Beautiful theoretical synthesis.
Offline RL. Learn from a fixed dataset without environment access. Conservative Q-Learning (CQL), Implicit Q-Learning (IQL), Decision Transformer reformulate the problem as supervised sequence modelling. Hard because OOD actions can't be evaluated.
RLHF. Reinforcement Learning from Human Feedback. Train a reward model from human preferences over pairs of outputs; optimise an LLM against it with PPO. Made instruction-following LLMs possible. The pretrain → SFT → RLHF pipeline is now standard for assistant-style models.
Exploration. ε-greedy is the floor. Better: entropy bonuses (SAC), intrinsic motivation (curiosity-driven, ICM), Thompson sampling on the Q-distribution (Bootstrapped DQN), random network distillation (RND). The right method depends on the problem's reward sparsity.
The deep-RL stability cottage industry. Modern deep RL is a list of stability tricks: target networks, replay buffers, layer normalization, clipped gradients, learning-rate annealing, normalised observations, advantage normalization, … None alone is magic; together they make things barely work.