Key idea

An agent acts in an environment; the environment gives reward. No labels. No supervised target. The agent has to figure out, from sparse and often delayed reward signals, what good behaviour looks like. The classical loop: observe state, choose action, receive reward and new state, repeat.

Watch tabular Q-learning learn a grid-world policy — Q-values flood backward from the goal as episodes accumulate
ε = 0.20 episode 0

A 5×5 grid world. The agent (indigo dot) starts top-left and gets +1 for reaching the goal (orange square) and -1 for stepping on a wall (grey). Each cell shows the max Q-value across actions — colour intensity = how good that state is. Watch the "warmth" flood backward from the goal as more episodes play out. Drop ε to greedy (almost no exploration); the agent gets stuck.

The RL loop. State s → action a → reward r → next state s'. Repeat. The agent learns a policy π(a | s) that maximises expected discounted future reward.

Q-learning. Learn a value Q(s, a) = "expected discounted reward if I take action a in state s and act greedily afterwards." Update rule: Q(s, a) ← Q(s, a) + α · (r + γ maxa' Q(s', a') − Q(s, a)).

Exploration vs exploitation. ε-greedy: with probability ε pick a random action, otherwise pick the best Q. Too much ε → never converges. Too little → stuck in local optima.

Policy gradient. Parametrize the policy directly (πθ(a | s)) and follow the gradient of expected reward. REINFORCE, A2C, PPO. Works when actions are continuous or the action space is huge.

Modern deep RL. Replace the Q-table with a neural network → DQN. Add policy networks → A2C, PPO. Add a value-network critic + replay buffer → SAC. The pieces are old; the engineering is what makes them work.

Reach for it when

  • Sequential decisions with delayed reward
  • Game-playing, robotics, control
  • RLHF — aligning models to human preferences
  • No labelled data, but a simulator or rollout mechanism

Limits

  • Sample-inefficient — usually needs millions of rollouts
  • Unstable training — exploration / exploitation balance is fragile
  • Reward design is hard — the agent will exploit whatever you wrote
  • Sim-to-real gap — policies that work in simulation often fail on hardware
import numpy as np
import gymnasium as gym

env = gym.make("FrozenLake-v1", is_slippery=False)
Q = np.zeros((env.observation_space.n, env.action_space.n))
alpha, gamma, eps = 0.1, 0.95, 0.2

for episode in range(5000):
    s, _ = env.reset()
    done = False
    while not done:
        a = np.random.randint(4) if np.random.random() < eps \
            else Q[s].argmax()
        s_next, r, done, *_ = env.step(a)
        Q[s, a] += alpha * (r + gamma * Q[s_next].max() - Q[s, a])
        s = s_next

# Greedy policy
policy = Q.argmax(axis=1)
Want MDPs, Bellman equations, and actor-critic?
Bellman optimality $$ Q^*(s, a) = \mathbb{E}_{s'}\!\left[\, R(s, a) + \gamma \max_{a'} Q^*(s', a') \,\right] $$
  • Q*optimal action-value function
  • γdiscount factor (0 = myopic; 1 = far-sighted)
  • Q-learning bootstraps this equation iteratively

MDPs. States S, actions A, transitions P(s' | s, a), rewards R(s, a, s'), discount γ. The agent's job is a policy π that maximises expected discounted return.

Value functions. Vπ(s) = expected return from s following π. Qπ(s, a) = expected return from s, taking action a, then following π. Both satisfy Bellman equations; both can be learned.

Model-free vs model-based. Model-free: learn V or Q directly from experience (Q-learning, SARSA, DQN). Model-based: learn the transition model and plan (Dyna-Q, MuZero). Model-based is more sample-efficient when the model is accurate; less so when it isn't.

Policy gradient methods. Parameterise πθ and take steps in the gradient of expected return: θ J = E[Σtθ log πθ(at | st) · A(st, at)] where A is the advantage. REINFORCE is the simplest form; A2C, A3C, PPO add stability.

Actor-critic. Combine policy gradient (actor) with a learned value function (critic). The critic reduces variance; the actor exploits it. Modern RL is mostly actor-critic in some form.

On-policy vs off-policy. On-policy (PPO, A2C): learn from data the current policy collected. Stable but sample-inefficient. Off-policy (DQN, SAC, DDPG): learn from any data (replay buffer). More sample-efficient; trickier to make stable.

import torch, torch.nn as nn, torch.nn.functional as F

# DQN: Q-function approximated by a neural net, replay buffer, target network
class DQN(nn.Module):
    def __init__(self, state_dim, n_actions):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(state_dim, 128), nn.ReLU(),
                                 nn.Linear(128, n_actions))
    def forward(self, x): return self.net(x)

# One training step
def dqn_step(net, target_net, batch, gamma=0.99):
    s, a, r, s_next, done = batch
    q     = net(s).gather(1, a.unsqueeze(1)).squeeze(1)
    with torch.no_grad():
        q_next = target_net(s_next).max(dim=1).values
        target = r + gamma * (1 - done) * q_next
    return F.smooth_l1_loss(q, target)
Want PPO, soft actor-critic, RLHF, and the exploration zoo?
PPO clipped objective $$ \mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t \min\!\Big(r_t(\theta) \hat A_t,\; \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat A_t\Big) $$
  • rt(θ)importance ratio between new and old policy
  • Clip prevents large policy updates that destabilise training
  • Default actor-critic for the practical RL world

PPO. Schulman et al. (2017). The de facto default for continuous control. Clipped ratio objective prevents large policy updates; multiple epochs over the same rollouts; advantage estimation via GAE. Practical, simple, robust.

SAC. Haarnoja et al. (2018). Off-policy actor-critic with entropy regularization — encourages exploration by maximising "expected return + policy entropy". Sample-efficient; the right choice for many continuous-action benchmarks.

MuZero. Schrittwieser et al. (2020). Learn the dynamics model in a latent space and plan with Monte-Carlo tree search. Achieves AlphaGo-level play without a hand-coded simulator. Beautiful theoretical synthesis.

Offline RL. Learn from a fixed dataset without environment access. Conservative Q-Learning (CQL), Implicit Q-Learning (IQL), Decision Transformer reformulate the problem as supervised sequence modelling. Hard because OOD actions can't be evaluated.

RLHF. Reinforcement Learning from Human Feedback. Train a reward model from human preferences over pairs of outputs; optimise an LLM against it with PPO. Made instruction-following LLMs possible. The pretrain → SFT → RLHF pipeline is now standard for assistant-style models.

Exploration. ε-greedy is the floor. Better: entropy bonuses (SAC), intrinsic motivation (curiosity-driven, ICM), Thompson sampling on the Q-distribution (Bootstrapped DQN), random network distillation (RND). The right method depends on the problem's reward sparsity.

The deep-RL stability cottage industry. Modern deep RL is a list of stability tricks: target networks, replay buffers, layer normalization, clipped gradients, learning-rate annealing, normalised observations, advantage normalization, … None alone is magic; together they make things barely work.

import torch, torch.nn.functional as F

# PPO clipped surrogate loss
def ppo_loss(logp_new, logp_old, advantages, eps=0.2):
    ratio = (logp_new - logp_old).exp()
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - eps, 1 + eps) * advantages
    return -torch.min(surr1, surr2).mean()

# Generalised Advantage Estimation
def gae(rewards, values, dones, gamma=0.99, lam=0.95):
    advantages = torch.zeros_like(rewards)
    last_gae = 0
    for t in reversed(range(len(rewards))):
        next_v = values[t + 1] if t + 1 < len(rewards) else 0
        delta = rewards[t] + gamma * next_v * (1 - dones[t]) - values[t]
        last_gae = delta + gamma * lam * (1 - dones[t]) * last_gae
        advantages[t] = last_gae
    return advantages
Too dense?