Big transformers, trained on most of the internet, that you fine-tune or just prompt for everything.
Key idea
Train one huge model on huge data with a simple objective; reuse it everywhere. Next-token prediction on trillions of tokens. The resulting model has absorbed so much structure that you can prompt it to do almost any text task — without retraining. Fine-tune for specific behaviour; RLHF for alignment. The economics of ML changed when this started working.
Watch capability emerge with scale — synthetic abilities sweep across model size on a log scale
10⁹ params
Stylised capability curves vs model scale. Memorisation and basic syntax emerge early. Arithmetic, multi-step reasoning, and tool use emerge much later, often with sharp transitions ("emergent abilities"). These curves are caricatures of real benchmark scaling — but the shape is qualitatively right.
The recipe. Pre-train a decoder-only transformer on ~trillions of tokens. Tokenise with BPE. Train with next-token cross-entropy. Use AdamW, cosine LR schedule, large batch size. Scale parameters, data, and compute together (Chinchilla scaling).
Then specialise. Supervised fine-tune (SFT) on (instruction, response) pairs. Optionally RLHF or DPO with human preference data. Now you have an instruction-following assistant rather than a raw text predictor.
In-context learning. Sufficiently large pre-trained models can do new tasks from a few examples in the prompt — no gradient updates. The mechanism is still actively researched; the practical fact is that prompt design is the dominant way teams adapt LLMs.
Retrieval-augmented generation. Embed a knowledge base; at query time, retrieve relevant chunks; concatenate into the prompt. Cheaper than fine-tuning, factually grounded, updatable. Standard architecture for "chat-with-your-docs" products.
Tool use & agents. LLMs can call external functions — search, code execution, API calls. Loop: prompt → tool call → result → next prompt. Powers everything from web-browsing assistants to coding agents.
Prompt injection — adversarial inputs can hijack instructions
Cost & latency — non-trivial for high-volume real-time use
from anthropic import Anthropic
client = Anthropic()
# Few-shot in-context learning
prompt = """
Classify the sentiment of each sentence as POSITIVE, NEGATIVE, or NEUTRAL.
Q: "The movie was breathtaking and emotional."
A: POSITIVE
Q: "It was okay, nothing special."
A: NEUTRAL
Q: "I want my two hours back."
A:
""".strip()
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=10,
messages=[{"role": "user", "content": prompt}],
)
print(resp.content[0].text)
Hoffmann et al. (2022): optimal N and D grow together
~20 tokens per parameter is the canonical "Chinchilla optimal" ratio
Scaling laws. Kaplan et al. (2020) showed cross-entropy loss is predictable as a power law in compute, params, and data. Hoffmann et al. (2022) corrected the optimal ratio — for a fixed compute budget, train a smaller model on more data than GPT-3 did. The Chinchilla curve is the rule of thumb most modern frontier models follow.
Emergence. Wei et al. (2022) — some abilities (3-digit arithmetic, chain-of-thought reasoning) appear sharply at specific scales. Schaeffer et al. (2023) — emergence is partly an artefact of the metrics used; smoother curves under different metrics. Both perspectives are true; the practical takeaway is to plot multiple scales when reporting capability.
RLHF and alternatives. Ouyang et al. (2022). Collect human preference pairs over model outputs; train a reward model; optimise the LLM against the reward via PPO. Modern alternatives — DPO (Rafailov et al. 2023), KTO, ORPO — skip the reward model and optimise directly on preferences. Simpler, often comparable.
Mixture of Experts. Instead of every token going through every parameter, route tokens to a few "experts". Sparse activation: 8× more parameters at 1× the compute. Used in Mixtral, Switch Transformer, GShard, GPT-4 (rumoured).
Context length. Vanilla attention is O(n²). Modern frontier models support 100k–2M token contexts via Flash Attention (kernel-level optimisation), sparse / linear attention, position interpolation (RoPE), and architectural tricks (Mamba, ring attention).
Inference optimisation. Quantisation (4-bit, 8-bit), speculative decoding (draft + verify with a small model), KV-cache, FlashAttention. The difference between a model that's 50ms/token and 5s/token is mostly here.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load a quantised model — runs on consumer GPU
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
load_in_4bit=True, # bitsandbytes 4-bit quantisation
)
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct")
# Chat template
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain attention to a 10-year-old."},
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=200, temperature=0.7)
print(tok.decode(out[0], skip_special_tokens=True))
Want alignment, MoE routing, long context, and agentic systems?
yw, yl"winning" and "losing" responses from human preference data
No separate reward model, no RL — closed-form preference optimisation
Simpler, more stable than RLHF for many applications
Alignment. Make the model do what users (and developers) want — instructions, safety, honesty, helpfulness. The toolkit: instruction tuning (SFT), preference optimisation (RLHF / DPO), constitutional AI (rule-based reward), red-teaming, system prompts. Active research area; the right combination is task-dependent.
Mixture-of-Experts routing. Each token gets routed to top-k experts (out of N) based on a learned router. Tradeoffs: load balancing (avoid degenerate routing), expert capacity (how many tokens fit per expert), router stability. Switch Transformer (Fedus et al. 2022) and Mixtral demonstrate the recipe at scale.
Long context. Position encoding extensions (RoPE scaling, ALiBi), attention modifications (Flash Attention 2/3, ring attention), state-space models (Mamba). Reading 1M-token context is technically feasible now; whether the model uses it well is another matter ("needle in haystack" evals).
Agentic systems. LLMs as the reasoning core of a multi-step process: plan → call tools → observe → iterate. ReAct, Tree of Thoughts, AutoGPT, BabyAGI. The frontier is making these reliable enough for production — currently a brittle research area but improving fast.
Multi-modal LLMs. Image, audio, video tokens fed into the same model. CLIP-style encoders for input; sometimes separate decoders for output. GPT-4V, Gemini, Claude 3+ all support image input; some support image / audio output.
Synthetic data. Frontier models are increasingly trained on data generated by other LLMs — distillation, self-instruction, synthetic Q-A. Risks: model collapse from training on slop, contamination of benchmarks. Powerful when curated carefully.
Watermarking and provenance. Active research into making model outputs identifiable as machine-generated (statistical fingerprints in the sampling distribution). Important for misinformation, attribution, and benchmark integrity.
import torch, torch.nn.functional as F
# Direct Preference Optimization (DPO) — preference learning without RL
def dpo_loss(policy_logp_chosen, policy_logp_rejected,
ref_logp_chosen, ref_logp_rejected, beta=0.1):
pi_logratio = policy_logp_chosen - policy_logp_rejected
ref_logratio = ref_logp_chosen - ref_logp_rejected
return -F.logsigmoid(beta * (pi_logratio - ref_logratio)).mean()
# Speculative decoding — draft + verify
def speculative_generate(draft_model, target_model, prompt, n_draft=4):
"""Draft generates `n_draft` tokens; target verifies them in one pass."""
ids = prompt
while len(ids) < max_length:
draft = draft_model.generate(ids, max_new_tokens=n_draft)
target_probs = target_model(draft).softmax(-1)
# Accept tokens where target agrees; resample at the rejection point
ids = verify_and_resample(draft, target_probs)
return ids