Foundation Models & LLMs — ML Resources Hub

Key idea

Train one huge model on huge data with a simple objective; reuse it everywhere. Next-token prediction on trillions of tokens. The resulting model has absorbed so much structure that you can prompt it to do almost any text task — without retraining. Fine-tune for specific behaviour; RLHF for alignment. The economics of ML changed when this started working.

Watch capability emerge with scale — synthetic abilities sweep across model size on a log scale

model scale 10⁹ params

Stylised capability curves vs model scale. Memorisation and basic syntax emerge early. Arithmetic, multi-step reasoning, and tool use emerge much later, often with sharp transitions ("emergent abilities"). These curves are caricatures of real benchmark scaling — but the shape is qualitatively right.

The recipe. Pre-train a decoder-only transformer on ~trillions of tokens. Tokenise with BPE. Train with next-token cross-entropy. Use AdamW, cosine LR schedule, large batch size. Scale parameters, data, and compute together (Chinchilla scaling).

Then specialise. Supervised fine-tune (SFT) on (instruction, response) pairs. Optionally RLHF or DPO with human preference data. Now you have an instruction-following assistant rather than a raw text predictor.

In-context learning. Sufficiently large pre-trained models can do new tasks from a few examples in the prompt — no gradient updates. The mechanism is still actively researched; the practical fact is that prompt design is the dominant way teams adapt LLMs.

Retrieval-augmented generation. Embed a knowledge base; at query time, retrieve relevant chunks; concatenate into the prompt. Cheaper than fine-tuning, factually grounded, updatable. Standard architecture for "chat-with-your-docs" products.

Tool use & agents. LLMs can call external functions — search, code execution, API calls. Loop: prompt → tool call → result → next prompt. Powers everything from web-browsing assistants to coding agents.

What LLMs do well

Summarisation, translation, classification, extraction
Code generation, refactoring, explanation
Open-ended dialogue, drafting, brainstorming
Structured outputs (JSON, code, function calls) with the right APIs
Few-shot in-context learning on unseen tasks

Common failure modes

Hallucination — confident statements that are simply wrong
Stale knowledge — train cutoffs mean recent facts may be missing
Reasoning gaps — multi-step logic, arithmetic, planning
Prompt injection — adversarial inputs can hijack instructions
Cost & latency — non-trivial for high-volume real-time use

from anthropic import Anthropic

client = Anthropic()

# Few-shot in-context learning
prompt = """
Classify the sentiment of each sentence as POSITIVE, NEGATIVE, or NEUTRAL.

Q: "The movie was breathtaking and emotional."
A: POSITIVE

Q: "It was okay, nothing special."
A: NEUTRAL

Q: "I want my two hours back."
A:
""".strip()

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=10,
    messages=[{"role": "user", "content": prompt}],
)
print(resp.content[0].text)

Want scaling laws, RLHF, MoE, & the modern stack?

Chinchilla scaling law $$ L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta} $$

Nparameters, D training tokens
Hoffmann et al. (2022): optimal N and D grow together
~20 tokens per parameter is the canonical "Chinchilla optimal" ratio

Scaling laws. Kaplan et al. (2020) showed cross-entropy loss is predictable as a power law in compute, params, and data. Hoffmann et al. (2022) corrected the optimal ratio — for a fixed compute budget, train a smaller model on more data than GPT-3 did. The Chinchilla curve is the rule of thumb most modern frontier models follow.

Emergence. Wei et al. (2022) — some abilities (3-digit arithmetic, chain-of-thought reasoning) appear sharply at specific scales. Schaeffer et al. (2023) — emergence is partly an artefact of the metrics used; smoother curves under different metrics. Both perspectives are true; the practical takeaway is to plot multiple scales when reporting capability.

RLHF and alternatives. Ouyang et al. (2022). Collect human preference pairs over model outputs; train a reward model; optimise the LLM against the reward via PPO. Modern alternatives — DPO (Rafailov et al. 2023), KTO, ORPO — skip the reward model and optimise directly on preferences. Simpler, often comparable.

Mixture of Experts. Instead of every token going through every parameter, route tokens to a few "experts". Sparse activation: 8× more parameters at 1× the compute. Used in Mixtral, Switch Transformer, GShard, GPT-4 (rumoured).

Context length. Vanilla attention is O(n²). Modern frontier models support 100k–2M token contexts via Flash Attention (kernel-level optimisation), sparse / linear attention, position interpolation (RoPE), and architectural tricks (Mamba, ring attention).

Inference optimisation. Quantisation (4-bit, 8-bit), speculative decoding (draft + verify with a small model), KV-cache, FlashAttention. The difference between a model that's 50ms/token and 5s/token is mostly here.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load a quantised model — runs on consumer GPU
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_4bit=True,                   # bitsandbytes 4-bit quantisation
)
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct")

# Chat template
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "Explain attention to a 10-year-old."},
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
ids    = tok(prompt, return_tensors="pt").to(model.device)
out    = model.generate(**ids, max_new_tokens=200, temperature=0.7)
print(tok.decode(out[0], skip_special_tokens=True))

Want alignment, MoE routing, long context, and agentic systems?

DPO objective $$ \mathcal{L}_{\text{DPO}} = -\mathbb{E}\!\left[\log \sigma\!\big(\beta \log \tfrac{\pi_\theta(y_w | x)}{\pi_\text{ref}(y_w | x)} - \beta \log \tfrac{\pi_\theta(y_l | x)}{\pi_\text{ref}(y_l | x)}\big)\right] $$

y_w, y_l"winning" and "losing" responses from human preference data
No separate reward model, no RL — closed-form preference optimisation
Simpler, more stable than RLHF for many applications

Alignment. Make the model do what users (and developers) want — instructions, safety, honesty, helpfulness. The toolkit: instruction tuning (SFT), preference optimisation (RLHF / DPO), constitutional AI (rule-based reward), red-teaming, system prompts. Active research area; the right combination is task-dependent.

Mixture-of-Experts routing. Each token gets routed to top-k experts (out of N) based on a learned router. Tradeoffs: load balancing (avoid degenerate routing), expert capacity (how many tokens fit per expert), router stability. Switch Transformer (Fedus et al. 2022) and Mixtral demonstrate the recipe at scale.

Long context. Position encoding extensions (RoPE scaling, ALiBi), attention modifications (Flash Attention 2/3, ring attention), state-space models (Mamba). Reading 1M-token context is technically feasible now; whether the model uses it well is another matter ("needle in haystack" evals).

Agentic systems. LLMs as the reasoning core of a multi-step process: plan → call tools → observe → iterate. ReAct, Tree of Thoughts, AutoGPT, BabyAGI. The frontier is making these reliable enough for production — currently a brittle research area but improving fast.

Multi-modal LLMs. Image, audio, video tokens fed into the same model. CLIP-style encoders for input; sometimes separate decoders for output. GPT-4V, Gemini, Claude 3+ all support image input; some support image / audio output.

Synthetic data. Frontier models are increasingly trained on data generated by other LLMs — distillation, self-instruction, synthetic Q-A. Risks: model collapse from training on slop, contamination of benchmarks. Powerful when curated carefully.

Watermarking and provenance. Active research into making model outputs identifiable as machine-generated (statistical fingerprints in the sampling distribution). Important for misinformation, attribution, and benchmark integrity.

import torch, torch.nn.functional as F

# Direct Preference Optimization (DPO) — preference learning without RL
def dpo_loss(policy_logp_chosen, policy_logp_rejected,
             ref_logp_chosen,    ref_logp_rejected, beta=0.1):
    pi_logratio  = policy_logp_chosen - policy_logp_rejected
    ref_logratio = ref_logp_chosen    - ref_logp_rejected
    return -F.logsigmoid(beta * (pi_logratio - ref_logratio)).mean()

# Speculative decoding — draft + verify
def speculative_generate(draft_model, target_model, prompt, n_draft=4):
    """Draft generates `n_draft` tokens; target verifies them in one pass."""
    ids = prompt
    while len(ids) < max_length:
        draft = draft_model.generate(ids, max_new_tokens=n_draft)
        target_probs = target_model(draft).softmax(-1)
        # Accept tokens where target agrees; resample at the rejection point
        ids = verify_and_resample(draft, target_probs)
    return ids

Too dense?