Key idea

Words become vectors; sentences become sequences of vectors; models work on those. The whole stack — tokenisation, embedding, attention, decoding — is about turning fuzzy human-written language into something a model can compute on, then back. Modern NLP is almost entirely transformer-based now, but the older pieces still show up everywhere.

Type a sentence — watch byte-pair tokenisation chunk it and embed each token as a vector

A toy tokeniser splits your text into sub-word chunks (some common words stay whole, rarer ones break up). Each token becomes a vector — visualised here as a coloured bar pattern. Real models use 768- or 4096-dim vectors; this is a sketch. The structural point is that tokens, not characters or whole words, are what models work with.

Tokenisation. Split text into reusable chunks. Modern: byte-pair encoding (BPE), WordPiece, SentencePiece — vocabularies of ~30k–100k sub-word units. Common words stay whole; rare words decompose. Crucial detail: tokens, not characters, are the model's atoms.

Embeddings. Each token → a learned vector. Same dimension for everything. Vectors with similar meanings end up geometrically close — but the magic is in relative structure: "king − man + woman ≈ queen" works in word2vec.

Transformers. The current standard. Self-attention lets each token attend to every other, no recurrence. Scales to long sequences (with the right tricks) and trains in parallel. See the Transformer page for details.

Decoder, encoder, or both. BERT is encoder-only (good for classification, NER). GPT is decoder-only (good for generation). T5 and BART are encoder-decoder (good for translation, summarisation). Modern frontier is mostly decoder-only.

The fine-tuning stack. Pre-train on huge unlabeled text → supervised fine-tune on instruction-following pairs → RLHF for alignment. The recipe that turned GPT-3 into ChatGPT.

Classical NLP tasks

  • Classification: sentiment, topic, spam — fine-tune a BERT-like
  • NER: tag spans (people, places, products)
  • Translation / summarisation: T5, BART, or just GPT-4
  • Question answering: retrieve relevant docs, then read with an LLM
  • Embedding for search: sentence-transformers, BGE, E5

Watch out

  • Tokenisation quirks bite you in unexpected ways (numbers, code, multi-byte chars)
  • Context windows are finite — long-doc reading needs chunking
  • LLM hallucination is real — pair with retrieval for factual tasks
  • Cross-lingual transfer is uneven; resource-poor languages still suffer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Tokeniser handles every detail of text → token IDs
tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")
enc = tok("the quick brown fox", return_tensors="pt", padding=True, truncation=True)
print(enc.input_ids)            # tensor of subword IDs
print(tok.convert_ids_to_tokens(enc.input_ids[0]))

# Fine-tune for classification
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2)
out   = model(**enc, labels=torch.tensor([1]))
out.loss.backward()
Want word embeddings, sub-word tokenisation, and modern architectures?
Cross-entropy on a vocabulary $$ \mathcal{L} = -\sum_{t=1}^{T} \log p(w_t \mid w_{1:t-1}) $$
  • p(wt | …)softmax over the vocabulary at position t
  • Next-token prediction — the universal NLP objective
  • Pre-training is exactly this over trillions of tokens

Word embeddings (word2vec, GloVe). Classical: learn one vector per word from co-occurrence statistics or a shallow predictive task. Famous for the analogy structure (king − man + woman ≈ queen). Now subsumed by contextual embeddings from transformers, but the geometric idea persists.

Contextual embeddings. ELMo, BERT, RoBERTa — the same word gets different vectors depending on context. "Bank" in "river bank" vs "savings bank" gets different embeddings. Pre-training task: masked token prediction or next-token prediction.

Subword tokenisation. BPE (used by GPT) merges most-frequent byte pairs iteratively. WordPiece (BERT) does similar but with a likelihood criterion. SentencePiece treats text as a raw byte stream — language-agnostic. Vocabulary typically 32k–100k. Affects model performance, training speed, and downstream task design.

Sequence-to-sequence. Encoder reads input; decoder generates output token-by-token, attending to the encoder's output. Translation, summarisation, dialogue, code generation — all variants.

Causal vs masked attention. Causal (decoder-only, GPT): each position only attends to past positions. Masked (encoder-only, BERT): every position sees every other. Different trade-offs for generation vs comprehension; modern frontier favours causal.

Decoding strategies. Greedy: pick the argmax. Beam search: keep top-k hypotheses. Sampling: temperature, top-k, top-p (nucleus). Each gives different generation quality; sampling with low temperature is the modern default for assistant-style models.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tok   = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2").eval()

prompt = "The future of AI is"
ids = tok(prompt, return_tensors="pt").input_ids

# Generate with sampling — top-p (nucleus) decoding
out = model.generate(
    ids,
    do_sample=True,
    temperature=0.8,
    top_p=0.92,
    max_new_tokens=50,
)
print(tok.decode(out[0]))
Want RAG, long-context tricks, multilingual, and structured outputs?
Retrieval-augmented generation $$ p(y \mid x) = \sum_{z \in \mathrm{top}_k(x)} p_\eta(z \mid x) \, p_\theta(y \mid x, z) $$
  • zretrieved documents; pη retriever
  • pθgenerator conditioned on retrieved context
  • Decouples knowledge from parameters; fresh facts via a vector store

Retrieval-augmented generation (RAG). Embed your knowledge base; at query time retrieve top-k relevant chunks; concatenate with the prompt; let the LLM answer. Cheaper than fine-tuning, factually grounded, and updatable. The standard production architecture for "chat-with-your-docs" applications.

Long-context architectures. Vanilla attention is O(n²) in sequence length — prohibitive past ~8k tokens. Modern approaches: sparse attention (Longformer, BigBird), linear attention (Linear Transformer, Performer), state-space models (Mamba), retrieval-augmented memory (Memorising Transformer). Most production long-context models combine several tricks.

Tool use. LLMs that can call external functions: search, code execution, calculators, web browsers, retrieval. The agent loop: prompt → function call → result → prompt with result. Modern frameworks (LangChain, LlamaIndex, OpenAI function calling) standardise this.

Instruction tuning & RLHF. Fine-tune a pre-trained LLM on (instruction, response) pairs (SFT), then optimise against a reward model trained from human preferences (RLHF) or against rule-based rewards (RLAIF, Constitutional AI). The reason GPT-3 was a curiosity and GPT-3.5 became a product.

Multilingual. Most modern foundation models are heavily English-skewed but include some multilingual capability via the tokeniser's byte fallback. Performance drops sharply on low-resource languages; specialised models (mT5, NLLB, BLOOM) help.

Structured outputs. Force the LLM to produce valid JSON, code, or other formats by constraining the decoding step (grammar-constrained generation, JSON mode, JSONSchema). Crucial for production reliability; OpenAI, Anthropic, and others ship native support.

from sentence_transformers import SentenceTransformer
import numpy as np

# Build a simple RAG retriever
encoder = SentenceTransformer("all-MiniLM-L6-v2")
docs    = ["Paris is the capital of France.",
           "The Eiffel Tower was built in 1889.",
           "Croissants are flaky French pastries.",
           ...]
doc_emb = encoder.encode(docs)

def retrieve(query, k=3):
    q = encoder.encode(query)
    sims = doc_emb @ q
    return [docs[i] for i in sims.argsort()[::-1][:k]]

# Then call the LLM with retrieved context:
context = "\n".join(retrieve("when was the Eiffel Tower built?"))
prompt  = f"Use this context:\n{context}\nQ: when was the Eiffel Tower built?\nA:"
Too dense?