Tokenisation. Split text into reusable chunks. Modern: byte-pair encoding (BPE), WordPiece, SentencePiece — vocabularies of ~30k–100k sub-word units. Common words stay whole; rare words decompose. Crucial detail: tokens, not characters, are the model's atoms.
Embeddings. Each token → a learned vector. Same dimension for everything. Vectors with similar meanings end up geometrically close — but the magic is in relative structure: "king − man + woman ≈ queen" works in word2vec.
Transformers. The current standard. Self-attention lets each token attend to every other, no recurrence. Scales to long sequences (with the right tricks) and trains in parallel. See the Transformer page for details.
Decoder, encoder, or both. BERT is encoder-only (good for classification, NER). GPT is decoder-only (good for generation). T5 and BART are encoder-decoder (good for translation, summarisation). Modern frontier is mostly decoder-only.
The fine-tuning stack. Pre-train on huge unlabeled text → supervised fine-tune on instruction-following pairs → RLHF for alignment. The recipe that turned GPT-3 into ChatGPT.