Skip to main content

1.1 What is a Large Language Model?

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

1.1 What is a Large Language Model?

Key Concepts: Neural networks → Transformers → LLMs · Parameters · Training vs Inference

Official Docs: OpenAI — What are LLMs? · Hugging Face NLP Course


What is a Large Language Model?

A Large Language Model (LLM) is a neural network trained on large amounts of text data. Its training objective is simple: predict the next token in a sequence. All capabilities — answering questions, writing code, summarising documents — emerge from learning this single task at scale.

Input text  →  Tokenizer  →  Embedding  →  Transformer Layers  →  Output probabilities  →  Next token

The Transformer Architecture

Modern LLMs are built on the Transformer architecture. The key innovation is self-attention, which allows every token to directly relate to every other token in the sequence, regardless of distance.

ComponentRole
Embedding layerConverts token IDs to dense vectors
Positional encodingAdds position information to each token
Multi-head self-attentionEach token attends to all other tokens
Feed-forward networkPer-token non-linear transformation
Layer norm + residualsStabilises training
LM headProjects to vocabulary → next-token probabilities

Parameters

A parameter is a learnable number (weight) stored in the network. Models range from millions to hundreds of billions of parameters. Parameters encode patterns learned during training — grammar, facts, reasoning styles.

📌 Rule of thumb: a model with N billion parameters needs roughly 2N GB of GPU memory at 16-bit precision.


Training vs Inference

┌──────────────────────────────────────────────┐
│ PRE-TRAINING │
│ Objective: predict next token │
│ Data: large text corpora │
│ Cost: very high (weeks, large GPU clusters) │
├──────────────────────────────────────────────┤
│ FINE-TUNING / ALIGNMENT │
│ Supervised training on instruction pairs │
│ Alignment with human feedback │
├──────────────────────────────────────────────┤
│ INFERENCE │
│ Single forward pass per token │
│ Cost: low (milliseconds per token) │
└──────────────────────────────────────────────┘

Key Takeaways

  • LLMs are next-token predictors — all capabilities emerge from this objective
  • The Transformer (self-attention + feed-forward) is the universal building block
  • Parameters store learned patterns; data quality matters as much as size
  • Pre-training is expensive; inference is cheap

Common Misconceptions

Common Mistakes
  1. "Bigger is always better" — A 7B model fine-tuned on domain data often outperforms a 70B general model on a narrow task.
  2. "LLMs understand language" — LLMs statistically predict tokens. They have no grounding in the world; all "understanding" is pattern matching.
  3. "Temperature = creativity" — Temperature reshapes probability distributions. Setting temperature=0 doesn't make the model "think harder"; it makes it deterministic.
  4. Confusing training with inference — Pre-training happens once at enormous cost. Inference is cheap and happens per request.

Quick Quiz

Test Your Understanding

Q1. What is the training objective of every LLM?
A1. Predict the next token in a sequence.

Q2. What component of the Transformer architecture allows every token to relate to every other token?
A2. Multi-head self-attention.

Q3. A model has 13 billion parameters. Approximately how much GPU memory (fp16) does it need at inference?
A3. ~26 GB (2 × 13 = 26 GB at 16-bit precision).

Q4. What is the difference between pre-training and fine-tuning?
A4. Pre-training learns general language patterns from massive text corpora (predict next token). Fine-tuning adapts the model to a specific task using labelled instruction/response pairs.


Student Exercise

Exercise 1.1 — Explore the Transformer interactively
Visit BertViz or the Transformer Explainer and trace how attention weights change for different sentences. Write a 5-sentence summary of what you observe.

Exercise 1.2 — Count parameters
Open Hugging Face Model Hub and find the parameter counts of: GPT-2 (small), LLaMA 3.1 8B, and Mistral 7B. Calculate the fp16 VRAM requirement for each.


Further Reading

Next → 1.2 Tokenization & Context Windows