1.1 What is a Large Language Model?
AI-generated content may contain errors. Always verify against official sources.
1.1 What is a Large Language Model?
Key Concepts: Neural networks → Transformers → LLMs · Parameters · Training vs Inference
Official Docs: OpenAI — What are LLMs? · Hugging Face NLP Course
What is a Large Language Model?
A Large Language Model (LLM) is a neural network trained on large amounts of text data. Its training objective is simple: predict the next token in a sequence. All capabilities — answering questions, writing code, summarising documents — emerge from learning this single task at scale.
Input text → Tokenizer → Embedding → Transformer Layers → Output probabilities → Next token
The Transformer Architecture
Modern LLMs are built on the Transformer architecture. The key innovation is self-attention, which allows every token to directly relate to every other token in the sequence, regardless of distance.
| Component | Role |
|---|---|
| Embedding layer | Converts token IDs to dense vectors |
| Positional encoding | Adds position information to each token |
| Multi-head self-attention | Each token attends to all other tokens |
| Feed-forward network | Per-token non-linear transformation |
| Layer norm + residuals | Stabilises training |
| LM head | Projects to vocabulary → next-token probabilities |
Parameters
A parameter is a learnable number (weight) stored in the network. Models range from millions to hundreds of billions of parameters. Parameters encode patterns learned during training — grammar, facts, reasoning styles.
📌 Rule of thumb: a model with N billion parameters needs roughly 2N GB of GPU memory at 16-bit precision.
Training vs Inference
┌──────────────────────────────────────────────┐
│ PRE-TRAINING │
│ Objective: predict next token │
│ Data: large text corpora │
│ Cost: very high (weeks, large GPU clusters) │
├──────────────────────────────────────────────┤
│ FINE-TUNING / ALIGNMENT │
│ Supervised training on instruction pairs │
│ Alignment with human feedback │
├──────────────────────────────────────────────┤
│ INFERENCE │
│ Single forward pass per token │
│ Cost: low (milliseconds per token) │
└──────────────────────────────────────────────┘
Key Takeaways
- LLMs are next-token predictors — all capabilities emerge from this objective
- The Transformer (self-attention + feed-forward) is the universal building block
- Parameters store learned patterns; data quality matters as much as size
- Pre-training is expensive; inference is cheap