Skip to main content

1.1 What is a Large Language Model?

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

1.1 What is a Large Language Model?

Key Concepts: Neural networks → Transformers → LLMs · Parameters · Training vs Inference

Official Docs: OpenAI — What are LLMs? · Hugging Face NLP Course


What is a Large Language Model?

A Large Language Model (LLM) is a neural network trained on large amounts of text data. Its training objective is simple: predict the next token in a sequence. All capabilities — answering questions, writing code, summarising documents — emerge from learning this single task at scale.

Input text  →  Tokenizer  →  Embedding  →  Transformer Layers  →  Output probabilities  →  Next token

The Transformer Architecture

Modern LLMs are built on the Transformer architecture. The key innovation is self-attention, which allows every token to directly relate to every other token in the sequence, regardless of distance.

ComponentRole
Embedding layerConverts token IDs to dense vectors
Positional encodingAdds position information to each token
Multi-head self-attentionEach token attends to all other tokens
Feed-forward networkPer-token non-linear transformation
Layer norm + residualsStabilises training
LM headProjects to vocabulary → next-token probabilities

Parameters

A parameter is a learnable number (weight) stored in the network. Models range from millions to hundreds of billions of parameters. Parameters encode patterns learned during training — grammar, facts, reasoning styles.

📌 Rule of thumb: a model with N billion parameters needs roughly 2N GB of GPU memory at 16-bit precision.


Training vs Inference

┌──────────────────────────────────────────────┐
│ PRE-TRAINING │
│ Objective: predict next token │
│ Data: large text corpora │
│ Cost: very high (weeks, large GPU clusters) │
├──────────────────────────────────────────────┤
│ FINE-TUNING / ALIGNMENT │
│ Supervised training on instruction pairs │
│ Alignment with human feedback │
├──────────────────────────────────────────────┤
│ INFERENCE │
│ Single forward pass per token │
│ Cost: low (milliseconds per token) │
└──────────────────────────────────────────────┘

Key Takeaways

  • LLMs are next-token predictors — all capabilities emerge from this objective
  • The Transformer (self-attention + feed-forward) is the universal building block
  • Parameters store learned patterns; data quality matters as much as size
  • Pre-training is expensive; inference is cheap

Further Reading

Next → 1.2 Tokenization & Context Windows