1.2 Tokenization & Context Windows
AI-generated content may contain errors. Always verify against official sources.
1.2 Tokenization & Context Windows
Key Concepts: Tokens · BPE · Context length limits · Context explosion
Official Docs: OpenAI Tokenizer · Hugging Face Tokenizers
What is a Token?
LLMs do not process characters or words — they process tokens. A token is a chunk of text produced by a tokenizer, typically 3–4 characters on average in English.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Large Language Models are fascinating!")
print(tokens) # list of integer IDs
print(len(tokens)) # 7
🔢 Rough rule: 1 token ≈ 0.75 English words. 1,000 tokens ≈ 750 words.
Byte-Pair Encoding (BPE)
Most modern LLMs use BPE tokenization:
- Start with individual bytes as base tokens
- Repeatedly merge the most frequent adjacent pair into a new token
- Repeat until the target vocabulary size is reached
"tokenization" → ["token", "ization"] # 2 tokens
"tokenize" → ["token", "ize"] # 2 tokens
"tok3nize" → ["tok", "3", "n", "ize"] # 4 tokens ← unusual text costs more
Why it matters:
- Non-English text, code, and numbers tokenize less efficiently
- API costs are per token — token awareness saves money
- Prompt formatting affects total token count
Context Window
The context window is the maximum number of tokens the model can process at once (prompt + response combined). Exceeding it raises an error.
| Model | Context Window |
|---|---|
| GPT-4o | 128,000 tokens |
| Claude 3.5 Sonnet | 200,000 tokens |
| Gemini 1.5 Pro | 2,000,000 tokens |
| LLaMA 3.1 / 3.2 | 128,000 tokens |
Quality Degrades at High Fill Levels
Even within the context window, models tend to perform better on information near the beginning and end of the context. Information buried in the middle of a very long context is sometimes overlooked. Keep retrieved context concise.
For RAG, keep retrieved context under 4 000 tokens per query where possible.
Counting Tokens in Code
# OpenAI models — use tiktoken
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
print(count_tokens("Hello, world!")) # 4
# Hugging Face models — use the model's own tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokens = tokenizer("Hello, world!", return_tensors="pt")
print(tokens["input_ids"].shape)
Key Takeaways
- Tokens ≠ words. English averages ~0.75 words per token.
- BPE merges frequent byte pairs to build a vocabulary — unusual text costs more tokens.
- The context window is the hard limit on how much text the model can process at once.
- Models perform worse on information in the middle of very long contexts ("lost in the middle" problem).
- API costs are charged per token — always count before sending.
Common Mistakes
- Ignoring token costs — sending full documents when only excerpts are needed wastes tokens and money.
- Assuming 1 word = 1 token — code, numbers, and non-English text tokenize less efficiently.
- Exceeding the context window silently — some SDKs truncate automatically; others raise an error. Always check.
- Mixing tokenizers — using
tiktoken(OpenAI) to count tokens for a LLaMA model will give wrong counts.
Quick Quiz
Q1. Roughly how many tokens is 750 English words?
A1. ~1,000 tokens (1 token ≈ 0.75 words).
Q2. Why does the string "tok3nize" produce more tokens than "tokenize"?
A2. The digit 3 is an unusual character in the middle of a word; the BPE vocabulary doesn't have a merge for that pattern, so it splits into more pieces.
Q3. What is the context window of GPT-4o?
A3. 128,000 tokens (as of the current release — always verify at platform.openai.com/docs/models).
Q4. What tool does OpenAI provide to tokenise text interactively?
A4. The OpenAI Tokenizer playground.
Student Exercise
Exercise 1.2 — Observe BPE in action
Go to platform.openai.com/tokenizer. Paste the same sentence in English, Arabic, and Python code. Record the token counts and explain why they differ.
Exercise 1.3 — Build a token counter
Write a Python function using tiktoken that accepts a list of messages (OpenAI format) and prints the total tokens, cost at $0.15/1M input tokens (gpt-4o-mini pricing), and a warning if the total exceeds 8,000 tokens.
Further Reading
- 🌐 OpenAI Tokenizer playground
- 📘 Hugging Face Tokenizers docs
- 📄 Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2016 — BPE paper)
- 📄 Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023)
Next → 1.3 How LLMs Generate Text
---
## Key Takeaways
- Tokens ≠ words; always count tokens when estimating cost and context usage
- BPE merges common subwords; unusual text uses more tokens per word
- Context window = hard limit; plan for overflow before it happens
---
## Further Reading
- 🛠️ [OpenAI Tokenizer Playground](https://platform.openai.com/tokenizer)
- 📦 [tiktoken on GitHub](https://github.com/openai/tiktoken)
- 📘 [Hugging Face Tokenizers Docs](https://huggingface.co/docs/tokenizers/index)
**Next →** [1.3 Text Generation & Sampling](text-generation)