1.2 Tokenization & Context Windows
AI-generated content may contain errors. Always verify against official sources.
1.2 Tokenization & Context Windows
Key Concepts: Tokens · BPE · Context length limits · Context explosion
Official Docs: OpenAI Tokenizer · Hugging Face Tokenizers
What is a Token?
LLMs do not process characters or words — they process tokens. A token is a chunk of text produced by a tokenizer, typically 3–4 characters on average in English.
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Large Language Models are fascinating!")
print(tokens) # list of integer IDs
print(len(tokens)) # 7
🔢 Rough rule: 1 token ≈ 0.75 English words. 1,000 tokens ≈ 750 words.
Byte-Pair Encoding (BPE)
Most modern LLMs use BPE tokenization:
- Start with individual bytes as base tokens
- Repeatedly merge the most frequent adjacent pair into a new token
- Repeat until the target vocabulary size is reached
"tokenization" → ["token", "ization"] # 2 tokens
"tokenize" → ["token", "ize"] # 2 tokens
"tok3nize" → ["tok", "3", "n", "ize"] # 4 tokens ← unusual text costs more
Why it matters:
- Non-English text, code, and numbers tokenize less efficiently
- API costs are per token — token awareness saves money
- Prompt formatting affects total token count
Context Window
The context window is the maximum number of tokens the model can process at once (prompt + response combined). Exceeding it raises an error.
| Model | Context Window |
|---|---|
| GPT-4o | 128,000 tokens |
| Claude 3.5 Sonnet | 200,000 tokens |
| Gemini 1.5 Pro | 2,000,000 tokens |
| LLaMA 3.1 / 3.2 | 128,000 tokens |
Quality Degrades at High Fill Levels
Even within the context window, models tend to perform better on information near the beginning and end of the context. Information buried in the middle of a very long context is sometimes overlooked. Keep retrieved context concise.
For RAG, keep retrieved context under 4 000 tokens per query where possible.
Counting Tokens in Code
# OpenAI models — use tiktoken
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
print(count_tokens("Hello, world!")) # 4
# Hugging Face models — use the model's own tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokens = tokenizer("Hello, world!", return_tensors="pt")
print(tokens["input_ids"].shape)
Key Takeaways
- Tokens ≠ words; always count tokens when estimating cost and context usage
- BPE merges common subwords; unusual text uses more tokens per word
- Context window = hard limit; plan for overflow before it happens