Skip to main content

1.2 Tokenization & Context Windows

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

1.2 Tokenization & Context Windows

Key Concepts: Tokens · BPE · Context length limits · Context explosion

Official Docs: OpenAI Tokenizer · Hugging Face Tokenizers


What is a Token?

LLMs do not process characters or words — they process tokens. A token is a chunk of text produced by a tokenizer, typically 3–4 characters on average in English.

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Large Language Models are fascinating!")
print(tokens) # list of integer IDs
print(len(tokens)) # 7

🔢 Rough rule: 1 token ≈ 0.75 English words. 1,000 tokens ≈ 750 words.


Byte-Pair Encoding (BPE)

Most modern LLMs use BPE tokenization:

  1. Start with individual bytes as base tokens
  2. Repeatedly merge the most frequent adjacent pair into a new token
  3. Repeat until the target vocabulary size is reached
"tokenization"  →  ["token", "ization"]      # 2 tokens
"tokenize" → ["token", "ize"] # 2 tokens
"tok3nize" → ["tok", "3", "n", "ize"] # 4 tokens ← unusual text costs more

Why it matters:

  • Non-English text, code, and numbers tokenize less efficiently
  • API costs are per token — token awareness saves money
  • Prompt formatting affects total token count

Context Window

The context window is the maximum number of tokens the model can process at once (prompt + response combined). Exceeding it raises an error.

ModelContext Window
GPT-4o128,000 tokens
Claude 3.5 Sonnet200,000 tokens
Gemini 1.5 Pro2,000,000 tokens
LLaMA 3.1 / 3.2128,000 tokens

Quality Degrades at High Fill Levels

Even within the context window, models tend to perform better on information near the beginning and end of the context. Information buried in the middle of a very long context is sometimes overlooked. Keep retrieved context concise.

Production Advice

For RAG, keep retrieved context under 4 000 tokens per query where possible.


Counting Tokens in Code

# OpenAI models — use tiktoken
import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))

print(count_tokens("Hello, world!")) # 4
# Hugging Face models — use the model's own tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokens = tokenizer("Hello, world!", return_tensors="pt")
print(tokens["input_ids"].shape)

Key Takeaways

  • Tokens ≠ words; always count tokens when estimating cost and context usage
  • BPE merges common subwords; unusual text uses more tokens per word
  • Context window = hard limit; plan for overflow before it happens

Further Reading

Next → 1.3 Text Generation & Sampling