1.2 Tokenization & Context Windows

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

1.2 Tokenization & Context Windows

Key Concepts: Tokens · BPE · Context length limits · Context explosion

Official Docs: OpenAI Tokenizer · Hugging Face Tokenizers

What is a Token?

LLMs do not process characters or words — they process tokens. A token is a chunk of text produced by a tokenizer, typically 3–4 characters on average in English.

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Large Language Models are fascinating!")
print(tokens)        # list of integer IDs
print(len(tokens))   # 7

🔢 Rough rule: 1 token ≈ 0.75 English words. 1,000 tokens ≈ 750 words.

Byte-Pair Encoding (BPE)

Most modern LLMs use BPE tokenization:

Start with individual bytes as base tokens
Repeatedly merge the most frequent adjacent pair into a new token
Repeat until the target vocabulary size is reached

"tokenization"  →  ["token", "ization"]      # 2 tokens
"tokenize"      →  ["token", "ize"]           # 2 tokens
"tok3nize"      →  ["tok", "3", "n", "ize"]   # 4 tokens  ← unusual text costs more

Why it matters:

Non-English text, code, and numbers tokenize less efficiently
API costs are per token — token awareness saves money
Prompt formatting affects total token count

Context Window

The context window is the maximum number of tokens the model can process at once (prompt + response combined). Exceeding it raises an error.

Model	Context Window
GPT-4o	128,000 tokens
Claude 3.5 Sonnet	200,000 tokens
Gemini 1.5 Pro	2,000,000 tokens
LLaMA 3.1 / 3.2	128,000 tokens

Quality Degrades at High Fill Levels

Even within the context window, models tend to perform better on information near the beginning and end of the context. Information buried in the middle of a very long context is sometimes overlooked. Keep retrieved context concise.

Production Advice

For RAG, keep retrieved context under 4 000 tokens per query where possible.

Counting Tokens in Code

# OpenAI models — use tiktoken
import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

print(count_tokens("Hello, world!"))  # 4

# Hugging Face models — use the model's own tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokens = tokenizer("Hello, world!", return_tensors="pt")
print(tokens["input_ids"].shape)

Key Takeaways

Tokens ≠ words; always count tokens when estimating cost and context usage
BPE merges common subwords; unusual text uses more tokens per word
Context window = hard limit; plan for overflow before it happens

1.2 Tokenization & Context Windows

What is a Token?​

Byte-Pair Encoding (BPE)​

Context Window​

Quality Degrades at High Fill Levels​

Counting Tokens in Code​

Key Takeaways​

Further Reading​

What is a Token?

Byte-Pair Encoding (BPE)

Context Window

Quality Degrades at High Fill Levels

Counting Tokens in Code

Key Takeaways

Further Reading