Skip to main content

1.2 Tokenization & Context Windows

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

1.2 Tokenization & Context Windows

Key Concepts: Tokens · BPE · Context length limits · Context explosion

Official Docs: OpenAI Tokenizer · Hugging Face Tokenizers


What is a Token?

LLMs do not process characters or words — they process tokens. A token is a chunk of text produced by a tokenizer, typically 3–4 characters on average in English.

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Large Language Models are fascinating!")
print(tokens) # list of integer IDs
print(len(tokens)) # 7

🔢 Rough rule: 1 token ≈ 0.75 English words. 1,000 tokens ≈ 750 words.


Byte-Pair Encoding (BPE)

Most modern LLMs use BPE tokenization:

  1. Start with individual bytes as base tokens
  2. Repeatedly merge the most frequent adjacent pair into a new token
  3. Repeat until the target vocabulary size is reached
"tokenization"  →  ["token", "ization"]      # 2 tokens
"tokenize" → ["token", "ize"] # 2 tokens
"tok3nize" → ["tok", "3", "n", "ize"] # 4 tokens ← unusual text costs more

Why it matters:

  • Non-English text, code, and numbers tokenize less efficiently
  • API costs are per token — token awareness saves money
  • Prompt formatting affects total token count

Context Window

The context window is the maximum number of tokens the model can process at once (prompt + response combined). Exceeding it raises an error.

ModelContext Window
GPT-4o128,000 tokens
Claude 3.5 Sonnet200,000 tokens
Gemini 1.5 Pro2,000,000 tokens
LLaMA 3.1 / 3.2128,000 tokens

Quality Degrades at High Fill Levels

Even within the context window, models tend to perform better on information near the beginning and end of the context. Information buried in the middle of a very long context is sometimes overlooked. Keep retrieved context concise.

Production Advice

For RAG, keep retrieved context under 4 000 tokens per query where possible.


Counting Tokens in Code

# OpenAI models — use tiktoken
import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))

print(count_tokens("Hello, world!")) # 4
# Hugging Face models — use the model's own tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokens = tokenizer("Hello, world!", return_tensors="pt")
print(tokens["input_ids"].shape)

Key Takeaways

  • Tokens ≠ words. English averages ~0.75 words per token.
  • BPE merges frequent byte pairs to build a vocabulary — unusual text costs more tokens.
  • The context window is the hard limit on how much text the model can process at once.
  • Models perform worse on information in the middle of very long contexts ("lost in the middle" problem).
  • API costs are charged per token — always count before sending.

Common Mistakes

Common Mistakes
  1. Ignoring token costs — sending full documents when only excerpts are needed wastes tokens and money.
  2. Assuming 1 word = 1 token — code, numbers, and non-English text tokenize less efficiently.
  3. Exceeding the context window silently — some SDKs truncate automatically; others raise an error. Always check.
  4. Mixing tokenizers — using tiktoken (OpenAI) to count tokens for a LLaMA model will give wrong counts.

Quick Quiz

Test Your Understanding

Q1. Roughly how many tokens is 750 English words?
A1. ~1,000 tokens (1 token ≈ 0.75 words).

Q2. Why does the string "tok3nize" produce more tokens than "tokenize"?
A2. The digit 3 is an unusual character in the middle of a word; the BPE vocabulary doesn't have a merge for that pattern, so it splits into more pieces.

Q3. What is the context window of GPT-4o?
A3. 128,000 tokens (as of the current release — always verify at platform.openai.com/docs/models).

Q4. What tool does OpenAI provide to tokenise text interactively?
A4. The OpenAI Tokenizer playground.


Student Exercise

Exercise 1.2 — Observe BPE in action
Go to platform.openai.com/tokenizer. Paste the same sentence in English, Arabic, and Python code. Record the token counts and explain why they differ.

Exercise 1.3 — Build a token counter
Write a Python function using tiktoken that accepts a list of messages (OpenAI format) and prints the total tokens, cost at $0.15/1M input tokens (gpt-4o-mini pricing), and a warning if the total exceeds 8,000 tokens.


Further Reading

Next → 1.3 How LLMs Generate Text


---

## Key Takeaways

- Tokens ≠ words; always count tokens when estimating cost and context usage
- BPE merges common subwords; unusual text uses more tokens per word
- Context window = hard limit; plan for overflow before it happens

---

## Further Reading

- 🛠️ [OpenAI Tokenizer Playground](https://platform.openai.com/tokenizer)
- 📦 [tiktoken on GitHub](https://github.com/openai/tiktoken)
- 📘 [Hugging Face Tokenizers Docs](https://huggingface.co/docs/tokenizers/index)

**Next →** [1.3 Text Generation & Sampling](text-generation)