3.4 Cost & Token Management
AI-generated content may contain errors. Always verify against official sources.
3.4 Cost & Token Management
Key Concepts: Token counting · Cost estimation · Batching · Caching strategies
Official Docs: OpenAI Pricing · Anthropic Pricing · OpenAI Prompt Caching
Understanding Token Costs
LLM APIs charge separately for input tokens (your prompt) and output tokens (the model’s response). Output tokens are usually 3–4× more expensive than input tokens.
Always verify current prices at the official provider pricing pages — prices change frequently.
Counting Tokens Before Sending
import tiktoken
def count_tokens(messages: list[dict], model: str = "gpt-4o-mini") -> int:
"""Count tokens for an OpenAI messages array."""
enc = tiktoken.encoding_for_model(model)
tokens = 0
for msg in messages:
tokens += 4 # every message has a 4-token overhead
for key, value in msg.items():
tokens += len(enc.encode(str(value)))
tokens += 2 # reply priming
return tokens
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain gradient descent in 3 sentences."},
]
print(f"Estimated input tokens: {count_tokens(messages)}")
Cost Estimation Utility
from openai import OpenAI
client = OpenAI()
# gpt-4o-mini pricing (verify at openai.com/api/pricing)
PRICING = {
"gpt-4o-mini": {"input": 0.15, "output": 0.60}, # per 1M tokens
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini-batch": {"input": 0.075, "output": 0.30},
}
def call_and_cost(messages: list[dict], model: str = "gpt-4o-mini") -> tuple[str, float]:
resp = client.chat.completions.create(
model=model,
messages=messages,
temperature=0,
)
usage = resp.usage
rates = PRICING.get(model, {"input": 0, "output": 0})
cost = (
usage.prompt_tokens / 1_000_000 * rates["input"]
+ usage.completion_tokens / 1_000_000 * rates["output"]
)
return resp.choices[0].message.content, cost
answer, usd = call_and_cost(
[{"role": "user", "content": "What is the capital of France?"}]
)
print(answer)
print(f"Cost: ${usd:.6f}")
Prompt Caching
OpenAI automatically caches the prefix of long prompts. Repeated requests that share the same prefix (e.g., a large system prompt) pay a reduced rate for cached tokens.
# Long, stable system prompt — will be cached on repeat requests
SYSTEM_PROMPT = open("large_knowledge_base.txt").read() # e.g. 8000 tokens
# First call: full input price
# Subsequent calls with the same system prompt: cached price (~50% discount)
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_query},
],
)
print(resp.usage.prompt_tokens_details) # shows cached_tokens
See OpenAI Prompt Caching docs for eligibility rules.
Batch API (50% Discount)
For non-real-time tasks (dataset processing, bulk analysis), the OpenAI Batch API offers a 50% cost reduction with a 24-hour turnaround.
import json
# Create a JSONL file of requests
requests = [
{
"custom_id": f"req-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": text}],
"max_tokens": 128,
},
}
for i, text in enumerate(["Summarise: ...", "Classify: ..."])
]
with open("batch_input.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
# Upload and submit
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print(batch.id) # poll later with client.batches.retrieve(batch.id)
Cost Reduction Strategies
| Strategy | Typical Saving | Notes |
|---|---|---|
| Use mini models | 10–20× cheaper | gpt-4o-mini vs gpt-4o |
| Prompt caching | ~50% on input | Stable system prompt ≥ 1024 tokens |
| Batch API | 50% | Non-real-time tasks only |
| Trim retrieved context | Variable | Keep RAG context ≤ 4k tokens |
Reduce max_tokens | Direct | Set tight limits for short-answer tasks |
Common Mistakes
- Not monitoring token usage — adding a growing conversation history without a cap causes costs to compound exponentially.
- Using
gpt-4ofor simple tasks — classification and short summaries don’t need the flagship model. Usegpt-4o-mini. - Ignoring the Batch API — if you’re processing thousands of documents offline, the Batch API halves your cost with no code changes.
- Storing sensitive data to make caching work — caching is only for stable, non-sensitive prefixes. Don’t inject PII into cached system prompts.
Quick Quiz
Q1. Why are output tokens more expensive than input tokens?
A1. Output generation requires a full forward pass per token (autoregressive), while input tokens are processed in a single parallel pass.
Q2. What discount does the OpenAI Batch API offer, and what is the trade-off?
A2. 50% discount. Trade-off: up to 24-hour turnaround — not suitable for real-time applications.
Q3. What minimum prefix length is required for OpenAI prompt caching to activate?
A3. 1,024 tokens (as per OpenAI docs).
Student Exercise
Exercise 3.4 — Cost calculator
Build a Python function that processes a list of 100 text samples through gpt-4o-mini, tracks total token usage, and prints: total input tokens, total output tokens, total cost in USD, and average cost per sample.
Further Reading
Next → 3.5 Error Handling & Retries