Skip to main content

3.4 Cost & Token Management

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

3.4 Cost & Token Management

Key Concepts: Token counting · Cost estimation · Batching · Caching strategies

Official Docs: OpenAI Pricing · Anthropic Pricing · OpenAI Prompt Caching


Understanding Token Costs

LLM APIs charge separately for input tokens (your prompt) and output tokens (the model’s response). Output tokens are usually 3–4× more expensive than input tokens.

Always verify current prices at the official provider pricing pages — prices change frequently.


Counting Tokens Before Sending

import tiktoken

def count_tokens(messages: list[dict], model: str = "gpt-4o-mini") -> int:
"""Count tokens for an OpenAI messages array."""
enc = tiktoken.encoding_for_model(model)
tokens = 0
for msg in messages:
tokens += 4 # every message has a 4-token overhead
for key, value in msg.items():
tokens += len(enc.encode(str(value)))
tokens += 2 # reply priming
return tokens

messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain gradient descent in 3 sentences."},
]
print(f"Estimated input tokens: {count_tokens(messages)}")

Cost Estimation Utility

from openai import OpenAI

client = OpenAI()

# gpt-4o-mini pricing (verify at openai.com/api/pricing)
PRICING = {
"gpt-4o-mini": {"input": 0.15, "output": 0.60}, # per 1M tokens
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini-batch": {"input": 0.075, "output": 0.30},
}

def call_and_cost(messages: list[dict], model: str = "gpt-4o-mini") -> tuple[str, float]:
resp = client.chat.completions.create(
model=model,
messages=messages,
temperature=0,
)
usage = resp.usage
rates = PRICING.get(model, {"input": 0, "output": 0})
cost = (
usage.prompt_tokens / 1_000_000 * rates["input"]
+ usage.completion_tokens / 1_000_000 * rates["output"]
)
return resp.choices[0].message.content, cost

answer, usd = call_and_cost(
[{"role": "user", "content": "What is the capital of France?"}]
)
print(answer)
print(f"Cost: ${usd:.6f}")

Prompt Caching

OpenAI automatically caches the prefix of long prompts. Repeated requests that share the same prefix (e.g., a large system prompt) pay a reduced rate for cached tokens.

# Long, stable system prompt — will be cached on repeat requests
SYSTEM_PROMPT = open("large_knowledge_base.txt").read() # e.g. 8000 tokens

# First call: full input price
# Subsequent calls with the same system prompt: cached price (~50% discount)
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_query},
],
)
print(resp.usage.prompt_tokens_details) # shows cached_tokens

See OpenAI Prompt Caching docs for eligibility rules.


Batch API (50% Discount)

For non-real-time tasks (dataset processing, bulk analysis), the OpenAI Batch API offers a 50% cost reduction with a 24-hour turnaround.

import json

# Create a JSONL file of requests
requests = [
{
"custom_id": f"req-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": text}],
"max_tokens": 128,
},
}
for i, text in enumerate(["Summarise: ...", "Classify: ..."])
]

with open("batch_input.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")

# Upload and submit
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print(batch.id) # poll later with client.batches.retrieve(batch.id)

Cost Reduction Strategies

StrategyTypical SavingNotes
Use mini models10–20× cheapergpt-4o-mini vs gpt-4o
Prompt caching~50% on inputStable system prompt ≥ 1024 tokens
Batch API50%Non-real-time tasks only
Trim retrieved contextVariableKeep RAG context ≤ 4k tokens
Reduce max_tokensDirectSet tight limits for short-answer tasks

Common Mistakes

Common Mistakes
  1. Not monitoring token usage — adding a growing conversation history without a cap causes costs to compound exponentially.
  2. Using gpt-4o for simple tasks — classification and short summaries don’t need the flagship model. Use gpt-4o-mini.
  3. Ignoring the Batch API — if you’re processing thousands of documents offline, the Batch API halves your cost with no code changes.
  4. Storing sensitive data to make caching work — caching is only for stable, non-sensitive prefixes. Don’t inject PII into cached system prompts.

Quick Quiz

Test Your Understanding

Q1. Why are output tokens more expensive than input tokens?
A1. Output generation requires a full forward pass per token (autoregressive), while input tokens are processed in a single parallel pass.

Q2. What discount does the OpenAI Batch API offer, and what is the trade-off?
A2. 50% discount. Trade-off: up to 24-hour turnaround — not suitable for real-time applications.

Q3. What minimum prefix length is required for OpenAI prompt caching to activate?
A3. 1,024 tokens (as per OpenAI docs).


Student Exercise

Exercise 3.4 — Cost calculator
Build a Python function that processes a list of 100 text samples through gpt-4o-mini, tracks total token usage, and prints: total input tokens, total output tokens, total cost in USD, and average cost per sample.


Further Reading

Next → 3.5 Error Handling & Retries