10.5 Cost Optimization
AI-generated content may contain errors. Always verify against official sources.
10.5 Cost Optimization
Key Concepts: Caching · Prompt compression · Model routing (small vs large)
Official Docs: OpenAI Pricing · OpenAI Prompt Caching
Cost Drivers
LLM costs come from input tokens + output tokens. Strategies that reduce either = direct cost savings.
| Strategy | Typical savings | Effort |
|---|---|---|
| Prompt caching | 50% on cached prefix | Low |
| Model routing | 60–80% | Medium |
| Response caching | 100% on repeated queries | Medium |
| Prompt compression | 20–40% | Medium |
| Output length control | 10–40% | Low |
Strategy 1 — OpenAI Prompt Caching
OpenAI automatically caches prompt prefixes that are identical across requests. Cached tokens cost 50% less.
from openai import OpenAI
client = OpenAI()
# Put stable content at the START (system prompt, long context)
# Put dynamic content at the END (user question)
SYSTEM_PROMPT = """You are an expert Python tutor.
[... 2000 tokens of stable curriculum content ...]"""
def ask_tutor(question: str) -> str:
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT}, # CACHED on repeat
{"role": "user", "content": question}, # Dynamic
],
)
usage = resp.usage
# Check how many tokens were cached
if hasattr(usage, 'prompt_tokens_details'):
cached = usage.prompt_tokens_details.cached_tokens
print(f"Cached tokens: {cached} (50% cheaper)")
return resp.choices[0].message.content
Strategy 2 — Response Caching
For repeated or similar queries, cache exact responses:
import hashlib
import json
from pathlib import Path
from openai import OpenAI
client = OpenAI()
class LLMCache:
def __init__(self, cache_file: str = "llm_cache.json"):
self.cache_file = Path(cache_file)
self.cache: dict = json.loads(self.cache_file.read_text()) if self.cache_file.exists() else {}
def _key(self, messages: list, model: str) -> str:
content = json.dumps({"messages": messages, "model": model}, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()
def get(self, messages: list, model: str) -> str | None:
return self.cache.get(self._key(messages, model))
def set(self, messages: list, model: str, response: str):
self.cache[self._key(messages, model)] = response
self.cache_file.write_text(json.dumps(self.cache))
cache = LLMCache()
def cached_completion(messages: list, model: str = "gpt-4o-mini") -> str:
cached = cache.get(messages, model)
if cached:
print("[CACHE HIT] Returning cached response")
return cached
resp = client.chat.completions.create(model=model, messages=messages)
result = resp.choices[0].message.content
cache.set(messages, model, result)
return result
Strategy 3 — Model Routing
Route simple queries to cheap small models; only use expensive models for complex tasks:
from openai import OpenAI
client = OpenAI()
def classify_complexity(query: str) -> str:
"""Classify query complexity using the cheapest model."""
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Is this query simple (factual lookup, yes/no, single-step)
or complex (multi-step reasoning, creative writing, code)?
Query: {query}
Answer with just: simple or complex"""
}],
max_tokens=5,
temperature=0,
)
return resp.choices[0].message.content.strip().lower()
def smart_completion(query: str) -> str:
complexity = classify_complexity(query)
model = "gpt-4o-mini" if complexity == "simple" else "gpt-4o"
print(f"Routing to: {model} (complexity: {complexity})")
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": query}],
)
return resp.choices[0].message.content
# Simple query → gpt-4o-mini ($0.15/1M)
print(smart_completion("What is the capital of France?"))
# Complex query → gpt-4o ($5/1M)
print(smart_completion("Design a scalable microservices architecture for a banking app."))
Common Mistakes
- Using GPT-4o for every task — GPT-4o-mini is 33× cheaper and handles 80% of tasks. Always benchmark quality on small models first.
- Long system prompts not at the start — prompt caching only works if the cached prefix is at the beginning of the prompt. Dynamic content breaks caching.
- No output length control — always set
max_tokensappropriate to the task. Open-ended generation without limits wastes tokens. - Caching unsafe content — don’t cache responses to queries that contain personal data or time-sensitive information.
Quick Quiz
Q1. How much do cached input tokens cost compared to regular input tokens in OpenAI’s API?
A1. 50% less (half price).
Q2. What is model routing and what does it optimise for?
A2. Automatically selecting the cheapest model that can handle a given query’s complexity, optimising for cost without sacrificing quality.
Q3. What must be true for prompt caching to be effective?
A3. The cached prefix (system prompt, context) must be identical across requests and placed at the beginning of the prompt. Dynamic content must come at the end.
Student Exercise
Exercise 11.5 — Cost comparison
Build the model router. Test 20 diverse queries. Track which model each query was routed to and the total estimated cost. Compare with the cost if you had used GPT-4o for all 20 queries. Calculate the savings percentage.
Further Reading
Next Chapter → Chapter 12: Validation