Skip to main content

10.5 Cost Optimization

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

10.5 Cost Optimization

Key Concepts: Caching · Prompt compression · Model routing (small vs large)

Official Docs: OpenAI Pricing · OpenAI Prompt Caching


Cost Drivers

LLM costs come from input tokens + output tokens. Strategies that reduce either = direct cost savings.

StrategyTypical savingsEffort
Prompt caching50% on cached prefixLow
Model routing60–80%Medium
Response caching100% on repeated queriesMedium
Prompt compression20–40%Medium
Output length control10–40%Low

Strategy 1 — OpenAI Prompt Caching

OpenAI automatically caches prompt prefixes that are identical across requests. Cached tokens cost 50% less.

from openai import OpenAI
client = OpenAI()

# Put stable content at the START (system prompt, long context)
# Put dynamic content at the END (user question)
SYSTEM_PROMPT = """You are an expert Python tutor.
[... 2000 tokens of stable curriculum content ...]"""

def ask_tutor(question: str) -> str:
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT}, # CACHED on repeat
{"role": "user", "content": question}, # Dynamic
],
)
usage = resp.usage
# Check how many tokens were cached
if hasattr(usage, 'prompt_tokens_details'):
cached = usage.prompt_tokens_details.cached_tokens
print(f"Cached tokens: {cached} (50% cheaper)")
return resp.choices[0].message.content

Strategy 2 — Response Caching

For repeated or similar queries, cache exact responses:

import hashlib
import json
from pathlib import Path
from openai import OpenAI

client = OpenAI()

class LLMCache:
def __init__(self, cache_file: str = "llm_cache.json"):
self.cache_file = Path(cache_file)
self.cache: dict = json.loads(self.cache_file.read_text()) if self.cache_file.exists() else {}

def _key(self, messages: list, model: str) -> str:
content = json.dumps({"messages": messages, "model": model}, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()

def get(self, messages: list, model: str) -> str | None:
return self.cache.get(self._key(messages, model))

def set(self, messages: list, model: str, response: str):
self.cache[self._key(messages, model)] = response
self.cache_file.write_text(json.dumps(self.cache))

cache = LLMCache()

def cached_completion(messages: list, model: str = "gpt-4o-mini") -> str:
cached = cache.get(messages, model)
if cached:
print("[CACHE HIT] Returning cached response")
return cached

resp = client.chat.completions.create(model=model, messages=messages)
result = resp.choices[0].message.content
cache.set(messages, model, result)
return result

Strategy 3 — Model Routing

Route simple queries to cheap small models; only use expensive models for complex tasks:

from openai import OpenAI

client = OpenAI()

def classify_complexity(query: str) -> str:
"""Classify query complexity using the cheapest model."""
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Is this query simple (factual lookup, yes/no, single-step)
or complex (multi-step reasoning, creative writing, code)?
Query: {query}
Answer with just: simple or complex"""
}],
max_tokens=5,
temperature=0,
)
return resp.choices[0].message.content.strip().lower()

def smart_completion(query: str) -> str:
complexity = classify_complexity(query)
model = "gpt-4o-mini" if complexity == "simple" else "gpt-4o"
print(f"Routing to: {model} (complexity: {complexity})")

resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": query}],
)
return resp.choices[0].message.content

# Simple query → gpt-4o-mini ($0.15/1M)
print(smart_completion("What is the capital of France?"))

# Complex query → gpt-4o ($5/1M)
print(smart_completion("Design a scalable microservices architecture for a banking app."))

Common Mistakes

Common Mistakes
  1. Using GPT-4o for every task — GPT-4o-mini is 33× cheaper and handles 80% of tasks. Always benchmark quality on small models first.
  2. Long system prompts not at the start — prompt caching only works if the cached prefix is at the beginning of the prompt. Dynamic content breaks caching.
  3. No output length control — always set max_tokens appropriate to the task. Open-ended generation without limits wastes tokens.
  4. Caching unsafe content — don’t cache responses to queries that contain personal data or time-sensitive information.

Quick Quiz

Test Your Understanding

Q1. How much do cached input tokens cost compared to regular input tokens in OpenAI’s API?
A1. 50% less (half price).

Q2. What is model routing and what does it optimise for?
A2. Automatically selecting the cheapest model that can handle a given query’s complexity, optimising for cost without sacrificing quality.

Q3. What must be true for prompt caching to be effective?
A3. The cached prefix (system prompt, context) must be identical across requests and placed at the beginning of the prompt. Dynamic content must come at the end.


Student Exercise

Exercise 11.5 — Cost comparison
Build the model router. Test 20 diverse queries. Track which model each query was routed to and the total estimated cost. Compare with the cost if you had used GPT-4o for all 20 queries. Calculate the savings percentage.


Further Reading

Next Chapter → Chapter 12: Validation