10.5 Cost Optimization

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

10.5 Cost Optimization

Key Concepts: Caching · Prompt compression · Model routing (small vs large)

Official Docs: OpenAI Pricing · OpenAI Prompt Caching

Cost Drivers

LLM costs come from input tokens + output tokens. Strategies that reduce either = direct cost savings.

Strategy	Typical savings	Effort
Prompt caching	50% on cached prefix	Low
Model routing	60–80%	Medium
Response caching	100% on repeated queries	Medium
Prompt compression	20–40%	Medium
Output length control	10–40%	Low

Strategy 1 — OpenAI Prompt Caching

OpenAI automatically caches prompt prefixes that are identical across requests. Cached tokens cost 50% less.

from openai import OpenAI
client = OpenAI()

# Put stable content at the START (system prompt, long context)
# Put dynamic content at the END (user question)
SYSTEM_PROMPT = """You are an expert Python tutor.
[... 2000 tokens of stable curriculum content ...]"""

def ask_tutor(question: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},  # CACHED on repeat
            {"role": "user",   "content": question},       # Dynamic
        ],
    )
    usage = resp.usage
    # Check how many tokens were cached
    if hasattr(usage, 'prompt_tokens_details'):
        cached = usage.prompt_tokens_details.cached_tokens
        print(f"Cached tokens: {cached} (50% cheaper)")
    return resp.choices[0].message.content

Strategy 2 — Response Caching

For repeated or similar queries, cache exact responses:

import hashlib
import json
from pathlib import Path
from openai import OpenAI

client = OpenAI()

class LLMCache:
    def __init__(self, cache_file: str = "llm_cache.json"):
        self.cache_file = Path(cache_file)
        self.cache: dict = json.loads(self.cache_file.read_text()) if self.cache_file.exists() else {}
    
    def _key(self, messages: list, model: str) -> str:
        content = json.dumps({"messages": messages, "model": model}, sort_keys=True)
        return hashlib.sha256(content.encode()).hexdigest()
    
    def get(self, messages: list, model: str) -> str | None:
        return self.cache.get(self._key(messages, model))
    
    def set(self, messages: list, model: str, response: str):
        self.cache[self._key(messages, model)] = response
        self.cache_file.write_text(json.dumps(self.cache))

cache = LLMCache()

def cached_completion(messages: list, model: str = "gpt-4o-mini") -> str:
    cached = cache.get(messages, model)
    if cached:
        print("[CACHE HIT] Returning cached response")
        return cached
    
    resp = client.chat.completions.create(model=model, messages=messages)
    result = resp.choices[0].message.content
    cache.set(messages, model, result)
    return result

Strategy 3 — Model Routing

Route simple queries to cheap small models; only use expensive models for complex tasks:

from openai import OpenAI

client = OpenAI()

def classify_complexity(query: str) -> str:
    """Classify query complexity using the cheapest model."""
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Is this query simple (factual lookup, yes/no, single-step) 
            or complex (multi-step reasoning, creative writing, code)?
            Query: {query}
            Answer with just: simple or complex"""
        }],
        max_tokens=5,
        temperature=0,
    )
    return resp.choices[0].message.content.strip().lower()

def smart_completion(query: str) -> str:
    complexity = classify_complexity(query)
    model = "gpt-4o-mini" if complexity == "simple" else "gpt-4o"
    print(f"Routing to: {model} (complexity: {complexity})")
    
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": query}],
    )
    return resp.choices[0].message.content

# Simple query → gpt-4o-mini ($0.15/1M)
print(smart_completion("What is the capital of France?"))

# Complex query → gpt-4o ($5/1M)
print(smart_completion("Design a scalable microservices architecture for a banking app."))

Common Mistakes

Using GPT-4o for every task — GPT-4o-mini is 33× cheaper and handles 80% of tasks. Always benchmark quality on small models first.
Long system prompts not at the start — prompt caching only works if the cached prefix is at the beginning of the prompt. Dynamic content breaks caching.
No output length control — always set max_tokens appropriate to the task. Open-ended generation without limits wastes tokens.
Caching unsafe content — don’t cache responses to queries that contain personal data or time-sensitive information.

Quick Quiz

Test Your Understanding

Q1. How much do cached input tokens cost compared to regular input tokens in OpenAI’s API?
A1. 50% less (half price).

Q2. What is model routing and what does it optimise for?
A2. Automatically selecting the cheapest model that can handle a given query’s complexity, optimising for cost without sacrificing quality.

Q3. What must be true for prompt caching to be effective?
A3. The cached prefix (system prompt, context) must be identical across requests and placed at the beginning of the prompt. Dynamic content must come at the end.

Student Exercise

Exercise 11.5 — Cost comparison
Build the model router. Test 20 diverse queries. Track which model each query was routed to and the total estimated cost. Compare with the cost if you had used GPT-4o for all 20 queries. Calculate the savings percentage.

10.5 Cost Optimization

Cost Drivers​

Strategy 1 — OpenAI Prompt Caching​

Strategy 2 — Response Caching​

Strategy 3 — Model Routing​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

Cost Drivers

Strategy 1 — OpenAI Prompt Caching

Strategy 2 — Response Caching

Strategy 3 — Model Routing

Common Mistakes

Quick Quiz

Student Exercise

Further Reading