3.4 Cost & Token Management

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

3.4 Cost & Token Management

Key Concepts: Token counting · Cost estimation · Batching · Caching strategies

Official Docs: OpenAI Pricing · Anthropic Pricing · OpenAI Prompt Caching

Understanding Token Costs

LLM APIs charge separately for input tokens (your prompt) and output tokens (the model’s response). Output tokens are usually 3–4× more expensive than input tokens.

Always verify current prices at the official provider pricing pages — prices change frequently.

Counting Tokens Before Sending

import tiktoken

def count_tokens(messages: list[dict], model: str = "gpt-4o-mini") -> int:
    """Count tokens for an OpenAI messages array."""
    enc = tiktoken.encoding_for_model(model)
    tokens = 0
    for msg in messages:
        tokens += 4   # every message has a 4-token overhead
        for key, value in msg.items():
            tokens += len(enc.encode(str(value)))
    tokens += 2   # reply priming
    return tokens

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "Explain gradient descent in 3 sentences."},
]
print(f"Estimated input tokens: {count_tokens(messages)}")

Cost Estimation Utility

from openai import OpenAI

client = OpenAI()

# gpt-4o-mini pricing (verify at openai.com/api/pricing)
PRICING = {
    "gpt-4o-mini":    {"input": 0.15,  "output": 0.60},   # per 1M tokens
    "gpt-4o":         {"input": 2.50,  "output": 10.00},
    "gpt-4o-mini-batch": {"input": 0.075, "output": 0.30},
}

def call_and_cost(messages: list[dict], model: str = "gpt-4o-mini") -> tuple[str, float]:
    resp = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,
    )
    usage  = resp.usage
    rates  = PRICING.get(model, {"input": 0, "output": 0})
    cost   = (
        usage.prompt_tokens     / 1_000_000 * rates["input"]
      + usage.completion_tokens / 1_000_000 * rates["output"]
    )
    return resp.choices[0].message.content, cost

answer, usd = call_and_cost(
    [{"role": "user", "content": "What is the capital of France?"}]
)
print(answer)
print(f"Cost: ${usd:.6f}")

Prompt Caching

OpenAI automatically caches the prefix of long prompts. Repeated requests that share the same prefix (e.g., a large system prompt) pay a reduced rate for cached tokens.

# Long, stable system prompt — will be cached on repeat requests
SYSTEM_PROMPT = open("large_knowledge_base.txt").read()   # e.g. 8000 tokens

# First call: full input price
# Subsequent calls with the same system prompt: cached price (~50% discount)
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user",   "content": user_query},
    ],
)
print(resp.usage.prompt_tokens_details)   # shows cached_tokens

See OpenAI Prompt Caching docs for eligibility rules.

Batch API (50% Discount)

For non-real-time tasks (dataset processing, bulk analysis), the OpenAI Batch API offers a 50% cost reduction with a 24-hour turnaround.

import json

# Create a JSONL file of requests
requests = [
    {
        "custom_id": f"req-{i}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "messages": [{"role": "user", "content": text}],
            "max_tokens": 128,
        },
    }
    for i, text in enumerate(["Summarise: ...", "Classify: ..."])
]

with open("batch_input.jsonl", "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

# Upload and submit
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)
print(batch.id)   # poll later with client.batches.retrieve(batch.id)

Cost Reduction Strategies

Strategy	Typical Saving	Notes
Use mini models	10–20× cheaper	`gpt-4o-mini` vs `gpt-4o`
Prompt caching	~50% on input	Stable system prompt ≥ 1024 tokens
Batch API	50%	Non-real-time tasks only
Trim retrieved context	Variable	Keep RAG context ≤ 4k tokens
Reduce `max_tokens`	Direct	Set tight limits for short-answer tasks

Common Mistakes

Not monitoring token usage — adding a growing conversation history without a cap causes costs to compound exponentially.
Using gpt-4o for simple tasks — classification and short summaries don’t need the flagship model. Use gpt-4o-mini.
Ignoring the Batch API — if you’re processing thousands of documents offline, the Batch API halves your cost with no code changes.
Storing sensitive data to make caching work — caching is only for stable, non-sensitive prefixes. Don’t inject PII into cached system prompts.

Quick Quiz

Test Your Understanding

Q1. Why are output tokens more expensive than input tokens?
A1. Output generation requires a full forward pass per token (autoregressive), while input tokens are processed in a single parallel pass.

Q2. What discount does the OpenAI Batch API offer, and what is the trade-off?
A2. 50% discount. Trade-off: up to 24-hour turnaround — not suitable for real-time applications.

Q3. What minimum prefix length is required for OpenAI prompt caching to activate?
A3. 1,024 tokens (as per OpenAI docs).

Student Exercise

Exercise 3.4 — Cost calculator
Build a Python function that processes a list of 100 text samples through gpt-4o-mini, tracks total token usage, and prints: total input tokens, total output tokens, total cost in USD, and average cost per sample.

3.4 Cost & Token Management

Understanding Token Costs​

Counting Tokens Before Sending​

Cost Estimation Utility​

Prompt Caching​

Batch API (50% Discount)​

Cost Reduction Strategies​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

Understanding Token Costs

Counting Tokens Before Sending

Cost Estimation Utility

Prompt Caching

Batch API (50% Discount)

Cost Reduction Strategies

Common Mistakes

Quick Quiz

Student Exercise

Further Reading