3.5 Error Handling & Retries
AI-generated content may contain errors. Always verify against official sources.
3.5 Error Handling & Retries
Key Concepts: Rate limits · Timeouts · Exponential backoff · Fallback models
Official Docs: OpenAI Error Codes · OpenAI Rate Limits
Common API Errors
| Error | HTTP Code | Cause | Fix |
|---|---|---|---|
RateLimitError | 429 | Too many requests per minute/day | Retry with exponential backoff |
AuthenticationError | 401 | Invalid or missing API key | Check OPENAI_API_KEY env var |
BadRequestError | 400 | Invalid request (malformed JSON, bad params) | Fix request structure |
NotFoundError | 404 | Model doesn’t exist or wrong endpoint | Check model name |
InternalServerError | 500 | Provider-side outage | Retry with backoff |
APITimeoutError | — | Request took too long | Increase timeout or retry |
ContextWindowExceededError | 400 | Prompt + response exceeds context limit | Reduce prompt length |
Automatic Retries with the OpenAI SDK
The OpenAI SDK has built-in retry logic:
from openai import OpenAI
# Automatically retries rate-limit and server errors up to 3 times
client = OpenAI(max_retries=3)
Manual Exponential Backoff
For more control, implement your own retry decorator:
import time
import random
from openai import OpenAI, RateLimitError, InternalServerError
client = OpenAI(max_retries=0) # disable auto-retries to use our own
def with_backoff(fn, *args, max_retries: int = 5, base_delay: float = 1.0, **kwargs):
"""Call fn with exponential backoff on rate-limit and server errors."""
for attempt in range(max_retries):
try:
return fn(*args, **kwargs)
except (RateLimitError, InternalServerError) as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 0.5)
print(f"Attempt {attempt + 1} failed ({type(e).__name__}). Retrying in {delay:.1f}s...")
time.sleep(delay)
response = with_backoff(
client.chat.completions.create,
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is the speed of light?"}],
)
print(response.choices[0].message.content)
Using tenacity for Retry Logic
tenacity is a clean, production-grade retry library:
pip install tenacity
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from openai import OpenAI, RateLimitError
client = OpenAI()
@retry(
retry=retry_if_exception_type(RateLimitError),
wait=wait_exponential(multiplier=1, min=2, max=60),
stop=stop_after_attempt(6),
)
def chat(messages: list[dict]) -> str:
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
)
return resp.choices[0].message.content
print(chat([{"role": "user", "content": "Hello!"}]))
Fallback Models
Degrade gracefully to a cheaper/available model if the primary model is unavailable:
from openai import OpenAI, RateLimitError, InternalServerError
client = OpenAI()
MODEL_FALLBACKS = ["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"]
def chat_with_fallback(messages: list[dict]) -> str:
for model in MODEL_FALLBACKS:
try:
resp = client.chat.completions.create(
model=model,
messages=messages,
timeout=30,
)
return resp.choices[0].message.content
except (RateLimitError, InternalServerError) as e:
print(f"{model} unavailable: {e}. Trying fallback...")
raise RuntimeError("All models failed.")
Setting Timeouts
from openai import OpenAI
client = OpenAI(
timeout=30.0, # 30 seconds total
max_retries=2,
)
Common Mistakes
- Infinite retry loops — always set a maximum retry count. Infinite loops burn API credits and can cascade into larger outages.
- Retrying
BadRequestError(400) — 400 errors indicate a malformed request. Retrying won’t fix them; fix the request instead. - No jitter in backoff — without random jitter, all clients retry simultaneously after an outage, creating a “thundering herd” that re-triggers the rate limit.
- Not logging failures — always log the error type, attempt number, and delay so you can debug production issues.
Quick Quiz
Q1. What does HTTP 429 mean in the context of LLM APIs?
A1. Rate limit exceeded — too many requests per minute (or tokens per minute) for your tier.
Q2. Why should exponential backoff include random jitter?
A2. Without jitter, all clients retry at the same time after a rate-limit window resets, causing another burst that triggers the rate limit again.
Q3. Should you retry a BadRequestError (400)?
A3. No — a 400 error means your request is malformed. Retrying without changing the request will always fail.
Q4. What is the purpose of a fallback model strategy?
A4. To degrade gracefully to a cheaper or different model when the primary model is unavailable, ensuring service continuity.
Student Exercise
Exercise 3.5 — Resilient API client
Build a ResilientClient class that wraps the OpenAI SDK. It should: retry on rate-limit errors (max 5 attempts, exponential backoff with jitter), fall back to gpt-4o-mini if gpt-4o fails, log each retry attempt, and raise after all retries are exhausted.
Further Reading
- 📘 OpenAI — Error Codes
- 📘 OpenAI — Rate Limits Guide
- 📘 tenacity library docs
- 📄 Exponential Backoff and Jitter — AWS Blog
Next Chapter → Chapter 4: RAG