Skip to main content

3.5 Error Handling & Retries

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

3.5 Error Handling & Retries

Key Concepts: Rate limits · Timeouts · Exponential backoff · Fallback models

Official Docs: OpenAI Error Codes · OpenAI Rate Limits


Common API Errors

ErrorHTTP CodeCauseFix
RateLimitError429Too many requests per minute/dayRetry with exponential backoff
AuthenticationError401Invalid or missing API keyCheck OPENAI_API_KEY env var
BadRequestError400Invalid request (malformed JSON, bad params)Fix request structure
NotFoundError404Model doesn’t exist or wrong endpointCheck model name
InternalServerError500Provider-side outageRetry with backoff
APITimeoutErrorRequest took too longIncrease timeout or retry
ContextWindowExceededError400Prompt + response exceeds context limitReduce prompt length

Automatic Retries with the OpenAI SDK

The OpenAI SDK has built-in retry logic:

from openai import OpenAI

# Automatically retries rate-limit and server errors up to 3 times
client = OpenAI(max_retries=3)

Manual Exponential Backoff

For more control, implement your own retry decorator:

import time
import random
from openai import OpenAI, RateLimitError, InternalServerError

client = OpenAI(max_retries=0) # disable auto-retries to use our own

def with_backoff(fn, *args, max_retries: int = 5, base_delay: float = 1.0, **kwargs):
"""Call fn with exponential backoff on rate-limit and server errors."""
for attempt in range(max_retries):
try:
return fn(*args, **kwargs)
except (RateLimitError, InternalServerError) as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 0.5)
print(f"Attempt {attempt + 1} failed ({type(e).__name__}). Retrying in {delay:.1f}s...")
time.sleep(delay)

response = with_backoff(
client.chat.completions.create,
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is the speed of light?"}],
)
print(response.choices[0].message.content)

Using tenacity for Retry Logic

tenacity is a clean, production-grade retry library:

pip install tenacity
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from openai import OpenAI, RateLimitError

client = OpenAI()

@retry(
retry=retry_if_exception_type(RateLimitError),
wait=wait_exponential(multiplier=1, min=2, max=60),
stop=stop_after_attempt(6),
)
def chat(messages: list[dict]) -> str:
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
)
return resp.choices[0].message.content

print(chat([{"role": "user", "content": "Hello!"}]))

Fallback Models

Degrade gracefully to a cheaper/available model if the primary model is unavailable:

from openai import OpenAI, RateLimitError, InternalServerError

client = OpenAI()
MODEL_FALLBACKS = ["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"]

def chat_with_fallback(messages: list[dict]) -> str:
for model in MODEL_FALLBACKS:
try:
resp = client.chat.completions.create(
model=model,
messages=messages,
timeout=30,
)
return resp.choices[0].message.content
except (RateLimitError, InternalServerError) as e:
print(f"{model} unavailable: {e}. Trying fallback...")
raise RuntimeError("All models failed.")

Setting Timeouts

from openai import OpenAI

client = OpenAI(
timeout=30.0, # 30 seconds total
max_retries=2,
)

Common Mistakes

Common Mistakes
  1. Infinite retry loops — always set a maximum retry count. Infinite loops burn API credits and can cascade into larger outages.
  2. Retrying BadRequestError (400) — 400 errors indicate a malformed request. Retrying won’t fix them; fix the request instead.
  3. No jitter in backoff — without random jitter, all clients retry simultaneously after an outage, creating a “thundering herd” that re-triggers the rate limit.
  4. Not logging failures — always log the error type, attempt number, and delay so you can debug production issues.

Quick Quiz

Test Your Understanding

Q1. What does HTTP 429 mean in the context of LLM APIs?
A1. Rate limit exceeded — too many requests per minute (or tokens per minute) for your tier.

Q2. Why should exponential backoff include random jitter?
A2. Without jitter, all clients retry at the same time after a rate-limit window resets, causing another burst that triggers the rate limit again.

Q3. Should you retry a BadRequestError (400)?
A3. No — a 400 error means your request is malformed. Retrying without changing the request will always fail.

Q4. What is the purpose of a fallback model strategy?
A4. To degrade gracefully to a cheaper or different model when the primary model is unavailable, ensuring service continuity.


Student Exercise

Exercise 3.5 — Resilient API client
Build a ResilientClient class that wraps the OpenAI SDK. It should: retry on rate-limit errors (max 5 attempts, exponential backoff with jitter), fall back to gpt-4o-mini if gpt-4o fails, log each retry attempt, and raise after all retries are exhausted.


Further Reading

Next Chapter → Chapter 4: RAG