10.2 API Gateway & Rate Limiting
AI-generated content may contain errors. Always verify against official sources.
10.2 API Gateway & Rate Limiting
Key Concepts: FastAPI wrapper · Request queuing · Token budgets
Official Docs: FastAPI · slowapi
Why You Need an API Gateway
Exposing an LLM directly to users without a gateway risks:
- Abuse — one user can exhaust your entire API quota
- Cost explosions — no per-user token limits
- Security gaps — no authentication or input validation
- No observability — no logging of who asked what
Basic FastAPI Gateway
pip install fastapi uvicorn slowapi openai python-dotenv
from fastapi import FastAPI, HTTPException, Request, Depends
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from pydantic import BaseModel
from openai import OpenAI
import time
client = OpenAI()
app = FastAPI(title="LLM API Gateway")
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
class ChatRequest(BaseModel):
message: str
max_tokens: int = 500
class ChatResponse(BaseModel):
response: str
tokens_used: int
latency_ms: float
@app.post("/chat", response_model=ChatResponse)
@limiter.limit("10/minute") # Max 10 requests per IP per minute
async def chat(request: Request, body: ChatRequest):
# Token budget guard
if body.max_tokens > 1000:
raise HTTPException(400, "max_tokens exceeds limit of 1000")
# Input length guard
if len(body.message) > 2000:
raise HTTPException(400, "Message too long (max 2000 characters)")
start = time.time()
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": body.message}],
max_tokens=body.max_tokens,
)
latency = (time.time() - start) * 1000
return ChatResponse(
response=resp.choices[0].message.content,
tokens_used=resp.usage.total_tokens,
latency_ms=round(latency, 1),
)
@app.get("/health")
async def health():
return {"status": "ok"}
# Run: uvicorn main:app --host 0.0.0.0 --port 8080
Per-User Token Budget
from collections import defaultdict
# In-memory token tracker (use Redis in production)
user_token_counts: dict[str, int] = defaultdict(int)
DAILY_TOKEN_LIMIT = 50_000
def check_token_budget(user_id: str, requested_tokens: int):
current = user_token_counts[user_id]
if current + requested_tokens > DAILY_TOKEN_LIMIT:
raise HTTPException(
429,
f"Daily token limit exceeded. Used: {current}/{DAILY_TOKEN_LIMIT}"
)
def record_token_usage(user_id: str, tokens_used: int):
user_token_counts[user_id] += tokens_used
Common Mistakes
- No rate limiting at all — without rate limits, a single malicious or buggy client can drain your entire OpenAI credit in minutes.
- Rate limiting by IP only — users can bypass IP-based limits using VPNs. Add user-based rate limiting with API keys.
- No request logging — without logging, you can’t debug incidents or track usage patterns.
- Exposing raw API keys to clients — clients should call your gateway with your own API key scheme. Never pass your OpenAI key to the frontend.
Quick Quiz
Q1. What is the purpose of rate limiting in an LLM gateway?
A1. To prevent any single user or IP from consuming excessive resources, protecting against abuse and unexpected cost spikes.
Q2. Why is per-user token budgeting important beyond request rate limiting?
A2. A user could send a small number of requests with very large prompts, bypassing request-count limits but still consuming enormous token costs.
Q3. What is the difference between HTTP 429 and 400 error codes?
A3. 429 = Too Many Requests (rate limit exceeded). 400 = Bad Request (invalid input such as message too long).
Student Exercise
Exercise 11.2 — Build a gateway
Build the FastAPI gateway above. Add: (1) a simple API key header check (X-API-Key), (2) logging of every request to a CSV file (timestamp, IP, tokens_used, latency_ms), (3) a GET /usage endpoint that returns total tokens used today.