10.2 API Gateway & Rate Limiting

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

10.2 API Gateway & Rate Limiting

Key Concepts: FastAPI wrapper · Request queuing · Token budgets

Official Docs: FastAPI · slowapi

Why You Need an API Gateway

Exposing an LLM directly to users without a gateway risks:

Abuse — one user can exhaust your entire API quota
Cost explosions — no per-user token limits
Security gaps — no authentication or input validation
No observability — no logging of who asked what

Basic FastAPI Gateway

pip install fastapi uvicorn slowapi openai python-dotenv

from fastapi import FastAPI, HTTPException, Request, Depends
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from pydantic import BaseModel
from openai import OpenAI
import time

client = OpenAI()
app = FastAPI(title="LLM API Gateway")
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

class ChatRequest(BaseModel):
    message: str
    max_tokens: int = 500

class ChatResponse(BaseModel):
    response: str
    tokens_used: int
    latency_ms: float

@app.post("/chat", response_model=ChatResponse)
@limiter.limit("10/minute")   # Max 10 requests per IP per minute
async def chat(request: Request, body: ChatRequest):
    # Token budget guard
    if body.max_tokens > 1000:
        raise HTTPException(400, "max_tokens exceeds limit of 1000")
    
    # Input length guard
    if len(body.message) > 2000:
        raise HTTPException(400, "Message too long (max 2000 characters)")
    
    start = time.time()
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": body.message}],
        max_tokens=body.max_tokens,
    )
    latency = (time.time() - start) * 1000
    
    return ChatResponse(
        response=resp.choices[0].message.content,
        tokens_used=resp.usage.total_tokens,
        latency_ms=round(latency, 1),
    )

@app.get("/health")
async def health():
    return {"status": "ok"}

# Run: uvicorn main:app --host 0.0.0.0 --port 8080

Per-User Token Budget

from collections import defaultdict

# In-memory token tracker (use Redis in production)
user_token_counts: dict[str, int] = defaultdict(int)
DAILY_TOKEN_LIMIT = 50_000

def check_token_budget(user_id: str, requested_tokens: int):
    current = user_token_counts[user_id]
    if current + requested_tokens > DAILY_TOKEN_LIMIT:
        raise HTTPException(
            429,
            f"Daily token limit exceeded. Used: {current}/{DAILY_TOKEN_LIMIT}"
        )

def record_token_usage(user_id: str, tokens_used: int):
    user_token_counts[user_id] += tokens_used

Common Mistakes

No rate limiting at all — without rate limits, a single malicious or buggy client can drain your entire OpenAI credit in minutes.
Rate limiting by IP only — users can bypass IP-based limits using VPNs. Add user-based rate limiting with API keys.
No request logging — without logging, you can’t debug incidents or track usage patterns.
Exposing raw API keys to clients — clients should call your gateway with your own API key scheme. Never pass your OpenAI key to the frontend.

Quick Quiz

Test Your Understanding

Q1. What is the purpose of rate limiting in an LLM gateway?
A1. To prevent any single user or IP from consuming excessive resources, protecting against abuse and unexpected cost spikes.

Q2. Why is per-user token budgeting important beyond request rate limiting?
A2. A user could send a small number of requests with very large prompts, bypassing request-count limits but still consuming enormous token costs.

Q3. What is the difference between HTTP 429 and 400 error codes?
A3. 429 = Too Many Requests (rate limit exceeded). 400 = Bad Request (invalid input such as message too long).

Student Exercise

Exercise 11.2 — Build a gateway
Build the FastAPI gateway above. Add: (1) a simple API key header check (X-API-Key), (2) logging of every request to a CSV file (timestamp, IP, tokens_used, latency_ms), (3) a GET /usage endpoint that returns total tokens used today.

10.2 API Gateway & Rate Limiting

Why You Need an API Gateway​

Basic FastAPI Gateway​

Per-User Token Budget​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

Why You Need an API Gateway

Basic FastAPI Gateway

Per-User Token Budget

Common Mistakes

Quick Quiz

Student Exercise

Further Reading