10.4 Monitoring & Observability
AI-generated content may contain errors. Always verify against official sources.
10.4 Monitoring & Observability
Key Concepts: LangSmith · Logging prompts/responses · Drift detection
Official Docs: LangSmith · OpenTelemetry
What to Monitor
| Metric | Why it matters | Alert threshold |
|---|---|---|
| Latency (p50, p95, p99) | User experience | p95 > 5s |
| Error rate | Availability | > 1% |
| Token usage | Cost | > 120% of baseline |
| Cost per request | Budget | > budget threshold |
| Output quality score | Quality drift | < baseline - 0.1 |
| Rate limit hits | Capacity | > 5% of requests |
Option 1 — LangSmith (LangChain)
LangSmith traces every chain and agent call automatically:
pip install langsmith langchain-openai
import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
# Set LangSmith env vars
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls__..." # from smith.langchain.com
os.environ["LANGCHAIN_PROJECT"] = "my-llm-app"
llm = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_template("Answer: {question}")
chain = prompt | llm
# Every call is now traced automatically in LangSmith
result = chain.invoke({"question": "What is the capital of France?"})
print(result.content)
LangSmith dashboard shows: latency, tokens, cost, input/output, errors — all automatically.
Option 2 — Custom Structured Logging
For any LLM framework, add structured logging around every LLM call:
import logging
import time
import json
from openai import OpenAI
from datetime import datetime
from hashlib import sha256
# Structured JSON logger
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("llm.monitor")
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter('%(message)s'))
logger.handlers = [handler]
client = OpenAI()
def monitored_completion(
messages: list,
model: str = "gpt-4o-mini",
session_id: str = "default",
**kwargs
) -> str:
"""OpenAI call wrapper with full observability."""
start = time.time()
error = None
response_text = ""
tokens_used = 0
try:
resp = client.chat.completions.create(
model=model, messages=messages, **kwargs
)
response_text = resp.choices[0].message.content
tokens_used = resp.usage.total_tokens
return response_text
except Exception as e:
error = str(e)
raise
finally:
latency_ms = round((time.time() - start) * 1000, 1)
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"session_id": session_id,
"model": model,
"latency_ms": latency_ms,
"tokens": tokens_used,
"cost_usd": round(tokens_used * 0.00000015, 6), # gpt-4o-mini pricing
"error": error,
"input_hash": sha256(json.dumps(messages).encode()).hexdigest()[:16],
}
logger.info(json.dumps(log_entry))
# Usage
response = monitored_completion(
messages=[{"role": "user", "content": "What is Python?"}],
session_id="session_001",
)
Quality Drift Detection
Monitor output quality over time to catch model drift:
from collections import deque
class QualityMonitor:
"""Track average quality score and alert on drift."""
def __init__(self, window: int = 100, alert_threshold: float = 0.1):
self.scores = deque(maxlen=window)
self.baseline = None
self.alert_threshold = alert_threshold
def record(self, score: float):
self.scores.append(score)
if len(self.scores) == self.scores.maxlen and self.baseline is None:
self.baseline = sum(self.scores) / len(self.scores)
print(f"Baseline established: {self.baseline:.3f}")
def check_drift(self) -> bool:
if self.baseline is None or len(self.scores) < 20:
return False
current = sum(self.scores) / len(self.scores)
drift = self.baseline - current
if drift > self.alert_threshold:
print(f"⚠️ Quality drift detected: {current:.3f} vs baseline {self.baseline:.3f}")
return True
return False
monitor = QualityMonitor(window=50, alert_threshold=0.1)
Common Mistakes
- Logging in development, not in production — set up monitoring from day one. Incidents in production without logs are impossible to debug.
- Storing raw prompt/response text in logs — this may expose sensitive user data. Log hashes and metadata; store full text in a secure, access-controlled store.
- No alerting on cost — a loop bug can generate thousands of unexpected LLM calls. Always set budget alerts in your cloud billing console.
- Ignoring p99 latency — average latency can look fine while 1% of users experience 30+ second timeouts. Monitor percentile latency.
Quick Quiz
Q1. What is the difference between p50 and p99 latency?
A1. p50 is the median response time (50% of requests are faster). p99 is the 99th percentile — the slowest 1% of requests. p99 reveals worst-case user experience.
Q2. What is quality drift and why is it dangerous?
A2. A gradual decline in output quality over time, potentially caused by model updates or input distribution shifts. Dangerous because it can go unnoticed without automated monitoring.
Q3. What does LangSmith automatically trace?
A3. Every LangChain chain and agent call: input, output, latency, token usage, cost, and errors — without any manual instrumentation.
Student Exercise
Exercise 11.4 — Monitoring dashboard
Add the monitored_completion wrapper to your previous project. Log 50 calls. Compute: average latency, p95 latency, total cost, and error rate. Display as a summary table.
Further Reading
Next → 11.5 Cost Optimisation