Skip to main content

10.4 Monitoring & Observability

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

10.4 Monitoring & Observability

Key Concepts: LangSmith · Logging prompts/responses · Drift detection

Official Docs: LangSmith · OpenTelemetry


What to Monitor

MetricWhy it mattersAlert threshold
Latency (p50, p95, p99)User experiencep95 > 5s
Error rateAvailability> 1%
Token usageCost> 120% of baseline
Cost per requestBudget> budget threshold
Output quality scoreQuality drift< baseline - 0.1
Rate limit hitsCapacity> 5% of requests

Option 1 — LangSmith (LangChain)

LangSmith traces every chain and agent call automatically:

pip install langsmith langchain-openai
import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# Set LangSmith env vars
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls__..." # from smith.langchain.com
os.environ["LANGCHAIN_PROJECT"] = "my-llm-app"

llm = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_template("Answer: {question}")
chain = prompt | llm

# Every call is now traced automatically in LangSmith
result = chain.invoke({"question": "What is the capital of France?"})
print(result.content)

LangSmith dashboard shows: latency, tokens, cost, input/output, errors — all automatically.


Option 2 — Custom Structured Logging

For any LLM framework, add structured logging around every LLM call:

import logging
import time
import json
from openai import OpenAI
from datetime import datetime
from hashlib import sha256

# Structured JSON logger
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("llm.monitor")
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter('%(message)s'))
logger.handlers = [handler]

client = OpenAI()

def monitored_completion(
messages: list,
model: str = "gpt-4o-mini",
session_id: str = "default",
**kwargs
) -> str:
"""OpenAI call wrapper with full observability."""
start = time.time()
error = None
response_text = ""
tokens_used = 0

try:
resp = client.chat.completions.create(
model=model, messages=messages, **kwargs
)
response_text = resp.choices[0].message.content
tokens_used = resp.usage.total_tokens
return response_text
except Exception as e:
error = str(e)
raise
finally:
latency_ms = round((time.time() - start) * 1000, 1)
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"session_id": session_id,
"model": model,
"latency_ms": latency_ms,
"tokens": tokens_used,
"cost_usd": round(tokens_used * 0.00000015, 6), # gpt-4o-mini pricing
"error": error,
"input_hash": sha256(json.dumps(messages).encode()).hexdigest()[:16],
}
logger.info(json.dumps(log_entry))

# Usage
response = monitored_completion(
messages=[{"role": "user", "content": "What is Python?"}],
session_id="session_001",
)

Quality Drift Detection

Monitor output quality over time to catch model drift:

from collections import deque

class QualityMonitor:
"""Track average quality score and alert on drift."""

def __init__(self, window: int = 100, alert_threshold: float = 0.1):
self.scores = deque(maxlen=window)
self.baseline = None
self.alert_threshold = alert_threshold

def record(self, score: float):
self.scores.append(score)
if len(self.scores) == self.scores.maxlen and self.baseline is None:
self.baseline = sum(self.scores) / len(self.scores)
print(f"Baseline established: {self.baseline:.3f}")

def check_drift(self) -> bool:
if self.baseline is None or len(self.scores) < 20:
return False
current = sum(self.scores) / len(self.scores)
drift = self.baseline - current
if drift > self.alert_threshold:
print(f"⚠️ Quality drift detected: {current:.3f} vs baseline {self.baseline:.3f}")
return True
return False

monitor = QualityMonitor(window=50, alert_threshold=0.1)

Common Mistakes

Common Mistakes
  1. Logging in development, not in production — set up monitoring from day one. Incidents in production without logs are impossible to debug.
  2. Storing raw prompt/response text in logs — this may expose sensitive user data. Log hashes and metadata; store full text in a secure, access-controlled store.
  3. No alerting on cost — a loop bug can generate thousands of unexpected LLM calls. Always set budget alerts in your cloud billing console.
  4. Ignoring p99 latency — average latency can look fine while 1% of users experience 30+ second timeouts. Monitor percentile latency.

Quick Quiz

Test Your Understanding

Q1. What is the difference between p50 and p99 latency?
A1. p50 is the median response time (50% of requests are faster). p99 is the 99th percentile — the slowest 1% of requests. p99 reveals worst-case user experience.

Q2. What is quality drift and why is it dangerous?
A2. A gradual decline in output quality over time, potentially caused by model updates or input distribution shifts. Dangerous because it can go unnoticed without automated monitoring.

Q3. What does LangSmith automatically trace?
A3. Every LangChain chain and agent call: input, output, latency, token usage, cost, and errors — without any manual instrumentation.


Student Exercise

Exercise 11.4 — Monitoring dashboard
Add the monitored_completion wrapper to your previous project. Log 50 calls. Compute: average latency, p95 latency, total cost, and error rate. Display as a summary table.


Further Reading

Next → 11.5 Cost Optimisation