10.4 Monitoring & Observability

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

10.4 Monitoring & Observability

Key Concepts: LangSmith · Logging prompts/responses · Drift detection

Official Docs: LangSmith · OpenTelemetry

What to Monitor

Metric	Why it matters	Alert threshold
Latency (p50, p95, p99)	User experience	p95 > 5s
Error rate	Availability	> 1%
Token usage	Cost	> 120% of baseline
Cost per request	Budget	> budget threshold
Output quality score	Quality drift	< baseline - 0.1
Rate limit hits	Capacity	> 5% of requests

Option 1 — LangSmith (LangChain)

LangSmith traces every chain and agent call automatically:

pip install langsmith langchain-openai

import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# Set LangSmith env vars
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls__..."  # from smith.langchain.com
os.environ["LANGCHAIN_PROJECT"] = "my-llm-app"

llm = ChatOpenAI(model="gpt-4o-mini")
prompt = ChatPromptTemplate.from_template("Answer: {question}")
chain = prompt | llm

# Every call is now traced automatically in LangSmith
result = chain.invoke({"question": "What is the capital of France?"})
print(result.content)

LangSmith dashboard shows: latency, tokens, cost, input/output, errors — all automatically.

Option 2 — Custom Structured Logging

For any LLM framework, add structured logging around every LLM call:

import logging
import time
import json
from openai import OpenAI
from datetime import datetime
from hashlib import sha256

# Structured JSON logger
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("llm.monitor")
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter('%(message)s'))
logger.handlers = [handler]

client = OpenAI()

def monitored_completion(
    messages: list,
    model: str = "gpt-4o-mini",
    session_id: str = "default",
    **kwargs
) -> str:
    """OpenAI call wrapper with full observability."""
    start = time.time()
    error = None
    response_text = ""
    tokens_used = 0
    
    try:
        resp = client.chat.completions.create(
            model=model, messages=messages, **kwargs
        )
        response_text = resp.choices[0].message.content
        tokens_used = resp.usage.total_tokens
        return response_text
    except Exception as e:
        error = str(e)
        raise
    finally:
        latency_ms = round((time.time() - start) * 1000, 1)
        log_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "session_id": session_id,
            "model": model,
            "latency_ms": latency_ms,
            "tokens": tokens_used,
            "cost_usd": round(tokens_used * 0.00000015, 6),  # gpt-4o-mini pricing
            "error": error,
            "input_hash": sha256(json.dumps(messages).encode()).hexdigest()[:16],
        }
        logger.info(json.dumps(log_entry))

# Usage
response = monitored_completion(
    messages=[{"role": "user", "content": "What is Python?"}],
    session_id="session_001",
)

Quality Drift Detection

Monitor output quality over time to catch model drift:

from collections import deque

class QualityMonitor:
    """Track average quality score and alert on drift."""
    
    def __init__(self, window: int = 100, alert_threshold: float = 0.1):
        self.scores = deque(maxlen=window)
        self.baseline = None
        self.alert_threshold = alert_threshold
    
    def record(self, score: float):
        self.scores.append(score)
        if len(self.scores) == self.scores.maxlen and self.baseline is None:
            self.baseline = sum(self.scores) / len(self.scores)
            print(f"Baseline established: {self.baseline:.3f}")
    
    def check_drift(self) -> bool:
        if self.baseline is None or len(self.scores) < 20:
            return False
        current = sum(self.scores) / len(self.scores)
        drift = self.baseline - current
        if drift > self.alert_threshold:
            print(f"⚠️ Quality drift detected: {current:.3f} vs baseline {self.baseline:.3f}")
            return True
        return False

monitor = QualityMonitor(window=50, alert_threshold=0.1)

Common Mistakes

Logging in development, not in production — set up monitoring from day one. Incidents in production without logs are impossible to debug.
Storing raw prompt/response text in logs — this may expose sensitive user data. Log hashes and metadata; store full text in a secure, access-controlled store.
No alerting on cost — a loop bug can generate thousands of unexpected LLM calls. Always set budget alerts in your cloud billing console.
Ignoring p99 latency — average latency can look fine while 1% of users experience 30+ second timeouts. Monitor percentile latency.

Quick Quiz

Test Your Understanding

Q1. What is the difference between p50 and p99 latency?
A1. p50 is the median response time (50% of requests are faster). p99 is the 99th percentile — the slowest 1% of requests. p99 reveals worst-case user experience.

Q2. What is quality drift and why is it dangerous?
A2. A gradual decline in output quality over time, potentially caused by model updates or input distribution shifts. Dangerous because it can go unnoticed without automated monitoring.

Q3. What does LangSmith automatically trace?
A3. Every LangChain chain and agent call: input, output, latency, token usage, cost, and errors — without any manual instrumentation.

Student Exercise

Exercise 11.4 — Monitoring dashboard
Add the monitored_completion wrapper to your previous project. Log 50 calls. Compute: average latency, p95 latency, total cost, and error rate. Display as a summary table.

10.4 Monitoring & Observability

What to Monitor​

Option 1 — LangSmith (LangChain)​

Option 2 — Custom Structured Logging​

Quality Drift Detection​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

What to Monitor

Option 1 — LangSmith (LangChain)

Option 2 — Custom Structured Logging

Quality Drift Detection

Common Mistakes

Quick Quiz

Student Exercise

Further Reading