6.6 Building a Validation Pipeline
AI-generated content may contain errors. Always verify against official sources.
6.6 Building a Validation Pipeline
Key Concepts: Rule engine + LLM judge + Ensemble → combined validation gate
The Full Validation Stack
Each layer catches different types of errors:
User Query
↓
[Layer 1] Input Filtering ← Moderation API, PII check (free, fast)
↓
[Layer 2] LLM Generation
↓
[Layer 3] Rule-Based Checks ← Schema, regex, policy (free, fast)
↓ FAIL → retry
[Layer 4] LLM-as-Judge ← Quality + faithfulness (~$0.001)
↓ FAIL → retry or escalate
[Layer 5] Claim Decomposition ← Atomic fact check (expensive, on-demand)
↓ PASS
Return to User
Complete Implementation
from openai import OpenAI
from pydantic import BaseModel, ValidationError
import json
import re
from dataclasses import dataclass
from enum import Enum
client = OpenAI()
class ValidationStatus(Enum):
PASS = "pass"
FAIL = "fail"
ESCALATE = "escalate"
@dataclass
class ValidationReport:
status: ValidationStatus
layer_results: dict
final_answer: str = ""
escalation_reason: str = ""
# --- Layer 1: Input Filtering ---
def check_input(user_input: str) -> bool:
resp = client.moderations.create(input=user_input)
return not resp.results[0].flagged
# --- Layer 3: Rule-Based Checks ---
def rule_check(answer: str, min_words: int = 10, max_words: int = 500) -> list[str]:
errors = []
words = len(re.findall(r'\b\w+\b', answer))
if not (min_words <= words <= max_words):
errors.append(f"Word count {words} not in [{min_words}, {max_words}]")
if re.search(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', answer):
errors.append("Contains email address")
return errors
# --- Layer 4: LLM-as-Judge ---
def llm_judge(question: str, answer: str) -> dict:
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"""Rate this answer 1-5.
Q: {question}\nA: {answer}
JSON: {{"score": int, "issues": [list]}}"""}],
temperature=0,
response_format={"type": "json_object"},
)
return json.loads(resp.choices[0].message.content)
# --- Full Pipeline ---
def validate_pipeline(
question: str,
max_retries: int = 2,
enable_claim_check: bool = False,
) -> ValidationReport:
report = ValidationReport(
status=ValidationStatus.FAIL,
layer_results={},
)
# Layer 1: Input
if not check_input(question):
report.layer_results["input"] = "BLOCKED"
return report
report.layer_results["input"] = "PASS"
# Generate + validate with retries
for attempt in range(max_retries + 1):
# Generate
gen_resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": question}],
temperature=0 if attempt > 0 else 0.7, # Deterministic on retry
)
answer = gen_resp.choices[0].message.content
# Layer 3: Rule checks
rule_errors = rule_check(answer)
report.layer_results[f"rules_attempt_{attempt}"] = rule_errors or "PASS"
if rule_errors:
continue # Retry
# Layer 4: LLM judge
judgement = llm_judge(question, answer)
report.layer_results[f"llm_judge_attempt_{attempt}"] = judgement
if judgement["score"] >= 4:
report.status = ValidationStatus.PASS
report.final_answer = answer
return report
elif judgement["score"] == 3 and attempt == max_retries:
report.status = ValidationStatus.ESCALATE
report.final_answer = answer
report.escalation_reason = f"Score 3/5 after {max_retries+1} attempts"
return report
return report
# Run it
result = validate_pipeline("What is the speed of light?")
print(f"Status: {result.status.value}")
if result.status == ValidationStatus.PASS:
print(f"Answer: {result.final_answer}")
elif result.status == ValidationStatus.ESCALATE:
print(f"Escalation reason: {result.escalation_reason}")
Common Mistakes
- Validating everything at Layer 4 — not every query needs LLM-as-judge. Use rule checks as the primary gate and only use the expensive LLM judge for borderline cases.
- Infinite retry loop — always cap retries at 2–3. More retries rarely improve quality and increase cost linearly.
- No escalation path — some queries will always be uncertain. Build a human escalation queue rather than returning potentially incorrect answers.
Quick Quiz
Q1. In the validation stack, which layer is most cost-effective to run first?
A1. Rule-based checks — they are free, instant, and catch structural/formatting failures before spending on LLM judge calls.
Q2. What determines when to escalate vs fail vs pass?
A2. Confidence thresholds: high confidence in quality = PASS, high confidence of failure = FAIL, uncertain middle ground = ESCALATE to human review.
Q3. How does the pipeline handle a consistently low-quality answer after max retries?
A3. It escalates to human review rather than returning a bad answer. An ESCALATE state sends the case to a human reviewer queue.
Student Exercise
Exercise 12.6 — Build the full pipeline
Implement the complete 4-layer validation pipeline. Test it on: (1) a question the model answers well, (2) a question about very recent events (likely uncertain), (3) a question containing PII. Observe which layer catches each issue.
Further Reading
Next Chapter → Chapter 13: RAG