Skip to main content

6.6 Building a Validation Pipeline

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

6.6 Building a Validation Pipeline

Key Concepts: Rule engine + LLM judge + Ensemble → combined validation gate


The Full Validation Stack

Each layer catches different types of errors:

User Query

[Layer 1] Input Filtering ← Moderation API, PII check (free, fast)

[Layer 2] LLM Generation

[Layer 3] Rule-Based Checks ← Schema, regex, policy (free, fast)
↓ FAIL → retry
[Layer 4] LLM-as-Judge ← Quality + faithfulness (~$0.001)
↓ FAIL → retry or escalate
[Layer 5] Claim Decomposition ← Atomic fact check (expensive, on-demand)
↓ PASS
Return to User

Complete Implementation

from openai import OpenAI
from pydantic import BaseModel, ValidationError
import json
import re
from dataclasses import dataclass
from enum import Enum

client = OpenAI()

class ValidationStatus(Enum):
PASS = "pass"
FAIL = "fail"
ESCALATE = "escalate"

@dataclass
class ValidationReport:
status: ValidationStatus
layer_results: dict
final_answer: str = ""
escalation_reason: str = ""

# --- Layer 1: Input Filtering ---
def check_input(user_input: str) -> bool:
resp = client.moderations.create(input=user_input)
return not resp.results[0].flagged

# --- Layer 3: Rule-Based Checks ---
def rule_check(answer: str, min_words: int = 10, max_words: int = 500) -> list[str]:
errors = []
words = len(re.findall(r'\b\w+\b', answer))
if not (min_words <= words <= max_words):
errors.append(f"Word count {words} not in [{min_words}, {max_words}]")
if re.search(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', answer):
errors.append("Contains email address")
return errors

# --- Layer 4: LLM-as-Judge ---
def llm_judge(question: str, answer: str) -> dict:
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"""Rate this answer 1-5.
Q: {question}\nA: {answer}
JSON: {{"score": int, "issues": [list]}}"""}],
temperature=0,
response_format={"type": "json_object"},
)
return json.loads(resp.choices[0].message.content)

# --- Full Pipeline ---
def validate_pipeline(
question: str,
max_retries: int = 2,
enable_claim_check: bool = False,
) -> ValidationReport:
report = ValidationReport(
status=ValidationStatus.FAIL,
layer_results={},
)

# Layer 1: Input
if not check_input(question):
report.layer_results["input"] = "BLOCKED"
return report
report.layer_results["input"] = "PASS"

# Generate + validate with retries
for attempt in range(max_retries + 1):
# Generate
gen_resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": question}],
temperature=0 if attempt > 0 else 0.7, # Deterministic on retry
)
answer = gen_resp.choices[0].message.content

# Layer 3: Rule checks
rule_errors = rule_check(answer)
report.layer_results[f"rules_attempt_{attempt}"] = rule_errors or "PASS"
if rule_errors:
continue # Retry

# Layer 4: LLM judge
judgement = llm_judge(question, answer)
report.layer_results[f"llm_judge_attempt_{attempt}"] = judgement

if judgement["score"] >= 4:
report.status = ValidationStatus.PASS
report.final_answer = answer
return report
elif judgement["score"] == 3 and attempt == max_retries:
report.status = ValidationStatus.ESCALATE
report.final_answer = answer
report.escalation_reason = f"Score 3/5 after {max_retries+1} attempts"
return report

return report

# Run it
result = validate_pipeline("What is the speed of light?")
print(f"Status: {result.status.value}")
if result.status == ValidationStatus.PASS:
print(f"Answer: {result.final_answer}")
elif result.status == ValidationStatus.ESCALATE:
print(f"Escalation reason: {result.escalation_reason}")

Common Mistakes

Common Mistakes
  1. Validating everything at Layer 4 — not every query needs LLM-as-judge. Use rule checks as the primary gate and only use the expensive LLM judge for borderline cases.
  2. Infinite retry loop — always cap retries at 2–3. More retries rarely improve quality and increase cost linearly.
  3. No escalation path — some queries will always be uncertain. Build a human escalation queue rather than returning potentially incorrect answers.

Quick Quiz

Test Your Understanding

Q1. In the validation stack, which layer is most cost-effective to run first?
A1. Rule-based checks — they are free, instant, and catch structural/formatting failures before spending on LLM judge calls.

Q2. What determines when to escalate vs fail vs pass?
A2. Confidence thresholds: high confidence in quality = PASS, high confidence of failure = FAIL, uncertain middle ground = ESCALATE to human review.

Q3. How does the pipeline handle a consistently low-quality answer after max retries?
A3. It escalates to human review rather than returning a bad answer. An ESCALATE state sends the case to a human reviewer queue.


Student Exercise

Exercise 12.6 — Build the full pipeline
Implement the complete 4-layer validation pipeline. Test it on: (1) a question the model answers well, (2) a question about very recent events (likely uncertain), (3) a question containing PII. Observe which layer catches each issue.


Further Reading

Next Chapter → Chapter 13: RAG