6.6 Building a Validation Pipeline

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

6.6 Building a Validation Pipeline

Key Concepts: Rule engine + LLM judge + Ensemble → combined validation gate

The Full Validation Stack

Each layer catches different types of errors:

User Query
    ↓
[Layer 1] Input Filtering     ← Moderation API, PII check (free, fast)
    ↓
[Layer 2] LLM Generation
    ↓
[Layer 3] Rule-Based Checks   ← Schema, regex, policy (free, fast)
    ↓ FAIL → retry
[Layer 4] LLM-as-Judge        ← Quality + faithfulness (~$0.001)
    ↓ FAIL → retry or escalate
[Layer 5] Claim Decomposition ← Atomic fact check (expensive, on-demand)
    ↓ PASS
Return to User

Complete Implementation

from openai import OpenAI
from pydantic import BaseModel, ValidationError
import json
import re
from dataclasses import dataclass
from enum import Enum

client = OpenAI()

class ValidationStatus(Enum):
    PASS = "pass"
    FAIL = "fail"
    ESCALATE = "escalate"

@dataclass
class ValidationReport:
    status: ValidationStatus
    layer_results: dict
    final_answer: str = ""
    escalation_reason: str = ""

# --- Layer 1: Input Filtering ---
def check_input(user_input: str) -> bool:
    resp = client.moderations.create(input=user_input)
    return not resp.results[0].flagged

# --- Layer 3: Rule-Based Checks ---
def rule_check(answer: str, min_words: int = 10, max_words: int = 500) -> list[str]:
    errors = []
    words = len(re.findall(r'\b\w+\b', answer))
    if not (min_words <= words <= max_words):
        errors.append(f"Word count {words} not in [{min_words}, {max_words}]")
    if re.search(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', answer):
        errors.append("Contains email address")
    return errors

# --- Layer 4: LLM-as-Judge ---
def llm_judge(question: str, answer: str) -> dict:
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"""Rate this answer 1-5.
Q: {question}\nA: {answer}
JSON: {{"score": int, "issues": [list]}}"""}],
        temperature=0,
        response_format={"type": "json_object"},
    )
    return json.loads(resp.choices[0].message.content)

# --- Full Pipeline ---
def validate_pipeline(
    question: str,
    max_retries: int = 2,
    enable_claim_check: bool = False,
) -> ValidationReport:
    report = ValidationReport(
        status=ValidationStatus.FAIL,
        layer_results={},
    )
    
    # Layer 1: Input
    if not check_input(question):
        report.layer_results["input"] = "BLOCKED"
        return report
    report.layer_results["input"] = "PASS"
    
    # Generate + validate with retries
    for attempt in range(max_retries + 1):
        # Generate
        gen_resp = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": question}],
            temperature=0 if attempt > 0 else 0.7,  # Deterministic on retry
        )
        answer = gen_resp.choices[0].message.content
        
        # Layer 3: Rule checks
        rule_errors = rule_check(answer)
        report.layer_results[f"rules_attempt_{attempt}"] = rule_errors or "PASS"
        if rule_errors:
            continue  # Retry
        
        # Layer 4: LLM judge
        judgement = llm_judge(question, answer)
        report.layer_results[f"llm_judge_attempt_{attempt}"] = judgement
        
        if judgement["score"] >= 4:
            report.status = ValidationStatus.PASS
            report.final_answer = answer
            return report
        elif judgement["score"] == 3 and attempt == max_retries:
            report.status = ValidationStatus.ESCALATE
            report.final_answer = answer
            report.escalation_reason = f"Score 3/5 after {max_retries+1} attempts"
            return report
    
    return report

# Run it
result = validate_pipeline("What is the speed of light?")
print(f"Status: {result.status.value}")
if result.status == ValidationStatus.PASS:
    print(f"Answer: {result.final_answer}")
elif result.status == ValidationStatus.ESCALATE:
    print(f"Escalation reason: {result.escalation_reason}")

Common Mistakes

Validating everything at Layer 4 — not every query needs LLM-as-judge. Use rule checks as the primary gate and only use the expensive LLM judge for borderline cases.
Infinite retry loop — always cap retries at 2–3. More retries rarely improve quality and increase cost linearly.
No escalation path — some queries will always be uncertain. Build a human escalation queue rather than returning potentially incorrect answers.

Quick Quiz

Test Your Understanding

Q1. In the validation stack, which layer is most cost-effective to run first?
A1. Rule-based checks — they are free, instant, and catch structural/formatting failures before spending on LLM judge calls.

Q2. What determines when to escalate vs fail vs pass?
A2. Confidence thresholds: high confidence in quality = PASS, high confidence of failure = FAIL, uncertain middle ground = ESCALATE to human review.

Q3. How does the pipeline handle a consistently low-quality answer after max retries?
A3. It escalates to human review rather than returning a bad answer. An ESCALATE state sends the case to a human reviewer queue.

Student Exercise

Exercise 12.6 — Build the full pipeline
Implement the complete 4-layer validation pipeline. Test it on: (1) a question the model answers well, (2) a question about very recent events (likely uncertain), (3) a question containing PII. Observe which layer catches each issue.

6.6 Building a Validation Pipeline

The Full Validation Stack​

Complete Implementation​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

The Full Validation Stack

Complete Implementation

Common Mistakes

Quick Quiz

Student Exercise

Further Reading