6.2 Rule-Based Validation

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

6.2 Rule-Based Validation

Key Concepts: Regex · Schema checks · Policy engines · Boundary assertions

Official Docs: Pydantic · jsonschema

Why Rule-Based Validation First?

Rule-based checks are:

Fast — microseconds vs seconds for LLM checks
Deterministic — same input always same output
Free — no API costs
Auditable — easy to explain why something failed

Always run rule-based checks before LLM-based checks.

Pattern 1 — Pydantic Schema Validation

Force LLM outputs to conform to a Pydantic model:

from pydantic import BaseModel, Field, field_validator
from openai import OpenAI
import json

client = OpenAI()

class ProductReview(BaseModel):
    rating: int = Field(ge=1, le=5, description="Rating from 1 to 5")
    sentiment: str = Field(description="positive, neutral, or negative")
    summary: str = Field(min_length=10, max_length=200)
    topics: list[str] = Field(min_length=1, max_length=5)
    
    @field_validator('sentiment')
    @classmethod
    def validate_sentiment(cls, v):
        allowed = {'positive', 'neutral', 'negative'}
        if v.lower() not in allowed:
            raise ValueError(f'sentiment must be one of {allowed}')
        return v.lower()

def extract_review_data(review_text: str) -> ProductReview:
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Analyse this review and return JSON with:
- rating (1-5 integer)
- sentiment (positive/neutral/negative)
- summary (10-200 chars)
- topics (1-5 strings)

Review: {review_text}"""
        }],
        response_format={"type": "json_object"},
        temperature=0,
    )
    raw = json.loads(resp.choices[0].message.content)
    return ProductReview(**raw)  # Validates or raises ValidationError

try:
    result = extract_review_data("This laptop is fantastic! Fast, well-built, great display.")
    print(result.model_dump())
except Exception as e:
    print(f"Validation failed: {e}")

Pattern 2 — Regex Assertions

import re

# Define output rules as regex patterns
VALIDATION_RULES = {
    "no_urls": {
        "pattern": r'https?://[\S]+',
        "should_match": False,
        "message": "Output must not contain URLs",
    },
    "no_pii_email": {
        "pattern": r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
        "should_match": False,
        "message": "Output must not contain email addresses",
    },
    "word_count": {
        "pattern": r'\b\w+\b',
        "should_match": True,
        "min_count": 10,
        "max_count": 500,
        "message": "Output word count must be between 10 and 500",
    },
}

def validate_output(text: str) -> list[str]:
    """Returns a list of validation errors. Empty = passed."""
    errors = []
    for rule_name, rule in VALIDATION_RULES.items():
        matches = re.findall(rule["pattern"], text)
        if "min_count" in rule:
            count = len(matches)
            if not (rule["min_count"] <= count <= rule["max_count"]):
                errors.append(f"{rule_name}: {rule['message']} (got {count})")
        elif rule["should_match"] and not matches:
            errors.append(f"{rule_name}: {rule['message']}")
        elif not rule["should_match"] and matches:
            errors.append(f"{rule_name}: {rule['message']}")
    return errors

# Test
errors = validate_output("Visit https://example.com for more info.")
print(errors)  # ['no_urls: Output must not contain URLs']

Pattern 3 — Content Policy Engine

FORBIDDEN_PATTERNS = [
    r'\b(password|secret|api.?key)\b',
    r'\b\d{16}\b',  # Credit card numbers
    r'\b\d{3}-\d{2}-\d{4}\b',  # SSN format
]

REQUIRED_DISCLAIMERS = {
    "medical": "consult a doctor",
    "legal": "consult a lawyer",
    "financial": "consult a financial advisor",
}

def policy_check(text: str, domain: str = None) -> dict:
    violations = []
    
    for pattern in FORBIDDEN_PATTERNS:
        if re.search(pattern, text, re.IGNORECASE):
            violations.append(f"Forbidden content: {pattern}")
    
    if domain and domain in REQUIRED_DISCLAIMERS:
        required = REQUIRED_DISCLAIMERS[domain]
        if required.lower() not in text.lower():
            violations.append(f"Missing required disclaimer: '{required}'")
    
    return {"passed": len(violations) == 0, "violations": violations}

Common Mistakes

Skipping schema validation — if you ask for JSON and don’t validate the schema, the model may return slightly wrong field names or types that break downstream code.
Over-specific regex — very strict patterns cause false positives. Test your rules on 100+ real outputs before deploying.
Not retrying on validation failure — if the schema validation fails, retry the LLM call 1–2 times before giving up. Models often self-correct on retry.

Quick Quiz

Test Your Understanding

Q1. Why should rule-based validation run before LLM-based validation?
A1. Rule-based checks are microseconds fast and free. Running them first catches obvious failures without spending money on LLM judge calls.

Q2. What does Pydantic’s field_validator allow you to do beyond basic type checking?
A2. Write custom validation logic — e.g., checking that a string value is from an allowed set, or that a date is in the future.

Q3. What should you do when a Pydantic validation fails on an LLM output?
A3. Retry the LLM call 1–2 times (optionally including the error message in the retry prompt). Models often produce valid output on retry.

Student Exercise

Exercise 12.2 — Validation harness
Build a product review extractor using Pydantic validation. Deliberately break 3 rules (wrong sentiment value, summary too short, rating out of range). Implement a retry mechanism that shows the Pydantic error to the model in the retry prompt.

6.2 Rule-Based Validation

Why Rule-Based Validation First?​

Pattern 1 — Pydantic Schema Validation​

Pattern 2 — Regex Assertions​

Pattern 3 — Content Policy Engine​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

Why Rule-Based Validation First?

Pattern 1 — Pydantic Schema Validation

Pattern 2 — Regex Assertions

Pattern 3 — Content Policy Engine

Common Mistakes

Quick Quiz

Student Exercise

Further Reading