6.2 Rule-Based Validation
AI-generated content may contain errors. Always verify against official sources.
6.2 Rule-Based Validation
Key Concepts: Regex · Schema checks · Policy engines · Boundary assertions
Official Docs: Pydantic · jsonschema
Why Rule-Based Validation First?
Rule-based checks are:
- Fast — microseconds vs seconds for LLM checks
- Deterministic — same input always same output
- Free — no API costs
- Auditable — easy to explain why something failed
Always run rule-based checks before LLM-based checks.
Pattern 1 — Pydantic Schema Validation
Force LLM outputs to conform to a Pydantic model:
from pydantic import BaseModel, Field, field_validator
from openai import OpenAI
import json
client = OpenAI()
class ProductReview(BaseModel):
rating: int = Field(ge=1, le=5, description="Rating from 1 to 5")
sentiment: str = Field(description="positive, neutral, or negative")
summary: str = Field(min_length=10, max_length=200)
topics: list[str] = Field(min_length=1, max_length=5)
@field_validator('sentiment')
@classmethod
def validate_sentiment(cls, v):
allowed = {'positive', 'neutral', 'negative'}
if v.lower() not in allowed:
raise ValueError(f'sentiment must be one of {allowed}')
return v.lower()
def extract_review_data(review_text: str) -> ProductReview:
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Analyse this review and return JSON with:
- rating (1-5 integer)
- sentiment (positive/neutral/negative)
- summary (10-200 chars)
- topics (1-5 strings)
Review: {review_text}"""
}],
response_format={"type": "json_object"},
temperature=0,
)
raw = json.loads(resp.choices[0].message.content)
return ProductReview(**raw) # Validates or raises ValidationError
try:
result = extract_review_data("This laptop is fantastic! Fast, well-built, great display.")
print(result.model_dump())
except Exception as e:
print(f"Validation failed: {e}")
Pattern 2 — Regex Assertions
import re
# Define output rules as regex patterns
VALIDATION_RULES = {
"no_urls": {
"pattern": r'https?://[\S]+',
"should_match": False,
"message": "Output must not contain URLs",
},
"no_pii_email": {
"pattern": r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
"should_match": False,
"message": "Output must not contain email addresses",
},
"word_count": {
"pattern": r'\b\w+\b',
"should_match": True,
"min_count": 10,
"max_count": 500,
"message": "Output word count must be between 10 and 500",
},
}
def validate_output(text: str) -> list[str]:
"""Returns a list of validation errors. Empty = passed."""
errors = []
for rule_name, rule in VALIDATION_RULES.items():
matches = re.findall(rule["pattern"], text)
if "min_count" in rule:
count = len(matches)
if not (rule["min_count"] <= count <= rule["max_count"]):
errors.append(f"{rule_name}: {rule['message']} (got {count})")
elif rule["should_match"] and not matches:
errors.append(f"{rule_name}: {rule['message']}")
elif not rule["should_match"] and matches:
errors.append(f"{rule_name}: {rule['message']}")
return errors
# Test
errors = validate_output("Visit https://example.com for more info.")
print(errors) # ['no_urls: Output must not contain URLs']
Pattern 3 — Content Policy Engine
FORBIDDEN_PATTERNS = [
r'\b(password|secret|api.?key)\b',
r'\b\d{16}\b', # Credit card numbers
r'\b\d{3}-\d{2}-\d{4}\b', # SSN format
]
REQUIRED_DISCLAIMERS = {
"medical": "consult a doctor",
"legal": "consult a lawyer",
"financial": "consult a financial advisor",
}
def policy_check(text: str, domain: str = None) -> dict:
violations = []
for pattern in FORBIDDEN_PATTERNS:
if re.search(pattern, text, re.IGNORECASE):
violations.append(f"Forbidden content: {pattern}")
if domain and domain in REQUIRED_DISCLAIMERS:
required = REQUIRED_DISCLAIMERS[domain]
if required.lower() not in text.lower():
violations.append(f"Missing required disclaimer: '{required}'")
return {"passed": len(violations) == 0, "violations": violations}
Common Mistakes
- Skipping schema validation — if you ask for JSON and don’t validate the schema, the model may return slightly wrong field names or types that break downstream code.
- Over-specific regex — very strict patterns cause false positives. Test your rules on 100+ real outputs before deploying.
- Not retrying on validation failure — if the schema validation fails, retry the LLM call 1–2 times before giving up. Models often self-correct on retry.
Quick Quiz
Q1. Why should rule-based validation run before LLM-based validation?
A1. Rule-based checks are microseconds fast and free. Running them first catches obvious failures without spending money on LLM judge calls.
Q2. What does Pydantic’s field_validator allow you to do beyond basic type checking?
A2. Write custom validation logic — e.g., checking that a string value is from an allowed set, or that a date is in the future.
Q3. What should you do when a Pydantic validation fails on an LLM output?
A3. Retry the LLM call 1–2 times (optionally including the error message in the retry prompt). Models often produce valid output on retry.
Student Exercise
Exercise 12.2 — Validation harness
Build a product review extractor using Pydantic validation. Deliberately break 3 rules (wrong sentiment value, summary too short, rating out of range). Implement a retry mechanism that shows the Pydantic error to the model in the retry prompt.
Further Reading
Next → 12.3 LLM-as-Judge Validation