6.4 Multi-Pass Validation (Ensemble)
AI-generated content may contain errors. Always verify against official sources.
6.4 Multi-Pass Validation (Ensemble)
Key Concepts: Triple-validation · Consensus voting · Confidence thresholds
Why Multiple Passes?
Single-model validation is unreliable because:
- Models are biased towards their own outputs
- A single call can give an inconsistent judgement
- No way to express uncertainty
Multiple independent validation passes allow consensus voting and confidence scoring.
Pattern 1 — Triple Validation with Consensus
from openai import OpenAI
import json
client = OpenAI()
def single_validation_pass(question: str, answer: str, pass_id: int) -> dict:
"""One validation pass — deliberately uses slight variation in framing."""
framings = [
f"Is this answer accurate and complete? Question: {question}\nAnswer: {answer}",
f"Review this response for factual errors: Q: {question}\nA: {answer}",
f"Would you trust this answer? Q: {question}\nResponse: {answer}",
]
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": framings[pass_id] + "\nJSON: {\"valid\": bool, \"confidence\": 0.0-1.0}"
}],
temperature=0,
response_format={"type": "json_object"},
)
result = json.loads(resp.choices[0].message.content)
result["pass_id"] = pass_id
return result
def triple_validate(question: str, answer: str) -> dict:
"""Run 3 validation passes and take the majority decision."""
results = [single_validation_pass(question, answer, i) for i in range(3)]
votes_valid = sum(1 for r in results if r["valid"])
avg_confidence = sum(r["confidence"] for r in results) / 3
consensus = votes_valid >= 2 # Majority vote
return {
"consensus_valid": consensus,
"votes": f"{votes_valid}/3 valid",
"avg_confidence": round(avg_confidence, 3),
"individual_results": results,
}
result = triple_validate(
"What is Python's GIL?",
"The GIL is Python's Global Interpreter Lock that prevents multiple threads from executing Python bytecode simultaneously."
)
print(f"Valid: {result['consensus_valid']} | Votes: {result['votes']} | Confidence: {result['avg_confidence']}")
Pattern 2 — Temperature Sampling for Uncertainty
Run the same validation prompt at different temperatures to estimate uncertainty:
def uncertainty_validate(question: str, answer: str, n_samples: int = 5) -> dict:
"""Sample multiple judgements to estimate confidence."""
results = []
for _ in range(n_samples):
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"Is this factually correct? Answer yes or no.\nQ: {question}\nA: {answer}"
}],
temperature=0.5, # Non-zero for sampling
max_tokens=3,
)
answer_text = resp.choices[0].message.content.lower().strip()
results.append("yes" in answer_text)
valid_count = sum(results)
confidence = valid_count / n_samples
return {
"valid": confidence >= 0.6, # Pass if 60%+ say yes
"confidence": confidence,
"votes": f"{valid_count}/{n_samples}",
}
When to Escalate
from enum import Enum
class ValidationResult(Enum):
PASS = "pass"
FAIL = "fail"
ESCALATE = "escalate" # Uncertain — needs human review
def validate_with_escalation(question: str, answer: str) -> ValidationResult:
result = triple_validate(question, answer)
if result["consensus_valid"] and result["avg_confidence"] >= 0.8:
return ValidationResult.PASS
elif not result["consensus_valid"] and result["avg_confidence"] <= 0.3:
return ValidationResult.FAIL
else:
return ValidationResult.ESCALATE # Uncertain — send to human reviewer
Common Mistakes
- Running all passes sequentially — run validation passes in parallel using
asyncioto reduce latency. Three sequential calls triple your latency. - Using the same framing for all passes — if all passes use identical prompts, you get correlated errors. Vary the framing slightly.
- No escalation path — don’t treat every uncertain result as FAIL. Build an ESCALATE path for human review.
Quick Quiz
Q1. Why is majority voting (2/3) more reliable than a single validation call?
A1. Independent passes reduce the impact of individual model errors. If one pass gives a wrong judgement, the other two can overrule it.
Q2. What does the ESCALATE state represent?
A2. An uncertain case where the model is not confident enough to auto-pass or auto-fail — requiring human review before returning to the user.
Q3. Why should you vary the framing between validation passes?
A3. Identical prompts produce correlated errors. Varied framings provide genuinely independent perspectives, making the consensus more meaningful.
Student Exercise
Exercise 12.4 — Parallel multi-pass
Convert the triple_validate function to use asyncio so all 3 passes run in parallel. Measure the latency difference between sequential and parallel execution on 10 test cases.
Further Reading
Next → 12.5 Claim Decomposition