Skip to main content

6.4 Multi-Pass Validation (Ensemble)

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

6.4 Multi-Pass Validation (Ensemble)

Key Concepts: Triple-validation · Consensus voting · Confidence thresholds


Why Multiple Passes?

Single-model validation is unreliable because:

  • Models are biased towards their own outputs
  • A single call can give an inconsistent judgement
  • No way to express uncertainty

Multiple independent validation passes allow consensus voting and confidence scoring.


Pattern 1 — Triple Validation with Consensus

from openai import OpenAI
import json

client = OpenAI()

def single_validation_pass(question: str, answer: str, pass_id: int) -> dict:
"""One validation pass — deliberately uses slight variation in framing."""
framings = [
f"Is this answer accurate and complete? Question: {question}\nAnswer: {answer}",
f"Review this response for factual errors: Q: {question}\nA: {answer}",
f"Would you trust this answer? Q: {question}\nResponse: {answer}",
]

resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": framings[pass_id] + "\nJSON: {\"valid\": bool, \"confidence\": 0.0-1.0}"
}],
temperature=0,
response_format={"type": "json_object"},
)
result = json.loads(resp.choices[0].message.content)
result["pass_id"] = pass_id
return result

def triple_validate(question: str, answer: str) -> dict:
"""Run 3 validation passes and take the majority decision."""
results = [single_validation_pass(question, answer, i) for i in range(3)]

votes_valid = sum(1 for r in results if r["valid"])
avg_confidence = sum(r["confidence"] for r in results) / 3
consensus = votes_valid >= 2 # Majority vote

return {
"consensus_valid": consensus,
"votes": f"{votes_valid}/3 valid",
"avg_confidence": round(avg_confidence, 3),
"individual_results": results,
}

result = triple_validate(
"What is Python's GIL?",
"The GIL is Python's Global Interpreter Lock that prevents multiple threads from executing Python bytecode simultaneously."
)
print(f"Valid: {result['consensus_valid']} | Votes: {result['votes']} | Confidence: {result['avg_confidence']}")

Pattern 2 — Temperature Sampling for Uncertainty

Run the same validation prompt at different temperatures to estimate uncertainty:

def uncertainty_validate(question: str, answer: str, n_samples: int = 5) -> dict:
"""Sample multiple judgements to estimate confidence."""
results = []
for _ in range(n_samples):
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"Is this factually correct? Answer yes or no.\nQ: {question}\nA: {answer}"
}],
temperature=0.5, # Non-zero for sampling
max_tokens=3,
)
answer_text = resp.choices[0].message.content.lower().strip()
results.append("yes" in answer_text)

valid_count = sum(results)
confidence = valid_count / n_samples

return {
"valid": confidence >= 0.6, # Pass if 60%+ say yes
"confidence": confidence,
"votes": f"{valid_count}/{n_samples}",
}

When to Escalate

from enum import Enum

class ValidationResult(Enum):
PASS = "pass"
FAIL = "fail"
ESCALATE = "escalate" # Uncertain — needs human review

def validate_with_escalation(question: str, answer: str) -> ValidationResult:
result = triple_validate(question, answer)

if result["consensus_valid"] and result["avg_confidence"] >= 0.8:
return ValidationResult.PASS
elif not result["consensus_valid"] and result["avg_confidence"] <= 0.3:
return ValidationResult.FAIL
else:
return ValidationResult.ESCALATE # Uncertain — send to human reviewer

Common Mistakes

Common Mistakes
  1. Running all passes sequentially — run validation passes in parallel using asyncio to reduce latency. Three sequential calls triple your latency.
  2. Using the same framing for all passes — if all passes use identical prompts, you get correlated errors. Vary the framing slightly.
  3. No escalation path — don’t treat every uncertain result as FAIL. Build an ESCALATE path for human review.

Quick Quiz

Test Your Understanding

Q1. Why is majority voting (2/3) more reliable than a single validation call?
A1. Independent passes reduce the impact of individual model errors. If one pass gives a wrong judgement, the other two can overrule it.

Q2. What does the ESCALATE state represent?
A2. An uncertain case where the model is not confident enough to auto-pass or auto-fail — requiring human review before returning to the user.

Q3. Why should you vary the framing between validation passes?
A3. Identical prompts produce correlated errors. Varied framings provide genuinely independent perspectives, making the consensus more meaningful.


Student Exercise

Exercise 12.4 — Parallel multi-pass
Convert the triple_validate function to use asyncio so all 3 passes run in parallel. Measure the latency difference between sequential and parallel execution on 10 test cases.


Further Reading

Next → 12.5 Claim Decomposition