6.3 LLM-as-Judge
AI-generated content may contain errors. Always verify against official sources.
6.3 LLM-as-Judge
Key Concepts: Self-evaluation · Cross-model verification · Scoring rubrics
Key Paper: Self-RAG (Asai et al., 2023)
LLM-as-Judge for Validation vs Evaluation
In Chapter 10, LLM-as-judge was used for evaluation (benchmarking model quality). Here it is used for runtime validation — checking each individual output before returning it to the user.
| Use case | When | Purpose |
|---|---|---|
| Evaluation | Batch testing | Measure overall quality |
| Validation | Per-request | Gate individual responses |
Pattern 1 — Self-Evaluation
Ask the same model to critique its own output:
from openai import OpenAI
import json
client = OpenAI()
def generate_and_validate(question: str) -> dict:
# Step 1: Generate answer
gen_resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": question}],
temperature=0.7,
)
answer = gen_resp.choices[0].message.content
# Step 2: Self-evaluate
eval_resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Question: {question}
Answer: {answer}
Evaluate this answer:
1. Is it factually accurate? (true/false)
2. Does it fully address the question? (true/false)
3. Confidence the answer is correct (0.0-1.0)
Return JSON: {{"accurate": bool, "complete": bool, "confidence": float, "issues": [list of issues]}}"""
}],
temperature=0,
response_format={"type": "json_object"},
)
evaluation = json.loads(eval_resp.choices[0].message.content)
return {
"answer": answer,
"evaluation": evaluation,
"passed": evaluation["accurate"] and evaluation["confidence"] >= 0.8,
}
result = generate_and_validate("What year did World War 2 end?")
print(f"Answer: {result['answer']}")
print(f"Passed: {result['passed']} | Confidence: {result['evaluation']['confidence']}")
Pattern 2 — Cross-Model Verification
Use a different (stronger) model to verify the output:
def cross_model_verify(question: str, answer: str) -> dict:
"""Use GPT-4o to verify a GPT-4o-mini answer."""
resp = client.chat.completions.create(
model="gpt-4o", # Stronger judge model
messages=[{
"role": "user",
"content": f"""Verify this answer for accuracy and completeness.
Question: {question}
Answer: {answer}
JSON: {{"verdict": "correct|incorrect|partial", "corrections": [list of corrections if any], "confidence": 0.0-1.0}}"""
}],
temperature=0,
response_format={"type": "json_object"},
)
return json.loads(resp.choices[0].message.content)
verification = cross_model_verify(
"What is the boiling point of water at sea level?",
"Water boils at 100 degrees Celsius at sea level."
)
print(verification)
# {"verdict": "correct", "corrections": [], "confidence": 1.0}
Pattern 3 — Faithfulness Check (for RAG)
Verify that a RAG response is grounded in the retrieved context:
def check_faithfulness(question: str, context: str, answer: str) -> dict:
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"""Is the answer fully supported by the context?
Do not use any external knowledge — only check what’s in the context.
Context: {context}
Question: {question}
Answer: {answer}
JSON: {{"faithful": bool, "unsupported_claims": [list of claims in the answer NOT in the context]}}"""
}],
temperature=0,
response_format={"type": "json_object"},
)
return json.loads(resp.choices[0].message.content)
Common Mistakes
- Relying solely on self-evaluation — models are over-confident in their own answers. Always combine self-evaluation with other signals.
- Not caching validation results — running LLM-as-judge on every request doubles your API costs. Cache validation results for identical inputs.
- Confusing validation with evaluation — runtime validation gates individual responses; evaluation benchmarks overall system quality.
Quick Quiz
Q1. What is the main limitation of self-evaluation?
A1. Models tend to be over-confident in their own outputs. They may confirm incorrect answers with high confidence.
Q2. Why is cross-model verification more reliable than self-evaluation?
A2. Using a different (typically stronger) model as the judge removes the self-confidence bias and provides an independent assessment.
Q3. In a RAG system, what does faithfulness validation check?
A3. Whether the generated answer is grounded in the retrieved context — i.e., it doesn’t contain claims that aren’t supported by the retrieved documents.
Student Exercise
Exercise 12.3 — Validation pipeline
Build a QA system that: (1) generates an answer with GPT-4o-mini, (2) self-evaluates with confidence score, (3) if confidence < 0.8, escalates to GPT-4o for cross-model verification. Log which path each query took.
Further Reading
Next → 12.4 Multi-Pass Validation