Skip to main content

6.3 LLM-as-Judge

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

6.3 LLM-as-Judge

Key Concepts: Self-evaluation · Cross-model verification · Scoring rubrics

Key Paper: Self-RAG (Asai et al., 2023)


LLM-as-Judge for Validation vs Evaluation

In Chapter 10, LLM-as-judge was used for evaluation (benchmarking model quality). Here it is used for runtime validation — checking each individual output before returning it to the user.

Use caseWhenPurpose
EvaluationBatch testingMeasure overall quality
ValidationPer-requestGate individual responses

Pattern 1 — Self-Evaluation

Ask the same model to critique its own output:

from openai import OpenAI
import json

client = OpenAI()

def generate_and_validate(question: str) -> dict:
# Step 1: Generate answer
gen_resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": question}],
temperature=0.7,
)
answer = gen_resp.choices[0].message.content

# Step 2: Self-evaluate
eval_resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Question: {question}

Answer: {answer}

Evaluate this answer:
1. Is it factually accurate? (true/false)
2. Does it fully address the question? (true/false)
3. Confidence the answer is correct (0.0-1.0)

Return JSON: {{"accurate": bool, "complete": bool, "confidence": float, "issues": [list of issues]}}"""
}],
temperature=0,
response_format={"type": "json_object"},
)
evaluation = json.loads(eval_resp.choices[0].message.content)

return {
"answer": answer,
"evaluation": evaluation,
"passed": evaluation["accurate"] and evaluation["confidence"] >= 0.8,
}

result = generate_and_validate("What year did World War 2 end?")
print(f"Answer: {result['answer']}")
print(f"Passed: {result['passed']} | Confidence: {result['evaluation']['confidence']}")

Pattern 2 — Cross-Model Verification

Use a different (stronger) model to verify the output:

def cross_model_verify(question: str, answer: str) -> dict:
"""Use GPT-4o to verify a GPT-4o-mini answer."""
resp = client.chat.completions.create(
model="gpt-4o", # Stronger judge model
messages=[{
"role": "user",
"content": f"""Verify this answer for accuracy and completeness.

Question: {question}
Answer: {answer}

JSON: {{"verdict": "correct|incorrect|partial", "corrections": [list of corrections if any], "confidence": 0.0-1.0}}"""
}],
temperature=0,
response_format={"type": "json_object"},
)
return json.loads(resp.choices[0].message.content)

verification = cross_model_verify(
"What is the boiling point of water at sea level?",
"Water boils at 100 degrees Celsius at sea level."
)
print(verification)
# {"verdict": "correct", "corrections": [], "confidence": 1.0}

Pattern 3 — Faithfulness Check (for RAG)

Verify that a RAG response is grounded in the retrieved context:

def check_faithfulness(question: str, context: str, answer: str) -> dict:
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"""Is the answer fully supported by the context?
Do not use any external knowledge — only check what’s in the context.

Context: {context}
Question: {question}
Answer: {answer}

JSON: {{"faithful": bool, "unsupported_claims": [list of claims in the answer NOT in the context]}}"""
}],
temperature=0,
response_format={"type": "json_object"},
)
return json.loads(resp.choices[0].message.content)

Common Mistakes

Common Mistakes
  1. Relying solely on self-evaluation — models are over-confident in their own answers. Always combine self-evaluation with other signals.
  2. Not caching validation results — running LLM-as-judge on every request doubles your API costs. Cache validation results for identical inputs.
  3. Confusing validation with evaluation — runtime validation gates individual responses; evaluation benchmarks overall system quality.

Quick Quiz

Test Your Understanding

Q1. What is the main limitation of self-evaluation?
A1. Models tend to be over-confident in their own outputs. They may confirm incorrect answers with high confidence.

Q2. Why is cross-model verification more reliable than self-evaluation?
A2. Using a different (typically stronger) model as the judge removes the self-confidence bias and provides an independent assessment.

Q3. In a RAG system, what does faithfulness validation check?
A3. Whether the generated answer is grounded in the retrieved context — i.e., it doesn’t contain claims that aren’t supported by the retrieved documents.


Student Exercise

Exercise 12.3 — Validation pipeline
Build a QA system that: (1) generates an answer with GPT-4o-mini, (2) self-evaluates with confidence score, (3) if confidence < 0.8, escalates to GPT-4o for cross-model verification. Log which path each query took.


Further Reading

Next → 12.4 Multi-Pass Validation