6.3 LLM-as-Judge

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

6.3 LLM-as-Judge

Key Concepts: Self-evaluation · Cross-model verification · Scoring rubrics

Key Paper: Self-RAG (Asai et al., 2023)

LLM-as-Judge for Validation vs Evaluation

In Chapter 10, LLM-as-judge was used for evaluation (benchmarking model quality). Here it is used for runtime validation — checking each individual output before returning it to the user.

Use case	When	Purpose
Evaluation	Batch testing	Measure overall quality
Validation	Per-request	Gate individual responses

Pattern 1 — Self-Evaluation

Ask the same model to critique its own output:

from openai import OpenAI
import json

client = OpenAI()

def generate_and_validate(question: str) -> dict:
    # Step 1: Generate answer
    gen_resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}],
        temperature=0.7,
    )
    answer = gen_resp.choices[0].message.content
    
    # Step 2: Self-evaluate
    eval_resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Question: {question}

Answer: {answer}

Evaluate this answer:
1. Is it factually accurate? (true/false)
2. Does it fully address the question? (true/false)
3. Confidence the answer is correct (0.0-1.0)

Return JSON: {{"accurate": bool, "complete": bool, "confidence": float, "issues": [list of issues]}}"""
        }],
        temperature=0,
        response_format={"type": "json_object"},
    )
    evaluation = json.loads(eval_resp.choices[0].message.content)
    
    return {
        "answer": answer,
        "evaluation": evaluation,
        "passed": evaluation["accurate"] and evaluation["confidence"] >= 0.8,
    }

result = generate_and_validate("What year did World War 2 end?")
print(f"Answer: {result['answer']}")
print(f"Passed: {result['passed']} | Confidence: {result['evaluation']['confidence']}")

Pattern 2 — Cross-Model Verification

Use a different (stronger) model to verify the output:

def cross_model_verify(question: str, answer: str) -> dict:
    """Use GPT-4o to verify a GPT-4o-mini answer."""
    resp = client.chat.completions.create(
        model="gpt-4o",       # Stronger judge model
        messages=[{
            "role": "user",
            "content": f"""Verify this answer for accuracy and completeness.

Question: {question}
Answer: {answer}

JSON: {{"verdict": "correct|incorrect|partial", "corrections": [list of corrections if any], "confidence": 0.0-1.0}}"""
        }],
        temperature=0,
        response_format={"type": "json_object"},
    )
    return json.loads(resp.choices[0].message.content)

verification = cross_model_verify(
    "What is the boiling point of water at sea level?",
    "Water boils at 100 degrees Celsius at sea level."
)
print(verification)
# {"verdict": "correct", "corrections": [], "confidence": 1.0}

Pattern 3 — Faithfulness Check (for RAG)

Verify that a RAG response is grounded in the retrieved context:

def check_faithfulness(question: str, context: str, answer: str) -> dict:
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"""Is the answer fully supported by the context? 
            Do not use any external knowledge — only check what’s in the context.

Context: {context}
Question: {question}
Answer: {answer}

JSON: {{"faithful": bool, "unsupported_claims": [list of claims in the answer NOT in the context]}}"""
        }],
        temperature=0,
        response_format={"type": "json_object"},
    )
    return json.loads(resp.choices[0].message.content)

Common Mistakes

Relying solely on self-evaluation — models are over-confident in their own answers. Always combine self-evaluation with other signals.
Not caching validation results — running LLM-as-judge on every request doubles your API costs. Cache validation results for identical inputs.
Confusing validation with evaluation — runtime validation gates individual responses; evaluation benchmarks overall system quality.

Quick Quiz

Test Your Understanding

Q1. What is the main limitation of self-evaluation?
A1. Models tend to be over-confident in their own outputs. They may confirm incorrect answers with high confidence.

Q2. Why is cross-model verification more reliable than self-evaluation?
A2. Using a different (typically stronger) model as the judge removes the self-confidence bias and provides an independent assessment.

Q3. In a RAG system, what does faithfulness validation check?
A3. Whether the generated answer is grounded in the retrieved context — i.e., it doesn’t contain claims that aren’t supported by the retrieved documents.

Student Exercise

Exercise 12.3 — Validation pipeline
Build a QA system that: (1) generates an answer with GPT-4o-mini, (2) self-evaluates with confidence score, (3) if confidence < 0.8, escalates to GPT-4o for cross-model verification. Log which path each query took.

6.3 LLM-as-Judge

LLM-as-Judge for Validation vs Evaluation​

Pattern 1 — Self-Evaluation​

Pattern 2 — Cross-Model Verification​

Pattern 3 — Faithfulness Check (for RAG)​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

LLM-as-Judge for Validation vs Evaluation

Pattern 1 — Self-Evaluation

Pattern 2 — Cross-Model Verification

Pattern 3 — Faithfulness Check (for RAG)

Common Mistakes

Quick Quiz

Student Exercise

Further Reading