8.3 LLM-as-Judge Evaluation

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

8.3 LLM-as-Judge Evaluation

Key Concepts: Rubric design · Pairwise comparison · Calibration

Official Docs: OpenAI Evals Framework · Prometheus-2 Model

Key Paper: Judging LLM-as-a-Judge with MT-Bench (Zheng et al., 2023)

What is LLM-as-Judge?

Instead of human annotators, use a powerful LLM (typically GPT-4o) to evaluate another model’s outputs. Studies show GPT-4o judgements have >80% agreement with human judgements on many tasks.

Pattern 1 — Absolute Scoring (Rubric-Based)

from openai import OpenAI
client = OpenAI()

def evaluate_with_rubric(
    question: str, 
    model_answer: str, 
    reference_answer: str
) -> dict:
    prompt = f"""You are an expert evaluator for AI-generated text.

Evaluate the model answer against the reference answer using this rubric:

**Scoring (1-5):**
- 5: Completely correct, comprehensive, well-explained
- 4: Mostly correct with minor gaps
- 3: Partially correct, missing key aspects
- 2: Mostly incorrect but contains some relevant info
- 1: Completely incorrect or irrelevant

**Question:** {question}

**Reference Answer:** {reference_answer}

**Model Answer:** {model_answer}

Provide your evaluation as JSON:
{{"score": <1-5>, "reasoning": "<brief explanation>", "missing": "<what is missing if any>"}}"""

    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )
    import json
    return json.loads(resp.choices[0].message.content)

# Example
result = evaluate_with_rubric(
    question="What is the transformer architecture?",
    model_answer="Transformers use attention mechanisms to process text in parallel.",
    reference_answer="The Transformer uses multi-head self-attention and feed-forward layers with residual connections to process sequences in parallel without recurrence.",
)
print(result)
# {"score": 3, "reasoning": "Correct but too shallow", "missing": "Feed-forward layers, residual connections"}

Pattern 2 — Pairwise Comparison (A/B Testing)

Compare two model outputs to decide which is better:

def pairwise_judge(question: str, answer_a: str, answer_b: str) -> dict:
    prompt = f"""Compare two AI model responses to the same question.

**Question:** {question}

**Response A:** {answer_a}

**Response B:** {answer_b}

Which response is better overall? Consider accuracy, completeness, and clarity.

Respond with JSON: {{"winner": "A" or "B" or "tie", "reasoning": "<explanation>"}}"""

    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )
    import json
    return json.loads(resp.choices[0].message.content)

Pattern 3 — Multi-Criteria Evaluation

def multi_criteria_eval(question: str, answer: str) -> dict:
    prompt = f"""Evaluate this AI response on 4 criteria.

**Question:** {question}
**Response:** {answer}

Rate each criterion 1–5 and provide a brief note:

Return JSON:
{{
  "accuracy": {{"score": <1-5>, "note": "<why>"}},
  "completeness": {{"score": <1-5>, "note": "<why>"}},
  "clarity": {{"score": <1-5>, "note": "<why>"}},
  "helpfulness": {{"score": <1-5>, "note": "<why>"}},
  "overall": <average score>
}}"""

    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"},
    )
    import json
    return json.loads(resp.choices[0].message.content)

Calibration: Does the Judge Agree with Humans?

Before trusting LLM-as-judge at scale, calibrate it against a small human-annotated set:

from scipy.stats import spearmanr

human_scores = [4, 5, 2, 3, 4, 1, 5, 3]
llm_scores   = [4, 5, 3, 3, 4, 2, 5, 2]

corr, p_value = spearmanr(human_scores, llm_scores)
print(f"Spearman rank correlation: {corr:.3f} (p={p_value:.4f})")
# Target: corr > 0.8 for reliable LLM judge

Common Mistakes

Position bias — LLM judges often prefer Response A over Response B simply because it appears first. Always swap order and average the scores.
Self-preference bias — GPT-4 tends to prefer GPT-4 answers. When comparing GPT-4 vs Claude, use an independent judge.
Vague rubrics — "Rate quality 1-5" gives inconsistent results. Define exactly what each score means.
Using a weaker model as judge — the judge must be more capable than the model being evaluated. Don’t use GPT-4o-mini to judge GPT-4o outputs.

Quick Quiz

Test Your Understanding

Q1. What is position bias in pairwise LLM evaluation?
A1. The tendency for the judge model to prefer the first response shown, regardless of quality. Mitigation: evaluate with both orderings (A vs B, then B vs A) and average.

Q2. What is a good Spearman correlation target for an LLM judge?
A2. Above 0.8 indicates strong agreement with human judgements and a reliable automated judge.

Q3. Why should you not use GPT-4o-mini to judge GPT-4o outputs?
A3. The judge must be capable of recognising high-quality responses. A weaker model cannot reliably evaluate outputs from a stronger model.

Student Exercise

Exercise 10.3 — Build an evaluator
Create an LLM-as-judge that scores responses 1–5 for accuracy and clarity. Evaluate 10 model responses. Then manually score the same 10 responses. Compute the Spearman correlation between your scores and the LLM judge scores.

8.3 LLM-as-Judge Evaluation

What is LLM-as-Judge?​

Pattern 1 — Absolute Scoring (Rubric-Based)​

Pattern 2 — Pairwise Comparison (A/B Testing)​

Pattern 3 — Multi-Criteria Evaluation​

Calibration: Does the Judge Agree with Humans?​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

What is LLM-as-Judge?

Pattern 1 — Absolute Scoring (Rubric-Based)

Pattern 2 — Pairwise Comparison (A/B Testing)

Pattern 3 — Multi-Criteria Evaluation

Calibration: Does the Judge Agree with Humans?

Common Mistakes

Quick Quiz

Student Exercise

Further Reading