Skip to main content

8.1 Why Evaluation Is Hard

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

8.1 Why Evaluation Is Hard

Key Concepts: Non-determinism · Subjective quality · Benchmark gaming

Official Docs: HuggingFace Evaluate · LMSYS Chatbot Arena


The Core Problems

1. Non-Determinism

LLMs are stochastic — the same prompt may give different outputs on each run.

from openai import OpenAI
client = OpenAI()

prompt = "In one sentence, what is machine learning?"

# Same prompt, different outputs on each call
for i in range(3):
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.7,
)
print(f"Run {i+1}: {resp.choices[0].message.content}")

# Solution: use temperature=0 for reproducible evaluation runs

2. Subjective Quality

There is often no single correct answer:

  • "Write a good poem about autumn" — what is good?
  • "Summarise this article" — what is the right length and depth?

3. Benchmark Gaming / Contamination

Many public benchmarks (MMLU, HellaSwag) were likely included in training data. High benchmark scores can reflect memorisation, not generalisation.

4. Task-Specific Metrics Needed

Different tasks need different evaluation strategies:

TaskAppropriate metric
SummarisationROUGE, human eval
QA with factsExact match, F1
Code generationPass@k (unit tests)
Open-ended chatLLM-as-judge, human eval
ClassificationAccuracy, F1

Evaluation Pyramid

         ▲  Human Evaluation       (most reliable, most expensive)
╱ ╲
╱ ╲ LLM-as-Judge (scalable, good correlation)
╱ ╲
╱ ╲ Automated Metrics (fast, reproducible, limited)
╱ ╲
╱───────────╲ Unit Tests / Assertions (fastest, binary)

Use all layers together. Unit tests catch regressions; automated metrics track trends; LLM-as-judge catches quality; human eval provides ground truth.


Common Mistakes

Common Mistakes
  1. Using only one metric — no single metric captures overall quality. Always use a portfolio of metrics.
  2. Evaluating with temperature > 0 — for reproducible evaluation, always use temperature=0. Results should be the same each run.
  3. Trusting public benchmarks — if a model was trained on web data that includes benchmark answers, its scores are inflated. Use held-out private test sets.
  4. No baseline comparison — a score of 0.72 ROUGE is meaningless without a baseline. Always compare: does this beat the previous model?

Quick Quiz

Test Your Understanding

Q1. Why is non-determinism a problem for evaluation?
A1. The same prompt can produce different outputs on each run, making it impossible to compare models reliably unless you fix temperature=0 or average across many runs.

Q2. What is benchmark contamination?
A2. When benchmark questions and answers are in the model’s training data, so high scores reflect memorisation rather than true capability.

Q3. For evaluating a code generation model, what is Pass@k?
A3. The percentage of problems for which at least one of k generated solutions passes all unit tests.


Student Exercise

Exercise 10.1 — Measure non-determinism
Ask GPT-4o-mini the same question 10 times with temperature=0.7. Measure the character length variation and ROUGE score between consecutive answers. Repeat with temperature=0. Discuss the difference.


Further Reading

Next → 10.2 Automated Metrics