8.1 Why Evaluation Is Hard
AI-generated content may contain errors. Always verify against official sources.
8.1 Why Evaluation Is Hard
Key Concepts: Non-determinism · Subjective quality · Benchmark gaming
Official Docs: HuggingFace Evaluate · LMSYS Chatbot Arena
The Core Problems
1. Non-Determinism
LLMs are stochastic — the same prompt may give different outputs on each run.
from openai import OpenAI
client = OpenAI()
prompt = "In one sentence, what is machine learning?"
# Same prompt, different outputs on each call
for i in range(3):
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.7,
)
print(f"Run {i+1}: {resp.choices[0].message.content}")
# Solution: use temperature=0 for reproducible evaluation runs
2. Subjective Quality
There is often no single correct answer:
- "Write a good poem about autumn" — what is good?
- "Summarise this article" — what is the right length and depth?
3. Benchmark Gaming / Contamination
Many public benchmarks (MMLU, HellaSwag) were likely included in training data. High benchmark scores can reflect memorisation, not generalisation.
4. Task-Specific Metrics Needed
Different tasks need different evaluation strategies:
| Task | Appropriate metric |
|---|---|
| Summarisation | ROUGE, human eval |
| QA with facts | Exact match, F1 |
| Code generation | Pass@k (unit tests) |
| Open-ended chat | LLM-as-judge, human eval |
| Classification | Accuracy, F1 |
Evaluation Pyramid
▲ Human Evaluation (most reliable, most expensive)
╱ ╲
╱ ╲ LLM-as-Judge (scalable, good correlation)
╱ ╲
╱ ╲ Automated Metrics (fast, reproducible, limited)
╱ ╲
╱───────────╲ Unit Tests / Assertions (fastest, binary)
Use all layers together. Unit tests catch regressions; automated metrics track trends; LLM-as-judge catches quality; human eval provides ground truth.
Common Mistakes
- Using only one metric — no single metric captures overall quality. Always use a portfolio of metrics.
- Evaluating with temperature > 0 — for reproducible evaluation, always use
temperature=0. Results should be the same each run. - Trusting public benchmarks — if a model was trained on web data that includes benchmark answers, its scores are inflated. Use held-out private test sets.
- No baseline comparison — a score of 0.72 ROUGE is meaningless without a baseline. Always compare: does this beat the previous model?
Quick Quiz
Q1. Why is non-determinism a problem for evaluation?
A1. The same prompt can produce different outputs on each run, making it impossible to compare models reliably unless you fix temperature=0 or average across many runs.
Q2. What is benchmark contamination?
A2. When benchmark questions and answers are in the model’s training data, so high scores reflect memorisation rather than true capability.
Q3. For evaluating a code generation model, what is Pass@k?
A3. The percentage of problems for which at least one of k generated solutions passes all unit tests.
Student Exercise
Exercise 10.1 — Measure non-determinism
Ask GPT-4o-mini the same question 10 times with temperature=0.7. Measure the character length variation and ROUGE score between consecutive answers. Repeat with temperature=0. Discuss the difference.
Further Reading
- 📘 LMSYS Chatbot Arena — human preference evaluation at scale
- 📄 Judging LLM-as-a-Judge with MT-Bench (Zheng et al., 2023)
Next → 10.2 Automated Metrics