8.4 Human Evaluation Design
AI-generated content may contain errors. Always verify against official sources.
8.4 Human Evaluation Design
Key Concepts: Annotation guidelines · Inter-rater reliability · Stratified sampling
Key Paper: Judging LLM-as-a-Judge with MT-Bench (Zheng et al., 2023)
When Human Evaluation Is Needed
Human evaluation is the gold standard but expensive. Use it when:
✅ Launching a new product and need ground truth
✅ Automated metrics are inconclusive
✅ Calibrating your LLM-as-judge
✅ Evaluating nuanced qualities (empathy, creativity, professionalism)
Writing Good Annotation Guidelines
Annotators need clear, specific instructions. Vague guidelines produce unreliable results.
Bad guideline:
"Rate how helpful the response is on a scale of 1–5."
Good guideline:
Helpfulness Rating Guide:
5 - Fully addresses the question. Complete, accurate, clearly explained.
Example: Asks for Python sort code → gets correct, well-commented code with explanation.
4 - Mostly addresses the question. Minor gaps or slight ambiguity.
Example: Gets correct code but no explanation.
3 - Partially addresses the question. Missing key component or contains a minor error.
Example: Sorts ascending but question asked for descending.
2 - Mostly unhelpful. Multiple errors or major missing components.
Example: Code has a bug that would cause a runtime error.
1 - Completely unhelpful. Wrong task, empty, or dangerous content.
Example: Provides a sorting function in the wrong language entirely.
If unsure between two scores, choose the higher one.
Measuring Inter-Rater Reliability
Always have multiple annotators rate the same examples and measure agreement:
from sklearn.metrics import cohen_kappa_score
# Ratings from two annotators on the same 20 responses
annotator_1 = [4, 5, 2, 3, 4, 1, 5, 3, 4, 2, 5, 3, 4, 2, 5, 1, 3, 4, 5, 2]
annotator_2 = [4, 5, 2, 4, 4, 1, 5, 3, 3, 2, 5, 3, 4, 2, 4, 1, 3, 4, 5, 3]
kappa = cohen_kappa_score(annotator_1, annotator_2, weights="quadratic")
print(f"Cohen's Kappa (quadratic): {kappa:.3f}")
# Interpretation:
# > 0.80: Near-perfect agreement ✓
# 0.60–0.80: Substantial agreement ✓
# 0.40–0.60: Moderate agreement ⚠️
# < 0.40: Poor agreement — revisit guidelines ❌
Stratified Sampling
Don’t randomly sample — ensure your eval set is representative:
import random
# Your outputs categorised by type
outputs_by_category = {
"factual_qa": [...], # 200 examples
"summarisation": [...], # 100 examples
"code_generation": [...], # 150 examples
"creative_writing": [...], # 50 examples
}
# Stratified sample: 10 from each category
eval_set = []
for category, examples in outputs_by_category.items():
sample = random.sample(examples, min(10, len(examples)))
for ex in sample:
ex["category"] = category
eval_set.extend(sample)
print(f"Eval set: {len(eval_set)} examples across {len(outputs_by_category)} categories")
Annotation Platform Options
| Platform | Best for | Cost |
|---|---|---|
| Label Studio (open source) | Teams with technical setup | Free |
| Scale AI | Production-grade, large datasets | High |
| Prolific | Academic/research studies | Medium |
| Spreadsheet + Forms | Small teams, < 200 examples | Free |
Common Mistakes
- Single annotator — one person’s opinion is not ground truth. Always use at least 2–3 annotators and measure agreement.
- Vague rating criteria — annotators with vague guidelines will interpret the scale differently. Define every score level with examples.
- No annotator training — always run annotators through a calibration round on 10–20 pre-rated examples before the real annotation task.
- Convenience sampling — evaluating only on easy or typical examples overestimates quality. Include edge cases, unusual inputs, and challenging questions.
Quick Quiz
Q1. What does Cohen’s Kappa measure?
A1. Inter-rater agreement between annotators, corrected for chance agreement. A kappa of 0.8+ indicates near-perfect agreement.
Q2. Why is stratified sampling better than random sampling for evaluation?
A2. Stratified sampling ensures all important categories are represented in the eval set, preventing overestimation of performance on easy categories.
Q3. What is a calibration round and why is it important?
A3. A set of pre-rated examples that annotators score before the real task. It aligns their understanding of the rating scale and improves consistency.
Student Exercise
Exercise 10.5 — Annotation study
Generate 20 chatbot responses of varying quality. Have two classmates independently rate them 1–5 using your annotation guidelines. Compute Cohen’s Kappa. If kappa < 0.6, revise the guidelines and repeat.
Further Reading
Next Chapter → Chapter 11: Deployment