8.4 Human Evaluation Design

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

8.4 Human Evaluation Design

Key Concepts: Annotation guidelines · Inter-rater reliability · Stratified sampling

Key Paper: Judging LLM-as-a-Judge with MT-Bench (Zheng et al., 2023)

When Human Evaluation Is Needed

Human evaluation is the gold standard but expensive. Use it when:

✅ Launching a new product and need ground truth
✅ Automated metrics are inconclusive
✅ Calibrating your LLM-as-judge
✅ Evaluating nuanced qualities (empathy, creativity, professionalism)

Writing Good Annotation Guidelines

Annotators need clear, specific instructions. Vague guidelines produce unreliable results.

Bad guideline:

"Rate how helpful the response is on a scale of 1–5."

Good guideline:

Helpfulness Rating Guide:

5 - Fully addresses the question. Complete, accurate, clearly explained.
    Example: Asks for Python sort code → gets correct, well-commented code with explanation.

4 - Mostly addresses the question. Minor gaps or slight ambiguity.
    Example: Gets correct code but no explanation.

3 - Partially addresses the question. Missing key component or contains a minor error.
    Example: Sorts ascending but question asked for descending.

2 - Mostly unhelpful. Multiple errors or major missing components.
    Example: Code has a bug that would cause a runtime error.

1 - Completely unhelpful. Wrong task, empty, or dangerous content.
    Example: Provides a sorting function in the wrong language entirely.

If unsure between two scores, choose the higher one.

Measuring Inter-Rater Reliability

Always have multiple annotators rate the same examples and measure agreement:

from sklearn.metrics import cohen_kappa_score

# Ratings from two annotators on the same 20 responses
annotator_1 = [4, 5, 2, 3, 4, 1, 5, 3, 4, 2, 5, 3, 4, 2, 5, 1, 3, 4, 5, 2]
annotator_2 = [4, 5, 2, 4, 4, 1, 5, 3, 3, 2, 5, 3, 4, 2, 4, 1, 3, 4, 5, 3]

kappa = cohen_kappa_score(annotator_1, annotator_2, weights="quadratic")
print(f"Cohen's Kappa (quadratic): {kappa:.3f}")

# Interpretation:
# > 0.80: Near-perfect agreement ✓
# 0.60–0.80: Substantial agreement ✓
# 0.40–0.60: Moderate agreement ⚠️
# < 0.40: Poor agreement — revisit guidelines ❌

Stratified Sampling

Don’t randomly sample — ensure your eval set is representative:

import random

# Your outputs categorised by type
outputs_by_category = {
    "factual_qa": [...],         # 200 examples
    "summarisation": [...],      # 100 examples
    "code_generation": [...],    # 150 examples
    "creative_writing": [...],   # 50 examples
}

# Stratified sample: 10 from each category
eval_set = []
for category, examples in outputs_by_category.items():
    sample = random.sample(examples, min(10, len(examples)))
    for ex in sample:
        ex["category"] = category
    eval_set.extend(sample)

print(f"Eval set: {len(eval_set)} examples across {len(outputs_by_category)} categories")

Annotation Platform Options

Platform	Best for	Cost
Label Studio (open source)	Teams with technical setup	Free
Scale AI	Production-grade, large datasets	High
Prolific	Academic/research studies	Medium
Spreadsheet + Forms	Small teams, < 200 examples	Free

Common Mistakes

Single annotator — one person’s opinion is not ground truth. Always use at least 2–3 annotators and measure agreement.
Vague rating criteria — annotators with vague guidelines will interpret the scale differently. Define every score level with examples.
No annotator training — always run annotators through a calibration round on 10–20 pre-rated examples before the real annotation task.
Convenience sampling — evaluating only on easy or typical examples overestimates quality. Include edge cases, unusual inputs, and challenging questions.

Quick Quiz

Test Your Understanding

Q1. What does Cohen’s Kappa measure?
A1. Inter-rater agreement between annotators, corrected for chance agreement. A kappa of 0.8+ indicates near-perfect agreement.

Q2. Why is stratified sampling better than random sampling for evaluation?
A2. Stratified sampling ensures all important categories are represented in the eval set, preventing overestimation of performance on easy categories.

Q3. What is a calibration round and why is it important?
A3. A set of pre-rated examples that annotators score before the real task. It aligns their understanding of the rating scale and improves consistency.

Student Exercise

Exercise 10.5 — Annotation study
Generate 20 chatbot responses of varying quality. Have two classmates independently rate them 1–5 using your annotation guidelines. Compute Cohen’s Kappa. If kappa < 0.6, revise the guidelines and repeat.

8.4 Human Evaluation Design

When Human Evaluation Is Needed​

Writing Good Annotation Guidelines​

Bad guideline:​

Good guideline:​

Measuring Inter-Rater Reliability​

Stratified Sampling​

Annotation Platform Options​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

When Human Evaluation Is Needed

Writing Good Annotation Guidelines

Bad guideline:

Good guideline:

Measuring Inter-Rater Reliability

Stratified Sampling

Annotation Platform Options

Common Mistakes

Quick Quiz

Student Exercise

Further Reading