8.2 Automated Metrics
AI-generated content may contain errors. Always verify against official sources.
8.2 Automated Metrics
Key Concepts: BLEU · ROUGE · BERTScore · RAGAS (faithfulness, relevance)
Official Docs: HuggingFace Evaluate · RAGAS Docs
Installing the Evaluate Library
pip install evaluate rouge-score bert-score sacrebleu
ROUGE — For Summarisation
ROUGE measures n-gram overlap between generated and reference text.
| Metric | Measures |
|---|---|
| ROUGE-1 | Unigram (word) overlap |
| ROUGE-2 | Bigram overlap |
| ROUGE-L | Longest common subsequence |
import evaluate
rouge = evaluate.load("rouge")
predictions = [
"The model processes text using attention mechanisms.",
"Paris is the capital city of France.",
]
references = [
"The model uses attention to process sequences of text.",
"The capital of France is Paris.",
]
results = rouge.compute(predictions=predictions, references=references)
print(f"ROUGE-1: {results['rouge1']:.3f}")
print(f"ROUGE-2: {results['rouge2']:.3f}")
print(f"ROUGE-L: {results['rougeL']:.3f}")
Limitation: ROUGE ignores semantics. "The cat sat on the mat" vs "A feline rested on the rug" scores 0 ROUGE despite being semantically identical.
BERTScore — Semantic Similarity
BERTScore uses BERT embeddings to measure semantic similarity, not word overlap.
bertscorer = evaluate.load("bertscore")
results = bertscorer.compute(
predictions=predictions,
references=references,
lang="en",
)
for i, (p, r, f) in enumerate(zip(results["precision"], results["recall"], results["f1"])):
print(f"Example {i+1}: Precision={p:.3f}, Recall={r:.3f}, F1={f:.3f}")
RAGAS — For RAG Evaluation
RAGAS is designed specifically for evaluating RAG pipelines.
pip install ragas
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from datasets import Dataset
# Your RAG pipeline outputs
data = {
"question": ["What is the capital of France?"],
"answer": ["Paris is the capital of France."],
"contexts": [["France is a country in Western Europe. Its capital is Paris."]],
"ground_truth": ["Paris"],
}
dataset = Dataset.from_dict(data)
results = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_recall],
)
print(results)
# {"faithfulness": 1.0, "answer_relevancy": 0.95, "context_recall": 1.0}
| RAGAS Metric | What it measures |
|---|---|
| Faithfulness | Is the answer grounded in the retrieved context? |
| Answer Relevancy | Does the answer actually address the question? |
| Context Recall | Does the retrieved context contain the necessary info? |
Common Mistakes
- BLEU for long-form generation — BLEU was designed for machine translation with short outputs. Use ROUGE for summarisation and BERTScore for semantic tasks.
- Treating automated metrics as ground truth — a model can achieve high ROUGE by generating verbose text that repeats many of the reference words. Always sanity-check with human examples.
- Not using batch evaluation — run metrics on your entire test set (50–200 examples), not just 3–5 examples.
- Forgetting RAGAS needs an LLM — RAGAS faithfulness and relevancy metrics internally call an LLM to judge quality. Set your OpenAI API key before running.
Quick Quiz
Q1. What is the main limitation of ROUGE compared to BERTScore?
A1. ROUGE measures word overlap and cannot detect semantic similarity — synonyms and paraphrases are penalised even if semantically equivalent.
Q2. What does RAGAS faithfulness measure?
A2. Whether the generated answer is factually grounded in the retrieved context, not hallucinated.
Q3. Why is BERTScore better than ROUGE for evaluating creative or paraphrased outputs?
A3. BERTScore uses contextual BERT embeddings to compare meaning, so semantically similar sentences score high even if they share few words.
Student Exercise
Exercise 10.2 — Metrics comparison
Generate 10 model answers for a simple QA task. Compute ROUGE-1, ROUGE-L, and BERTScore for each. Find two examples where ROUGE and BERTScore disagree significantly. Explain why.
Further Reading
- 📘 HuggingFace Evaluate
- 📘 RAGAS Documentation
- 📄 BERTScore: Evaluating Text Generation with BERT (Zhang et al., 2020)
Next → 10.3 LLM-as-Judge