Skip to main content

8.2 Automated Metrics

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

8.2 Automated Metrics

Key Concepts: BLEU · ROUGE · BERTScore · RAGAS (faithfulness, relevance)

Official Docs: HuggingFace Evaluate · RAGAS Docs


Installing the Evaluate Library

pip install evaluate rouge-score bert-score sacrebleu

ROUGE — For Summarisation

ROUGE measures n-gram overlap between generated and reference text.

MetricMeasures
ROUGE-1Unigram (word) overlap
ROUGE-2Bigram overlap
ROUGE-LLongest common subsequence
import evaluate

rouge = evaluate.load("rouge")

predictions = [
"The model processes text using attention mechanisms.",
"Paris is the capital city of France.",
]
references = [
"The model uses attention to process sequences of text.",
"The capital of France is Paris.",
]

results = rouge.compute(predictions=predictions, references=references)
print(f"ROUGE-1: {results['rouge1']:.3f}")
print(f"ROUGE-2: {results['rouge2']:.3f}")
print(f"ROUGE-L: {results['rougeL']:.3f}")

Limitation: ROUGE ignores semantics. "The cat sat on the mat" vs "A feline rested on the rug" scores 0 ROUGE despite being semantically identical.


BERTScore — Semantic Similarity

BERTScore uses BERT embeddings to measure semantic similarity, not word overlap.

bertscorer = evaluate.load("bertscore")

results = bertscorer.compute(
predictions=predictions,
references=references,
lang="en",
)

for i, (p, r, f) in enumerate(zip(results["precision"], results["recall"], results["f1"])):
print(f"Example {i+1}: Precision={p:.3f}, Recall={r:.3f}, F1={f:.3f}")

RAGAS — For RAG Evaluation

RAGAS is designed specifically for evaluating RAG pipelines.

pip install ragas
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from datasets import Dataset

# Your RAG pipeline outputs
data = {
"question": ["What is the capital of France?"],
"answer": ["Paris is the capital of France."],
"contexts": [["France is a country in Western Europe. Its capital is Paris."]],
"ground_truth": ["Paris"],
}

dataset = Dataset.from_dict(data)

results = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_recall],
)
print(results)
# {"faithfulness": 1.0, "answer_relevancy": 0.95, "context_recall": 1.0}
RAGAS MetricWhat it measures
FaithfulnessIs the answer grounded in the retrieved context?
Answer RelevancyDoes the answer actually address the question?
Context RecallDoes the retrieved context contain the necessary info?

Common Mistakes

Common Mistakes
  1. BLEU for long-form generation — BLEU was designed for machine translation with short outputs. Use ROUGE for summarisation and BERTScore for semantic tasks.
  2. Treating automated metrics as ground truth — a model can achieve high ROUGE by generating verbose text that repeats many of the reference words. Always sanity-check with human examples.
  3. Not using batch evaluation — run metrics on your entire test set (50–200 examples), not just 3–5 examples.
  4. Forgetting RAGAS needs an LLM — RAGAS faithfulness and relevancy metrics internally call an LLM to judge quality. Set your OpenAI API key before running.

Quick Quiz

Test Your Understanding

Q1. What is the main limitation of ROUGE compared to BERTScore?
A1. ROUGE measures word overlap and cannot detect semantic similarity — synonyms and paraphrases are penalised even if semantically equivalent.

Q2. What does RAGAS faithfulness measure?
A2. Whether the generated answer is factually grounded in the retrieved context, not hallucinated.

Q3. Why is BERTScore better than ROUGE for evaluating creative or paraphrased outputs?
A3. BERTScore uses contextual BERT embeddings to compare meaning, so semantically similar sentences score high even if they share few words.


Student Exercise

Exercise 10.2 — Metrics comparison
Generate 10 model answers for a simple QA task. Compute ROUGE-1, ROUGE-L, and BERTScore for each. Find two examples where ROUGE and BERTScore disagree significantly. Explain why.


Further Reading

Next → 10.3 LLM-as-Judge