8.2 Automated Metrics

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

8.2 Automated Metrics

Key Concepts: BLEU · ROUGE · BERTScore · RAGAS (faithfulness, relevance)

Official Docs: HuggingFace Evaluate · RAGAS Docs

Installing the Evaluate Library

pip install evaluate rouge-score bert-score sacrebleu

ROUGE — For Summarisation

ROUGE measures n-gram overlap between generated and reference text.

Metric	Measures
ROUGE-1	Unigram (word) overlap
ROUGE-2	Bigram overlap
ROUGE-L	Longest common subsequence

import evaluate

rouge = evaluate.load("rouge")

predictions = [
    "The model processes text using attention mechanisms.",
    "Paris is the capital city of France.",
]
references = [
    "The model uses attention to process sequences of text.",
    "The capital of France is Paris.",
]

results = rouge.compute(predictions=predictions, references=references)
print(f"ROUGE-1: {results['rouge1']:.3f}")
print(f"ROUGE-2: {results['rouge2']:.3f}")
print(f"ROUGE-L: {results['rougeL']:.3f}")

Limitation: ROUGE ignores semantics. "The cat sat on the mat" vs "A feline rested on the rug" scores 0 ROUGE despite being semantically identical.

BERTScore — Semantic Similarity

BERTScore uses BERT embeddings to measure semantic similarity, not word overlap.

bertscorer = evaluate.load("bertscore")

results = bertscorer.compute(
    predictions=predictions,
    references=references,
    lang="en",
)

for i, (p, r, f) in enumerate(zip(results["precision"], results["recall"], results["f1"])):
    print(f"Example {i+1}: Precision={p:.3f}, Recall={r:.3f}, F1={f:.3f}")

RAGAS — For RAG Evaluation

RAGAS is designed specifically for evaluating RAG pipelines.

pip install ragas

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from datasets import Dataset

# Your RAG pipeline outputs
data = {
    "question": ["What is the capital of France?"],
    "answer": ["Paris is the capital of France."],
    "contexts": [["France is a country in Western Europe. Its capital is Paris."]],
    "ground_truth": ["Paris"],
}

dataset = Dataset.from_dict(data)

results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_recall],
)
print(results)
# {"faithfulness": 1.0, "answer_relevancy": 0.95, "context_recall": 1.0}

RAGAS Metric	What it measures
Faithfulness	Is the answer grounded in the retrieved context?
Answer Relevancy	Does the answer actually address the question?
Context Recall	Does the retrieved context contain the necessary info?

Common Mistakes

BLEU for long-form generation — BLEU was designed for machine translation with short outputs. Use ROUGE for summarisation and BERTScore for semantic tasks.
Treating automated metrics as ground truth — a model can achieve high ROUGE by generating verbose text that repeats many of the reference words. Always sanity-check with human examples.
Not using batch evaluation — run metrics on your entire test set (50–200 examples), not just 3–5 examples.
Forgetting RAGAS needs an LLM — RAGAS faithfulness and relevancy metrics internally call an LLM to judge quality. Set your OpenAI API key before running.

Quick Quiz

Test Your Understanding

Q1. What is the main limitation of ROUGE compared to BERTScore?
A1. ROUGE measures word overlap and cannot detect semantic similarity — synonyms and paraphrases are penalised even if semantically equivalent.

Q2. What does RAGAS faithfulness measure?
A2. Whether the generated answer is factually grounded in the retrieved context, not hallucinated.

Q3. Why is BERTScore better than ROUGE for evaluating creative or paraphrased outputs?
A3. BERTScore uses contextual BERT embeddings to compare meaning, so semantically similar sentences score high even if they share few words.

Student Exercise

Exercise 10.2 — Metrics comparison
Generate 10 model answers for a simple QA task. Compute ROUGE-1, ROUGE-L, and BERTScore for each. Find two examples where ROUGE and BERTScore disagree significantly. Explain why.

8.2 Automated Metrics

Installing the Evaluate Library​

ROUGE — For Summarisation​

BERTScore — Semantic Similarity​

RAGAS — For RAG Evaluation​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

Installing the Evaluate Library

ROUGE — For Summarisation

BERTScore — Semantic Similarity

RAGAS — For RAG Evaluation

Common Mistakes

Quick Quiz

Student Exercise

Further Reading