Skip to main content

9.5 Evaluation After Fine-Tuning

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

9.5 Evaluation After Fine-Tuning

Key Concepts: Before/after comparison · Catastrophic forgetting · Overfitting signals

Official Docs: HuggingFace Evaluate · EleutherAI LM Evaluation Harness


Why Evaluation Is Necessary

Fine-tuning can go wrong in two directions:

  1. Underfitting — the model didn’t learn the new behaviour
  2. Overfitting / Catastrophic forgetting — the model memorised training examples and lost general capabilities

Always evaluate on both your task and general benchmarks.


Step 1 — Task-Specific Evaluation

Compare model outputs before and after fine-tuning on a held-out test set:

from unsloth import FastLanguageModel
from datasets import load_dataset
import json

# Load fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
"./fine-tuned-model",
max_seq_length=2048,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# Load test set
test_data = [json.loads(line) for line in open("val.jsonl")]

def run_inference(messages: list) -> str:
"""Run inference on a list of messages."""
text = tokenizer.apply_chat_template(
messages[:-1], # Exclude expected assistant response
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.1)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
return decoded.split("<|assistant|>")[-1].strip()

# Compare on first 5 examples
for example in test_data[:5]:
messages = example["messages"]
expected = messages[-1]["content"]
predicted = run_inference(messages)
print(f"Expected: {expected[:100]}")
print(f"Predicted: {predicted[:100]}")
print("-" * 40)

Step 2 — Training Curve Analysis

import matplotlib.pyplot as plt

# After training, the trainer saves logs
# trainer.state.log_history contains all training metrics
logs = trainer.state.log_history

train_loss = [(l["step"], l["loss"]) for l in logs if "loss" in l]
eval_loss = [(l["step"], l["eval_loss"]) for l in logs if "eval_loss" in l]

fig, ax = plt.subplots(figsize=(10, 4))
if train_loss:
ax.plot(*zip(*train_loss), label="Train Loss")
if eval_loss:
ax.plot(*zip(*eval_loss), label="Validation Loss", linestyle="--")
ax.set_xlabel("Step")
ax.set_ylabel("Loss")
ax.set_title("Training & Validation Loss")
ax.legend()
plt.savefig("training_curve.png")

# Warning signs:
# - Validation loss rising while train loss falls = overfitting
# - Train loss not decreasing = learning rate too low, or data issues

Step 3 — Catastrophic Forgetting Check

Test if the model still handles general tasks:

general_tests = [
"What is the capital of France?",
"Translate 'Hello, how are you?' to Spanish.",
"Write a Python function that sorts a list.",
"What is 2 + 2 * 3?",
]

print("=== Catastrophic Forgetting Check ===")
for question in general_tests:
messages = [{"role": "user", "content": question}]
answer = run_inference(messages)
print(f"Q: {question}")
print(f"A: {answer[:150]}")
print()

Step 4 — Automated Metrics

import evaluate

rouge = evaluate.load("rouge")

# Collect predictions and references
predictions = []
references = []

for example in test_data[:50]:
messages = example["messages"]
expected = messages[-1]["content"]
predicted = run_inference(messages)
predictions.append(predicted)
references.append(expected)

results = rouge.compute(predictions=predictions, references=references)
print(f"ROUGE-1: {results['rouge1']:.3f}")
print(f"ROUGE-2: {results['rouge2']:.3f}")
print(f"ROUGE-L: {results['rougeL']:.3f}")

Common Mistakes

Common Mistakes
  1. Evaluating only on training data — perfect training-set performance means nothing. Always evaluate on a held-out test set.
  2. Ignoring catastrophic forgetting — fine-tuning on a narrow task often degrades general capabilities. Always test general tasks.
  3. ROUGE as the only metric — ROUGE measures word overlap, not semantic quality. Always combine automated metrics with human or LLM-as-judge evaluation.
  4. Not testing edge cases — test with unusual inputs, very long inputs, and adversarial inputs before deploying.

Quick Quiz

Test Your Understanding

Q1. What is catastrophic forgetting in the context of fine-tuning?
A1. When the model overfits to the fine-tuning task and loses its general capabilities (e.g., cannot do basic arithmetic or translation anymore).

Q2. What does a rising validation loss with a falling training loss indicate?
A2. Overfitting — the model is memorising the training examples rather than learning generalisable patterns. Training should be stopped earlier.

Q3. Why is ROUGE not sufficient as the sole evaluation metric?
A3. ROUGE measures word overlap, not semantic correctness or quality. A model can score low ROUGE while giving a semantically correct answer, or high ROUGE while giving a wrong answer.


Student Exercise

Exercise 9.5 — Pre/post comparison
Fine-tune a small model (e.g., Llama 3.2-1B) on a 50-example customer support dataset. Before and after fine-tuning, run the same 10 test questions and compare outputs. Also run the 4 general-knowledge questions to check for catastrophic forgetting.


Further Reading

Next Chapter → Chapter 10: Evaluation