Skip to main content

8.5 Building an Eval Pipeline

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

8.5 Building an Eval Pipeline

Key Concepts: Dataset creation · Test harness · CI/CD for LLM quality

Official Docs: OpenAI Evals · LangSmith


What is an Evaluation Pipeline?

An evaluation pipeline is a repeatable, automated system that:

  1. Loads a fixed test dataset
  2. Runs all test cases through your LLM system
  3. Scores each output using automated metrics + LLM-as-judge
  4. Produces a summary report
  5. Fails the CI/CD build if quality drops below a threshold

Step 1 — Create a Test Dataset

import json
from pathlib import Path

# Golden test cases — representative, diverse, includes edge cases
test_cases = [
{
"id": "tc001",
"input": "What is the capital of France?",
"expected": "Paris",
"category": "factual",
},
{
"id": "tc002",
"input": "Summarise the transformer architecture in 2 sentences.",
"expected": "The Transformer uses multi-head self-attention to process sequences in parallel. It replaces recurrence with attention mechanisms and residual connections.",
"category": "summarisation",
},
{
"id": "tc003",
"input": "Write a Python function to reverse a string.",
"expected_contains": ["def ", "[::-1]"], # Code should contain these
"category": "code",
},
]

Path("eval_dataset.jsonl").write_text(
"\n".join(json.dumps(tc) for tc in test_cases)
)

Step 2 — The Evaluation Harness

from openai import OpenAI
import json
from pathlib import Path
from datetime import datetime

client = OpenAI()

def run_model(prompt: str) -> str:
"""Your model/pipeline call."""
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
)
return resp.choices[0].message.content

def score_response(test_case: dict, response: str) -> dict:
"""Score a response based on test case type."""
score = {"id": test_case["id"], "passed": False, "score": 0.0}

if "expected" in test_case:
# Exact match check
score["passed"] = test_case["expected"].lower() in response.lower()
score["score"] = 1.0 if score["passed"] else 0.0

if "expected_contains" in test_case:
# All expected strings present?
hits = [s in response for s in test_case["expected_contains"]]
score["score"] = sum(hits) / len(hits)
score["passed"] = all(hits)

return score

# Run evaluation
test_cases = [json.loads(l) for l in Path("eval_dataset.jsonl").read_text().splitlines()]
results = []

for tc in test_cases:
response = run_model(tc["input"])
score = score_response(tc, response)
score["response"] = response
results.append(score)
print(f"[{'PASS' if score['passed'] else 'FAIL'}] {tc['id']}: {score['score']:.2f}")

# Summary report
passed = sum(1 for r in results if r["passed"])
overall = sum(r["score"] for r in results) / len(results)
print(f"\nOverall: {passed}/{len(results)} passed | Avg score: {overall:.3f}")

# Save report
report = {
"timestamp": datetime.utcnow().isoformat(),
"model": "gpt-4o-mini",
"total": len(results),
"passed": passed,
"overall_score": overall,
"results": results,
}
Path(f"eval_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json").write_text(
json.dumps(report, indent=2)
)

Step 3 — CI/CD Integration

Run evaluation on every PR using GitHub Actions:

# .github/workflows/llm-eval.yml
name: LLM Evaluation

on:
pull_request:
paths:
- 'prompts/**'
- 'chains/**'

jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- run: pip install openai evaluate rouge-score
- run: python run_eval.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
# Fails the build if exit code != 0
# In run_eval.py, fail if quality drops below threshold
import sys

THRESHOLD = 0.80 # 80% pass rate required

if overall < THRESHOLD:
print(f"\n❌ EVAL FAILED: {overall:.3f} < {THRESHOLD} threshold")
sys.exit(1)
else:
print(f"\n✅ EVAL PASSED: {overall:.3f} >= {THRESHOLD} threshold")
sys.exit(0)

Common Mistakes

Common Mistakes
  1. Static threshold that never gets updated — as your model improves, update the threshold. 0.80 today should become 0.85 next month.
  2. Test dataset too small — fewer than 20 test cases gives unreliable pass rates. Aim for 50+ covering diverse categories and edge cases.
  3. Not versioning the eval dataset — if test cases change, scores are not comparable. Track test dataset versions in git.
  4. Eval in prod only — run evals before deploying, not just after incidents.

Quick Quiz

Test Your Understanding

Q1. What is the key property that makes an eval pipeline “repeatable”?
A1. Using temperature=0 for deterministic outputs, a fixed test dataset (versioned in git), and the same scoring logic every run.

Q2. Why should eval be integrated into CI/CD?
A2. To automatically catch quality regressions when prompts, chains, or model parameters change, before they reach production.

Q3. What is the purpose of sys.exit(1) in the eval script?
A3. A non-zero exit code signals failure to the CI/CD runner (GitHub Actions, GitLab CI), which blocks the merge/deployment.


Student Exercise

Exercise 10.4 — Build a mini eval pipeline
Create a 10-test-case dataset for a QA bot. Build the harness, run it, and produce a JSON report. Then modify a prompt to intentionally degrade quality — verify the pipeline detects the regression.


Further Reading

Next → 10.5 Human Evaluation