8.5 Building an Eval Pipeline

AI-Generated Content

AI-generated content may contain errors. Always verify against official sources.

8.5 Building an Eval Pipeline

Key Concepts: Dataset creation · Test harness · CI/CD for LLM quality

Official Docs: OpenAI Evals · LangSmith

What is an Evaluation Pipeline?

An evaluation pipeline is a repeatable, automated system that:

Loads a fixed test dataset
Runs all test cases through your LLM system
Scores each output using automated metrics + LLM-as-judge
Produces a summary report
Fails the CI/CD build if quality drops below a threshold

Step 1 — Create a Test Dataset

import json
from pathlib import Path

# Golden test cases — representative, diverse, includes edge cases
test_cases = [
    {
        "id": "tc001",
        "input": "What is the capital of France?",
        "expected": "Paris",
        "category": "factual",
    },
    {
        "id": "tc002",
        "input": "Summarise the transformer architecture in 2 sentences.",
        "expected": "The Transformer uses multi-head self-attention to process sequences in parallel. It replaces recurrence with attention mechanisms and residual connections.",
        "category": "summarisation",
    },
    {
        "id": "tc003",
        "input": "Write a Python function to reverse a string.",
        "expected_contains": ["def ", "[::-1]"],  # Code should contain these
        "category": "code",
    },
]

Path("eval_dataset.jsonl").write_text(
    "\n".join(json.dumps(tc) for tc in test_cases)
)

Step 2 — The Evaluation Harness

from openai import OpenAI
import json
from pathlib import Path
from datetime import datetime

client = OpenAI()

def run_model(prompt: str) -> str:
    """Your model/pipeline call."""
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return resp.choices[0].message.content

def score_response(test_case: dict, response: str) -> dict:
    """Score a response based on test case type."""
    score = {"id": test_case["id"], "passed": False, "score": 0.0}
    
    if "expected" in test_case:
        # Exact match check
        score["passed"] = test_case["expected"].lower() in response.lower()
        score["score"] = 1.0 if score["passed"] else 0.0
    
    if "expected_contains" in test_case:
        # All expected strings present?
        hits = [s in response for s in test_case["expected_contains"]]
        score["score"] = sum(hits) / len(hits)
        score["passed"] = all(hits)
    
    return score

# Run evaluation
test_cases = [json.loads(l) for l in Path("eval_dataset.jsonl").read_text().splitlines()]
results = []

for tc in test_cases:
    response = run_model(tc["input"])
    score = score_response(tc, response)
    score["response"] = response
    results.append(score)
    print(f"[{'PASS' if score['passed'] else 'FAIL'}] {tc['id']}: {score['score']:.2f}")

# Summary report
passed = sum(1 for r in results if r["passed"])
overall = sum(r["score"] for r in results) / len(results)
print(f"\nOverall: {passed}/{len(results)} passed | Avg score: {overall:.3f}")

# Save report
report = {
    "timestamp": datetime.utcnow().isoformat(),
    "model": "gpt-4o-mini",
    "total": len(results),
    "passed": passed,
    "overall_score": overall,
    "results": results,
}
Path(f"eval_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json").write_text(
    json.dumps(report, indent=2)
)

Step 3 — CI/CD Integration

Run evaluation on every PR using GitHub Actions:

# .github/workflows/llm-eval.yml
name: LLM Evaluation

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'chains/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - run: pip install openai evaluate rouge-score
      - run: python run_eval.py
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        # Fails the build if exit code != 0

# In run_eval.py, fail if quality drops below threshold
import sys

THRESHOLD = 0.80  # 80% pass rate required

if overall < THRESHOLD:
    print(f"\n❌ EVAL FAILED: {overall:.3f} < {THRESHOLD} threshold")
    sys.exit(1)
else:
    print(f"\n✅ EVAL PASSED: {overall:.3f} >= {THRESHOLD} threshold")
    sys.exit(0)

Common Mistakes

Static threshold that never gets updated — as your model improves, update the threshold. 0.80 today should become 0.85 next month.
Test dataset too small — fewer than 20 test cases gives unreliable pass rates. Aim for 50+ covering diverse categories and edge cases.
Not versioning the eval dataset — if test cases change, scores are not comparable. Track test dataset versions in git.
Eval in prod only — run evals before deploying, not just after incidents.

Quick Quiz

Test Your Understanding

Q1. What is the key property that makes an eval pipeline “repeatable”?
A1. Using temperature=0 for deterministic outputs, a fixed test dataset (versioned in git), and the same scoring logic every run.

Q2. Why should eval be integrated into CI/CD?
A2. To automatically catch quality regressions when prompts, chains, or model parameters change, before they reach production.

Q3. What is the purpose of sys.exit(1) in the eval script?
A3. A non-zero exit code signals failure to the CI/CD runner (GitHub Actions, GitLab CI), which blocks the merge/deployment.

Student Exercise

Exercise 10.4 — Build a mini eval pipeline
Create a 10-test-case dataset for a QA bot. Build the harness, run it, and produce a JSON report. Then modify a prompt to intentionally degrade quality — verify the pipeline detects the regression.

8.5 Building an Eval Pipeline

What is an Evaluation Pipeline?​

Step 1 — Create a Test Dataset​

Step 2 — The Evaluation Harness​

Step 3 — CI/CD Integration​

Common Mistakes​

Quick Quiz​

Student Exercise​

Further Reading​

What is an Evaluation Pipeline?

Step 1 — Create a Test Dataset

Step 2 — The Evaluation Harness

Step 3 — CI/CD Integration

Common Mistakes

Quick Quiz

Student Exercise

Further Reading