8.5 Building an Eval Pipeline
AI-generated content may contain errors. Always verify against official sources.
8.5 Building an Eval Pipeline
Key Concepts: Dataset creation · Test harness · CI/CD for LLM quality
Official Docs: OpenAI Evals · LangSmith
What is an Evaluation Pipeline?
An evaluation pipeline is a repeatable, automated system that:
- Loads a fixed test dataset
- Runs all test cases through your LLM system
- Scores each output using automated metrics + LLM-as-judge
- Produces a summary report
- Fails the CI/CD build if quality drops below a threshold
Step 1 — Create a Test Dataset
import json
from pathlib import Path
# Golden test cases — representative, diverse, includes edge cases
test_cases = [
{
"id": "tc001",
"input": "What is the capital of France?",
"expected": "Paris",
"category": "factual",
},
{
"id": "tc002",
"input": "Summarise the transformer architecture in 2 sentences.",
"expected": "The Transformer uses multi-head self-attention to process sequences in parallel. It replaces recurrence with attention mechanisms and residual connections.",
"category": "summarisation",
},
{
"id": "tc003",
"input": "Write a Python function to reverse a string.",
"expected_contains": ["def ", "[::-1]"], # Code should contain these
"category": "code",
},
]
Path("eval_dataset.jsonl").write_text(
"\n".join(json.dumps(tc) for tc in test_cases)
)
Step 2 — The Evaluation Harness
from openai import OpenAI
import json
from pathlib import Path
from datetime import datetime
client = OpenAI()
def run_model(prompt: str) -> str:
"""Your model/pipeline call."""
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
)
return resp.choices[0].message.content
def score_response(test_case: dict, response: str) -> dict:
"""Score a response based on test case type."""
score = {"id": test_case["id"], "passed": False, "score": 0.0}
if "expected" in test_case:
# Exact match check
score["passed"] = test_case["expected"].lower() in response.lower()
score["score"] = 1.0 if score["passed"] else 0.0
if "expected_contains" in test_case:
# All expected strings present?
hits = [s in response for s in test_case["expected_contains"]]
score["score"] = sum(hits) / len(hits)
score["passed"] = all(hits)
return score
# Run evaluation
test_cases = [json.loads(l) for l in Path("eval_dataset.jsonl").read_text().splitlines()]
results = []
for tc in test_cases:
response = run_model(tc["input"])
score = score_response(tc, response)
score["response"] = response
results.append(score)
print(f"[{'PASS' if score['passed'] else 'FAIL'}] {tc['id']}: {score['score']:.2f}")
# Summary report
passed = sum(1 for r in results if r["passed"])
overall = sum(r["score"] for r in results) / len(results)
print(f"\nOverall: {passed}/{len(results)} passed | Avg score: {overall:.3f}")
# Save report
report = {
"timestamp": datetime.utcnow().isoformat(),
"model": "gpt-4o-mini",
"total": len(results),
"passed": passed,
"overall_score": overall,
"results": results,
}
Path(f"eval_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json").write_text(
json.dumps(report, indent=2)
)
Step 3 — CI/CD Integration
Run evaluation on every PR using GitHub Actions:
# .github/workflows/llm-eval.yml
name: LLM Evaluation
on:
pull_request:
paths:
- 'prompts/**'
- 'chains/**'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- run: pip install openai evaluate rouge-score
- run: python run_eval.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
# Fails the build if exit code != 0
# In run_eval.py, fail if quality drops below threshold
import sys
THRESHOLD = 0.80 # 80% pass rate required
if overall < THRESHOLD:
print(f"\n❌ EVAL FAILED: {overall:.3f} < {THRESHOLD} threshold")
sys.exit(1)
else:
print(f"\n✅ EVAL PASSED: {overall:.3f} >= {THRESHOLD} threshold")
sys.exit(0)
Common Mistakes
- Static threshold that never gets updated — as your model improves, update the threshold. 0.80 today should become 0.85 next month.
- Test dataset too small — fewer than 20 test cases gives unreliable pass rates. Aim for 50+ covering diverse categories and edge cases.
- Not versioning the eval dataset — if test cases change, scores are not comparable. Track test dataset versions in git.
- Eval in prod only — run evals before deploying, not just after incidents.
Quick Quiz
Q1. What is the key property that makes an eval pipeline “repeatable”?
A1. Using temperature=0 for deterministic outputs, a fixed test dataset (versioned in git), and the same scoring logic every run.
Q2. Why should eval be integrated into CI/CD?
A2. To automatically catch quality regressions when prompts, chains, or model parameters change, before they reach production.
Q3. What is the purpose of sys.exit(1) in the eval script?
A3. A non-zero exit code signals failure to the CI/CD runner (GitHub Actions, GitLab CI), which blocks the merge/deployment.
Student Exercise
Exercise 10.4 — Build a mini eval pipeline
Create a 10-test-case dataset for a QA bot. Build the harness, run it, and produce a JSON report. Then modify a prompt to intentionally degrade quality — verify the pipeline detects the regression.
Further Reading
Next → 10.5 Human Evaluation