Chapter 8: Evaluation & Benchmarking | Mohammed Sarfaraz

📄️ 8.1 Why It's Hard

Non-determinism, subjective quality, and benchmark gaming in LLM evaluation.

BLEU, ROUGE, BERTScore, and RAGAS (faithfulness, relevance) for LLM evaluation.

Rubric design, pairwise comparison, and calibration for LLM-as-judge evaluation.

Annotation guidelines, inter-rater reliability, and stratified sampling for LLM eval.

Dataset creation, test harness, and CI/CD for LLM output quality assurance.