📄️ 8.1 Why It's Hard
Non-determinism, subjective quality, and benchmark gaming in LLM evaluation.
📄️ 8.2 Automated Metrics
BLEU, ROUGE, BERTScore, and RAGAS (faithfulness, relevance) for LLM evaluation.
📄️ 8.3 LLM-as-Judge Eval
Rubric design, pairwise comparison, and calibration for LLM-as-judge evaluation.
📄️ 8.4 Human Evaluation
Annotation guidelines, inter-rater reliability, and stratified sampling for LLM eval.
📄️ 8.5 Eval Pipeline
Dataset creation, test harness, and CI/CD for LLM output quality assurance.