8.4 Human Evaluation DesignAnnotation guidelines, inter-rater reliability, and stratified sampling for LLM eval.