8.1 Why Evaluation Is HardNon-determinism, subjective quality, and benchmark gaming in LLM evaluation.