On Evaluating Methods vs. Evaluating Models (pdf)

Olawale Elijah Salaudeen, Florian E. Dorner, and Peter Hase

Evaluating the Evolving LLM Lifecycle Workshop at NeurIPS 2025 (Oral)

We distinguish between two often-conflated uses of datasets: evaluating methods, which compares algorithms under controlled conditions, and evaluating models, which assesses the capabilities of a fixed model. Method evaluation emphasizes relative rankings and is robust to model-independent distortions, while model evaluation requires valid absolute scores but can be biased by contamination or task mismatch. Using simple formulations and synthetic experiments, we demonstrate how this conflation can reverse method rankings and misrepresent model capabilities, leading to misleading leaderboards and a flawed understanding of a models’ true capabilities. We conclude with recommendations for designing and interpreting state-of-the-art evaluations, grounded in the critical distinction between methods and models.