Limits to scalable evaluation at the frontier: LLM as Judge won’t beat twice the data
Model evaluations based on LLM judgements are often biased. After debiasing, the gains achievable from access to an LLM are limited.
Florian E. Dorner, Vivian Y. Nastl, and Moritz Hardt
arxiv preprint