On Evaluating Methods vs. Evaluating Models
Olawale Elijah Salaudeen, Florian E. Dorner, and Peter Hase
Evaluating the Evolving LLM Lifecycle Workshop at NeurIPS 2025 (Oral)
Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead
Tom Sühr, Florian E. Dorner, Olawale Salaudeen, Augustin Kelava, and Samira Samadi
arxiv preprint
ROC-n-reroll: How verifier imperfection affects test-time scaling
Florian E. Dorner, Yatong Chen, André F Cruz, and Fanny Yang
arxiv preprint
How Benchmark Prediction from Fewer Data Misses the Mark
Guanhua Zhang, Florian E. Dorner, and Moritz Hardt
NeurIPS 2025
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Florian E. Dorner, Vivian Y. Nastl, and Moritz Hardt
ICLR 2025 (Oral)
Training on the Test Task Confounds Evaluation and Emergence
Ricardo Dominguez-Olmedo, Florian E. Dorner, and Moritz Hardt
ICLR 2025 (Oral)
Whose Preferences? Differences in Fairness Preferences and Their Impact on the Fairness of AI Utilizing Human Feedback
Emilia Agis Lerner, Florian E. Dorner, Elliott Ash, and Naman Goel
Annual Meeting of the Association for Computational Linguistics 2024
Don't Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget
Florian E. Dorner, Moritz Hardt
ICML 2024
Incentivizing Honesty among Competitors in Collaborative Learning and Optimization
Florian E. Dorner, Nikola Konstantinov, Georgi Pashaliev, Martin Vechev
NeurIPS 2023
Do Personality Tests Generalize to Large Language Models?
Florian E. Dorner, Tom Sühr, Samira Samadi, Augustin Kelava (Equal contribution)
Socially Responsible Language Modelling Research Workshop (at NeurIPS 2023)
Human-Guided Fair Classification for Natural Language Processing
Florian E. Dorner, Momchil Peychev, Nikola Konstantinov, Naman Goel, Elliott Ash, and Martin Vechev
ICLR 2023 (Top 25% Spotlight)
Forecasting AI progress: A research agenda
Ross Gruetzemacher, Florian E. Dorner, Niko Bernaola-Alvarez, Charlie Giattino, David Manheim
Technological Forecasting and Social Change 170, 120909 (2021)
Algorithmic collusion: A critical review
Florian E. Dorner
arxiv preprint
Measuring Progress in Deep Reinforcement Learning Sample Efficiency
Florian E. Dorner
arxiv preprint