Sitemap

Publications

Measuring Progress in Deep Reinforcement Learning Sample Efficiency

How has sample efficiency in deep reinforcement learning improved over time?

Florian E. Dorner

arxiv preprint

Algorithmic collusion: A critical review

How realistic is the prospect of pricing algorithms learning to collude?

Florian E. Dorner

arxiv preprint

Forecasting AI progress: A research agenda

A survey on experts’ opinions on forecasting AI progress.

Ross Gruetzemacher, Florian E. Dorner, Niko Bernaola-Alvarez, Charlie Giattino, David Manheim

Technological Forecasting and Social Change 170, 120909 (2021)

Human-Guided Fair Classification for Natural Language Processing

We use LLMs and human to generate valid constraints for individual fairness in text classification.

Florian E. Dorner, Momchil Peychev, Nikola Konstantinov, Naman Goel, Elliott Ash, and Martin Vechev

ICLR 2023 (Top 25% Spotlight)

Do Personality Tests Generalize to Large Language Models?

Language models’ answers to personality tests markedly deviate from typical human responses.

Florian E. Dorner, Tom Sühr, Samira Samadi, Augustin Kelava (Equal contribution)

Socially Responsible Language Modelling Research Workshop (at NeurIPS 2023)

Incentivizing Honesty among Competitors in Collaborative Learning and Optimization

We show how to use peer-prediction mechanisms to prevent rational clients from adversarially manipulating updates in federated learning.

Florian E. Dorner, Nikola Konstantinov, Georgi Pashaliev, Martin Vechev

NeurIPS 2023

Don’t Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget

When building a test set for binary classification from noisy label, how many labels to collect per data point? Surprisingly under a simple budget constraint, the answer is a single label.

Florian E. Dorner, Moritz Hardt

ICML 2024

Whose Preferences? Differences in Fairness Preferences and Their Impact on the Fairness of AI Utilizing Human Feedback

Human preferences about judgments in text classification are not invariant accross demographic groups

Emilia Agis Lerner, Florian E. Dorner, Elliott Ash, and Naman Goel

Annual Meeting of the Association for Computational Linguistics 2024

Training on the Test Task Confounds Evaluation and Emergence

Recent improvements in LLM performance that go beyond compute scaling appear to be fully explained by training on benchmark-specific data

Ricardo Dominguez-Olmedo, Florian E. Dorner, and Moritz Hardt

ICLR 2025 (Oral)

Limits to scalable evaluation at the frontier: LLM as Judge won’t beat twice the data

Model evaluations based on LLM judgements are often biased. After debiasing, the gains achievable from access to an LLM are limited.

Florian E. Dorner, Vivian Y. Nastl, and Moritz Hardt

ICLR 2025 (Oral)

How Benchmark Prediction from Fewer Data Misses the Mark

Can we reliably predict LLM performance using small amounts of data?

Guanhua Zhang, Florian E. Dorner, and Moritz Hardt

NeurIPS 2025

ROC-n-reroll: How verifier imperfection affects test-time scaling

How far can test-time scaling take us, when verifiers are imperfect?

Florian E. Dorner, Yatong Chen, André F Cruz, and Fanny Yang

arxiv preprint

Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead

Test designed to evaluate humans often produce misleading results when used to evaluate AI systems.

Tom Sühr, Florian E. Dorner, Olawale Salaudeen, Augustin Kelava, and Samira Samadi

arxiv preprint

On Evaluating Methods vs. Evaluating Models

Should benchmarks evaluate LLMs, or the methods used to train them?

Olawale Elijah Salaudeen, Florian E. Dorner, and Peter Hase

Evaluating the Evolving LLM Lifecycle Workshop at NeurIPS 2025 (Oral)

Tools

All the single labels Permalink

How many labels per instance are needed to compare two binary classifiers? Accompanying tool to the paper Don’t Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget.

Florian E. Dorner

Sitemap

Pages

Page Not Found

Publications

Tools

About

Sitemap

Publications

Measuring Progress in Deep Reinforcement Learning Sample Efficiency

Algorithmic collusion: A critical review

Forecasting AI progress: A research agenda

Human-Guided Fair Classification for Natural Language Processing

Do Personality Tests Generalize to Large Language Models?

Incentivizing Honesty among Competitors in Collaborative Learning and Optimization

Don’t Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget

Whose Preferences? Differences in Fairness Preferences and Their Impact on the Fairness of AI Utilizing Human Feedback

Training on the Test Task Confounds Evaluation and Emergence

Limits to scalable evaluation at the frontier: LLM as Judge won’t beat twice the data

How Benchmark Prediction from Fewer Data Misses the Mark

ROC-n-reroll: How verifier imperfection affects test-time scaling

Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead

On Evaluating Methods vs. Evaluating Models

Tools

All the single labels Permalink