MWAHAHA Competition

Live Leaderboard

Live

Task:

Rank	System	Rating	95% CI	Votes

Help us Rank! Submit a System

Info

This web page shows the live arena leaderboard for the evaluation of systems submitted by participants to the 2025-2026 MWAHAHA competition on Humor Generation. In this competition, participants submit computer program systems that are capable of generating humorous outputs given some context (e.g., a headline). This competition consists of two phases in which participants can submit systems: Evaluation Trial (until Dec 15, 2025) and Evaluation (Jan 10-31, 2026). We use this web page to show the live evaluation results for both phases, one at a time (currently for the Evaluation Trial). The results are non-final (they are partial; running, live, still being computed).

Frequently Asked Questions (FAQ)

1. What are these systems?
These are the names of the systems participants submitted to this competition. These systems generate jokes given a prompt (e.g., a headline). The names of the systems are composed of the participant name, a hyphen, and the submission number, obtained from our CodaBench competition website.

2. Why do some systems have similar names?
Participants are allowed to make multiple submissions. So the participant name can appear multiple times, with a different submission number.

3. What's baseline?
It's the name of a system provided by the competition organizers as a baseline.

4. How are the systems evaluated?
We use this annotation web page to let anyone on the Internet help us decide what's the funnier system on 1-on-1 arena-style battles, partially inspired by LMArena. With all the annotations, we compute an Elo-like rating score to rate the systems. A higher rating indicates a system that is more likely to generate outputs perceived as humorous. This is a system used by LMArena and also by games such as chess. More specifically, we employ a Bradley-Terry model to compute stable ratings, and apply bootstrapping to compute 95% confidence intervals. Note that, in some border cases, there could be differences between the confidence intervals and the final rating values. See this blog post from LMSYS Org for more info.

5. Why do some systems have the same rank?
Some systems have the same rank because we can't differentiate them in a statistically significant way, even when their ratings are different. Note that ties aren't transitive. For example, we may not be able to tell which of A and B and which of B and C are better, but we may be able to significantly tell that A is better than C. That's why a system may have a lower rank than another one without a statistically significant difference (because there are others systems with the same rank as the latter that can be differentiated from the former).

6. Why do some systems have fewer votes than others?
Some systems were submitted more recently. Also, some systems have a lot of votes because we automatically assign as a tie to the pairs of systems that give the same output. This typically happens for different versions of the same system.

7. When is the leaderboard updated?
We try to update it every one hour, but there may be issues so we can't guarantee it. In any case, the last update time appears at the bottom of the tables.

8. Can I evaluate systems?
Yes! Everyone is welcome! Visit the the annotation web page and have fun rating the funnier systems! Note that you can't both submit a system and evaluate systems. In other words, only non-participants can evaluate systems. If you're considering participating in the competition, please refrain from evaluating systems.

9. Can I participate in this competition?
Of course! Everyone is welcome! Visit the 2025-2026 MWAHAHA competition website for more info. You have time to submit your system's jokes until the Evaluation phase ends (late Jan, 2026). Note that, if you are considering participating, you cannot vote/annotate (i.e., you can't both submit a system and evaluate systems).