Ranking Large Language Models with LMArena

Summary

This assignment teaches students how to evaluate large language models (LLMs) using pairwise comparison data from LMArena. Students start by analyzing battle distributions, computing win rates, and applying the Bradley–Terry model to build leaderboards, then explore model performance across different prompt types. In the second part, they examine stylistic confounds such as formatting and length, using logistic regression to measure how these factors influence outcomes.

Topics

Machine Learning, Natural Language Processing, Model Evaluation, Interpretability, k-means clustering, logistic regression, Bradely-Terry Model, Large Language Models

Audience

Interaction to AI/ML

Difficulty

Moderate; ~6–10 hours for students familiar with Python, pandas, and basic ML concepts

Strengths

• Uses real-world crowdsourced LLM battle data

• Integrates statistical reasoning with visualization

• Highlights bias/interpretability issues in model evaluation

• Modular design allows incremental progress

Weaknesses

• Requires relatively large datasets

• Computationally heavier than small toy ML assignments

• Focuses on implementation over theoretical proof

Dependencies

Knowledge: Python, pandas, logistic regression, confidence intervals

Software: Python 3.10+, numpy, pandas, scikit-learn, plotly, otter-grader. No specific OS requirements

Variants

It could be expanded to include deeper exploratory analysis, such as examining specific embedding or tokenization strategies used by a model series (e.g., GPT) compared to others, and how these differences affect rankings. The analysis could also be extended to cover multiple languages, since the current notebook focuses only on English.

Resources

We have Otter autograder support for both part 1 and part 2, which can be uploaded to gradescope for easy automated evaluation.

Intro Paper Overview:

Part 1 code (folder: rank_llm_pt_1):

Part 2 code (folder: rank_llm_pt_2):