Ranking Large Language Models with LMArena

Summary	This assignment teaches students how to evaluate large language models (LLMs) using pairwise comparison data from LMArena. Students start by analyzing battle distributions, computing win rates, and applying the Bradley–Terry model to build leaderboards, then explore model performance across different prompt types. In the second part, they examine stylistic confounds such as formatting and length, using logistic regression to measure how these factors influence outcomes.
Topics	Machine Learning, Natural Language Processing, Model Evaluation, Interpretability, k-means clustering, logistic regression, Bradely-Terry Model, Large Language Models
Audience	Interaction to AI/ML
Difficulty	Moderate; ~6–10 hours for students familiar with Python, pandas, and basic ML concepts
Strengths	• Uses real-world crowdsourced LLM battle data • Integrates statistical reasoning with visualization • Highlights bias/interpretability issues in model evaluation • Modular design allows incremental progress
Weaknesses	• Requires relatively large datasets • Computationally heavier than small toy ML assignments • Focuses on implementation over theoretical proof
Dependencies	Knowledge: Python, pandas, logistic regression, confidence intervals Software: Python 3.10+, numpy, pandas, scikit-learn, plotly, otter-grader. No specific OS requirements
Variants	It could be expanded to include deeper exploratory analysis, such as examining specific embedding or tokenization strategies used by a model series (e.g., GPT) compared to others, and how these differences affect rankings. The analysis could also be extended to cover multiple languages, since the current notebook focuses only on English.

Resources

We have Otter autograder support for both part 1 and part 2, which can be uploaded to gradescope for easy automated evaluation.

Intro Paper Overview:

Student Version: Model_Rank_Intro Paper_Questions.pdf
Solution: solution/Model_Rank Intro Paper_Solutions.pdf

Part 1 code (folder: rank_llm_pt_1):

Student Version: rank_llm_pt_1/student/arena_warmup_question_student.ipynb
Autograder: rank_llm_pt_1/autograder/arena_warmup_question_student.ipynb
Solution: solution/arena_warmup_question_student.ipynb
Data files: Can be directly downloaded through notebook

Part 2 code (folder: rank_llm_pt_2):

Student Version: rank_llm_pt_2/student/arena_style_control_student.ipynb
Autograder: rank_llm_pt_2/autograder/arena_style_control_student.ipynb
Solution: solution/arena_warmup_question_student.ipynb
Data files: Can be directly downloaded through notebook

assignment.zip - the entire assignment folder zipped with all files above