|
Summary |
This assignment teaches students how to evaluate large language models (LLMs) using pairwise comparison data from LMArena. Students start by analyzing battle distributions, computing win rates, and applying the Bradley–Terry model to build leaderboards, then explore model performance across different prompt types. In the second part, they examine stylistic confounds such as formatting and length, using logistic regression to measure how these factors influence outcomes. |
|
Topics |
Machine Learning, Natural Language Processing, Model Evaluation, Interpretability, k-means clustering, logistic regression, Bradely-Terry Model, Large Language Models |
|
Audience |
Interaction to AI/ML |
|
Difficulty |
Moderate; ~6–10 hours for students familiar with Python, pandas, and basic ML concepts |
|
Strengths |
• Uses real-world crowdsourced LLM battle data • Integrates statistical reasoning with visualization • Highlights bias/interpretability issues in model evaluation • Modular design allows incremental progress |
|
Weaknesses |
• Requires relatively large datasets • Computationally heavier than small toy ML assignments • Focuses on implementation over theoretical proof |
|
Dependencies |
Knowledge: Python, pandas, logistic regression, confidence intervals Software: Python 3.10+, numpy, pandas, scikit-learn, plotly, otter-grader. No specific OS requirements |
|
Variants |
It could be expanded to include deeper exploratory analysis, such as examining specific embedding or tokenization strategies used by a model series (e.g., GPT) compared to others, and how these differences affect rankings. The analysis could also be extended to cover multiple languages, since the current notebook focuses only on English. |
We have Otter autograder support for both part 1 and part 2, which can be uploaded to gradescope for easy automated evaluation.
Intro Paper Overview:
Part 1 code (folder: rank_llm_pt_1):
Part 2 code (folder: rank_llm_pt_2):