Exploring Unfairness and Bias in Data

Overview

It is natural to assume that a model built from “real-world” data will inherently represent the world-at-large. However, this is not the case as seen in recent instances of models behaving in unexpectedly biased ways. We believe that one long term solution to this problem is to build a curriculum that inspires students to actively think about their data and the potential for it to be biased. Here, we present a Jupyter Notebook-based assignment exploring how bias can be introduced into a model using an example of gender-bias in credit history used to predict creditworthiness. We use an unbalanced data set (n = 500; 77 women and 423 men) to demonstrate how model evaluation methods like classification accuracy can be misleadingly high even with standard procedures like dataset splitting and cross-validation, which allows bias to remain undetected. Then we consider whether ignoring gender altogether will help our predictions. We conclude by prompting students to discuss whether these methods can solve unfairness in AI and to contribute their ideas about how to tackle this problem.

Meta Information

Summary

This is a lab that is designed to introduce the challenge of fairness in AI related domains to students studying introductory data science.

Topics

Fairness. Bias through Data.

Audience

Early career undergraduate students in CS (major and minor). Familiarity with a high-level programming language (gained from an Introduction to Programming course) is required.

Difficulty

Introductory-intermediate level. The lab is designed for students to complete in a 1-2hr lab session with an instructor or teaching assistants present to provide help.

Strengths

This assignment provides a easy introduction to how biased, unfair models can arise through data. This is accomplished through a simple demonstration. This assignment can be used as either a one-off exercise or the starting point to a larger, instructor-led discussion on the topic of fairness.

Weaknesses

The lab does not aim to provide a comprehensive discussion of fairness. The lab is also intended to be used during an "introduction to data science" style course and assumes familiarty with Python data processing tools like Pandas and NumPy, Scikit Learn, and Logistic Regression.

Dependencies

Coding is done in Python 3.6 (using numpy, matplotlib, scikit-learn). We recommend installing Python via the Anaconda package. The lending data used is provided.

Variants

This lab can be used as a one-off assignment or as in introduction to a more in depth discussion of the topic. It can be extended to give a more complete treatment of how to handle unbalanced data. It can also be adapted to focus more on the different types of bias beyond unbalanced data.

Instructions and Starter Code

We provide both an student-ready assignment and instructor solution notebook (available upon request). The Jupyter notebook runtime is installed by default with any Anaconda Python distribution. Navigate to this directory in your terminal and execute jupyter notebook to start. All of the other package dependencies required by the notebook should also be provided by default with an Anaconda installation.