# Lab 12: Exploring Unfairness in Data ⚖️

## Learning Objectives
* Understanding and Applying Linear Regression 
* Data Exploration
* Practice ML Workflow: Training, Testing, and Evaluation

## Outline

1. [Unfairness](#1.-Unfairness)
2. [Exploring Loan Approval Data](#2.-Exploring-Loan-Approval-Data)
3. [Building a Model](#3.-Building-a-Model)
4. [Becoming Data and Fairness Aware](#4.-Becoming-Data-and-Fairness-Aware)

## 1. Unfairness

It is natural to assume that a model built from "real-world" data will inherently represent the world-at-large. We often take the data that we have for granted, especially when we are first getting started with Data Science. However, if we do not pay attention to what our data look like, how they were collected, and what features they contain, we may unknowingly create models that propagate cultural biases and unfairnesses.

![hire](utility/images/undraw_hire_te5y.png)

In 2014, Amazon began building programs that could automate the hiring process for engineers. They wanted a machine to be able to pick out the top resumes from the thousands they receive every year. They trained their model on all of the resumes that they had, hoping that the model would be able to identify trends in keyword frequency within those applications. If most applications contained the word "intern," then one might reasonably expect that a resume containing it would be ranked higher than one that doesn't. However, as they began to deploy their model, it became increasingly apparent that the model was discriminating against women. When engineers investigated why this was the case, they found that the data they trained the model with, the resumes, had mostly come from men. The model had learned to prefer resumes that didn't contain the word "women's" because that word wasn't frequent seen during its training. Although gender was not explicitly a feature of the dataset, it was still present in the dataset, encoded within the experiences that applicants reported. 

Amazon's case serves as a reminder that we must be careful of our data, even more so today as data becomes cheaper to collect.

## 2. Exploring Loan Approval Data

Imagine that you are a data scientist at a bank and that one of your company's primary business areas is in lending money. The current loan approval process, that has been in place since the founding of the bank, has always relied on manual review of applications -- a process that is tedious and doesn't scale well in the modern age. The company wants to expand their business, but this archeic system is holding them back.

Think about how to approach this problem, you immediately think of using the bank's past loan approval records to build a model that can learn how a human application reviewer decides which applications to approve and which to reject.

![approval](utility/images/undraw_accept_request_vdsd.png)

### Acquiring the Data



Before we begin, let's make sure that we have the data. The cell below checks if you have the `loan-payments.csv` file in the `utility/data` directory.

In [None]:
from os.path import exists


data_dir = 'utility/data'

assert exists(f'{data_dir}/loan-payments.csv'), 'Loan data file is missing.'

Next, let's load our data. In the cell below, we read our [CSV][1] file into a [Pandas][2] [`DataFrame`][3] called `data`.



[1]: https://en.wikipedia.org/wiki/Comma-separated_values
[2]: https://pandas.pydata.org/
[3]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

In [None]:
import pandas as pd

data = pd.read_csv(f'{data_dir}/loan-payments.csv')

Let's take a look at what we have.

In [None]:
data

**Write-up!** How many examples are in our data set? How many features does it have?

**Write-up!** With your neighbor, come up with a description of what you think each feature is and what type of feature each one is. Which one should be our target variable? Which ones do you think will be useful for our model?

### Making Some Adjustments

Now let's drop the columns in `data` that contain features that we are not interested in. Since `loan_id`s are not informative for predicting new loans, we can ignore them. Additionally, `effective_date`, `due_date`, and `paid_time_off` are all encoded in `past_due_days`. It is unlikely that the specifics of when a loan was due is predictive of success.

In [None]:
not_interested = ['loan_id', 'effective_date', 'due_date', 'paid_off_time']

data = data.drop(not_interested, axis=1)

Let's see our new data set.

In [None]:
data.head()

Did you notice that `past_due_days` has `NaN` values?

**Write-up!** Why might some of the values in `past_due_days` be `NaN`?  With your neighbor, discuss what we should do about these values and note your conclusion below.

**Try this!** Replace the values in `past_due_days` with a reasonable value. `HINT` you can use the `fillna` function on `DataFrame`s to do this.

In [None]:
# your code here


Let's see if it worked.

In [None]:
data.head()

Nice!

### Visualizing the Data Set

Now that we have narrowed down the features we want to use, let's visualize them.

**Try this!** For each feature, make a new cell below and create a plot that we can use to understand the values of that feature. These plots should be appropriate for the type of each feature (e.g. use a bar plot for categorical features). Ensure that you have all the components off a nice plot, making sure to include things like axes labels, a legend, and a title. Also include a `raw` cell below each, describing what you see. `HINT` you can copy and paste groups of cells by shift-clicking them on the left.

In [None]:
import matplotlib.pyplot as plt

# this cell is free!

# your code here


## 3. Building a Model

Now that we have a sense for the nuances of our dataset we can try building some models.

![analytics](utility/images/undraw_predictive_analytics_kf9n.png)

Before we continue, we will need to encode our categorical features with enumerations instead of the string values that they currently have. As a reminder, this is what our dataset looks like right now.

In [None]:
data.head()

An easy way to do this encoding is to use the [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) from `sklearn`. In the cell below, we create a list called `categorical` containing the names of the columns corresponding to the categorical features in our dataset. We then create and instance of a `LabelEncoder` and use it to transform the categorical features.

In [None]:
from sklearn.preprocessing import LabelEncoder

categorical = ['loan_status', 'education', 'gender']

# create an instance of a LabelEncoder
encoder = LabelEncoder()

# make a copy of our data
encoded = data.copy()

# apply the encoder's `fit_transform` method to the values for each categorical
# feature column
encoded[categorical] = data[categorical].apply(encoder.fit_transform)

Let's take a look at the results.

In [None]:
encoded

Notice how the categorical values like "PAIDOFF" have now been replaced with numbers. We can see which numbers map to each value like this:

In [None]:
for column in categorical:
    print(*sorted(zip(encoded[column].unique(), data[column].unique()), key=lambda x: x[0]))

Next let's separate our features from our target variable, `loan_status`.

In [None]:
X, y = encoded.loc[:, encoded.columns != 'loan_status'], encoded.loan_status

### Establishing a Baseline

Now we're ready to start building models. First, let's create a train/test split of our data.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=3)

Then, let's train and evaluate a LogisticRegression model.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear', multi_class='auto')
model.fit(X_train, y_train)

**Try this!** In the cell below, evaluate the model's performance on the testing set.

In [None]:
# your code here


**Write-up!** How does our model perform on the test set?

Let's also try looking at the model's performance on test examples of different genders.

In [None]:
print(f'''
validation (men) score: {model.score(X_test[X_test['gender'] == 1], y_test[X_test['gender'] == 1]):0.3f}
validation (women) score: {model.score(X_test[X_test['gender'] == 0], y_test[X_test['gender'] == 0]):0.3f}
''')

Yikes!

**Write-up!** What do you notice about these scores? How does these compare with the initial score we saw for the entire test set? What does this imply about our model?

### Dropping Gender

So our model is biased with respect to gender and gender is a feature of the model. Would it help to ignore the gender feature during training? Let's try it out.

Let's start by creating another train/test split, but this time using a copy of `X` and `y` that don't include `gender`.

In [None]:
X_without_gender = X.drop(['gender'], axis=1)

X_train, X_test, y_train, y_test = \
    train_test_split(X_without_gender, y, test_size=0.2, stratify=y, random_state=3)

Let's see what `X_train` looks like now.

In [None]:
X_train.head()

Now let's repeat our procedure for our baseline experiment.

In [None]:
model = LogisticRegression(solver='liblinear', multi_class='auto')
model.fit(X_train, y_train)

print(f'''

validation score: {model.score(X_test, y_test)}
validation (men) score: {model.score(X_test[X.iloc[X_test.index]['gender'] == 1],
                                     y_test[X.iloc[X_test.index]['gender'] == 1]):0.3f}
validation (women) score: {model.score(X_test[X.iloc[X_test.index]['gender'] == 0],
                                       y_test[X.iloc[X_test.index]['gender'] == 0]):0.3f}
''')

The results are the same?

**Write-up!** With your neighbor, discuss what this might imply about our model and our data. Also, discuss why it may or may not be a good idea to ignore "protected variables" like "gender" when training a model. Record your response below.

## 4. Becoming Data and Fairness Aware

![team](utility/images/undraw_team_spirit_hrr4.png)

The goal of today's lab was to demonstrate how an accuracy score can mislead you into thinking that your model is great and that your mission has been accomplished. By digging only a little bit deeper and evaluating our model's performance on each gender separately, we found that it performed very differently between genders. It was biased!

Just like that, while we were building a model to predict creditworthiness and loan repayment, we ran into the same problem Amazon did with their resume reviewing algorithm. Because the data we used was imbalanced, we introduced bias into our model unintentionally.

If you search online, you will find a myriad of ways that could be used to mitigate the effects of an imbalanced dataset. However, at the end of the day, the **best solution for both Amazon and us is to collect more complete data**.

We have only barely scratched the surface of fairness in Data Science. The field is both complex and emerging. If you are looking for more information about, I recommend starting with [Google's overview](https://developers.google.com/machine-learning/fairness-overview/) of the topic.

I hope that you will leave here today with a different, more careful perspective on your data and how it might unintentionally create bias in your models.