Model Building and Risk Analysis with Health Survey Data

Sonya Allin (York University, sallin@yorku.ca)
Lisa Zhang (University of Toronto, lczhang@cs.toronto.edu)
Mustafa Haiderbhai (University of Toronto)
Carolyn Quinlan (University of Toronto)
Rutwa Engineer (University of Toronto)
Michael Pawliuk (University of Toronto)

Overview

In this sequence of three programming labs, students build and analyze machine learning models to predict the presence of heart disease using the NHANES survey responses. Students build a decision tree, a logistic regression, and a neural network model. More importantly, these labs integrate discussions surrounding model evaluation, group-level fairness analysis, and bias-variance decomposition.

The first lab scaffolds the exploratory data analysis process to develop an intuition of the data distribution and data limitations. Learners then build decision tree classifiers and perform a grid search for hyperparameters. Learners choose a final model and report the test accuracy, but reflect on the limitations of accuracy as a measurement.

The second lab asks learners to implement stochastic gradient descent and train a logistic regression model. Learners analyze the errors made by the model, with consideration for error types and their differences between sensitive subgroups (men vs women). The lab explores the use of different thresholds for men and women to predict heart disease and asks learners to critically consider the safety of this approach.

The final lab begins with a demonstration of bias-variance decomposition using synthetic data. Learners return to using the NHANES data set and empirically explore sources of error. The lab introduces model averaging techniques, then asks learners to discuss the impact of data on sources of error and risk.

Assignment Information

Summary	Sequence of three machine learning labs on health data, with embedded discussions related to model evaluation, risk, subgroup analysis, and bias-variance analysis.
Topics	Lab 1: Decision Trees, Model Evaluation, Hyperparameter Tuning Lab 2: Logistic Regression, Subgroup Analysis Lab 3: Polynomial Regression, Bias-Variance Decomposition, Ensemble Models
Audience	Upper-level undergraduate students in a machine learning course.
Difficulty	These labs are of moderate difficulty, with each assignment expected to take approximately 2-3 hours to complete. However, the difficulty is adaptable: model building portions like SGD implementation in lab 2 can be provided to students or omitted.
Strengths	The labs contain a combination of technical content and consideration for ethical, legal, and human factors in model use. The data exploration content encourages students to think about where data comes from and how data collection methods can affect how models built from the data can/should be used. The subgroup analysis content is particularly illustrative of potential issues that could arise in real-world settings. We intentionally chose to use health data for these labs. The dataset is sensitive in nature (e.g., potential risks are intuitively clear), while not too large so that the training time is not prohibitive for undergraduate learners with limited time and resources.
Weaknesses	In an upper-level machine learning course, each lab could theoretically be completed in a two week time period. There is however a need to balance education related to the nuances of data collection and interpretation (i.e. potential issues with self-reported measures, limitations of proxy measures, etc) alongside data modeling and analysis techniques (i.e. use of linear predictive models); both objectives require time and discussion. Fast paced courses may struggle to cover this breadth in a short period of time. Written reflection exercises after completion of labs may help students synthesize what they have learned.
Dependencies	Software: We use Jupyter Notebooks with Python for all three labs; the labs are designed to run on Google Colab. Packages: We use numpy, sklearn, matplotlib, and pandas in these labs. Additionally, graphviz and pydotplus are used in lab 1 for visualizing decision trees. Prior Material: Lab 1 requires the prior coverage of decision trees. Lab 2 requires coverage of logistic regression. Finally, lab 3 requires prior coverage of polynomial regression and a theoretical discussion of the bias-variance decomposition.
Variants	Not all "Tasks" in the lab need to be graded. In our implementation, we selected a subset of the "Tasks" to be "Graded Tasks". Parts of the labs can be omitted depending on the desired technical difficulty level. For example, if the Stochastic Gradient Descent implementation in Lab 2 is inappropriate for a course, it can be removed without affecting the remaining discussion of error analysis. In labs 1 and 2, the discussion of data exploration, model evaluation and subgroup analyses are somewhat independent of the actual models developed. Thus, if decision trees and regression are inappropriate for a course, other models could be used instead.

Summary of Resources

Markdown files are provided, which can be compiled into Jupyter Notebook files. Please email the authors for solutions.

Makefile: Uses notedown to generate the ipynb files from markdown files.
lab01.md: Lab 1 markdown file. Used to generate the student facing lab01.ipynb file.
lab01.ipynb: Student facing Jupyter Notebook file for lab01, compatible with Google Colab.
lab02.md: Lab 2 markdown file. Used to generate the student facing lab02.ipynb file.
lab02.ipynb: Student facing Jupyter Notebook file for lab02, compatible with Google Colab.
lab03.md: Lab 3 markdown file. Used to generate the student facing lab03.ipynb file.
lab03.ipynb: Student facing Jupyter Notebook file for lab03, compatible with Google Colab.
NHANES-heart.csv: Subset of the NHANES data set; this file is used for all three labs.