Differential Privacy with MedMNIST

Lisa Zhang (University of Toronto, lczhang@cs.toronto.edu)
Sonya Allin (York University, sallin@yorku.ca)
Mahdi Haghifam (Northeastern University)
Rutwa Engineer (University of Toronto)
Michael Pawliuk (University of Toronto)
Florian Shkurti (University of Toronto)

Overview

This sequence of two advanced deep learning assignments introduces learners to issues surrounding differential privacy and model memorization of training data. The assignments, which are implemented in Jupyter Notebooks, guide learners in implementing an advanced optimization method (DP-SGD, or Differentially Private Stochastic Gradient Descent) in PyTorch.

In the first assignment, learners build a Multi-Layer Perceptron (MLP) model that performs predictions on MedMNIST’s PneumoniaMNIST data set. This assignment introduces students to MLPs, and issues surrounding building and evaluating neural networks.

In the second assignment, learners analyze the model’s predictions on the training vs test data and notice that the model’s prediction pattern differs between the two, making it possible for end users to identify the training data. This observation motivates the idea that the output of neural networks trained using conventional methods may leak the sensitive information of individuals that contributed their data. Learners are then introduced to the idea of Differential Privacy (DP) as a remedy for this problem: DP requires that anything learned about an individual from the algorithm’s output could also be learned if their data were not part of the input. The prevailing method for training neural networks to achieve DP is DP-SGD. In the second assignment, by providing a step-by-step guide, learners implement the DP-SGD algorithm from scratch. Finally, they plot the model prediction on training vs holdout data and observe that DP training makes them nearly indistinguishable.

Emphasis is placed on analyzing and understanding emergent issues that can arise with neural network training that are of active interest in the research community, including where it is possible for a model’s training data to be reconstructed by advanced adversaries, and where such data leaks could produce harm. Emphasis is also placed on developing skills to implement novel algorithms from descriptions written in machine learning papers.

Assignment Information

Summary	Sequence of two deep learning assignments on x-ray images: one on building a MLP and another on exploring/implementing differentially private SGD (DP-SGD).
Topics	Assignment 1: Multi-Layer Perceptron, Confusion Matrix, AUC, Hyperparameters, Grid Search Assignment 2: Same topics plus Differential Privacy, Optimization, DP-SGD, Implementing Advanced Algorithms
Audience	Advanced undergraduate students or graduate students in a deep learning course.
Difficulty	The first assignment in the sequence is of moderate difficulty. The second assignment in the sequence is a more unique contribution in this work, and is of higher difficulty. The assign provide scaffolding to help learners understand the DP-SGD algorithm described in a machine learning paper. However, the mathematical notation and optimization ideas used are advanced. Many new ideas (like learning rate annealing, alternative ways of sampling, etc) are introduced.
Strengths	Both data privacy and differential privacy are important and often overlooked emergent topics. This assignment provides an opportunity to discuss these topics in a technical setting, while building skills for advanced learners to implement advanced algorithms based on their mathematical descriptions. Although the assignment is geared towards an advanced audience, the technical prerequisites are relatively few. The model used in the assignment only uses linear layers, and not convolutional or recurrent layers. This design allows for flexibility in the timing of the assignment delivery in a deep learning course. The dataset is chosen to be sensitive (health data) while not too large so that the training time is not prohibitively large for undergraduate learners with limited computational resources and a need to balance various commitments.
Weaknesses	This assignment is advanced and thus the broadness of its applicability is limited. While the assignment asks students to implement a privacy preservation technique, instructors may want to supplement it with discussions of situations and contexts wherein privacy preservation may stand in conflict with application utility. We find that this broader discussion is useful Although we aimed for the assignments to be standalone, with all new ideas introduced carefully in the assignment handout, students may still find it useful to have more advanced concepts introduced during lectures or other synchronous class meetings Due to the choice to use small datasets and only linear layers, the performance characteristic difference between SGD and DP-SGD is minimal. The assignment is tailored to a particular software stack, namely PyTorch and Jupyter Notebooks (Google Colab). Adaptability to another stack (e.g., tensorflow) requires significant effort. The unusually high amount of scaffolding in an advanced course may be seen as a weakness for an advanced assignment.
Dependencies	Software: We use Jupyter Notebooks with Python for both assignments; the assignments are designed to run on Google Colab. Packages: We use PyTorch, torchvision, matplotlib, sklearn, and numpy for both assignments. The data set is imported from the medmnist library. We also use the differential privacy library opacus in the second assignment. Prior Material: Multi-Layer Perceptions and Backpropagation.
Variants	Not all "Tasks" in the lab need to be graded. In our implementation, we selected a subset of the "Tasks" to be "Graded Tasks". Parts of the assignment can be adapted. For example, the DP-SGD implementation could be provided to students or the implementation itself de-emphasized. Thus, greater focus of the assignment could be placed on model comparison, privacy concerns, and the tradeoff between accuracy and privacy in various settings. The two assignments can be combined into a single, larger assignment.

Summary of Resources

Markdown files are provided, which can be compiled into Jupyter Notebook files. Please email the authors for solutions.

Makefile: Uses notedown to generate the ipynb files from markdown files.
lab01.md: Lab 1 markdown file. Used to generate the student facing lab01.ipynb file.
lab01.ipynb: Student facing Jupyter Notebook file for lab01, compatible with Google Colab.
lab02.md: Lab 2 markdown file. Used to generate the student facing lab02.ipynb file.
lab02.ipynb: Student facing Jupyter Notebook file for lab02, compatible with Google Colab.