In this assignment, you will build a system for predicting patient deaths in the Intensive Care Unit (ICU) using the large PhysioNet Computing in Cardiology Challenge 2012 dataset. For each patient in the dataset, demographic variables and time series of physiological variables were collected during their stay in the ICU.

Clinical risk prediction from physiological measurements, clinical notes, and demographic data using machine learning has seen advances in recent years. For example, see this article about Google's efforts in the area.

The risk prediction system you will build could be in principle be used to flag patients as being at risk of death so that physicians could intervene and improve their outcome. To be confident about the impact of such a system, you would need to to run an experiment. In this assignment, you will use a model in order to estimate the potential impact of the system.

The data you will be working with is all available from PhysioNet. You will be looking at only the data in "Training set A". The patient data files are https://physionet.org/content/challenge-2012/1.0.0/set-a.zip, and the outcomes file is https://physionet.org/challenge/2012/Outcomes-a.txt.

You will submit an `ipynb`

file and the `pdf`

file that was produced from it. Your code should be general enough that if the dataset were changed, the code would still run.

You will be graded on the correctness of your code as well as on the professionalism of the presentation of your report. The text of your report should be clear; the figures should do a good job of visualizing the data, and should have appropriate labels and legends. Overall, the document you produce should be easy to read and understand.

When you are asked to compute something, use code (rather than compute things outside of Jupyter Notebook and then input the answer), and include the code that you used in the report (you may deviate from this if the computation is really trivial.)

**10 points for a professional and well-formatted report. Points will be taken off if the report is not well-formatted and is difficult to read.****5 points for submitting the Rmd file on Blackboard and for submitting the pdf file on Gradescope. Pages on Gradescope should be assigned to questions.**

First, you should download the physiological data and patient outcomes file. Downloading set-a.zip, extract all the files from it to a directory, and download Outcomes-a.txt to the same directory. To extract files from a zip file, open the zip file, and then drag files that you see in a window that opens to whatever directory you want.

Read in the files as follows. Your working directory must be set to the directory that contains the text files you extracted. If your `filelist`

is empty, re-check what your work directory is.

**Note**: use something like

`os.chdir("/home/username/courses/datascience/project/set-a/")`

We want this data to be read from text files and assembled into a dataframe. To do so, please run the following code which will first define a function that reads a text file, and then runs that function on all the files and assembles the outputs into a single dataframe.

**Look at the function comp_patient and explain what it is doing in your report. Look at the first row of allpat_dat and at the first element of filelist, and explain precisely how the Age and Temp columns were computed. Reproduce the computation for Temp in your report and make sure that the numbers match. To reproduce the computation, look up the relevant numbers in the relevant txt file, and then enter them in your ipynb file and perform the computation in Python.**

In [1]:

```
import numpy as np
import glob, os
import pandas as pd
# CHANGE ME
os.chdir("/home/username/courses/datascience/project/set-a/")
### first we define a function that extracts the relevant info
def comp_patient2(patdat, attrs):
patdat[patdat == -1.0] = float('NaN')
patdat_dict = {}
for attr in attrs:
patdat_dict[attr] = [patdat["Value"][patdat["Parameter"]==attr].mean(axis = 0)]
return patdat_dict
attrs = ["Age", "Gender", "Height", "Weight", "Urine",
"HR", "Temp", "NIDiasABP", "SysABP", "DiasABP", "pH",
"PaCO2", "PaO2", "Platelets", "MAP", "K", "Na", "FiO2", "GCS", "RecordID"]
full_dat = pd.DataFrame(columns = attrs)
# Now let us take the list `all_pat_dat` and assemble it into a dataframe so we can wrange it with `Pandas`.
filenames = sorted(glob.glob("*.txt"))
for filename in filenames:
if filename == "Outcomes-a.txt":
continue
data = open(filename)
patient_dat_full = pd.read_csv(data, delimiter=',')
patient_dat = pd.DataFrame.from_dict(comp_patient2(patient_dat_full, attrs))
full_dat = full_dat.append(patient_dat)
```

`Outcomes-a.txt`

.

In [2]:

```
outcome_dat = pd.read_csv("Outcomes-a.txt")
outcome_dat[outcome_dat == -1] = float('NaN') # set all -1 to NaNs
full_dat_out = full_dat.merge(outcome_dat, left_on='RecordID', right_on='RecordID')
full_dat_out.replace([np.inf, -np.inf], np.nan)
col_means = full_dat_out.mean()
# Set everything that's NaN to the mean of that column:
# Note: we do this for simplicity. Strictly speaking, you should
# only use the training set to compute the column means
for i in range(full_dat_out.shape[1]):
mask = np.isnan(np.array(full_dat_out.iloc[:,i]))
full_dat_out.iloc[mask, i] = col_means[i]
```

Divide your data into training, validation, and test sets.

Use the features `HR`

, `Gender`

, `age`

, `temperature`

, `weight`

, `height`

, `PaO2`

, and `PaCO2`

, and fit a logistic regression model to predict in-hospital death. Compare the performance you obtain (on both the training and the validation sets) better than the base rate?

Write a function that, for a given threshold, calculates both the False Positive Rate (proportion of non-deaths identified as deaths) and True Positive Rate (proportion of deaths correctly identified as such) for your regression model.

For 100 threshold values equally spaced from 0 to 1, plot the True Positive Rate vs. the False Positive Rate. Use the validation set.

This plot is known as an ROC curve.

Using the plot generated in Part 3, what is the False Positive Rate associated with correctly identifying 90% of patients at risk for death in the ICU? Why might a high false positive rate be appropriate in this setting? You can read the answer off the ROC curve plot.

For this part, produce a short report that answers all the questions below. Include R code that produces the numbers that you need.

At the beginning of their shift, a doctor reviews their patients' charts, and decides what intervention is needed for each patient. In the following Parts, we will be trying to improve this process. We will consider a simplified version of what is going on. Suppose that if the doctor intervenes correctly, the patient will not die; suppose that the doctor has 60 minutes to look through 25 patient charts; and suppose that the probability of missing the correct treatment if the doctor spends $t$ minutes on reviewing a file is

$$P(\textrm{fail}) = \exp(-t^2/100).$$If the doctor reviews all the files, spends an equal amount of time on each chart, and there are 10 patients who will die without the correct intervention, how many patients are expected to die, if the doctor intervenes when they see that that's needed? What is the percentage of patients who are expected to die, out of 25?

Suppose now that the doctor is looking through all the patient charts in the validation set. They would have proportionately more time: $(N/25)\times 60$ minutes in total (where $N$ is the total number of patients in the set). How many patients would be expected to die, if the doctor intervenes correctly when they know they should do that?

Now, suppose that the doctor only reviews the files of patients for whom the model outputs a probability of greater than $20\%$. This would give the doctor more time to look through each file, but the doctor would never be able to intervene in the cases of patients form whom the output is $20\%$ or smaller. How many patients would be expected to die?

In this Part, you will explore the policy implications of using our model in an understaffed hospital.

Suppose that we are considering a policy of only reviewing the files of patients whose probability of death is above a threshold `thr`

. Each chart would be given an equal amount of time, and the total amount of time will be $(N/25)\times 60$.

Using the model from 5(a), plot the total number of expected deaths under the policy vs. the threshold. Using the plot, what is the best threshold to use that would minimize the number of deaths?

You should compute the expected number of deaths for the thresholds `np.arange(0, 1, 0.01)`

.

Use the validation set.

On the test set, compare the total number of expected deaths under the best policy that was selected in 5(b) to reviewing each patient's file. In relative terms (i.e., as a percentage), how many lives would be saved, if the assumptions underlying our simulation are accurate?

We made a number of assumptions when we tried to assess the impact of using our model. Critique those assumptions.