#In this assignment you will gain experience analyzing preprocessed clinical datasets. You will practice using common time-saving tools in the R programming language that are ideally suited to these tasks.
#You will work with a dataset that we have prepared for you using a process similar to what you were given in the prior homework. The dataset describes patients from the MIMIC III database who were put on mechanical ventilation and were stable for 12 hours. Some of these patients then experienced a sudden and sustained drop in oxygenation, while others did not.
#We have recorded a variety of features about each patient before the 12-hour mark (the index time), including counts of all prior diagnoses (aggregated with IC), all respiratory-related concepts in their notes, and indicators of events recorded in the patient charts. Indicator features are the number of times each event was recorded in the patient record, regardless of what the measured value was. For those chart events which have numeric values associated wtih them (e.g. lab tests) we found those in which a value was recorded for over 85% of the cohort and included the latest recorded value of those features. In addition, we have included demographic features (age and sex). For the small number of patients who did not have one or more of those features recorded, we used column-mean imputation to impute them. We also recorded whether or not each patient went on to experience a sudden and sustained drop in their oxygenation (the exposure). Finally, we recorded whether or not each patient eventually died during their hospitalization (the outcome). All of that data is contained in patient_feature_matrix.csv. Its companion file feature_descriptions.csv has descriptions of each of the features and their provenance. The final dataset you have access to is called cohort.csv, which contains the index time, exposure time (if any), in-hospital time of death (if any), and the time of censoring (when the patient was released from the hospital).
#The first portion of this assignment is identical to the first portion of the inferential analyses assignment.
#Please edit this document directly using either Jupyter Notebook or R markdown in R Studio and answer each of the questions below in-line. Jupyter and R markdown are useful tools for reproducible research that you will use over and over again in your later work. They are worth taking the short amount of time necessary to learn them. Turn in a single .pdf document showing all of your code and output for the entire assignment, with each question clearly demarcated. Submit your completed assignment through Canvas.
#Grading: All answers will be graded on the correctness and quality of your code and analyses. Partial credit will be given based on a demonstration of conceptual understanding and how close you can come to solving the problem. At various points we will ask you to produce particular values: the correctness of these numbers will not be used for your grade - they are tools for us to get an idea about what your code is doing.
#The first thing we need to do is load all of the important packages we will use for this assignment. Please load the packages caret and tidyverse. There are several other packages you will need or may want to use during the course of the assignment but if you need a package other than one of these two for a particular problem it will be noted in the problem statement.
install.packages(“caret”) install.packages(“ggplot2”) install.packages(“dplyr”) install.packages(“tidyverse”) library(caret) library(tidyverse) library(ggplot2) library(dplyr)
library(readr) patientFeatureMatrix <- read_csv(“Desktop/Homework6/patient_feature_matrix.csv”) featureDesc <- read.csv(“~/Desktop/Homework6/feature_descriptions.csv”) cohort <- read.csv(“~/Desktop/Homework6/cohort.csv”)
#Split the patient matrix up into a numerical matrix of features and a character vector of the outcome (died or survived). For the feature matrix, exclude the subject ID and the outcome variable and use data.matrix().
features <- data.matrix(patientFeatureMatrix) print(ncol(features)) features <- subset(features, TRUE, -subject_id) features <- subset(features, TRUE, -death_in_stay) print(ncol(features)) print(typeof(features)) #print(head(features)) outcome <- as.character(patientFeatureMatrix$death_in_stay) print(typeof(outcome[1])) #print(outcome) print(length(outcome))
#Before we do any modeling, let’s cut down on our feature space by removing low-variance features that probably aren’t useful enough to measure association with or use in a predictive model. caret has a function to do that, so let’s use it instead of reinventing the wheel.
#Find the relevant function in the caret documentation and use it to create a new patient-feature matrix with only the useful features. From now on we will use the result of this step instead of the full feature matrix.
featuresToFilt <- nearZeroVar(features, names = TRUE) #print(featuresToFilt) print(length(featuresToFilt)) %!in% <- Negate(%in%) cols <- which(colnames(features) %!in% featuresToFilt) print(length(cols)) filteredFeatures <- subset(features, TRUE, select = cols) print(dim(filteredFeatures)) print(ncol(filteredFeatures)) chartIndic <- which(startsWith(colnames(filteredFeatures), “chartindicator”)) print(length(chartIndic)) chartVal <- which(startsWith(colnames(filteredFeatures), “chartvalue”)) print(length(chartVal)) codes <- which(startsWith(colnames(filteredFeatures), “C”)) print(length(codes))
print(‘original features:’) chartIndic <- which(startsWith(colnames(features), “chartindicator”)) print(length(chartIndic)) chartVal <- which(startsWith(colnames(features), “chartvalue”)) print(length(chartVal)) codes <- which(startsWith(colnames(features), “C”)) print(length(codes)) print(head(outcome))
#In this part of the assignment we will see if we can predict which patients will die during their hospitalizations, given only the data from before the end of their 12-hour long stable ventiliation period.
#We will use the caret library for our predictive modeling tasks, so take a minute to acquaint yourself with it.
#Note on replicability: Many of the exercises below require you to run code that has some amount of randomness to it. To ensure that you get the same answer as everyone else, we will ask you to run set.seed(1) at various points. caret has the ability to run models in parallel across multiple cores, but we ask that you do not do this because each thread requires its own random seed and the results will not reconcile with single-core results.
#Note on packages and masking: You will load many packages in these exercises and some of them will import functions with names that conflict with other functions. To call the function from the package that you want, you can use the :: qualifier as in package::function(). For example, if both dplyr and plyr are loaded, you would use dplyr::summarize() to call dplyr’s summarize and plyr::summarize() to call plyr’s summarize.
#To find out how good the predictive models we will make are, we’ll need to randomly split the data into training and test sets.
#Use caret to make training and test sets that preserve the proportions of the outcome in each dataset. Use an 80% training / 20% testing split.
library(caret) set.seed(1) trainIndex <- createDataPartition(outcome, p = .8, list = FALSE, times = 1) head(trainIndex)
training <- filteredFeatures[trainIndex, ] testing <- filteredFeatures[-trainIndex, ] outcomeTrain <- outcome[trainIndex] outcomeTest <- outcome[-trainIndex]
print(dim(training)) print(dim(testing)) print(length(outcomeTrain)) print(length(outcomeTest))
#Fit an elastic net model on the training data and use it to predict on the test set. Use \(\lambda=0.01\) and \(\alpha=1\) (LASSO).
#What is the missclassification accuracy of the resulting model? install.packages(“glmnet”) library(glmnet) set.seed(1) outcomeTrainBinary <- as.numeric(outcomeTrain == “died”) outcomeTestBinary <- as.numeric(outcomeTest == “died”) print(dim(training)) print(length(outcomeTrainBinary)) lassoTrain <- train(training, outcomeTrain, method = “glmnet”, tuneGrid = expand.grid(alpha = 1, lambda = 0.01))
predictO <- predict(lassoTrain, newdata = data.frame(testing)) accuracy <- mean(outcomeTest == predictO) print(accuracy) #The mean accuracy is approximatey 81.77%.
#This looks like good performance, but misclassification accuracy can be misleading. It would be useful to look at the two-by-two table for the predictions vs. the true outcomes. caret has a function for this that also computes useful metrics to measure classifier performance. Find it in the documentation and use it to find the sensitivity and specificity of this model. What are those values for this model? What is a simple strategy or rule you could use to get a reasonable misclassification accuracy in this case without using any training data at all? xtab <- table(predictO, outcomeTest) conf_matrix <- confusionMatrix(xtab) print(conf_matrix)
#This model has a sensitivity of 0.96460 and a specificity of 0.15873 . If desiring a reasonable misclassification accuracy without using any training data, we could assign each predicted outcome to the majority class (ie. “survived” or “died”). As a result, we would correctly predict the outcome with the frequency of the most probable class.
#Alternatively, we can use the predicted class probabilities instead of the predicted classes and use either a precision-recall or ROC curve to assess the model.
#Write your own code and use ggplot2 to generate both an ROC curve and a precision-recall curve for the performance of this model on the test set.
library(ggplot2)
rocV <- function(labels, scores){ labels <- labels[order(scores, decreasing=TRUE)] data.frame(TPR=cumsum(labels)/sum(labels), FPR=cumsum(!labels)/sum(!labels), precision =cumsum(labels)/(cumsum(labels) + cumsum(!labels)), labels) }
#sensitivity = recall = true positive rate #specificity = true negatives/ true negatives + false positives = FPR predictBinary <- as.numeric(predictO == “died”) outcomeTestBinary <- as.numeric(outcomeTest == “died”)
rocVals <- rocV(outcomeTestBinary, predictBinary) ggplot(data = rocVals, aes(x = FPR, y = TPR))+ geom_line() + ggtitle(“ROC Curve”)
precision <- rocVals\(precision recall <- rocVals\)TPR pr <- data.frame(precision = precision, recall = recall) ggplot(data = pr, aes(x = recall, y = precision))+ geom_line() + ggtitle(“Precision-Recall Curve”)
#Another good way to assess the utility of a classifier is with a calibration plot.
#Write your own code and use ggplot2 to generate a calibration plot for the performance of this model on the test set.
#how to do calibration plot: histogram binning of all the predicted probabilities #after you bin them, you go in each bin + assess how many of them are actually one #empirical probability = amount that’s actually positive / total amount of values in that bin #you’re assessing whether or not the probability you got is actually an unbiased probability #ideal plot = diagonal line from 0 to 1 #take examples from test set + partition them into bins based on predicted probabilities #of those examples that correspond to predicted responsibility in given bin, count how many actually have the positive outcome #how many of predicted positive outcomes actually have those positive outcomes
calibrations <- data.frame(outcome = outcomeTestBinary, prediction = predictBinary) ggplot(data = calibrations, aes(x = prediction, y = outcome)) + geom_line() + ggtitle(“Calibration Plot”)
#Let’s see if we can find a different model that will do better by searching over different values of \(\gamma\) and \(\alpha\). To assess the utility of each model, we will use four-fold cross validation over the training set and calculate the AUC (note that caret annoyingly refers to the AUC as the ROC) on each held-out set. At the end, we will use the model parameters that give the best result and see how well the model does on the test set. We will test over a grid of \(\lambda\) and \(\alpha\) values. Use \(\lambda \in e^{[-6.5, -6, -5.5 ... -3, -2.5, -2]}\) and \(\alpha \in [0.1, 0.5, 0.9]\).
#Use caret to do this cross-validation and produce a caret model object using the functions trainControl, expand.grid, and train. You will likely have to search google and the caret documentation to find out how to do this and how to set the metric to AUC. Use set.seed(1) right before you call trainControl() to ensure that you get the same cross-validation folds as everyone else. Depending on your computer, fitting these models could take a few minutes. Call plot() on the returned model object to examine how the AUC changes with the different parameter choices.
#Examine the returned model object to find the values of \(\alpha\) and \(\lambda\) that produced the best result. What were they, and what was the AUC of that model? (hint: the model object contains two dataframes that you can inner join that will neatly give you this result)
alpha <- c(0.1, 0.5, 0.9) lambda <- c(exp(-6.5), exp(-6), exp(-5.5), exp(-5), exp(-4.5), exp(-4), exp(-3.5), exp(-3), exp(-2.5), exp(-2))
library(caret) ctrl <- trainControl(method = “cv”, number = 4, classProbs = TRUE, summaryFunction = twoClassSummary) lassoTrainCV <- train(training, outcomeTrain, method = “glmnet”, tuneGrid = expand.grid(alpha = alpha, lambda = lambda), trControl = ctrl)
plot(lassoTrainCV)
#Examining the model, we can see that the alpha and lambda that produce the best results are lambda = 0.082 and alpha = 0.1, resulting in an AUC of approximately 0.81.
#Use the pROC package to calculate the AUC of the ROC curve for this model. What is the AUC statistic of this model on the test set? Is it close to what was estimated by cross-validation?
#install.packages(“pROC”) library(pROC)
predict2Train <- predict(lassoTrainCV, newdata = data.frame(training), type = “prob”) rTrain <- pROC::roc(outcomeTrainBinary, predict2Train[,2]) auc(rTrain)
predict2Test <- predict(lassoTrainCV, newdata = data.frame(testing), type=“prob”) predict2BinaryTest <- as.numeric(predict2Test == “died”) outcomeTestBinary <- as.numeric(outcomeTest == “died”) rTest <- pROC::roc(outcomeTestBinary, predict2Test[,2]) auc(rTest)
#The AUC under the curve for the test set was 0.7648, which was lower than what we originally calculated when taking the AUC from the training set. This indicates the model has somewhat overfit.
#What if we had used univariate analyses on the training set to find all of the features that are significantly associated with mortality and have a large enough effect size and then used only those features to fit our models? Would you expect the cross-validation AUC to be on average larger, smaller, or the same as the test set AUC? Why? If you needed to reduce the feature space in this way, what could you in the cross validation to avoid potential mis-steps?
#The cross-validation AUC would on average be larger. We may be leading the model to overfit by selecting features that are a better fit for the training set, but may not be the most significant features for the testing set. If we needed to reduce the feature space, we could conduct our univariate analysis for each fold. In this way, we would choose our feature space for each fold.
#Is what is decribed above in 2.3.3 the same as we did earlier by removing the near-zero-variance features in the sense that we should we expect it to have a similar effect on the test vs. cross-validation AUC? Why or why not?
#What we described above in 2.3.3 is not the same as what we did earlier by removing the near-zero-variance features. This is because we conducted the near-zero-variance features on the full dataset, while we would perform the process of 2.3.3 on only the training set.
For the best model, what are the 10 most important features (in terms of the magnitude of their coefficients) and their descriptions and feature types? You will need to call the coef function on an internal datastructure of the caret object to do this. As always, dplyr comes in handy as well.
#install.packages(“dplyr”) library(dplyr) bestModel <- lassoTrainCV\(finalModel coeffs <- as.matrix(coef(bestModel, bestModel\)lambdaOpt)) coeffs <- as.data.frame(coeffs) coeffs$absolute <- abs(coeffs[,1]) top_n(coeffs, 11, absolute)
#(Intercept) 1.77551010 1.77551010 #oxy_drop 0.15564416 0.15564416 - engineered, sustained drop in oxygenation after stable ventilation #C0264545 -0.05545411 0.05545411 - note CUI, pleural thickening #chartindicator_795 -0.05072172 0.05072172 - chart indicator, Differential-Bands #chartindicator_850 -0.13417081 0.13417081 - chart indicator, Triglyceride (0-200) #chartvalue_184 0.07269283 0.07269283 - chart value, Eye Opening #chartvalue_190 -0.29259047 0.29259047 - chart value, FiO2 Set #chartvalue_454 0.10965138 0.10965138 - chart value, Motor Response #chartvalue_815 -0.10027653 0.10027653 - chart value, INR (2-4 ref. range) #chartvalue_833 0.04943822 0.04943822 - chart value, RBC #chartvalue_87 0.08035453 0.08035453 - chart value, Braden Score
#Let’s see if we can do better with a nonlinear model. Gradient boosted trees are considered to be state-of-the-art, so let’s give them a shot. In caret, boosted trees are implemented in the gbm method, which has several parameters. Describe each parameter and whether increasing each of them increases or decreases the bias or variance. #1) n.trees - the number of iterations/trees; Increasing n.trees decreases bias and increases variance. #2) shrinkage - learning rate, how quickly the algorithm adapts; A low learning rate will increase bias and decrease variance. #3) n.minobsinnode - the minimum number of training set samples in a node to commence splitting. Increasing n.minobsinnode decreases variance but increases bias. #4) Interaction depth - Having low order interactions will result in high bias and low variance.
#Using the same evaluation metric (ROC) and the same cross-validation setup (4-fold CV) as before, fit a gradient boosted tree model using caret. Set the interaction depth at 3, the minimum observations per node at 3, and the shrinkage at 0.1. Fit models ranging from 5 to 250 trees, in increments of 5 trees. Again use set.seed(1) before calling fitControl(), but this time also call set.seed(1) before calling train(). Training may take some time depending on your computer. Plot the model object to see how the cross-validation AUC changes as more trees are fit, and report the best parameter set and the resulting AUC, sensitivity, and specificity.
#method = gbm, specified number of trees in expand.grid for those parameters library(caret) set.seed(1) ctrl <- trainControl(method = “cv”, number = 4, classProbs = TRUE, summaryFunction = twoClassSummary) set.seed(1) gradBoostedTree <- train(training, outcomeTrain, method = “gbm”, tuneGrid = expand.grid(shrinkage = 0.1, n.minobsinnode = 3, interaction.depth = 3, n.trees = (1:50)*5), trControl = ctrl) plot(gradBoostedTree)
predictTreeTrain <- predict(gradBoostedTree, newdata = data.frame(training), type = “prob”) rTrainT <- pROC::roc(outcomeTrainBinary, predictTreeTrain[,2]) auc(rTrainT)
predictTreeTest <- predict(gradBoostedTree, newdata = data.frame(testing), type = “prob”) rTestT <- pROC::roc(outcomeTrainBinary, predictTestTree[,2]) auc(rTestT) #Result: AUC 0.9347
xtabTree <- table(predictBinary, outcomeTestBinary) sensitivityT <- sensitivity(xtabTree) print(sensitivityT) #Result: sensitivity of 0.9646018
specificityT <- specificity(xtabTree) print(specificityT) #Result: specificity of 0.1587302
Plot the test set ROC curve for this model.
predictTreeTest <- predict(gradBoostedTree, newdata = data.frame(testing), type=“prob”) outcomeTestBinary <- as.numeric(outcomeTest == “died”) rTreeTest <- pROC::roc(outcomeTestBinary, predictTreeTest[,2]) plot(rTreeTest)
#Use the varImp function to find the top ten most important features in this model. Report the importance measures, descriptions, and feature types for these top ten features. Read up about how these importance measures are calculated. Would a variable that is only split on in the first tree be more important than a variable that is only split on in the 200th tree? Which types of feature seem to be the most useful? Why do you think that is the case?
library(gbm) modelTree = gradBoostedTree\(finalModel varImp <- varImp(object = modelTree) varImp\)absolute <- abs(varImp[,1]) topVar <- top_n(varImp, 11, absolute) print(topVar)
#age_in_days 78.35896 78.35896 - age in days, demographic #chartindicator_127 36.25067 36.25067 - Circulation/SkinInt, chart indicator #chartindicator_620 17.90137 17.90137 - Responds to Stimuli [Type], chart indicator #chartindicator_655 21.02618 21.02618 - Spontaneous Movement, chart indicator #chartvalue_198 55.10935 55.10935 - GCS Total, chart value #chartvalue_454 33.24092 33.24092 - Motor Response, chart value #chartvalue_778 38.29700 38.29700 - Arterial PaCO2, chart value #chartvalue_781 41.45704 41.45704 - BUN (6-20), chart value #chartvalue_787 18.04376 18.04376 - Carbon Dioxide, chart value #chartvalue_828 20.56500 20.56500 - Platelets, chart value #chartvalue_87 19.67771 19.67771 - Braden Score, chart value
#Variable importance is a measure of the sum of the decrease in error once split by a variable. A variable that is only split on in the first tree is more significant than a variable that is split on in the 200th tree, as it effects each tree branching from it. Continuous numerical values appear to be most useful, because you can select your decision boundary.
#One of the nice things about tree ensembles is that they can automatically find and exploit interactions between features. Use the final model object and the plot command exported by the gbm package to plot the two-way partial dependence of the outcome on age_in_days and chartvalue_198 as well as the one-way partial dependance on each of them individually. Read a bit about partial dependence plots. Is the effect of age linear? What combination of these two features is most associated with worse outcomes? Do you think the result makes sense? Why or why not?
#plot(gradBoostedTree, i.var = 1, lwd = 2, col = “blue”, main = ““) #install.packages(”viridis”)
library(gbm) plot(gradBoostedTree\(finalModel, i.var = c("age_in_days", "chartvalue_198")) modelTree <- gradBoostedTree\)finalModel plot(modelTree, i.var = “age_in_days”) plot(modelTree, i.var = “chartvalue_198”)
#The graph appears as a step function, so the effect of age is non-linear. Old age and low chartvalue_198 is most assocaited with worse outcome. This result makes sense, as older people are more at-risk for being sick.
#What is strange about the range of the age variable in the partial dependence plot? Investigate by plotting a histogam of patient ages. What does this say about the data? Do you think this adversely affects the test set peformance of the gradient boosted tree model? What do you think the effect is on the linear model? Do you have evidence of that effect given the coefficients of the linear model?
age_in_years <- filteredFeatures[,“age_in_days”]/365 print(max(age_in_years)) hist(age_in_years)
#The range of the age variable in the partial dependence plot is well beyond the maximum span of a human life. This indicates the data is skewed to protect patient identify. Splitting the data based on age, we don’t care about the difference between 100 and 300 years old if the cut-off is 80 years old. The linear model cares about the exact values, but the gradient boosted tree model doesn’t care. In other words, the linear model is adversely affected but the gradient boosted tree model isn’t adversely affected. We can verify this by observing the coefficients of the linear model, as the coefficients are skewed towards 0.
#Given that we randomly split the data into training and test sets, do you think that the test set accuracy would be a good estimate, an overestimate, or an underestimate of the accuracy if we used this model to predict mortality for patients in the coming year? Justify your answer in one or two sentences. Give a suggestion for an alternative data-splitting method that could be better and why.
#The test set accuracy would be an over-estimation of the performance, because it doesn’t consider fluctuations of death rate throughout the years. For example, a test set split from a dataset in 2020 might not have as high a performance for predicting mortality in the year following the pandemic. This is because mortality may vary in the following year, and thus the test set would align more with the training data than the new data. An alternative data-splitting method would be to collect data across multiple years and have the training set represent the first few years of your data and the test set represent the last few years of your data.
| A | B | C | D | E |
|---|---|---|---|---|
| a great deal | a lot | a moderate amount | a little | none at all |
#I learned a great deal.