Cardiac Surgery Predictive Model

Introduction

This is an exploratory effort to build a model that predicts how many hours a cardiac surgery patient will spend in an ICU.

The solution uses the Random Forests algorithm in the R “caret” package.

Random Forests - Advantages/Disadvantages
Advantages:
– Can be very accurate
– Automatically includes interactions between variables
– Fairly easy to implement; fewer parameters to deal with
– Works even when number of examples is less than number of variables

Disadvantages:
– Meant for classification problems; not for predicting countinuous outcomes
– Can be computationally intense (slow to process)
– Has the disadvantage of all machine language solutions in that it is hard to explain
– Doesn’t handle missing values; must impute
– Can be prone to over-fitting. Important to use cross-validation

(Note, the following solution does not perform any normalization of numeric data in the data set, ie. numeric values were not centered or scaled.)

Load R Libraries

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

library(e1071)
library(doMC)  ## to enable multi core parallel processing

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

registerDoMC(cores = 4)  ## MacBook Pro Air has four cores.  Parallel processing will happen automatically]

## Set random number seed for reproducibility purposes.
set.seed(1234)

Load the data set

This code loads the cardiac surgery data set from an Excel file into an R data frame.

## Set working directory for R code
## workingdir = "\\\\ccor_ds\\common\\dbase\\CardiacSurgery\\preparation\\analysis\\R\\Code"
workingdir = "/Users/mitchellfawcett/Documents/CardiacSurgery/Code"
setwd(workingdir)

## Read data file
## datapath <- "\\\\ccor_ds\\common\\dbase\\CardiacSurgery\\preparation\\analysis\\R\\Data"
datapath <- "/Users/mitchellfawcett/Documents/CardiacSurgery/Data"
filename <- "Copy of Surgical_Data_V12_Data_Dictionary_TC.csv"
df <- read.csv(paste(datapath, filename, sep = "/"), header = TRUE, na.strings = "NA", stringsAsFactors = TRUE)
## Display summary information about the raw data set

## Number of rows and columns in data set
dim(df)

## [1] 2286  320

Preprocess Data Set

Change blanks to “NA” (NA indicates a missing value in R). Identify the columns that have 25% or less missing values and columns with more than 25% missing values.

## Change string "ERROR" in N_First_Hematocrit_After_Surgery column to ""
df$N_First_Hematocrit_After_Surgery[df$N_First_Hematocrit_After_Surgery == "ERROR"] <- ""

## Warning in `[<-.factor`(`*tmp*`, df$N_First_Hematocrit_After_Surgery == :
## invalid factor level, NA generated

## Change the class of the column from character to numeric
df[, c("N_First_Hematocrit_After_Surgery")] <- sapply(df[, c("N_First_Hematocrit_After_Surgery")], as.numeric)

## Change blanks to NA
df[ df == "" ] = NA

## Variables with 25% or less missing values
names(df[, colSums(is.na(df)) <= nrow(df) * 0.25])

##   [1] "SS_Event_STS_ID"                  "SS_Patient_ID"                   
##   [3] "PatientPersonID"                  "Account_Number"                  
##   [5] "Payor"                            "VisitID"                         
##   [7] "VisitStartDateTime"               "VisitEndDateTime"                
##   [9] "Admission_Date"                   "Discharge_Date"                  
##  [11] "Surgery_Date"                     "Surgery_End_Time"                
##  [13] "Surgery_Start_Time"               "Gender"                          
##  [15] "Height"                           "Weight"                          
##  [17] "AgeInYears"                       "Race"                            
##  [19] "RaceCaucasian"                    "RaceAfrican_American"            
##  [21] "RaceAsian"                        "RaceNative_American"             
##  [23] "RacePacificIslander"              "RaceOther"                       
##  [25] "RaceHispanic"                     "Hospital"                        
##  [27] "ChronicLungDisease"               "N_RFDiabetes"                    
##  [29] "N_RFDialysis"                     "N_RFHypertension"                
##  [31] "N_RFTobaccoUse"                   "N_RFFamilyHistoryCAD"            
##  [33] "Infectious_Endocarditis"          "N_RFPeripheralVascularDisease"   
##  [35] "N_RFCerebrovascularDisease"       "N_Unstable_Angina"               
##  [37] "RFAlcoholUse"                     "RFCVD_CVA"                       
##  [39] "RFCVD_TIA"                        "RFLastHematocrit"                
##  [41] "LastCreatinineLevel"              "N_RFImmunosuppresiveRx"          
##  [43] "N_RF5mWalkTest"                   "CardiacPresentationAdmission"    
##  [45] "AnginaClass2wks"                  "ArrhythmiaWhen"                  
##  [47] "N_Arrhythmia"                     "Arrhythmia2HBWhen"               
##  [49] "Arrhythmia3HBWhen"                "ArrhythmiaAFib"                  
##  [51] "ArrhythmiaAFibDuration"           "ArrhythmiaAFlutterWhen"          
##  [53] "ArrhythmiaPermPaced"              "ArrhythmiaSSSWhen"               
##  [55] "ArrhythmiaVVWhen"                 "Mitral_Insufficiency"            
##  [57] "N_CardiogenicShock"               "CathAorticStenosis"              
##  [59] "CathMitralStenosis"               "CathPulmonicStenosis"            
##  [61] "CathTricuspidStenosis"            "N_CHF"                           
##  [63] "NumDiseasedCoroVessels"           "CVA"                             
##  [65] "CVA_When"                         "Diabetes_Control"                
##  [67] "EF"                               "EF_Done"                         
##  [69] "N_Diagnosis_Type"                 "PreOpMedACEIARBI48hrs"           
##  [71] "PreOpMedADPInhibitors5Days"       "PreOpMedAnticoagulants"          
##  [73] "PreOpMedAntiplatelets5Days"       "PreOpMedAspirin"                 
##  [75] "PreOpMedBetaBlockers"             "PreOpMedCoumadin"                
##  [77] "PreOpMedGPIIbIIIaInhibitor"       "PreOpMedInotropes"               
##  [79] "PreOpMedLipidLowering"            "PreOpMedNitratesIV"              
##  [81] "PreOpMedSteroids"                 "Previous_CABG"                   
##  [83] "N_PreviousCardiacIntervention"    "Previous_Valve"                  
##  [85] "N_PriorMI"                        "N_Priorheartfailure"             
##  [87] "STSBMI"                           "STSBSA"                          
##  [89] "N_Resuscitation"                  "STSProcedureName"                
##  [91] "N_STSPROC_AVR"                    "N_STSPROC_AVR_CAB"               
##  [93] "N_STSPROC_CAB"                    "N_STSPROC_MV_Repair"             
##  [95] "N_STSPROC_MV_Repair_CAB"          "N_STSPROC_MV_Replace"            
##  [97] "N_STSPROC_MV_Replace_CAB"         "N_STSPROC_Other"                 
##  [99] "Surgeon"                          "Status"                          
## [101] "Incidence"                        "StatusEmergentReason"            
## [103] "StatusUrgentReason"               "SkinCutStartTime"                
## [105] "SkinCutStopTime"                  "Aortic_Implant"                  
## [107] "Aortic_Implant_Size"              "Aortic_Implant_Type"             
## [109] "AorticImplantSTSCode"             "Mitral_Implant"                  
## [111] "Mitral_Implant_Size"              "Mitral_Implant_Type"             
## [113] "MitralImplantSTSCode"             "Tricuspid_Implant_Size"          
## [115] "Tricuspid_Implant_Type"           "Tricuspid_Procedure"             
## [117] "TricuspidImplantSTSCode"          "Pulmonic_Implant"                
## [119] "Pulmonic_Implant_Size"            "Pulmonic_Implant_Type"           
## [121] "Pulmonic_Procedure"               "PulmonicImplantSTSCode"          
## [123] "AorticOcclusion"                  "N_CABGPerformed"                 
## [125] "Circulatory_Arrest_Time"          "CirculatoryArrest"               
## [127] "Cross_Clamp_Time"                 "DistAnastArt"                    
## [129] "DistAnastVein"                    "ICUVisit"                        
## [131] "IMAGraftsUsed"                    "IMAHarvestTechnique"             
## [133] "InitHrsICU"                       "InitVentStartDate"               
## [135] "InitVentStartTime"                "InitVentStopDate"                
## [137] "InitVentStopTime"                 "IntraopBloodProducts"            
## [139] "LowestHematocrit"                 "LowestTemp"                      
## [141] "N_Bloodloss_Unit"                 "N_Intraoperative_Bleeding"       
## [143] "N_No_of_Comorbidities"            "N_Significant_Bleeding"          
## [145] "IABP"                             "Perfusion_Time"                  
## [147] "UrgentEmergentReason"             "Valve"                           
## [149] "ValveDisAortic"                   "ValveDisMitral"                  
## [151] "ValveDisPulmonic"                 "ValveDisTricuspid"               
## [153] "N_AorticValve"                    "VSAorticImplant"                 
## [155] "VSAorticProcedure"                "N_MitralValve"                   
## [157] "VSMitralImplant"                  "VSMitralProcedure"               
## [159] "VSPulmonicImplant"                "VSPulmonicProcedure"             
## [161] "VSTricuspidImplant"               "VSTricuspidProcedure"            
## [163] "TotHrsICU"                        "AddICUHours"                     
## [165] "ICUHours"                         "N_Count_of_Bilirubins"           
## [167] "N_Count_of_BUNs"                  "N_First_BUN_After_Surgery"       
## [169] "N_Count_of_Creatinines"           "N_First_Creatinine_After_Surgery"
## [171] "N_Count_of_GFRs"                  "N_First_GFR_After_Surgery"       
## [173] "N_Count_of_Hematocrits"           "N_First_Hematocrit_After_Surgery"
## [175] "N_Count_of_Hypertriglyceridemias" "N_Encephalopathy"                
## [177] "N_Stroke_Indicator"               "Complications_Any"               
## [179] "DCLocation"                       "Death"                           
## [181] "Operative_Death"                  "Mort30DayStatus"                 
## [183] "MortDCStatus"                     "MortLocation"                    
## [185] "MortPrimaryCause"                 "N_30Day_Revisit"                 
## [187] "N_60Day_Revisit"                  "N_7Day_Revisit"                  
## [189] "N_90Day_Revisit"                  "Oth_AFib"                        
## [191] "Oth_Anticoagulant"                "Oth_Cardiac_Arrest"              
## [193] "Oth_GI"                           "Oth_MultiSystem_Failure"         
## [195] "Oth_OtherComplication"            "Oth_Tamponade"                   
## [197] "PostOpCannulationSite"            "PostOpConduitHarvest"            
## [199] "PostOpCreatinineLevel"            "PostOpDeepSternalMediastin"      
## [201] "PostOpDeepSternalMediastinDate"   "PostOpDeepVenousThrombosis"      
## [203] "PostOpDialysisDuration"           "PostopEcho"                      
## [205] "PostOpLaryngealNerveInjury"       "PostOpNeuroComaEnceph"           
## [207] "PostOpNeuroParalysis"             "PostOpNeuroParalysisType"        
## [209] "PostOpNeuroStrokePermanent"       "PostOpNeuroStrokeTransientTIA"   
## [211] "PostOpPhrenicNerveInjury"         "PostOpPlanDelaySternalClose"     
## [213] "PostOpPleuralEffusionReqDrain"    "PostOpPneumothorax"              
## [215] "PostOpPulmonaryThromboembolism"   "PostOpRhythmReqPermDevice"       
## [217] "PostOpSepsis"                     "PostOpSepsisPositiveCultures"    
## [219] "PostOpSternalInstability"         "PostOpSternalSuperficialInfect"  
## [221] "PostOpSternotomyIssue"            "PostOpSurgicalSiteInfection"     
## [223] "PostOpVentHoursTotal"             "PostOpVTE"                       
## [225] "PostOpWound2ndMuscleFlap"         "PostOpWound2ndOmentalFlap"       
## [227] "PostOpWoundOpenPacking"           "PostOpWoundProcedure"            
## [229] "PostOpWoundVac"                   "Pulm_Pneumonia"                  
## [231] "Pulm_Ventilator_Prolonged"        "ReadmitICU"                      
## [233] "ReIntubated"                      "Renal_Dialysis_Required"         
## [235] "Renal_Failure"                    "Reop_Bleeding"                   
## [237] "Reop_Other_Cardiac"               "Reop_Other_NonCardiac"           
## [239] "ReopGraftOcclusion"               "ReopValveDysfunction"

## Variables with more than 25% missing values
names(df[, colSums(is.na(df)) > nrow(df) * 0.25])

##  [1] "Diabetes"                                  
##  [2] "RFDiabetes"                                
##  [3] "Dialysis"                                  
##  [4] "RFDialysis"                                
##  [5] "Hypertension"                              
##  [6] "RFHypertension"                            
##  [7] "TobaccoUse"                                
##  [8] "Rfothertobacco"                            
##  [9] "Family_History_CAD"                        
## [10] "RFFamilyHistoryCAD"                        
## [11] "InfectEndocarditis"                        
## [12] "Peripheral_Vasc_Disease"                   
## [13] "RFPeripheralVascularDisease"               
## [14] "cerebrovascular_disease"                   
## [15] "RFCerebrovascularDisease"                  
## [16] "RFBilirubinTotal"                          
## [17] "RFDyslipidemia"                            
## [18] "RFImmunosuppresiveRx"                      
## [19] "Immunosuppresive_Rx"                       
## [20] "RF5mWalkTest"                              
## [21] "RF5mWalkTest_old"                          
## [22] "RF5mWalkTestTime1"                         
## [23] "RF5mWalkTestTime2"                         
## [24] "RF5mWalkTestTime3"                         
## [25] "CardiacSymptomsAtSurgery"                  
## [26] "AortaDisease"                              
## [27] "AortaDiseaseAneurysm"                      
## [28] "AortaDiseaseArch"                          
## [29] "AortaDiseaseAscending"                     
## [30] "AortaDiseaseCoarctation"                   
## [31] "AortaDiseaseDescThoracic"                  
## [32] "AortaDiseaseDissection"                    
## [33] "AortaDiseaseDissectionTiming"              
## [34] "AortaDiseaseDissectionType"                
## [35] "AortaDiseaseIntraHematoma"                 
## [36] "AortaDiseasePresentation"                  
## [37] "AortaDiseasePseudoAneurysm"                
## [38] "AortaDiseaseRoot"                          
## [39] "AortaDiseaseRupture"                       
## [40] "AortaDiseaseThoracoAbd"                    
## [41] "AortaDiseaseUlcer"                         
## [42] "Arrhythmia"                                
## [43] "Aortic_Insufficiency"                      
## [44] "Pulmonic_Insufficiency"                    
## [45] "Cardiogenic_Shock"                         
## [46] "CardiogenicShock"                          
## [47] "CHF"                                       
## [48] "HeartFailure2wks"                          
## [49] "CoronaryDominance"                         
## [50] "CoronaryInfoKnown"                         
## [51] "CoronaryStenosisSource"                    
## [52] "LMNativeStenosis"                          
## [53] "NativePercentStenosisKnown"                
## [54] "NYHA"                                      
## [55] "previous_CV_intervention"                  
## [56] "PrevOthCardPCI"                            
## [57] "PreviousCardiacIntervention"               
## [58] "PriorMI"                                   
## [59] "PreviousMI"                                
## [60] "HeartFailurePrior"                         
## [61] "Priorheartfailure"                         
## [62] "Resuscitation"                             
## [63] "Resuscitation_history"                     
## [64] "AnesthesiaEndDate"                         
## [65] "AnesthesiaEndTime"                         
## [66] "Tricuspid_Insufficiency"                   
## [67] "CABG"                                      
## [68] "CABGPerformed"                             
## [69] "IABP_When"                                 
## [70] "AFibProcedure"                             
## [71] "OCAFibProc"                                
## [72] "N_AFibProcedure"                           
## [73] "VSAortic"                                  
## [74] "VSAorticValve"                             
## [75] "VSMitral"                                  
## [76] "VSMitralValve"                             
## [77] "N_First_Bilirubin_After_Surgery"           
## [78] "N_First_Hypertriglyceridemia_After_Surgery"
## [79] "DateOfDeath"                               
## [80] "PostOpReOpBleedTiming"

Examine target data

Look at the distribution of ICU hours.

summary(df$ICUHours)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   24.00   30.00   55.71   52.00 1483.00

hist(df$ICUHours, freq = TRUE, breaks = 7)

Format a target variable

There are 5 variables in the data set that have to do with ICU stay.

ICUVisit - Did an ICU visit occur? 1 = yes; 0 = no. Most patients went to an ICU.
InitHrsICU - The number of hours the patient spent in the ICU immediately after surgery.
TotHrsICU - ?
AddICUHours - ICU hours for ICU visit(s) after the initial ICU visit.
ICUHours - Combined total of initial ICU visit hours and additional ICU visits hours.
ReadmitICU - Was there an additional ICU visit after the initial ICU visit? 1 = yes; 0 = no.

I used ICUHours to create a categorical target variable ICUDuration. A = The patient spent 48 hours or less in ICUs; B = The patient spent 49 to 216 hours in ICUs; C = Patient spent more than 216 hours in ICUs.

df$ICUDuration <- as.factor(ifelse(df$ICUHours <= 48, "A",
                      ifelse(48 < df$ICUHours & df$ICUHours <= 216, "B", "C")
                      ))

## Note, ICUDuration was created as a factor variable so we can later use rfImpute (to impute missing values.)

Sample of data showing the translation of the target variable.

df[1:10, c("ICUHours", "ICUDuration")]

##    ICUHours ICUDuration
## 1        25           A
## 2        29           A
## 3       120           B
## 4         0           A
## 5        16           A
## 6        26           A
## 7        48           A
## 8        17           A
## 9        22           A
## 10      110           B

Select predictor variables

To build a simple baseline model I wanted variables that had less than 25% missing values. I also looked for variables whose names started with “N” because these variables were ones that had been unified across all versions of STS. No “expert” knowledge went into the selection.

These are the variables I picked. N_PriorMI, N_No_of_Comorbidities, RFLastHematocrit, N_Priorheartfailure, N_Arrhythmia, N_RF5mWalkTest, N_Diagnosis_Type, N_CardiogenicShock, N_First_BUN_After_Surgery, N_First_Creatinine_After_Surgery, N_First_Hematocrit_After_Surgery

df <- df[, c("N_PriorMI", "N_No_of_Comorbidities", "RFLastHematocrit", "N_Priorheartfailure", "N_Arrhythmia", "N_RF5mWalkTest", "N_Diagnosis_Type", "N_CardiogenicShock", "N_First_BUN_After_Surgery", "N_First_Creatinine_After_Surgery", "N_First_Hematocrit_After_Surgery", "ICUDuration")]

Here is a look at the data of the chosen variables.

str(df)

## 'data.frame':    2286 obs. of  12 variables:
##  $ N_PriorMI                       : Factor w/ 4 levels "","No","Unknown",..: 2 2 2 2 2 2 2 4 2 2 ...
##  $ N_No_of_Comorbidities           : int  2 2 7 1 5 5 3 6 4 2 ...
##  $ RFLastHematocrit                : num  40.3 35.9 39.9 40.2 43.3 36.3 41.2 29.4 50.6 26.7 ...
##  $ N_Priorheartfailure             : Factor w/ 3 levels "No","Unknown",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ N_Arrhythmia                    : Factor w/ 7 levels "","No","None",..: 3 3 5 3 3 3 3 3 3 3 ...
##  $ N_RF5mWalkTest                  : Factor w/ 4 levels "","No","Non-ambulatory patient",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ N_Diagnosis_Type                : Factor w/ 4 levels "Not on Care Pathway",..: 1 1 1 1 1 4 1 1 1 1 ...
##  $ N_CardiogenicShock              : Factor w/ 4 levels "No","Yes","Yes - At the time of the procedure",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ N_First_BUN_After_Surgery       : int  17 13 12 20 11 19 12 25 22 23 ...
##  $ N_First_Creatinine_After_Surgery: num  0.74 0.6 0.9 1.3 0.8 1.25 1.2 2.4 1.5 1.1 ...
##  $ N_First_Hematocrit_After_Surgery: num  105 62 98 189 165 121 75 42 228 25 ...
##  $ ICUDuration                     : Factor w/ 3 levels "A","B","C": 1 1 2 1 1 1 1 1 1 2 ...

Create training and testing data sets

I split the data set into a ratio of 60% training examples, 40% testing examples.

inTrain <- createDataPartition(y = df$ICUDuration, 
                               p = 0.60, 
                               list = FALSE)
dTrain <- df[inTrain,]
dTest <- df[-inTrain,]

## Number of rows and columns in the training and testing sets
dim(dTrain)

## [1] 1373   12

dim(dTest)

## [1] 913  12

Impute missing values in training data

dTrainImpute <- rfImpute(ICUDuration ~ ., dTrain)

## ntree      OOB      1      2      3
##   300:  30.23%  5.69% 91.37% 98.11%
## ntree      OOB      1      2      3
##   300:  30.08%  6.30% 88.99% 98.11%
## ntree      OOB      1      2      3
##   300:  30.37%  6.20% 90.48% 98.11%
## ntree      OOB      1      2      3
##   300:  29.86%  5.59% 90.18% 98.11%
## ntree      OOB      1      2      3
##   300:  30.37%  5.79% 91.67% 98.11%

Train model using imputed data

## Train the random forest model using training data
ICU.rf <- randomForest(ICUDuration ~ ., dTrainImpute)

## Display the confusion matrix for the model
ICU.rf$confusion

##     A  B C class.error
## A 924 57 3  0.06097561
## B 298 36 2  0.89285714
## C  38 14 1  0.98113208

Impute missing values in testing data

dTestImpute <- rfImpute(ICUDuration ~ ., dTest)

## ntree      OOB      1      2      3
##   300:  28.26%  5.34% 84.75% 97.14%
## ntree      OOB      1      2      3
##   300:  28.15%  4.58% 86.55% 97.14%
## ntree      OOB      1      2      3
##   300:  29.03%  5.50% 87.44% 97.14%
## ntree      OOB      1      2      3
##   300:  27.93%  4.73% 85.65% 94.29%
## ntree      OOB      1      2      3
##   300:  28.15%  5.80% 82.96% 97.14%

Make predictions using random forest model with test data

Exlude the first column (ICUDuration) from the test set since that is the value we are predicting.

ICU.testpred<-predict(ICU.rf, newdata = dTestImpute[, -(1)])

Confusion matrix for predictions made using testing data

confusionMatrix(ICU.testpred, dTestImpute$ICUDuration)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C
##          A 629 197  19
##          B  25  24  16
##          C   1   2   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7152          
##                  95% CI : (0.6847, 0.7443)
##     No Information Rate : 0.7174          
##     P-Value [Acc > NIR] : 0.5749          
##                                           
##                   Kappa : 0.1059          
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C
## Sensitivity            0.9603  0.10762 0.000000
## Specificity            0.1628  0.94058 0.996583
## Pos Pred Value         0.7444  0.36923 0.000000
## Neg Pred Value         0.6176  0.76533 0.961538
## Prevalence             0.7174  0.24425 0.038335
## Detection Rate         0.6889  0.02629 0.000000
## Detection Prevalence   0.9255  0.07119 0.003286
## Balanced Accuracy      0.5615  0.52410 0.498292