This is an exploratory effort to build a model that predicts how many hours a cardiac surgery patient will spend in an ICU.
The solution uses the Random Forests algorithm in the R “caret” package.
Random Forests - Advantages/Disadvantages
Advantages:
– Can be very accurate
– Automatically includes interactions between variables
– Fairly easy to implement; fewer parameters to deal with
– Works even when number of examples is less than number of variables
Disadvantages:
– Meant for classification problems; not for predicting countinuous outcomes
– Can be computationally intense (slow to process)
– Has the disadvantage of all machine language solutions in that it is hard to explain
– Doesn’t handle missing values; must impute
– Can be prone to over-fitting. Important to use cross-validation
(Note, the following solution does not perform any normalization of numeric data in the data set, ie. numeric values were not centered or scaled.)
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
library(e1071)
library(doMC) ## to enable multi core parallel processing
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
registerDoMC(cores = 4) ## MacBook Pro Air has four cores. Parallel processing will happen automatically]
## Set random number seed for reproducibility purposes.
set.seed(1234)
This code loads the cardiac surgery data set from an Excel file into an R data frame.
## Set working directory for R code
## workingdir = "\\\\ccor_ds\\common\\dbase\\CardiacSurgery\\preparation\\analysis\\R\\Code"
workingdir = "/Users/mitchellfawcett/Documents/CardiacSurgery/Code"
setwd(workingdir)
## Read data file
## datapath <- "\\\\ccor_ds\\common\\dbase\\CardiacSurgery\\preparation\\analysis\\R\\Data"
datapath <- "/Users/mitchellfawcett/Documents/CardiacSurgery/Data"
filename <- "Copy of Surgical_Data_V12_Data_Dictionary_TC.csv"
df <- read.csv(paste(datapath, filename, sep = "/"), header = TRUE, na.strings = "NA", stringsAsFactors = TRUE)
## Display summary information about the raw data set
## Number of rows and columns in data set
dim(df)
## [1] 2286 320
Change blanks to “NA” (NA indicates a missing value in R). Identify the columns that have 25% or less missing values and columns with more than 25% missing values.
## Change string "ERROR" in N_First_Hematocrit_After_Surgery column to ""
df$N_First_Hematocrit_After_Surgery[df$N_First_Hematocrit_After_Surgery == "ERROR"] <- ""
## Warning in `[<-.factor`(`*tmp*`, df$N_First_Hematocrit_After_Surgery == :
## invalid factor level, NA generated
## Change the class of the column from character to numeric
df[, c("N_First_Hematocrit_After_Surgery")] <- sapply(df[, c("N_First_Hematocrit_After_Surgery")], as.numeric)
## Change blanks to NA
df[ df == "" ] = NA
## Variables with 25% or less missing values
names(df[, colSums(is.na(df)) <= nrow(df) * 0.25])
## [1] "SS_Event_STS_ID" "SS_Patient_ID"
## [3] "PatientPersonID" "Account_Number"
## [5] "Payor" "VisitID"
## [7] "VisitStartDateTime" "VisitEndDateTime"
## [9] "Admission_Date" "Discharge_Date"
## [11] "Surgery_Date" "Surgery_End_Time"
## [13] "Surgery_Start_Time" "Gender"
## [15] "Height" "Weight"
## [17] "AgeInYears" "Race"
## [19] "RaceCaucasian" "RaceAfrican_American"
## [21] "RaceAsian" "RaceNative_American"
## [23] "RacePacificIslander" "RaceOther"
## [25] "RaceHispanic" "Hospital"
## [27] "ChronicLungDisease" "N_RFDiabetes"
## [29] "N_RFDialysis" "N_RFHypertension"
## [31] "N_RFTobaccoUse" "N_RFFamilyHistoryCAD"
## [33] "Infectious_Endocarditis" "N_RFPeripheralVascularDisease"
## [35] "N_RFCerebrovascularDisease" "N_Unstable_Angina"
## [37] "RFAlcoholUse" "RFCVD_CVA"
## [39] "RFCVD_TIA" "RFLastHematocrit"
## [41] "LastCreatinineLevel" "N_RFImmunosuppresiveRx"
## [43] "N_RF5mWalkTest" "CardiacPresentationAdmission"
## [45] "AnginaClass2wks" "ArrhythmiaWhen"
## [47] "N_Arrhythmia" "Arrhythmia2HBWhen"
## [49] "Arrhythmia3HBWhen" "ArrhythmiaAFib"
## [51] "ArrhythmiaAFibDuration" "ArrhythmiaAFlutterWhen"
## [53] "ArrhythmiaPermPaced" "ArrhythmiaSSSWhen"
## [55] "ArrhythmiaVVWhen" "Mitral_Insufficiency"
## [57] "N_CardiogenicShock" "CathAorticStenosis"
## [59] "CathMitralStenosis" "CathPulmonicStenosis"
## [61] "CathTricuspidStenosis" "N_CHF"
## [63] "NumDiseasedCoroVessels" "CVA"
## [65] "CVA_When" "Diabetes_Control"
## [67] "EF" "EF_Done"
## [69] "N_Diagnosis_Type" "PreOpMedACEIARBI48hrs"
## [71] "PreOpMedADPInhibitors5Days" "PreOpMedAnticoagulants"
## [73] "PreOpMedAntiplatelets5Days" "PreOpMedAspirin"
## [75] "PreOpMedBetaBlockers" "PreOpMedCoumadin"
## [77] "PreOpMedGPIIbIIIaInhibitor" "PreOpMedInotropes"
## [79] "PreOpMedLipidLowering" "PreOpMedNitratesIV"
## [81] "PreOpMedSteroids" "Previous_CABG"
## [83] "N_PreviousCardiacIntervention" "Previous_Valve"
## [85] "N_PriorMI" "N_Priorheartfailure"
## [87] "STSBMI" "STSBSA"
## [89] "N_Resuscitation" "STSProcedureName"
## [91] "N_STSPROC_AVR" "N_STSPROC_AVR_CAB"
## [93] "N_STSPROC_CAB" "N_STSPROC_MV_Repair"
## [95] "N_STSPROC_MV_Repair_CAB" "N_STSPROC_MV_Replace"
## [97] "N_STSPROC_MV_Replace_CAB" "N_STSPROC_Other"
## [99] "Surgeon" "Status"
## [101] "Incidence" "StatusEmergentReason"
## [103] "StatusUrgentReason" "SkinCutStartTime"
## [105] "SkinCutStopTime" "Aortic_Implant"
## [107] "Aortic_Implant_Size" "Aortic_Implant_Type"
## [109] "AorticImplantSTSCode" "Mitral_Implant"
## [111] "Mitral_Implant_Size" "Mitral_Implant_Type"
## [113] "MitralImplantSTSCode" "Tricuspid_Implant_Size"
## [115] "Tricuspid_Implant_Type" "Tricuspid_Procedure"
## [117] "TricuspidImplantSTSCode" "Pulmonic_Implant"
## [119] "Pulmonic_Implant_Size" "Pulmonic_Implant_Type"
## [121] "Pulmonic_Procedure" "PulmonicImplantSTSCode"
## [123] "AorticOcclusion" "N_CABGPerformed"
## [125] "Circulatory_Arrest_Time" "CirculatoryArrest"
## [127] "Cross_Clamp_Time" "DistAnastArt"
## [129] "DistAnastVein" "ICUVisit"
## [131] "IMAGraftsUsed" "IMAHarvestTechnique"
## [133] "InitHrsICU" "InitVentStartDate"
## [135] "InitVentStartTime" "InitVentStopDate"
## [137] "InitVentStopTime" "IntraopBloodProducts"
## [139] "LowestHematocrit" "LowestTemp"
## [141] "N_Bloodloss_Unit" "N_Intraoperative_Bleeding"
## [143] "N_No_of_Comorbidities" "N_Significant_Bleeding"
## [145] "IABP" "Perfusion_Time"
## [147] "UrgentEmergentReason" "Valve"
## [149] "ValveDisAortic" "ValveDisMitral"
## [151] "ValveDisPulmonic" "ValveDisTricuspid"
## [153] "N_AorticValve" "VSAorticImplant"
## [155] "VSAorticProcedure" "N_MitralValve"
## [157] "VSMitralImplant" "VSMitralProcedure"
## [159] "VSPulmonicImplant" "VSPulmonicProcedure"
## [161] "VSTricuspidImplant" "VSTricuspidProcedure"
## [163] "TotHrsICU" "AddICUHours"
## [165] "ICUHours" "N_Count_of_Bilirubins"
## [167] "N_Count_of_BUNs" "N_First_BUN_After_Surgery"
## [169] "N_Count_of_Creatinines" "N_First_Creatinine_After_Surgery"
## [171] "N_Count_of_GFRs" "N_First_GFR_After_Surgery"
## [173] "N_Count_of_Hematocrits" "N_First_Hematocrit_After_Surgery"
## [175] "N_Count_of_Hypertriglyceridemias" "N_Encephalopathy"
## [177] "N_Stroke_Indicator" "Complications_Any"
## [179] "DCLocation" "Death"
## [181] "Operative_Death" "Mort30DayStatus"
## [183] "MortDCStatus" "MortLocation"
## [185] "MortPrimaryCause" "N_30Day_Revisit"
## [187] "N_60Day_Revisit" "N_7Day_Revisit"
## [189] "N_90Day_Revisit" "Oth_AFib"
## [191] "Oth_Anticoagulant" "Oth_Cardiac_Arrest"
## [193] "Oth_GI" "Oth_MultiSystem_Failure"
## [195] "Oth_OtherComplication" "Oth_Tamponade"
## [197] "PostOpCannulationSite" "PostOpConduitHarvest"
## [199] "PostOpCreatinineLevel" "PostOpDeepSternalMediastin"
## [201] "PostOpDeepSternalMediastinDate" "PostOpDeepVenousThrombosis"
## [203] "PostOpDialysisDuration" "PostopEcho"
## [205] "PostOpLaryngealNerveInjury" "PostOpNeuroComaEnceph"
## [207] "PostOpNeuroParalysis" "PostOpNeuroParalysisType"
## [209] "PostOpNeuroStrokePermanent" "PostOpNeuroStrokeTransientTIA"
## [211] "PostOpPhrenicNerveInjury" "PostOpPlanDelaySternalClose"
## [213] "PostOpPleuralEffusionReqDrain" "PostOpPneumothorax"
## [215] "PostOpPulmonaryThromboembolism" "PostOpRhythmReqPermDevice"
## [217] "PostOpSepsis" "PostOpSepsisPositiveCultures"
## [219] "PostOpSternalInstability" "PostOpSternalSuperficialInfect"
## [221] "PostOpSternotomyIssue" "PostOpSurgicalSiteInfection"
## [223] "PostOpVentHoursTotal" "PostOpVTE"
## [225] "PostOpWound2ndMuscleFlap" "PostOpWound2ndOmentalFlap"
## [227] "PostOpWoundOpenPacking" "PostOpWoundProcedure"
## [229] "PostOpWoundVac" "Pulm_Pneumonia"
## [231] "Pulm_Ventilator_Prolonged" "ReadmitICU"
## [233] "ReIntubated" "Renal_Dialysis_Required"
## [235] "Renal_Failure" "Reop_Bleeding"
## [237] "Reop_Other_Cardiac" "Reop_Other_NonCardiac"
## [239] "ReopGraftOcclusion" "ReopValveDysfunction"
## Variables with more than 25% missing values
names(df[, colSums(is.na(df)) > nrow(df) * 0.25])
## [1] "Diabetes"
## [2] "RFDiabetes"
## [3] "Dialysis"
## [4] "RFDialysis"
## [5] "Hypertension"
## [6] "RFHypertension"
## [7] "TobaccoUse"
## [8] "Rfothertobacco"
## [9] "Family_History_CAD"
## [10] "RFFamilyHistoryCAD"
## [11] "InfectEndocarditis"
## [12] "Peripheral_Vasc_Disease"
## [13] "RFPeripheralVascularDisease"
## [14] "cerebrovascular_disease"
## [15] "RFCerebrovascularDisease"
## [16] "RFBilirubinTotal"
## [17] "RFDyslipidemia"
## [18] "RFImmunosuppresiveRx"
## [19] "Immunosuppresive_Rx"
## [20] "RF5mWalkTest"
## [21] "RF5mWalkTest_old"
## [22] "RF5mWalkTestTime1"
## [23] "RF5mWalkTestTime2"
## [24] "RF5mWalkTestTime3"
## [25] "CardiacSymptomsAtSurgery"
## [26] "AortaDisease"
## [27] "AortaDiseaseAneurysm"
## [28] "AortaDiseaseArch"
## [29] "AortaDiseaseAscending"
## [30] "AortaDiseaseCoarctation"
## [31] "AortaDiseaseDescThoracic"
## [32] "AortaDiseaseDissection"
## [33] "AortaDiseaseDissectionTiming"
## [34] "AortaDiseaseDissectionType"
## [35] "AortaDiseaseIntraHematoma"
## [36] "AortaDiseasePresentation"
## [37] "AortaDiseasePseudoAneurysm"
## [38] "AortaDiseaseRoot"
## [39] "AortaDiseaseRupture"
## [40] "AortaDiseaseThoracoAbd"
## [41] "AortaDiseaseUlcer"
## [42] "Arrhythmia"
## [43] "Aortic_Insufficiency"
## [44] "Pulmonic_Insufficiency"
## [45] "Cardiogenic_Shock"
## [46] "CardiogenicShock"
## [47] "CHF"
## [48] "HeartFailure2wks"
## [49] "CoronaryDominance"
## [50] "CoronaryInfoKnown"
## [51] "CoronaryStenosisSource"
## [52] "LMNativeStenosis"
## [53] "NativePercentStenosisKnown"
## [54] "NYHA"
## [55] "previous_CV_intervention"
## [56] "PrevOthCardPCI"
## [57] "PreviousCardiacIntervention"
## [58] "PriorMI"
## [59] "PreviousMI"
## [60] "HeartFailurePrior"
## [61] "Priorheartfailure"
## [62] "Resuscitation"
## [63] "Resuscitation_history"
## [64] "AnesthesiaEndDate"
## [65] "AnesthesiaEndTime"
## [66] "Tricuspid_Insufficiency"
## [67] "CABG"
## [68] "CABGPerformed"
## [69] "IABP_When"
## [70] "AFibProcedure"
## [71] "OCAFibProc"
## [72] "N_AFibProcedure"
## [73] "VSAortic"
## [74] "VSAorticValve"
## [75] "VSMitral"
## [76] "VSMitralValve"
## [77] "N_First_Bilirubin_After_Surgery"
## [78] "N_First_Hypertriglyceridemia_After_Surgery"
## [79] "DateOfDeath"
## [80] "PostOpReOpBleedTiming"
Look at the distribution of ICU hours.
summary(df$ICUHours)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 24.00 30.00 55.71 52.00 1483.00
hist(df$ICUHours, freq = TRUE, breaks = 7)
There are 5 variables in the data set that have to do with ICU stay.
I used ICUHours to create a categorical target variable ICUDuration. A = The patient spent 48 hours or less in ICUs; B = The patient spent 49 to 216 hours in ICUs; C = Patient spent more than 216 hours in ICUs.
df$ICUDuration <- as.factor(ifelse(df$ICUHours <= 48, "A",
ifelse(48 < df$ICUHours & df$ICUHours <= 216, "B", "C")
))
## Note, ICUDuration was created as a factor variable so we can later use rfImpute (to impute missing values.)
Sample of data showing the translation of the target variable.
df[1:10, c("ICUHours", "ICUDuration")]
## ICUHours ICUDuration
## 1 25 A
## 2 29 A
## 3 120 B
## 4 0 A
## 5 16 A
## 6 26 A
## 7 48 A
## 8 17 A
## 9 22 A
## 10 110 B
To build a simple baseline model I wanted variables that had less than 25% missing values. I also looked for variables whose names started with “N” because these variables were ones that had been unified across all versions of STS. No “expert” knowledge went into the selection.
These are the variables I picked. N_PriorMI, N_No_of_Comorbidities, RFLastHematocrit, N_Priorheartfailure, N_Arrhythmia, N_RF5mWalkTest, N_Diagnosis_Type, N_CardiogenicShock, N_First_BUN_After_Surgery, N_First_Creatinine_After_Surgery, N_First_Hematocrit_After_Surgery
df <- df[, c("N_PriorMI", "N_No_of_Comorbidities", "RFLastHematocrit", "N_Priorheartfailure", "N_Arrhythmia", "N_RF5mWalkTest", "N_Diagnosis_Type", "N_CardiogenicShock", "N_First_BUN_After_Surgery", "N_First_Creatinine_After_Surgery", "N_First_Hematocrit_After_Surgery", "ICUDuration")]
Here is a look at the data of the chosen variables.
str(df)
## 'data.frame': 2286 obs. of 12 variables:
## $ N_PriorMI : Factor w/ 4 levels "","No","Unknown",..: 2 2 2 2 2 2 2 4 2 2 ...
## $ N_No_of_Comorbidities : int 2 2 7 1 5 5 3 6 4 2 ...
## $ RFLastHematocrit : num 40.3 35.9 39.9 40.2 43.3 36.3 41.2 29.4 50.6 26.7 ...
## $ N_Priorheartfailure : Factor w/ 3 levels "No","Unknown",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ N_Arrhythmia : Factor w/ 7 levels "","No","None",..: 3 3 5 3 3 3 3 3 3 3 ...
## $ N_RF5mWalkTest : Factor w/ 4 levels "","No","Non-ambulatory patient",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ N_Diagnosis_Type : Factor w/ 4 levels "Not on Care Pathway",..: 1 1 1 1 1 4 1 1 1 1 ...
## $ N_CardiogenicShock : Factor w/ 4 levels "No","Yes","Yes - At the time of the procedure",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ N_First_BUN_After_Surgery : int 17 13 12 20 11 19 12 25 22 23 ...
## $ N_First_Creatinine_After_Surgery: num 0.74 0.6 0.9 1.3 0.8 1.25 1.2 2.4 1.5 1.1 ...
## $ N_First_Hematocrit_After_Surgery: num 105 62 98 189 165 121 75 42 228 25 ...
## $ ICUDuration : Factor w/ 3 levels "A","B","C": 1 1 2 1 1 1 1 1 1 2 ...
I split the data set into a ratio of 60% training examples, 40% testing examples.
inTrain <- createDataPartition(y = df$ICUDuration,
p = 0.60,
list = FALSE)
dTrain <- df[inTrain,]
dTest <- df[-inTrain,]
## Number of rows and columns in the training and testing sets
dim(dTrain)
## [1] 1373 12
dim(dTest)
## [1] 913 12
dTrainImpute <- rfImpute(ICUDuration ~ ., dTrain)
## ntree OOB 1 2 3
## 300: 30.23% 5.69% 91.37% 98.11%
## ntree OOB 1 2 3
## 300: 30.08% 6.30% 88.99% 98.11%
## ntree OOB 1 2 3
## 300: 30.37% 6.20% 90.48% 98.11%
## ntree OOB 1 2 3
## 300: 29.86% 5.59% 90.18% 98.11%
## ntree OOB 1 2 3
## 300: 30.37% 5.79% 91.67% 98.11%
## Train the random forest model using training data
ICU.rf <- randomForest(ICUDuration ~ ., dTrainImpute)
## Display the confusion matrix for the model
ICU.rf$confusion
## A B C class.error
## A 924 57 3 0.06097561
## B 298 36 2 0.89285714
## C 38 14 1 0.98113208
dTestImpute <- rfImpute(ICUDuration ~ ., dTest)
## ntree OOB 1 2 3
## 300: 28.26% 5.34% 84.75% 97.14%
## ntree OOB 1 2 3
## 300: 28.15% 4.58% 86.55% 97.14%
## ntree OOB 1 2 3
## 300: 29.03% 5.50% 87.44% 97.14%
## ntree OOB 1 2 3
## 300: 27.93% 4.73% 85.65% 94.29%
## ntree OOB 1 2 3
## 300: 28.15% 5.80% 82.96% 97.14%
Exlude the first column (ICUDuration) from the test set since that is the value we are predicting.
ICU.testpred<-predict(ICU.rf, newdata = dTestImpute[, -(1)])
confusionMatrix(ICU.testpred, dTestImpute$ICUDuration)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C
## A 629 197 19
## B 25 24 16
## C 1 2 0
##
## Overall Statistics
##
## Accuracy : 0.7152
## 95% CI : (0.6847, 0.7443)
## No Information Rate : 0.7174
## P-Value [Acc > NIR] : 0.5749
##
## Kappa : 0.1059
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C
## Sensitivity 0.9603 0.10762 0.000000
## Specificity 0.1628 0.94058 0.996583
## Pos Pred Value 0.7444 0.36923 0.000000
## Neg Pred Value 0.6176 0.76533 0.961538
## Prevalence 0.7174 0.24425 0.038335
## Detection Rate 0.6889 0.02629 0.000000
## Detection Prevalence 0.9255 0.07119 0.003286
## Balanced Accuracy 0.5615 0.52410 0.498292