DATA622 HW4

Introduction

The dataset that I’m using is a Kaggle dataset of surgical operative data. It contains patient demographic data, procedure information, temporal information about the time of day and year associated with the operation along with outcomes of 30 day mortality and complication events.

Link to dataset

This dataset originally came from the article “Operation Timing and 30-Day Mortality After Elective General Surgery”. The paper looked at the association between time of day/year of the surgery and complication/mortality events for elective surgeries.

Link to article where Procedure Category are defined

The goal of this exercise is to accurately predict patients who had complications within 30 days of elective surgery. We also want to identify the features that are most important in our prediction. We want to create a model that could accurately identify high risk patients so that they can receive additional interventions to reduce post operative complications.

Read data

surgDf <- read.csv("G:/Documents/DATA622_HW4/Surgical-deepnet.csv")

For the paper they calculated some risk scores that I’m removing from the dataset. I only interested in the underlying features and targets. This dataset is a little weird in that it as been anonymized in a fashion where is no primary keys (patient Id or surgery date) left in the dataset. Through inspection I found that there are 2900 duplicate rows that I’m filtering out at this step too. The first thing that tipped me off that there was duplicates in the dataset was an inspection of ‘age’

hist(surgDf$Age)

surgDf2 <- surgDf %>% select(-c('ahrq_ccs', 'ccsComplicationRate', 'ccsMort30Rate', 'complication_rsi', 'mortality_rsi')) %>% distinct()

I’m using the PointBlank function to visualize and inspect the data.

We have no missing values anymore. We can see that ‘complication’ has a positive correlation with ‘age’, ‘asa_status’, ‘baseline_cancer’, ‘baseline_charlson’ and is negatively correlated with ‘baseline_osteoart’ (osteoarthritis) and ‘bmi’.

Some of our features are highly correlated. Baseline cancer status is highly correlated with baseline Charlson Comorbidity Score, which makes sense since cancer is a comorbidity. Baseline cancer is also highly negatively correlated with osteoarthritis.

Interestingly, osteoarthritis is not associated with cancer. Link to article about osteoarthritis and cancer risk

I’m not rendering the ‘scan_data’ function. It was giving me an error uploading to RPubs

#pointblank::scan_data(surgDf2)

Method

Using logistic regression model to inspect the correlation between the complication target variable and the predictor variables.

The features most associated with the complication outcome are age, bmi, ASA, osteoarthritis, diabetes, Charlson Comorbidity Index.

test <- glm(complication  ~ ., surgDf2, family = 'binomial')
summary(test)

## 
## Call:
## glm(formula = complication ~ ., family = "binomial", data = surgDf2)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.696758   0.170053  -4.097 4.18e-05 ***
## bmi                -0.051836   0.002979 -17.399  < 2e-16 ***
## Age                 0.020529   0.001809  11.349  < 2e-16 ***
## asa_status          0.438548   0.042706  10.269  < 2e-16 ***
## baseline_cancer    -0.133113   0.056329  -2.363  0.01812 *  
## baseline_charlson   0.085893   0.013936   6.163 7.12e-10 ***
## baseline_cvd        0.008742   0.049006   0.178  0.85842    
## baseline_dementia   0.437946   0.253947   1.725  0.08461 .  
## baseline_diabetes  -0.309450   0.066529  -4.651 3.30e-06 ***
## baseline_digestive -0.071939   0.050807  -1.416  0.15680    
## baseline_osteoart  -0.698338   0.063955 -10.919  < 2e-16 ***
## baseline_psych     -0.170580   0.074227  -2.298  0.02156 *  
## baseline_pulmonary -0.108457   0.069095  -1.570  0.11649    
## dow                 0.006768   0.014774   0.458  0.64691    
## gender              0.127501   0.043046   2.962  0.00306 ** 
## hour                0.004803   0.007199   0.667  0.50468    
## month              -0.006815   0.006106  -1.116  0.26439    
## moonphase           0.031358   0.018793   1.669  0.09520 .  
## mort30              0.766866   0.298376   2.570  0.01017 *  
## race                0.108324   0.054162   2.000  0.04550 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 14611  on 11732  degrees of freedom
## Residual deviance: 13461  on 11713  degrees of freedom
## AIC: 13501
## 
## Number of Fisher Scoring iterations: 4

Here we are creating an individual decision tree to see what features are most predictive of our target postop complications. I like to have the printed version of the tree and the diagram to look at.

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──

## ✔ broom        1.0.5      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tune         1.2.0 
## ✔ infer        1.0.7      ✔ workflows    1.1.4 
## ✔ modeldata    1.3.0      ✔ workflowsets 1.1.0 
## ✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org

surgDf2$complication <- as.factor(surgDf2$complication)

set.seed(123)
data_split <- initial_split(surgDf2, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)

tree_spec <- decision_tree(engine = "rpart", mode = "classification", tree_depth = 7)

# Fit the model to the training data
tree_fit1 <- tree_spec %>%
 fit(complication ~ ., data = train_data, model=TRUE) 
print(tree_fit1)

## parsnip model object
## 
## n= 8799 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 8799 2766 0 (0.685646096 0.314353904)  
##     2) Age< 75.05 8340 2308 0 (0.723261391 0.276738609)  
##       4) Age>=32 8185 2160 0 (0.736102627 0.263897373)  
##         8) bmi>=35.235 2675  411 0 (0.846355140 0.153644860) *
##         9) bmi< 35.235 5510 1749 0 (0.682577132 0.317422868)  
##          18) Age< 41.25 1304  152 0 (0.883435583 0.116564417) *
##          19) Age>=41.25 4206 1597 0 (0.620304327 0.379695673)  
##            38) Age>=54.55 3715 1116 0 (0.699596231 0.300403769)  
##              76) Age< 58.55 1590  201 0 (0.873584906 0.126415094) *
##              77) Age>=58.55 2125  915 0 (0.569411765 0.430588235)  
##               154) Age>=70.35 1410  231 0 (0.836170213 0.163829787) *
##               155) Age< 70.35 715   31 1 (0.043356643 0.956643357) *
##            39) Age< 54.55 491   10 1 (0.020366599 0.979633401) *
##       5) Age< 32 155    7 1 (0.045161290 0.954838710) *
##     3) Age>=75.05 459    1 1 (0.002178649 0.997821351) *

# Load the library
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.3.3

## Loading required package: rpart

## 
## Attaching package: 'rpart'

## The following object is masked from 'package:dials':
## 
##     prune

# Plot the decision tree
rpart.plot(tree_fit1$fit, type = 4, extra = 101, under = TRUE, cex = 0.8, box.palette = "auto", roundint=FALSE)

Here are the rule in table form

rpart.rules(tree_fit1$fit, roundint=FALSE)

##  complication                                 
##          0.12 when Age is 32 to 41 & bmi <  35
##          0.13 when Age is 55 to 59 & bmi <  35
##          0.15 when Age is 32 to 75 & bmi >= 35
##          0.16 when Age is 70 to 75 & bmi <  35
##          0.95 when Age <  32                  
##          0.96 when Age is 59 to 70 & bmi <  35
##          0.98 when Age is 41 to 55 & bmi <  35
##          1.00 when Age >=       75

The decision tree gives us an accuracy of 88% and 87% on the training and test data respectively. I’m genuinely surprised how well the decision tree performs.

predictions <- tree_fit1 %>%
 predict(train_data) %>%
 pull(.pred_class)


caret::confusionMatrix(as.factor(predictions), as.factor(train_data[,20]))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5984  995
##          1   49 1771
##                                          
##                Accuracy : 0.8814         
##                  95% CI : (0.8744, 0.888)
##     No Information Rate : 0.6856         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6967         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9919         
##             Specificity : 0.6403         
##          Pos Pred Value : 0.8574         
##          Neg Pred Value : 0.9731         
##              Prevalence : 0.6856         
##          Detection Rate : 0.6801         
##    Detection Prevalence : 0.7932         
##       Balanced Accuracy : 0.8161         
##                                          
##        'Positive' Class : 0              
##

predictions <- tree_fit1 %>%
 predict(test_data) %>%
 pull(.pred_class)


caret::confusionMatrix(as.factor(predictions), as.factor(test_data[,20]))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1982  334
##          1   28  590
##                                           
##                Accuracy : 0.8766          
##                  95% CI : (0.8642, 0.8883)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.686           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9861          
##             Specificity : 0.6385          
##          Pos Pred Value : 0.8558          
##          Neg Pred Value : 0.9547          
##              Prevalence : 0.6851          
##          Detection Rate : 0.6755          
##    Detection Prevalence : 0.7894          
##       Balanced Accuracy : 0.8123          
##                                           
##        'Positive' Class : 0               
##

Now lets use Random Forest to predict complications from our dataset features

rffit <- randomForest::randomForest(complication ~ ., data = train_data)
rfpred <- predict(rffit, train_data[, -20])

Where we create a confusion matrix and measure the accuracy of our model on the training set. Our Random Forest classifier has an accuracy of 99% on the training data.

caret::confusionMatrix(as.factor(rfpred), as.factor(train_data[,20]))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6033   36
##          1    0 2730
##                                           
##                Accuracy : 0.9959          
##                  95% CI : (0.9943, 0.9971)
##     No Information Rate : 0.6856          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9905          
##                                           
##  Mcnemar's Test P-Value : 5.433e-09       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9870          
##          Pos Pred Value : 0.9941          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6856          
##          Detection Rate : 0.6856          
##    Detection Prevalence : 0.6897          
##       Balanced Accuracy : 0.9935          
##                                           
##        'Positive' Class : 0               
##

On the test data we get an accuracy of 86%.

rfpred <- predict(rffit, test_data[, -20])
caret::confusionMatrix(as.factor(rfpred), as.factor(test_data[,20]))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2001  401
##          1    9  523
##                                           
##                Accuracy : 0.8603          
##                  95% CI : (0.8472, 0.8726)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6342          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9955          
##             Specificity : 0.5660          
##          Pos Pred Value : 0.8331          
##          Neg Pred Value : 0.9831          
##              Prevalence : 0.6851          
##          Detection Rate : 0.6820          
##    Detection Prevalence : 0.8187          
##       Balanced Accuracy : 0.7808          
##                                           
##        'Positive' Class : 0               
##

We next try predicting complications using SVM. WE first try using a linear kernel.

library(e1071)

## Warning: package 'e1071' was built under R version 4.3.3

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:tune':
## 
##     tune

## The following object is masked from 'package:rsample':
## 
##     permutations

## The following object is masked from 'package:parsnip':
## 
##     tune

svmfit = svm(complication ~ ., train_data , cost = 10, kernel = "linear", scale = TRUE)

We only get an accuracy of 68% from SVM with a linear kernel on the training and test data.

svmpred = predict(svmfit, train_data[, -20])
caret::confusionMatrix(as.factor(svmpred), as.factor(train_data[,20]))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5992 2712
##          1   41   54
##                                           
##                Accuracy : 0.6871          
##                  95% CI : (0.6773, 0.6968)
##     No Information Rate : 0.6856          
##     P-Value [Acc > NIR] : 0.3875          
##                                           
##                   Kappa : 0.0172          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.99320         
##             Specificity : 0.01952         
##          Pos Pred Value : 0.68842         
##          Neg Pred Value : 0.56842         
##              Prevalence : 0.68565         
##          Detection Rate : 0.68099         
##    Detection Prevalence : 0.98920         
##       Balanced Accuracy : 0.50636         
##                                           
##        'Positive' Class : 0               
##

svmpred = predict(svmfit, test_data[, -20])
caret::confusionMatrix(as.factor(svmpred), as.factor(test_data[,20]))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1998  902
##          1   12   22
##                                           
##                Accuracy : 0.6885          
##                  95% CI : (0.6714, 0.7052)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.3537          
##                                           
##                   Kappa : 0.0241          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.99403         
##             Specificity : 0.02381         
##          Pos Pred Value : 0.68897         
##          Neg Pred Value : 0.64706         
##              Prevalence : 0.68507         
##          Detection Rate : 0.68098         
##    Detection Prevalence : 0.98841         
##       Balanced Accuracy : 0.50892         
##                                           
##        'Positive' Class : 0               
##

SVM with a polynomial kernel gives us an accuracy of 75% and 70% on the training data and test data respectively. Interestingly, without scaling we only get an accuracy of 41%.

svmfit = svm(complication ~ ., data = train_data , cost = 10, kernel = "polynomial", scale = TRUE)

svmpred = predict(svmfit, train_data[, -20])
caret::confusionMatrix(as.factor(svmpred), as.factor(train_data[,20]))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5751 1898
##          1  282  868
##                                           
##                Accuracy : 0.7522          
##                  95% CI : (0.7431, 0.7612)
##     No Information Rate : 0.6856          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3173          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9533          
##             Specificity : 0.3138          
##          Pos Pred Value : 0.7519          
##          Neg Pred Value : 0.7548          
##              Prevalence : 0.6856          
##          Detection Rate : 0.6536          
##    Detection Prevalence : 0.8693          
##       Balanced Accuracy : 0.6335          
##                                           
##        'Positive' Class : 0               
##

svmpred = predict(svmfit, test_data[, -20])
caret::confusionMatrix(as.factor(svmpred), as.factor(test_data[,20]))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1852  714
##          1  158  210
##                                           
##                Accuracy : 0.7028          
##                  95% CI : (0.6859, 0.7193)
##     No Information Rate : 0.6851          
##     P-Value [Acc > NIR] : 0.01994         
##                                           
##                   Kappa : 0.1775          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.9214          
##             Specificity : 0.2273          
##          Pos Pred Value : 0.7217          
##          Neg Pred Value : 0.5707          
##              Prevalence : 0.6851          
##          Detection Rate : 0.6312          
##    Detection Prevalence : 0.8746          
##       Balanced Accuracy : 0.5743          
##                                           
##        'Positive' Class : 0               
##

Results

With Random Forest we were able to create a model that has 86% accuracy, 99% sensitivity and 56% specificity. With a logistic regression model and decision tree we were able to identify the most predictive features: age, bmi, ASA, osteoarthritis, diabetes, Charlson Comorbidity Index. The decision tree and logistic regression identify something that may look counter intuitive, patients with a bmi under 32 higher risk for complications. Specifically, it’s people specifically with very low bmi are the highest risk for complications. A very low bmi is often associated with preoperative comorbidites or frailty. Low bmi women at risk for cardiovascular and respiratory disease Low bmi patients at risk after spine surgery

Conclusion

I feel that is a model that could be put into production to predict patient post-op complications. This model would be perfect if there was a low cost, low risk intervention that could reduce complication rates. For example, patients who were identified as ‘high-risk’ for complications could be given additional prophylactic antibiotics to prevent sepsis and surgical site infections. Because this model has a 44% false positive rate only a low risk/low cost intervention would be appropriate.