Foreword: About the Machine Learning in Medicine (MLM) project
The MLM project has been initialized in 2016 and aims to:
1.Encourage using Machine Learning techniques in medical research in Vietnam
2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.
Background
The 57th case study (Chapter 13 - Machine Learning in Medicine-Cookbook Two by T. J. Cleophas and A. H. Zwinderman, Springer, 2014) allows us to set up a small competiton between 2 machine learning algorithms : Extreme Gradient Boosting Tree versus Support Vector Machine.
In this study, we extracted the laboratory data and survival outcome of 200 patients hospitalized for sepsis. Our objective is to estimate the mortality risk of these patients using 10 markers: gamma-glutamyl transferase (gammagt, U/l),aspartate aminotransferase and Alanine transaminase (U/l), bilirubine (μmol/l), urea (mmol/l), creatinine (μmol/l), creatinine clearance (ml/min), erythrocyte sedimentation rate (ESR, mm), c-reactive protein (CRP, mg/l) and leucocyte count (×10^9/l).
In the original textbook, the authors adopted an SVM algorithm (Knime Data Miner) to resolve this classification problem. We will extend this experiment by setting a benchmark study to compare the performance of two potential algorithms (SVM and XGBT). Each model will be trained on the same training subset (n=160 or 80% of original dataset) then tested on the same external subset (n=40 or 20%).
Materials and method
The original dataset was provided in Chaper 13, Machine Learning in Medicine-Cookbook Three (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in csv format svm.csv) from their website: extras.springer.com.
Data analysis was performed in R statistical programming language. The caret package (by Max Kuhn: http://topepo.github.io/caret) will be used through out the process, including data splitting, model tuning, resampling, training and testing.
The Extreme gradient boosting Tree algorithm is supported by “xgboost” package, and Support vector machine with linear Kernel by “kernlab”" package.
The model training implied an integrated Tuning-Resampling process in caret, with a 5x5 cross-validations. This consists of training and testing the algorithm by CV while the model’s parameters are tuned until reaching their optimized performance. Then final model for each algorithm will be externally validated on the same test subset (n=40). Model’s averaged performance (including Accuracy, Sensitivity, Specificity, PPV, NPV, Kappa coefficient and Balanced Accuracy) will be compared by Wilcoxon-sign rank test.
First, following packages must be established in R-studio (some other packages will also be required during the analysis), and we will personalize a global theme for ggplot2
library(plyr)
library(tidyverse)
library(GGally)
my_theme <- function(base_size = 10, base_family = "sans"){
theme_minimal(base_size = base_size, base_family = base_family) +
theme(
axis.text = element_text(size = 10),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
axis.title = element_text(size = 10),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#f7fdff"),
strip.background = element_rect(fill = "#001d60", color = "#00113a", size =0.5),
strip.text = element_text(face = "bold", size = 10, color = "white"),
legend.position = "bottom",
legend.justification = "center",
legend.background = element_blank(),
panel.border = element_rect(color = "grey5", fill = NA, size = 0.5)
)
}
theme_set(my_theme())
Results
Step 0: Data loading and reshape
data<-read.csv("Case57.csv",sep=";")%>%as_tibble()%>%dplyr::rename(.,Death=VAR00001,GammaGT=VAR00002,ASAT=VAR00003,ALAT=VAR00004,Bilirubin=VAR00005,Urea=VAR00006,Creatinine=VAR00007,CreClearance=VAR00008,ESR=VAR00009,CRP=VAR00010,Leucocytes=VAR00011)
data$Outcome=recode_factor(data$Death,`0` = "Survive", `1` = "Death")
data
## # A tibble: 200 × 12
## Death GammaGT ASAT ALAT Bilirubin Urea Creatinine CreClearance ESR
## <int> <int> <int> <int> <int> <dbl> <int> <int> <int>
## 1 0 20 23 34 2 3.4 89 -111 2
## 2 0 14 21 33 3 2.0 67 -112 7
## 3 0 30 35 32 4 5.6 58 -116 8
## 4 0 35 34 40 4 6.0 76 -110 6
## 5 0 23 33 22 4 6.1 95 -120 9
## 6 0 26 31 24 3 5.4 78 -132 8
## 7 0 15 29 26 2 5.3 47 -120 12
## 8 0 13 26 24 1 6.3 65 -132 13
## 9 0 26 27 27 4 6.0 97 -112 14
## 10 0 34 25 13 3 4.0 67 -125 15
## # ... with 190 more rows, and 3 more variables: CRP <int>,
## # Leucocytes <int>, Outcome <fctr>
Once the above commmands executed, we will get a clean dataframe, that will be used throughout 4 steps.
Step 1) Data exploration
Table 1 and Figure 1: Descriptive statistics
psych::describeBy(data[,-1],group=data$Outcome)
## $Survive
## vars n mean sd median trimmed mad min max range
## GammaGT 1 107 25.08 20.01 21.0 21.37 7.41 2 134 132
## ASAT 2 107 25.26 27.13 21.0 20.63 8.90 3 236 233
## ALAT 3 107 25.18 19.42 24.0 21.87 10.38 2 120 118
## Bilirubin 4 107 6.14 20.16 3.0 2.95 1.48 1 201 200
## Urea 5 107 5.24 4.11 4.8 4.78 1.33 2 45 43
## Creatinine 6 107 94.12 88.59 80.0 80.87 11.86 47 845 798
## CreClearance 7 107 -110.81 16.38 -112.0 -112.86 10.38 -132 -7 125
## ESR 8 107 14.64 11.66 14.0 13.33 2.97 2 123 121
## CRP 9 107 6.95 9.38 6.0 5.86 1.48 2 100 98
## Leucocytes 10 107 6.27 3.21 6.0 5.94 1.48 2 29 27
## Outcome* 11 107 1.00 0.00 1.0 1.00 0.00 1 1 0
## skew kurtosis se
## GammaGT 3.27 11.81 1.93
## ASAT 5.29 34.23 2.62
## ALAT 3.32 12.22 1.88
## Bilirubin 8.58 78.78 1.95
## Urea 8.48 79.10 0.40
## Creatinine 7.10 52.78 8.56
## CreClearance 3.08 14.75 1.58
## ESR 7.52 66.92 1.13
## CRP 9.11 87.38 0.91
## Leucocytes 3.50 21.50 0.31
## Outcome* NaN NaN 0.00
##
## $Death
## vars n mean sd median trimmed mad min max
## GammaGT 1 93 402.31 432.37 269 320.21 220.91 13.0 2300
## ASAT 2 93 388.99 413.80 277 316.08 281.69 3.0 1980
## ALAT 3 93 380.80 369.73 267 322.85 257.97 15.0 1500
## Bilirubin 4 93 126.46 104.56 85 115.44 88.96 2.0 400
## Urea 5 93 26.77 22.65 18 23.31 15.72 3.2 98
## Creatinine 6 93 306.35 211.69 241 280.61 154.19 71.0 865
## CreClearance 7 93 -44.81 33.76 -38 -40.71 35.58 -132.0 -4
## ESR 8 93 66.86 42.14 48 62.32 34.10 7.0 180
## CRP 9 93 41.98 44.07 27 32.99 16.31 3.0 243
## Leucocytes 10 93 18.69 5.91 18 18.40 5.93 6.0 30
## Outcome* 11 93 2.00 0.00 2 2.00 0.00 2.0 2
## range skew kurtosis se
## GammaGT 2287.0 1.86 3.63 44.84
## ASAT 1977.0 2.08 4.84 42.91
## ALAT 1485.0 1.33 1.06 38.34
## Bilirubin 398.0 0.84 -0.21 10.84
## Urea 94.8 1.26 0.85 2.35
## Creatinine 794.0 1.00 -0.23 21.95
## CreClearance 128.0 -0.91 0.02 3.50
## ESR 173.0 0.79 -0.46 4.37
## CRP 240.0 2.24 5.31 4.57
## Leucocytes 24.0 0.39 -0.54 0.61
## Outcome* 0.0 NaN NaN 0.00
##
## attr(,"call")
## by.data.frame(data = x, INDICES = group, FUN = describe, type = type)
data%>%gather(GammaGT:Leucocytes,key="Marker",value="Value")%>%ggplot(aes(x=Value,fill=Outcome,color=Outcome))+geom_density(alpha=0.6,show.legend =T)+facet_wrap(~Marker,scales="free",ncol=4)+scale_fill_brewer(palette = "Set1")+scale_color_brewer(palette = "Set1")
Figure 2: Heatmap plot showing the level of 10 sepsis biomarkers in function of survival outcome
dfscale<-data[,-c(1,12)]%>%as.matrix()%>%scale()%>%as_tibble()%>%mutate(.,Outcome=data$Outcome,Id=row.names(.))
dfscale%>%gather(GammaGT:Leucocytes,key="Marker",value="Value")%>%ggplot(aes(x=reorder(Id,Value),y=reorder(Marker,Value),fill=Value))+geom_tile(color="black",show.legend=T)+facet_wrap(~Outcome,ncol=1,shrink=T,scale="free")+scale_fill_gradient2(low="#fcde00",mid="#fc3232",high="#7901a8",midpoint=3)+theme(axis.text.x=element_blank())+scale_y_discrete("Markers")+scale_x_discrete("Patient's Id")
We summon the “mighty”" caret package then perform data splitting
library(caret)
set.seed(123)
idTrain=createDataPartition(y=data$Death, p=160/200,list=FALSE)
trainset=data[idTrain,-1]
testset=data[-idTrain,-1]
As you can see, the original dataset was splitted into 2 subsets; train (n=160) and test (n=40)
Now, we will activate the Training-Tuning and Resampling process for two algorithms
# Extreme gradient boosting tree model
set.seed(123)
XGBT<-train(Outcome~.,
data=trainset,
method="xgbTree",
preProcess = NULL,
tuneLength=5,
trControl = trainControl(method = "repeatedcv", number=5,repeats=5,verboseIter = FALSE,summaryFunction=multiClassSummary)
)
#Support vector machine with Linear Kernel model
set.seed(123)
SVM<-train(Outcome~.,
data=trainset,
method="svmLinear",
preProcess = NULL,
tuneLength=5,
trControl = trainControl(method = "repeatedcv", number=5,repeats=5,verboseIter = FALSE,summaryFunction=multiClassSummary)
)
The Tuning process on XGBT could be presented as follow:
ggplot(XGBT)+scale_color_brewer(palette = "Set1")
The content of each model
XGBT$finalModel
## ##### xgb.Booster
## raw: 14.7 Kb
## call:
## xgb.train(params = list(eta = param$eta, max_depth = param$max_depth,
## gamma = param$gamma, colsample_bytree = param$colsample_bytree,
## min_child_weight = param$min_child_weight), data = dat, nrounds = param$nrounds,
## objective = "binary:logistic")
## params (as set within xgb.train):
## eta = "0.4", max_depth = "2", gamma = "0", colsample_bytree = "0.6", min_child_weight = "1", objective = "binary:logistic", silent = "1"
## xgb.attributes:
## niter
## callbacks:
## cb.print.evaluation(period = print_every_n)
## niter: 50
## xNames: GammaGTASATALATBilirubinUreaCreatinineCreClearanceESRCRPLeucocytes
## problemType: Classification
## tuneValue:
## nrounds max_depth eta gamma colsample_bytree min_child_weight
## 61 50 2 0.4 0 0.6 1
## obsLevels: SurviveDeath
SVM$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 29
##
## Objective Function Value : -21.7457
## Training error : 0.05
Now the duel between those two monsters really begins
ROUND 1) Confusion matrix: Averaged performance by training
confusionMatrix(XGBT,positive ="Death")
## Cross-Validated (5 fold, repeated 5 times) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction Survive Death
## Survive 51.2 1.0
## Death 1.9 45.9
##
## Accuracy (average) : 0.9712
confusionMatrix(SVM,positive ="Death")
## Cross-Validated (5 fold, repeated 5 times) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction Survive Death
## Survive 48.2 2.5
## Death 4.9 44.4
##
## Accuracy (average) : 0.9263
The confusion Matrix on averaged training performance of two models suggest that the XGBT performs slightly better.
So XGBT won this round.
ROUND 2: Averaged Performance through CV Resampling
This round, each model will show their performance through 25 steps of CV:
Figure 3A,B,C: Performance metrics of 2 models
resXGB<-XGBT$resample%>%separate(Resample,into = c("Fold", "Iter"), sep = ".Rep")
resXGB=dplyr::rename(resXGB,ACC=Accuracy,KAP=Kappa,SEN=Sensitivity,SPEC=Specificity,PPV=Pos_Pred_Value,NPV=Neg_Pred_Value,BAC=Balanced_Accuracy)%>%.[,-7]
resXGB$Iteration=c(1:nrow(resXGB))
resXGB$model="XGB"
resSVM<-SVM$resample%>%separate(Resample,into = c("Fold", "Iter"), sep = ".Rep")
resSVM=dplyr::rename(resSVM,ACC=Accuracy,KAP=Kappa,SEN=Sensitivity,SPEC=Specificity,PPV=Pos_Pred_Value,NPV=Neg_Pred_Value,BAC=Balanced_Accuracy)%>%.[,-7]
resSVM$Iteration=c(1:nrow(resSVM))
resSVM$model="SVM"
resXGB=gather(resXGB,ACC:BAC,key="Metric",value="Value")
resSVM=gather(resSVM,ACC:BAC,key="Metric",value="Value")
resdf=rbind(resSVM,resXGB)
resdf%>%ggplot(aes(x=Metric,y=Value,fill=model,color=model))+stat_summary(fun.ymin=min,fun.ymax=max,fun.y="median",shape=21,size=1)+facet_grid(model~.)+coord_flip()+scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1")
resdf%>%ggplot(aes(x=Metric,y=Value))+geom_boxplot(aes(fill=model),alpha=0.6)+coord_flip()+facet_wrap(~Metric,scales="free",ncol=2)+theme(axis.text.y=element_blank())+scale_fill_brewer(palette = "Set1")+scale_color_brewer(palette = "Set1")
resdf%>%ggplot(aes(x=Value))+geom_freqpoly(aes(fill=model,color=model),size=1)+facet_wrap(~Metric,scales="free",ncol=2)+theme(axis.text.y=element_blank())+scale_color_brewer(palette = "Set1")
Figure 4A,B: Evolution of performance metrics of two models through 25 iteration steps and by each fold
resdf%>%ggplot(aes(x=Iteration,y=Value))+geom_path(aes(color=model),size=1,alpha=0.7)+facet_wrap(~Metric,scales="free",ncol=2)+scale_color_brewer(palette = "Set1")
resdf%>%ggplot(aes(x=Iter,y=Value,group=model))+geom_line(aes(color=model),size=1,show.legend=F)+facet_grid(Fold~Metric,scales="free")+scale_color_brewer(palette = "Set1")
Based on those graphs, the XGBT seems to perform better than SVM by most of criteria. The XGBT produced a more steady evolution, lower variability and higher averaged value of performance metrics in each one of 5 folds and through out overall 25 iteration steps.
To confirm this, we will use a Wilcoxon-sign rank test (a non parametric that is equivalent to paired samples t-test):
resdf%>%select(.,4:6)%>%split(.$Metric)%>%map(~wilcox.test(.$Value~.$model,paired=T))%>%map(~.[3])
## $ACC
## $ACC$p.value
## [1] 0.001038597
##
##
## $BAC
## $BAC$p.value
## [1] 0.0009421216
##
##
## $KAP
## $KAP$p.value
## [1] 0.001168321
##
##
## $NPV
## $NPV$p.value
## [1] 0.004364185
##
##
## $PPV
## $PPV$p.value
## [1] 0.0210427
##
##
## $SEN
## $SEN$p.value
## [1] 0.005713921
##
##
## $SPEC
## $SPEC$p.value
## [1] 0.0271587
The results show that XGBT model was really better in some aspects (p<0.05 for all paired comparison)
ROUND 3: External Validation
As the internal validation is not enough, we will validate each model on the same external dataset (n=40). Here are the results of Confusion Matrix analysis
predxgb<-predict(XGBT,testset,type="prob")%>%cbind(testset,.)
predxgb$Predicted=predict(XGBT,testset)
predxgb$Model="XGBT"
predsvm<-testset%>%mutate(.,Predicted=predict(SVM,.))
predsvm$Model="SVM"
confusionMatrix(predxgb$Predicted, reference=predxgb$Outcome,positive ="Death",mode="everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survive Death
## Survive 22 0
## Death 0 18
##
## Accuracy : 1
## 95% CI : (0.9119, 1)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 4.116e-11
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.00
## Specificity : 1.00
## Pos Pred Value : 1.00
## Neg Pred Value : 1.00
## Precision : 1.00
## Recall : 1.00
## F1 : 1.00
## Prevalence : 0.45
## Detection Rate : 0.45
## Detection Prevalence : 0.45
## Balanced Accuracy : 1.00
##
## 'Positive' Class : Death
##
confusionMatrix(predsvm$Predicted, reference=predsvm$Outcome,positive ="Death",mode="everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Survive Death
## Survive 22 0
## Death 0 18
##
## Accuracy : 1
## 95% CI : (0.9119, 1)
## No Information Rate : 0.55
## P-Value [Acc > NIR] : 4.116e-11
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.00
## Specificity : 1.00
## Pos Pred Value : 1.00
## Neg Pred Value : 1.00
## Precision : 1.00
## Recall : 1.00
## F1 : 1.00
## Prevalence : 0.45
## Detection Rate : 0.45
## Detection Prevalence : 0.45
## Balanced Accuracy : 1.00
##
## 'Positive' Class : Death
##
pred<-rbind(predxgb[-c(12:13)],predsvm)
pred%>%ggplot(aes(x=Predicted))+geom_bar(aes(fill=Outcome),color="black",stat="count",position="fill",alpha=0.6)+facet_grid(Model~.,scales="free")+
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1")
predxgb%>%ggplot(aes(x=Death))+geom_histogram(aes(fill=Outcome,color=Outcome),alpha=0.6)+facet_grid(Outcome~.,scales="free")+
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1")
Those results indicate that both models worked perfectly well on an independent dataset. This is a draw match.
Discussion: To this point, the results of our benchmark experiment could be summarized as follow:
The XGBT is slighly better than the SVM with linear Kernel. It represents a steady and higher performance through out the resamling process. However, training a XGBT would cost much more time and this is a limitation for applied on larger dataset.
The external validation shows that both models (XGBT and SVM) are perfect for predicting the survival outcome in 40 patients. They could be considered equivalent for clinical application.
Advantage of XGBT model is that it allows us to understand the attribution of each biomarker to the survival prognosis. The role of each biomarker could be seen in Figures 5A and 5B:
Figure 5A,B:Importance features in the XGBT model
features<-trainset[,-11]%>%apply(.,2,function(x) as.numeric(as.character(x)))%>%as.matrix()%>%colnames()
xgboost::xgb.importance(features,XGBT$finalModel)%>%xgboost::xgb.ggplot.importance()
xgboost::xgb.importance(features,XGBT$finalModel)%>%gather(Gain:Frequency,key="Metric",value="Value")%>%ggplot(aes(x=reorder(Feature,Value),y=Value,fill=reorder(Feature,Value)))+geom_bar(color="black",width=0.5,stat="identity",alpha=0.7)+facet_wrap(~Metric,ncol=3,scales="free")+scale_x_discrete("Marker")+coord_flip()
Figure 6: Summarized result
datalog<-data%>%.[,c(2:7,9:11)]%>%log2(.)%>%mutate(.,Outcome=data$Outcome,PredXGBT=predict(XGBT,data[,-1]),PredSVM=predict(SVM,data[,-1]))
plotfuncLow <- function(data,mapping){
p <- ggplot(data = data,mapping=mapping)+geom_point(aes(fill=datalog$PredXGBT),shape=21,color="black")+stat_density2d(geom="polygon",aes(fill=datalog$PredXGBT,alpha = ..level..))+scale_fill_manual(values=c("blue","red"))
p
}
plotfuncUp <- function(data,mapping){
p <- ggplot(data = data,mapping=mapping)+geom_point(aes(fill=datalog$PredSVM),shape=21,color="black")+stat_density2d(geom="polygon",aes(fill=datalog$PredSVM,alpha = ..level..))+scale_fill_manual(values=c("violet","gold"))
p
}
library(GGally)
ggpairs(datalog,columns = 1:9,lower = list(continuous=plotfuncLow),diag=NULL,upper = list(continuous=plotfuncUp))
The figure 6 displays the classification by XGBT model (lower, blue= Survived, red=Death) versus SVM model (upper,violet=survived,orange=death) based on a scatter-plot matrix of most important biomarkers (logarithmic scale).
Conclusion: Based on our benchmark experiment, the XGBT seems to be silghtly better than the SVM algorith and both of them could be applied for our clinical context. Both models showed an excellent prognostic performance for survival outcome in patients with sepsis.
END