Case study N57: Extreme Gradient Boosting Tree V.S Support Vector Machine

Foreword: About the Machine Learning in Medicine (MLM) project

The MLM project has been initialized in 2016 and aims to:

1.Encourage using Machine Learning techniques in medical research in Vietnam

2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.

Background

The 57th case study (Chapter 13 - Machine Learning in Medicine-Cookbook Two by T. J. Cleophas and A. H. Zwinderman, Springer, 2014) allows us to set up a small competiton between 2 machine learning algorithms : Extreme Gradient Boosting Tree versus Support Vector Machine.

In this study, we extracted the laboratory data and survival outcome of 200 patients hospitalized for sepsis. Our objective is to estimate the mortality risk of these patients using 10 markers: gamma-glutamyl transferase (gammagt, U/l),aspartate aminotransferase and Alanine transaminase (U/l), bilirubine (μmol/l), urea (mmol/l), creatinine (μmol/l), creatinine clearance (ml/min), erythrocyte sedimentation rate (ESR, mm), c-reactive protein (CRP, mg/l) and leucocyte count (×10^9/l).

In the original textbook, the authors adopted an SVM algorithm (Knime Data Miner) to resolve this classification problem. We will extend this experiment by setting a benchmark study to compare the performance of two potential algorithms (SVM and XGBT). Each model will be trained on the same training subset (n=160 or 80% of original dataset) then tested on the same external subset (n=40 or 20%).

Materials and method

The original dataset was provided in Chaper 13, Machine Learning in Medicine-Cookbook Three (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in csv format svm.csv) from their website: extras.springer.com.

Data analysis was performed in R statistical programming language. The caret package (by Max Kuhn: http://topepo.github.io/caret) will be used through out the process, including data splitting, model tuning, resampling, training and testing.

The Extreme gradient boosting Tree algorithm is supported by “xgboost” package, and Support vector machine with linear Kernel by “kernlab”" package.

The model training implied an integrated Tuning-Resampling process in caret, with a 5x5 cross-validations. This consists of training and testing the algorithm by CV while the model’s parameters are tuned until reaching their optimized performance. Then final model for each algorithm will be externally validated on the same test subset (n=40). Model’s averaged performance (including Accuracy, Sensitivity, Specificity, PPV, NPV, Kappa coefficient and Balanced Accuracy) will be compared by Wilcoxon-sign rank test.

First, following packages must be established in R-studio (some other packages will also be required during the analysis), and we will personalize a global theme for ggplot2

library(plyr)
library(tidyverse)
library(GGally)

my_theme <- function(base_size = 10, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
    theme(
      axis.text = element_text(size = 10),
      axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
      axis.title = element_text(size = 10),
      panel.grid.major = element_line(color = "gray"),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = "#f7fdff"),
      strip.background = element_rect(fill = "#001d60", color = "#00113a", size =0.5),
      strip.text = element_text(face = "bold", size = 10, color = "white"),
      legend.position = "bottom",
      legend.justification = "center",
      legend.background = element_blank(),
      panel.border = element_rect(color = "grey5", fill = NA, size = 0.5)
    )
}

theme_set(my_theme())

Results

Step 0: Data loading and reshape

data<-read.csv("Case57.csv",sep=";")%>%as_tibble()%>%dplyr::rename(.,Death=VAR00001,GammaGT=VAR00002,ASAT=VAR00003,ALAT=VAR00004,Bilirubin=VAR00005,Urea=VAR00006,Creatinine=VAR00007,CreClearance=VAR00008,ESR=VAR00009,CRP=VAR00010,Leucocytes=VAR00011)

data$Outcome=recode_factor(data$Death,`0` = "Survive", `1` = "Death")

data

## # A tibble: 200 × 12
##    Death GammaGT  ASAT  ALAT Bilirubin  Urea Creatinine CreClearance   ESR
##    <int>   <int> <int> <int>     <int> <dbl>      <int>        <int> <int>
## 1      0      20    23    34         2   3.4         89         -111     2
## 2      0      14    21    33         3   2.0         67         -112     7
## 3      0      30    35    32         4   5.6         58         -116     8
## 4      0      35    34    40         4   6.0         76         -110     6
## 5      0      23    33    22         4   6.1         95         -120     9
## 6      0      26    31    24         3   5.4         78         -132     8
## 7      0      15    29    26         2   5.3         47         -120    12
## 8      0      13    26    24         1   6.3         65         -132    13
## 9      0      26    27    27         4   6.0         97         -112    14
## 10     0      34    25    13         3   4.0         67         -125    15
## # ... with 190 more rows, and 3 more variables: CRP <int>,
## #   Leucocytes <int>, Outcome <fctr>

Once the above commmands executed, we will get a clean dataframe, that will be used throughout 4 steps.

Step 1) Data exploration

Table 1 and Figure 1: Descriptive statistics

psych::describeBy(data[,-1],group=data$Outcome)

## $Survive
##              vars   n    mean    sd median trimmed   mad  min max range
## GammaGT         1 107   25.08 20.01   21.0   21.37  7.41    2 134   132
## ASAT            2 107   25.26 27.13   21.0   20.63  8.90    3 236   233
## ALAT            3 107   25.18 19.42   24.0   21.87 10.38    2 120   118
## Bilirubin       4 107    6.14 20.16    3.0    2.95  1.48    1 201   200
## Urea            5 107    5.24  4.11    4.8    4.78  1.33    2  45    43
## Creatinine      6 107   94.12 88.59   80.0   80.87 11.86   47 845   798
## CreClearance    7 107 -110.81 16.38 -112.0 -112.86 10.38 -132  -7   125
## ESR             8 107   14.64 11.66   14.0   13.33  2.97    2 123   121
## CRP             9 107    6.95  9.38    6.0    5.86  1.48    2 100    98
## Leucocytes     10 107    6.27  3.21    6.0    5.94  1.48    2  29    27
## Outcome*       11 107    1.00  0.00    1.0    1.00  0.00    1   1     0
##              skew kurtosis   se
## GammaGT      3.27    11.81 1.93
## ASAT         5.29    34.23 2.62
## ALAT         3.32    12.22 1.88
## Bilirubin    8.58    78.78 1.95
## Urea         8.48    79.10 0.40
## Creatinine   7.10    52.78 8.56
## CreClearance 3.08    14.75 1.58
## ESR          7.52    66.92 1.13
## CRP          9.11    87.38 0.91
## Leucocytes   3.50    21.50 0.31
## Outcome*      NaN      NaN 0.00
## 
## $Death
##              vars  n   mean     sd median trimmed    mad    min  max
## GammaGT         1 93 402.31 432.37    269  320.21 220.91   13.0 2300
## ASAT            2 93 388.99 413.80    277  316.08 281.69    3.0 1980
## ALAT            3 93 380.80 369.73    267  322.85 257.97   15.0 1500
## Bilirubin       4 93 126.46 104.56     85  115.44  88.96    2.0  400
## Urea            5 93  26.77  22.65     18   23.31  15.72    3.2   98
## Creatinine      6 93 306.35 211.69    241  280.61 154.19   71.0  865
## CreClearance    7 93 -44.81  33.76    -38  -40.71  35.58 -132.0   -4
## ESR             8 93  66.86  42.14     48   62.32  34.10    7.0  180
## CRP             9 93  41.98  44.07     27   32.99  16.31    3.0  243
## Leucocytes     10 93  18.69   5.91     18   18.40   5.93    6.0   30
## Outcome*       11 93   2.00   0.00      2    2.00   0.00    2.0    2
##               range  skew kurtosis    se
## GammaGT      2287.0  1.86     3.63 44.84
## ASAT         1977.0  2.08     4.84 42.91
## ALAT         1485.0  1.33     1.06 38.34
## Bilirubin     398.0  0.84    -0.21 10.84
## Urea           94.8  1.26     0.85  2.35
## Creatinine    794.0  1.00    -0.23 21.95
## CreClearance  128.0 -0.91     0.02  3.50
## ESR           173.0  0.79    -0.46  4.37
## CRP           240.0  2.24     5.31  4.57
## Leucocytes     24.0  0.39    -0.54  0.61
## Outcome*        0.0   NaN      NaN  0.00
## 
## attr(,"call")
## by.data.frame(data = x, INDICES = group, FUN = describe, type = type)

data%>%gather(GammaGT:Leucocytes,key="Marker",value="Value")%>%ggplot(aes(x=Value,fill=Outcome,color=Outcome))+geom_density(alpha=0.6,show.legend =T)+facet_wrap(~Marker,scales="free",ncol=4)+scale_fill_brewer(palette = "Set1")+scale_color_brewer(palette = "Set1")

Figure 2: Heatmap plot showing the level of 10 sepsis biomarkers in function of survival outcome

dfscale<-data[,-c(1,12)]%>%as.matrix()%>%scale()%>%as_tibble()%>%mutate(.,Outcome=data$Outcome,Id=row.names(.))

dfscale%>%gather(GammaGT:Leucocytes,key="Marker",value="Value")%>%ggplot(aes(x=reorder(Id,Value),y=reorder(Marker,Value),fill=Value))+geom_tile(color="black",show.legend=T)+facet_wrap(~Outcome,ncol=1,shrink=T,scale="free")+scale_fill_gradient2(low="#fcde00",mid="#fc3232",high="#7901a8",midpoint=3)+theme(axis.text.x=element_blank())+scale_y_discrete("Markers")+scale_x_discrete("Patient's Id")

We summon the “mighty”" caret package then perform data splitting

library(caret)

set.seed(123)
idTrain=createDataPartition(y=data$Death, p=160/200,list=FALSE)
trainset=data[idTrain,-1]
testset=data[-idTrain,-1]

As you can see, the original dataset was splitted into 2 subsets; train (n=160) and test (n=40)

Now, we will activate the Training-Tuning and Resampling process for two algorithms

# Extreme gradient boosting tree model

set.seed(123)
XGBT<-train(Outcome~.,
                  data=trainset,
                  method="xgbTree",
                  preProcess = NULL,
                  tuneLength=5,
                  trControl = trainControl(method = "repeatedcv", number=5,repeats=5,verboseIter = FALSE,summaryFunction=multiClassSummary)
)

#Support vector machine with Linear Kernel model

set.seed(123)
SVM<-train(Outcome~.,
            data=trainset,
            method="svmLinear",
            preProcess = NULL,
            tuneLength=5,
            trControl = trainControl(method = "repeatedcv", number=5,repeats=5,verboseIter = FALSE,summaryFunction=multiClassSummary)
)

The Tuning process on XGBT could be presented as follow:

ggplot(XGBT)+scale_color_brewer(palette = "Set1")

The content of each model

XGBT$finalModel

## ##### xgb.Booster
## raw: 14.7 Kb 
## call:
##   xgb.train(params = list(eta = param$eta, max_depth = param$max_depth, 
##     gamma = param$gamma, colsample_bytree = param$colsample_bytree, 
##     min_child_weight = param$min_child_weight), data = dat, nrounds = param$nrounds, 
##     objective = "binary:logistic")
## params (as set within xgb.train):
##   eta = "0.4", max_depth = "2", gamma = "0", colsample_bytree = "0.6", min_child_weight = "1", objective = "binary:logistic", silent = "1"
## xgb.attributes:
##   niter
## callbacks:
##   cb.print.evaluation(period = print_every_n)
## niter: 50
## xNames: GammaGTASATALATBilirubinUreaCreatinineCreClearanceESRCRPLeucocytes
## problemType: Classification
## tuneValue:
##     nrounds max_depth eta gamma colsample_bytree min_child_weight
## 61      50         2 0.4     0              0.6                1
## obsLevels: SurviveDeath

SVM$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 29 
## 
## Objective Function Value : -21.7457 
## Training error : 0.05

Now the duel between those two monsters really begins

ROUND 1) Confusion matrix: Averaged performance by training

confusionMatrix(XGBT,positive ="Death")

## Cross-Validated (5 fold, repeated 5 times) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction Survive Death
##    Survive    51.2   1.0
##    Death       1.9  45.9
##                             
##  Accuracy (average) : 0.9712

confusionMatrix(SVM,positive ="Death")

## Cross-Validated (5 fold, repeated 5 times) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction Survive Death
##    Survive    48.2   2.5
##    Death       4.9  44.4
##                             
##  Accuracy (average) : 0.9263

The confusion Matrix on averaged training performance of two models suggest that the XGBT performs slightly better.

So XGBT won this round.

ROUND 2: Averaged Performance through CV Resampling

This round, each model will show their performance through 25 steps of CV:

Figure 3A,B,C: Performance metrics of 2 models

resXGB<-XGBT$resample%>%separate(Resample,into = c("Fold", "Iter"), sep = ".Rep")
resXGB=dplyr::rename(resXGB,ACC=Accuracy,KAP=Kappa,SEN=Sensitivity,SPEC=Specificity,PPV=Pos_Pred_Value,NPV=Neg_Pred_Value,BAC=Balanced_Accuracy)%>%.[,-7]
resXGB$Iteration=c(1:nrow(resXGB))
resXGB$model="XGB"

resSVM<-SVM$resample%>%separate(Resample,into = c("Fold", "Iter"), sep = ".Rep")
resSVM=dplyr::rename(resSVM,ACC=Accuracy,KAP=Kappa,SEN=Sensitivity,SPEC=Specificity,PPV=Pos_Pred_Value,NPV=Neg_Pred_Value,BAC=Balanced_Accuracy)%>%.[,-7]
resSVM$Iteration=c(1:nrow(resSVM))
resSVM$model="SVM"

resXGB=gather(resXGB,ACC:BAC,key="Metric",value="Value")
resSVM=gather(resSVM,ACC:BAC,key="Metric",value="Value")

resdf=rbind(resSVM,resXGB)

resdf%>%ggplot(aes(x=Metric,y=Value,fill=model,color=model))+stat_summary(fun.ymin=min,fun.ymax=max,fun.y="median",shape=21,size=1)+facet_grid(model~.)+coord_flip()+scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1")

resdf%>%ggplot(aes(x=Metric,y=Value))+geom_boxplot(aes(fill=model),alpha=0.6)+coord_flip()+facet_wrap(~Metric,scales="free",ncol=2)+theme(axis.text.y=element_blank())+scale_fill_brewer(palette = "Set1")+scale_color_brewer(palette = "Set1")

resdf%>%ggplot(aes(x=Value))+geom_freqpoly(aes(fill=model,color=model),size=1)+facet_wrap(~Metric,scales="free",ncol=2)+theme(axis.text.y=element_blank())+scale_color_brewer(palette = "Set1")

Figure 4A,B: Evolution of performance metrics of two models through 25 iteration steps and by each fold

resdf%>%ggplot(aes(x=Iteration,y=Value))+geom_path(aes(color=model),size=1,alpha=0.7)+facet_wrap(~Metric,scales="free",ncol=2)+scale_color_brewer(palette = "Set1")

resdf%>%ggplot(aes(x=Iter,y=Value,group=model))+geom_line(aes(color=model),size=1,show.legend=F)+facet_grid(Fold~Metric,scales="free")+scale_color_brewer(palette = "Set1")

Based on those graphs, the XGBT seems to perform better than SVM by most of criteria. The XGBT produced a more steady evolution, lower variability and higher averaged value of performance metrics in each one of 5 folds and through out overall 25 iteration steps.

To confirm this, we will use a Wilcoxon-sign rank test (a non parametric that is equivalent to paired samples t-test):

resdf%>%select(.,4:6)%>%split(.$Metric)%>%map(~wilcox.test(.$Value~.$model,paired=T))%>%map(~.[3])

## $ACC
## $ACC$p.value
## [1] 0.001038597
## 
## 
## $BAC
## $BAC$p.value
## [1] 0.0009421216
## 
## 
## $KAP
## $KAP$p.value
## [1] 0.001168321
## 
## 
## $NPV
## $NPV$p.value
## [1] 0.004364185
## 
## 
## $PPV
## $PPV$p.value
## [1] 0.0210427
## 
## 
## $SEN
## $SEN$p.value
## [1] 0.005713921
## 
## 
## $SPEC
## $SPEC$p.value
## [1] 0.0271587

The results show that XGBT model was really better in some aspects (p<0.05 for all paired comparison)

ROUND 3: External Validation

As the internal validation is not enough, we will validate each model on the same external dataset (n=40). Here are the results of Confusion Matrix analysis

predxgb<-predict(XGBT,testset,type="prob")%>%cbind(testset,.)
predxgb$Predicted=predict(XGBT,testset)
predxgb$Model="XGBT"

predsvm<-testset%>%mutate(.,Predicted=predict(SVM,.))
predsvm$Model="SVM"

confusionMatrix(predxgb$Predicted, reference=predxgb$Outcome,positive ="Death",mode="everything")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survive Death
##    Survive      22     0
##    Death         0    18
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9119, 1)
##     No Information Rate : 0.55       
##     P-Value [Acc > NIR] : 4.116e-11  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.00       
##             Specificity : 1.00       
##          Pos Pred Value : 1.00       
##          Neg Pred Value : 1.00       
##               Precision : 1.00       
##                  Recall : 1.00       
##                      F1 : 1.00       
##              Prevalence : 0.45       
##          Detection Rate : 0.45       
##    Detection Prevalence : 0.45       
##       Balanced Accuracy : 1.00       
##                                      
##        'Positive' Class : Death      
##

confusionMatrix(predsvm$Predicted, reference=predsvm$Outcome,positive ="Death",mode="everything")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survive Death
##    Survive      22     0
##    Death         0    18
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9119, 1)
##     No Information Rate : 0.55       
##     P-Value [Acc > NIR] : 4.116e-11  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.00       
##             Specificity : 1.00       
##          Pos Pred Value : 1.00       
##          Neg Pred Value : 1.00       
##               Precision : 1.00       
##                  Recall : 1.00       
##                      F1 : 1.00       
##              Prevalence : 0.45       
##          Detection Rate : 0.45       
##    Detection Prevalence : 0.45       
##       Balanced Accuracy : 1.00       
##                                      
##        'Positive' Class : Death      
##

pred<-rbind(predxgb[-c(12:13)],predsvm)

pred%>%ggplot(aes(x=Predicted))+geom_bar(aes(fill=Outcome),color="black",stat="count",position="fill",alpha=0.6)+facet_grid(Model~.,scales="free")+
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1")

predxgb%>%ggplot(aes(x=Death))+geom_histogram(aes(fill=Outcome,color=Outcome),alpha=0.6)+facet_grid(Outcome~.,scales="free")+
  scale_fill_brewer(palette = "Set1") +
  scale_color_brewer(palette = "Set1")

Those results indicate that both models worked perfectly well on an independent dataset. This is a draw match.

Discussion: To this point, the results of our benchmark experiment could be summarized as follow:

The XGBT is slighly better than the SVM with linear Kernel. It represents a steady and higher performance through out the resamling process. However, training a XGBT would cost much more time and this is a limitation for applied on larger dataset.
The external validation shows that both models (XGBT and SVM) are perfect for predicting the survival outcome in 40 patients. They could be considered equivalent for clinical application.

Advantage of XGBT model is that it allows us to understand the attribution of each biomarker to the survival prognosis. The role of each biomarker could be seen in Figures 5A and 5B:

Figure 5A,B:Importance features in the XGBT model

features<-trainset[,-11]%>%apply(.,2,function(x) as.numeric(as.character(x)))%>%as.matrix()%>%colnames()

xgboost::xgb.importance(features,XGBT$finalModel)%>%xgboost::xgb.ggplot.importance()

xgboost::xgb.importance(features,XGBT$finalModel)%>%gather(Gain:Frequency,key="Metric",value="Value")%>%ggplot(aes(x=reorder(Feature,Value),y=Value,fill=reorder(Feature,Value)))+geom_bar(color="black",width=0.5,stat="identity",alpha=0.7)+facet_wrap(~Metric,ncol=3,scales="free")+scale_x_discrete("Marker")+coord_flip()

Figure 6: Summarized result

datalog<-data%>%.[,c(2:7,9:11)]%>%log2(.)%>%mutate(.,Outcome=data$Outcome,PredXGBT=predict(XGBT,data[,-1]),PredSVM=predict(SVM,data[,-1]))

plotfuncLow <- function(data,mapping){
  p <- ggplot(data = data,mapping=mapping)+geom_point(aes(fill=datalog$PredXGBT),shape=21,color="black")+stat_density2d(geom="polygon",aes(fill=datalog$PredXGBT,alpha = ..level..))+scale_fill_manual(values=c("blue","red"))
  p
}

plotfuncUp <- function(data,mapping){
  p <- ggplot(data = data,mapping=mapping)+geom_point(aes(fill=datalog$PredSVM),shape=21,color="black")+stat_density2d(geom="polygon",aes(fill=datalog$PredSVM,alpha = ..level..))+scale_fill_manual(values=c("violet","gold"))
  p
}

library(GGally)

ggpairs(datalog,columns = 1:9,lower = list(continuous=plotfuncLow),diag=NULL,upper = list(continuous=plotfuncUp))

The figure 6 displays the classification by XGBT model (lower, blue= Survived, red=Death) versus SVM model (upper,violet=survived,orange=death) based on a scatter-plot matrix of most important biomarkers (logarithmic scale).

Conclusion: Based on our benchmark experiment, the XGBT seems to be silghtly better than the SVM algorith and both of them could be applied for our clinical context. Both models showed an excellent prognostic performance for survival outcome in patients with sepsis.

END

MLM Case study 57

Le Dong Nhat Nam (MD, PhD)

20 jan 2017

Case study N57: Extreme Gradient Boosting Tree V.S Support Vector Machine