Foreword: About the Machine Learning in Medicine (MLM) project
The MLM project has been initialized in 2016 and aims to:
1.Encourage using Machine Learning techniques in medical research in Vietnam
2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.
Background
The main objective of Case study 52 (Chapter 8 - Machine Learning in Medicine-Cookbook 3 by T. J. Cleophas and A. H. Zwinderman, Springer, 2014) is to compare the performance of two diagnostic Tests for Peripheral Vascular Disease.
There are 2 datasets of 2 parallel groups. The first group was evaluated by an old test (n=588) and the second one by the new test. The study question is whether the new test performs better than the old one ?
Looking at these data, we will find out that each set contains only 2 variables : the Testscore and the true diagnosis. Does it look familiar to you ? Yes ! Any medical student can quickly solve this problem using the Receiver Operating Characteristic (ROC) analysis (also known as C-statistic). However, our experiment will go beyond the ROC curve with some Machine learning techniques.
Materials and method
The original datasets were provided in Chaper 8, Machine Learning in Medicine-Cookbook 3 (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in SPSS SAV format ; vascdisease1.sav and vascdisease2.sav) from their website: extras.springer.com.
Data analysis was performed in R statistical programming language. First, we will perform a conventional ROC curve analysis using pROC package, then we will try an alternative method that implies a benchmark study on several Machine learning algorithms. The mlr package (https://mlr-org.github.io/mlr-tutorial/release/html/index.html) will be used through our benchmark experiment, including model training and Resampling by K folds cross-validation.
Instead of evaluating directly the diagnostic test, our benchmark study consists of evaluating the performance of classfication rules that based on 4 basic algorithms: Naive Bayes, KNN, Logistic regression and Decision tree.
First, following packages must be established in R-studio (some other packages might also be required during the analysis), and we will personalize the aesthetic effects for ggplot2
library(tidyverse)
library(foreign)
library(pROC)
library(mlr)
library(gridExtra)
my_theme <- function(base_size = 8, base_family = "sans"){
theme_minimal(base_size = base_size, base_family = base_family) +
theme(
axis.text = element_text(size =7),
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
axis.title = element_text(size = 11),
panel.grid.major = element_line(color = "grey"),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#fffece"),
strip.background = element_rect(fill = "#ffb20c", color = "black", size =0.5),
strip.text = element_text(face = "bold", size = 8, color = "black"),
legend.position = "bottom",
legend.justification = "center",
legend.background = element_blank(),
panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)
)
}
theme_set(my_theme())
myfillcolors=c("#ee0cff", "#ffbb00","#ff0c4c", "#ff300c","#aaff00","red")
mycolors=c("#9e00aa","#ff8300","#d1063b" ,"#db2202","#28b201","red3")
Results
Step 0: Data loading and reshape
require(foreign)
df1=read.spss("vascdisease1.sav",use.value.labels = TRUE,to.data.frame = TRUE)%>%as_tibble()%>%na.omit()
df1$VAR00002<-df1$VAR00002%>%as.factor()%>%recode_factor(.,`0` = "No", `1` = "Yes")
df1<-df1%>%rename(.,Testscore=VAR00001,Disease=VAR00002)%>%mutate(.,Type="Oldtest")
df2=read.spss("vascdisease2.sav",use.value.labels = TRUE,to.data.frame = TRUE)%>%as_tibble()%>%na.omit()
df2$VAR00002<-df2$VAR00002%>%as.factor()%>%recode_factor(.,`0` = "No", `1` = "Yes")
df2<-df2%>%rename(.,Testscore=VAR00001,Disease=VAR00002)%>%mutate(.,Type="Newtest")
df1
## # A tibble: 588 × 3
## Testscore Disease Type
## <dbl> <fctr> <chr>
## 1 1 No Oldtest
## 2 1 No Oldtest
## 3 2 No Oldtest
## 4 2 No Oldtest
## 5 3 No Oldtest
## 6 3 No Oldtest
## 7 3 No Oldtest
## 8 4 No Oldtest
## 9 4 No Oldtest
## 10 4 No Oldtest
## # ... with 578 more rows
df2
## # A tibble: 650 × 3
## Testscore Disease Type
## <dbl> <fctr> <chr>
## 1 1 No Newtest
## 2 2 No Newtest
## 3 2 No Newtest
## 4 3 No Newtest
## 5 3 No Newtest
## 6 3 No Newtest
## 7 4 No Newtest
## 8 4 No Newtest
## 9 4 No Newtest
## 10 4 No Newtest
## # ... with 640 more rows
Once the above commmands executed, we will get 2 data tibbles named df1 and df2, that will be used throughout our analysis.
Step 1) Conventional ROC curve analysis
Figure 1: ROC curves for evaluating performance of two diagnostic tests
rocobj2 <- plot.roc(df2$Disease, df2$Testscore,
main="Old vs New test",
col="purple",grid=T
)
rocobj1 <- lines.roc(df1$Disease, df1$Testscore,col="orange")
testobj <- roc.test(rocobj2, rocobj1,method="delong",alternative="greater")
text(0.4,0.3,labels=paste("p-value =", format.pval(testobj$p.value),"(by Delong)"),cex =.75)
text(0.4,0.25,labels=paste("AUC New=",testobj$estimate[1]),cex =.75)
text(0.4,0.2,labels=paste("AUC Old=",testobj$estimate[2]),cex =.75)
legend("topright",legend=c("Old","New"),col=c("orange","purple"),lwd=2)
The figure 1 shows that: ROC-AUC of the new test is significantly better than that of the old test (0.974 vs 0.945, p-value=0.001 by DeLong method).
Table 1A ,1B: Threshold analysis for the Old and New tests
roc1=roc(df1$Disease, df1$Testscore, direction="auto",ci=T,smooth=F)
pre1=length(which(df1$Disease=="Yes"))/nrow(df1)
rocdf1=as_tibble(cbind(roc1$thresholds,roc1$sensitivities,roc1$specificities))%>%dplyr::rename(.,Cutoff=V1,Sens=V2,Spec=V3)
rocdf1<-rocdf1%>%mutate(.,FPR=(1-.$Spec),
FNR=(1-.$Sens),
PPV=((.$Sens*pre1)/(.$Sens*pre1+(1-.$Spec)*(1-pre1))),
NPV=(.$Spec*(1-pre1)/((1-.$Sens)*pre1+.$Spec*(1-pre1))),
d_distance=sqrt((1-.$Sens)^2+(1-.$Spec)^2),
Youden_J=.$Sens+.$Spec-1,Test="Old"
)%>%na.omit()
rocdf1%>%knitr::kable()
| Cutoff | Sens | Spec | FPR | FNR | PPV | NPV | d_distance | Youden_J | Test |
|---|---|---|---|---|---|---|---|---|---|
| 1.5 | 1.0000000 | 0.0072202 | 0.9927798 | 0.0000000 | 0.5307167 | 1.0000000 | 0.9927798 | 0.0072202 | Old |
| 2.5 | 1.0000000 | 0.0144404 | 0.9855596 | 0.0000000 | 0.5325342 | 1.0000000 | 0.9855596 | 0.0144404 | Old |
| 3.5 | 1.0000000 | 0.0252708 | 0.9747292 | 0.0000000 | 0.5352840 | 1.0000000 | 0.9747292 | 0.0252708 | Old |
| 4.5 | 1.0000000 | 0.0433213 | 0.9566787 | 0.0000000 | 0.5399306 | 1.0000000 | 0.9566787 | 0.0433213 | Old |
| 5.5 | 1.0000000 | 0.0685921 | 0.9314079 | 0.0000000 | 0.5465729 | 1.0000000 | 0.9314079 | 0.0685921 | Old |
| 6.5 | 1.0000000 | 0.1046931 | 0.8953069 | 0.0000000 | 0.5563506 | 1.0000000 | 0.8953069 | 0.1046931 | Old |
| 7.5 | 1.0000000 | 0.1480144 | 0.8519856 | 0.0000000 | 0.5685558 | 1.0000000 | 0.8519856 | 0.1480144 | Old |
| 8.5 | 1.0000000 | 0.2021661 | 0.7978339 | 0.0000000 | 0.5845865 | 1.0000000 | 0.7978339 | 0.2021661 | Old |
| 9.5 | 1.0000000 | 0.2599278 | 0.7400722 | 0.0000000 | 0.6027132 | 1.0000000 | 0.7400722 | 0.2599278 | Old |
| 10.5 | 1.0000000 | 0.3212996 | 0.6787004 | 0.0000000 | 0.6232465 | 1.0000000 | 0.6787004 | 0.3212996 | Old |
| 11.5 | 1.0000000 | 0.3826715 | 0.6173285 | 0.0000000 | 0.6452282 | 1.0000000 | 0.6173285 | 0.3826715 | Old |
| 12.5 | 1.0000000 | 0.4476534 | 0.5523466 | 0.0000000 | 0.6702586 | 1.0000000 | 0.5523466 | 0.4476534 | Old |
| 13.5 | 1.0000000 | 0.5126354 | 0.4873646 | 0.0000000 | 0.6973094 | 1.0000000 | 0.4873646 | 0.5126354 | Old |
| 14.5 | 1.0000000 | 0.5812274 | 0.4187726 | 0.0000000 | 0.7283372 | 1.0000000 | 0.4187726 | 0.5812274 | Old |
| 15.5 | 0.9903537 | 0.6425993 | 0.3574007 | 0.0096463 | 0.7567568 | 0.9834254 | 0.3575309 | 0.6329530 | Old |
| 16.5 | 0.9774920 | 0.6967509 | 0.3032491 | 0.0225080 | 0.7835052 | 0.9650000 | 0.3040833 | 0.6742429 | Old |
| 17.5 | 0.9581994 | 0.7545126 | 0.2454874 | 0.0418006 | 0.8142077 | 0.9414414 | 0.2490208 | 0.7127120 | Old |
| 18.5 | 0.9324759 | 0.7942238 | 0.2057762 | 0.0675241 | 0.8357349 | 0.9128631 | 0.2165718 | 0.7266997 | Old |
| 19.5 | 0.9035370 | 0.8303249 | 0.1696751 | 0.0964630 | 0.8567073 | 0.8846154 | 0.1951788 | 0.7338619 | Old |
| 20.5 | 0.8842444 | 0.8592058 | 0.1407942 | 0.1157556 | 0.8757962 | 0.8686131 | 0.1822701 | 0.7434501 | Old |
| 21.5 | 0.8392283 | 0.8772563 | 0.1227437 | 0.1607717 | 0.8847458 | 0.8293515 | 0.2022710 | 0.7164846 | Old |
| 22.5 | 0.7813505 | 0.9061372 | 0.0938628 | 0.2186495 | 0.9033457 | 0.7868339 | 0.2379450 | 0.6874877 | Old |
| 23.5 | 0.7170418 | 0.9314079 | 0.0685921 | 0.2829582 | 0.9214876 | 0.7456647 | 0.2911532 | 0.6484497 | Old |
| 24.5 | 0.6430868 | 0.9494585 | 0.0505415 | 0.3569132 | 0.9345794 | 0.7032086 | 0.3604739 | 0.5925453 | Old |
| 25.5 | 0.5723473 | 0.9675090 | 0.0324910 | 0.4276527 | 0.9518717 | 0.6683292 | 0.4288852 | 0.5398563 | Old |
| 26.5 | 0.4951768 | 0.9819495 | 0.0180505 | 0.5048232 | 0.9685535 | 0.6340326 | 0.5051458 | 0.4771263 | Old |
| 27.5 | 0.4244373 | 0.9927798 | 0.0072202 | 0.5755627 | 0.9850746 | 0.6057269 | 0.5756080 | 0.4172171 | Old |
| 28.5 | 0.3504823 | 1.0000000 | 0.0000000 | 0.6495177 | 1.0000000 | 0.5782881 | 0.6495177 | 0.3504823 | Old |
| 29.5 | 0.2797428 | 1.0000000 | 0.0000000 | 0.7202572 | 1.0000000 | 0.5528942 | 0.7202572 | 0.2797428 | Old |
| 30.5 | 0.2122186 | 1.0000000 | 0.0000000 | 0.7877814 | 1.0000000 | 0.5306513 | 0.7877814 | 0.2122186 | Old |
| 31.5 | 0.1511254 | 1.0000000 | 0.0000000 | 0.8488746 | 1.0000000 | 0.5120148 | 0.8488746 | 0.1511254 | Old |
| 32.5 | 0.0996785 | 1.0000000 | 0.0000000 | 0.9003215 | 1.0000000 | 0.4973070 | 0.9003215 | 0.0996785 | Old |
| 33.5 | 0.0610932 | 1.0000000 | 0.0000000 | 0.9389068 | 1.0000000 | 0.4868190 | 0.9389068 | 0.0610932 | Old |
| 34.5 | 0.0353698 | 1.0000000 | 0.0000000 | 0.9646302 | 1.0000000 | 0.4800693 | 0.9646302 | 0.0353698 | Old |
| 35.5 | 0.0192926 | 1.0000000 | 0.0000000 | 0.9807074 | 1.0000000 | 0.4759450 | 0.9807074 | 0.0192926 | Old |
| 36.5 | 0.0096463 | 1.0000000 | 0.0000000 | 0.9903537 | 1.0000000 | 0.4735043 | 0.9903537 | 0.0096463 | Old |
| 37.5 | 0.0032154 | 1.0000000 | 0.0000000 | 0.9967846 | 1.0000000 | 0.4718910 | 0.9967846 | 0.0032154 | Old |
roc2=roc(df2$Disease, df2$Testscore, direction="auto",ci=T,smooth=F)
pre2=length(which(df2$Disease=="Yes"))/nrow(df2)
rocdf2=as_tibble(cbind(roc2$thresholds,roc2$sensitivities,roc2$specificities))%>%dplyr::rename(.,Cutoff=V1,Sens=V2,Spec=V3)
rocdf2<-rocdf2%>%mutate(.,FPR=(1-.$Spec),
FNR=(1-.$Sens),
PPV=((.$Sens*pre2)/(.$Sens*pre2+(1-.$Spec)*(1-pre2))),
NPV=(.$Spec*(1-pre2)/((1-.$Sens)*pre2+.$Spec*(1-pre2))),
d_distance=sqrt((1-.$Sens)^2+(1-.$Spec)^2),
Youden_J=.$Sens+.$Spec-1,Test="New"
)%>%na.omit()
rocdf2%>%knitr::kable()
| Cutoff | Sens | Spec | FPR | FNR | PPV | NPV | d_distance | Youden_J | Test |
|---|---|---|---|---|---|---|---|---|---|
| 1.5 | 1.0000000 | 0.0037453 | 0.9962547 | 0.0000000 | 0.5901387 | 1.0000000 | 0.9962547 | 0.0037453 | New |
| 2.5 | 1.0000000 | 0.0112360 | 0.9887640 | 0.0000000 | 0.5919629 | 1.0000000 | 0.9887640 | 0.0112360 | New |
| 3.5 | 1.0000000 | 0.0224719 | 0.9775281 | 0.0000000 | 0.5947205 | 1.0000000 | 0.9775281 | 0.0224719 | New |
| 4.5 | 1.0000000 | 0.0411985 | 0.9588015 | 0.0000000 | 0.5993740 | 1.0000000 | 0.9588015 | 0.0411985 | New |
| 5.5 | 1.0000000 | 0.0711610 | 0.9288390 | 0.0000000 | 0.6069731 | 1.0000000 | 0.9288390 | 0.0711610 | New |
| 6.5 | 1.0000000 | 0.1161049 | 0.8838951 | 0.0000000 | 0.6187399 | 1.0000000 | 0.8838951 | 0.1161049 | New |
| 7.5 | 1.0000000 | 0.1647940 | 0.8352060 | 0.0000000 | 0.6320132 | 1.0000000 | 0.8352060 | 0.1647940 | New |
| 8.5 | 1.0000000 | 0.2322097 | 0.7677903 | 0.0000000 | 0.6513605 | 1.0000000 | 0.7677903 | 0.2322097 | New |
| 9.5 | 1.0000000 | 0.3033708 | 0.6966292 | 0.0000000 | 0.6731107 | 1.0000000 | 0.6966292 | 0.3033708 | New |
| 10.5 | 1.0000000 | 0.3782772 | 0.6217228 | 0.0000000 | 0.6976321 | 1.0000000 | 0.6217228 | 0.3782772 | New |
| 11.5 | 1.0000000 | 0.4569288 | 0.5430712 | 0.0000000 | 0.7253788 | 1.0000000 | 0.5430712 | 0.4569288 | New |
| 12.5 | 1.0000000 | 0.5355805 | 0.4644195 | 0.0000000 | 0.7554241 | 1.0000000 | 0.4644195 | 0.5355805 | New |
| 13.5 | 1.0000000 | 0.6179775 | 0.3820225 | 0.0000000 | 0.7896907 | 1.0000000 | 0.3820225 | 0.6179775 | New |
| 14.5 | 0.9869452 | 0.6928839 | 0.3071161 | 0.0130548 | 0.8217391 | 0.9736842 | 0.3073934 | 0.6798291 | New |
| 15.5 | 0.9686684 | 0.7602996 | 0.2397004 | 0.0313316 | 0.8528736 | 0.9441860 | 0.2417394 | 0.7289680 | New |
| 16.5 | 0.9477807 | 0.8277154 | 0.1722846 | 0.0522193 | 0.8875306 | 0.9170124 | 0.1800246 | 0.7754960 | New |
| 17.5 | 0.9242820 | 0.8838951 | 0.1161049 | 0.0757180 | 0.9194805 | 0.8905660 | 0.1386130 | 0.8081771 | New |
| 18.5 | 0.8955614 | 0.9288390 | 0.0711610 | 0.1044386 | 0.9475138 | 0.8611111 | 0.1263777 | 0.8244003 | New |
| 19.5 | 0.8616188 | 0.9513109 | 0.0486891 | 0.1383812 | 0.9620991 | 0.8273616 | 0.1466969 | 0.8129297 | New |
| 20.5 | 0.8224543 | 0.9700375 | 0.0299625 | 0.1775457 | 0.9752322 | 0.7920489 | 0.1800562 | 0.7924918 | New |
| 21.5 | 0.7728460 | 0.9812734 | 0.0187266 | 0.2271540 | 0.9833887 | 0.7507163 | 0.2279247 | 0.7541194 | New |
| 22.5 | 0.7127937 | 0.9925094 | 0.0074906 | 0.2872063 | 0.9927273 | 0.7066667 | 0.2873039 | 0.7053031 | New |
| 23.5 | 0.6475196 | 1.0000000 | 0.0000000 | 0.3524804 | 1.0000000 | 0.6641791 | 0.3524804 | 0.6475196 | New |
| 24.5 | 0.5744125 | 1.0000000 | 0.0000000 | 0.4255875 | 1.0000000 | 0.6209302 | 0.4255875 | 0.5744125 | New |
| 25.5 | 0.5039164 | 1.0000000 | 0.0000000 | 0.4960836 | 1.0000000 | 0.5842451 | 0.4960836 | 0.5039164 | New |
| 26.5 | 0.4281984 | 1.0000000 | 0.0000000 | 0.5718016 | 1.0000000 | 0.5493827 | 0.5718016 | 0.4281984 | New |
| 27.5 | 0.3577023 | 1.0000000 | 0.0000000 | 0.6422977 | 1.0000000 | 0.5204678 | 0.6422977 | 0.3577023 | New |
| 28.5 | 0.2845953 | 1.0000000 | 0.0000000 | 0.7154047 | 1.0000000 | 0.4935305 | 0.7154047 | 0.2845953 | New |
| 29.5 | 0.2271540 | 1.0000000 | 0.0000000 | 0.7728460 | 1.0000000 | 0.4742451 | 0.7728460 | 0.2271540 | New |
| 30.5 | 0.1723238 | 1.0000000 | 0.0000000 | 0.8276762 | 1.0000000 | 0.4571918 | 0.8276762 | 0.1723238 | New |
| 31.5 | 0.1227154 | 1.0000000 | 0.0000000 | 0.8772846 | 1.0000000 | 0.4427861 | 0.8772846 | 0.1227154 | New |
| 32.5 | 0.0809399 | 1.0000000 | 0.0000000 | 0.9190601 | 1.0000000 | 0.4313409 | 0.9190601 | 0.0809399 | New |
| 33.5 | 0.0496084 | 1.0000000 | 0.0000000 | 0.9503916 | 1.0000000 | 0.4231379 | 0.9503916 | 0.0496084 | New |
| 34.5 | 0.0287206 | 1.0000000 | 0.0000000 | 0.9712794 | 1.0000000 | 0.4178404 | 0.9712794 | 0.0287206 | New |
| 35.5 | 0.0156658 | 1.0000000 | 0.0000000 | 0.9843342 | 1.0000000 | 0.4145963 | 0.9843342 | 0.0156658 | New |
| 36.5 | 0.0078329 | 1.0000000 | 0.0000000 | 0.9921671 | 1.0000000 | 0.4126739 | 0.9921671 | 0.0078329 | New |
| 37.5 | 0.0026110 | 1.0000000 | 0.0000000 | 0.9973890 | 1.0000000 | 0.4114022 | 0.9973890 | 0.0026110 | New |
The tables 1A and 1B represents all information we can get from a ROC analysis, including Sensitivity, Specificity, False positive and negative rates, Positive and negative predictive values, d-distance and Youden’s J index for each cut-off level of Test score.
We can identify the optimal cut-off for each test based on Youden’J index and d_distance, as follow:
sel1=rocdf1%>%subset(.,d_distance==min(d_distance) | Youden_J==max(Youden_J))
sel2=rocdf2%>%subset(.,d_distance==min(d_distance) | Youden_J==max(Youden_J))
sel=rbind(sel1,sel2)
sel%>%knitr::kable()
| Cutoff | Sens | Spec | FPR | FNR | PPV | NPV | d_distance | Youden_J | Test |
|---|---|---|---|---|---|---|---|---|---|
| 20.5 | 0.8842444 | 0.8592058 | 0.1407942 | 0.1157556 | 0.8757962 | 0.8686131 | 0.1822701 | 0.7434501 | Old |
| 18.5 | 0.8955614 | 0.9288390 | 0.0711610 | 0.1044386 | 0.9475138 | 0.8611111 | 0.1263777 | 0.8244003 | New |
The result indicates that the optimal cut-off for Old and new tests are respectively 20.5 and 18.5, that correspond to a Sensitivity of 0.884 vs 0.896 and a Specificty of 0.859 vs 0.929. Thus, the new test provides higher Sensitivity and Specificity than the old one.
Those results are represented in the Fig 2A and 2B;
Figure 2A,B: Threshold vs Performance analysis
rocdf<-rbind(rocdf1,rocdf2)%>%gather(Sens:Youden_J,key="Metric",value="Value")
rocdf%>%ggplot(aes(x=Cutoff,y=Value,color=Metric))+geom_line(size=1)+facet_grid(Metric~Test,scales="free")+geom_vline(xintercept=18.5,color="blue")
rocdf%>%ggplot(aes(x=Cutoff,y=Value,fill=Metric,color=Metric))+geom_bar(stat="identity",alpha=0.6)+facet_grid(Metric~Test,scales="free")+geom_vline(xintercept=18.5,color="blue")
This is the end of analysis by C-statistic. On the next step we will go beyond the notion of probabilities and arbitrary cut-off.
Step 2) Benchmark experiment
There will be 2 questions that we should answer:
Whether the performance of each diagnostic test could be improved by changing the decision rule (or optimising the interpretation of test-score by implementing different classification algorithms) ?
Can we still report our result with ROC-AUCs ?
In a few minutes you could find the answer for both of these questions:
library(mlr)
taskOld=makeClassifTask(id="Old",data=df1[,-3],target="Disease",positive = "Yes")
taskNew=makeClassifTask(id="New",data=df2[,-3],target="Disease",positive = "Yes")
tasks=list(taskOld,taskNew)
resmp= makeResampleDesc("RepCV",folds=10,reps=10)
learners = list(
makeLearner("classif.naiveBayes",id="NBayes",predict.type = "prob",fix.factors.prediction = TRUE),
makeLearner("classif.rpart",id="Tree",predict.type = "prob",fix.factors.prediction = TRUE),
makeLearner("classif.kknn",id="KNN",predict.type = "prob",fix.factors.prediction = TRUE),
makeLearner("classif.logreg",id="LOG",predict.type = "prob",fix.factors.prediction = TRUE)
)
mes=list(auc,bac)
set.seed(123)
bmrk=benchmark(learners,tasks,resmp,mes)
bmdata=getBMRPerformances(bmrk, as.df = TRUE)%>%as_tibble()
We justed performed a Benchmark study that involves 4 basic classfication algorithms: Naive Bayes, K-NN, Logistic and CART, each one of them were trained across 100 iterations on two different datasets.Their performance was evaluated based on Ballanced accuracy and ROC-AUC (Yes, AUC of the ROC curve) in a 10x10 cross-validation process.
Pairwised comparison of AUC among 4 algorithms
bmdata%>%split(.$task.id)%>%map(~psych::describeBy(.$auc,group=.$learner.id))
## $New
## $NBayes
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100 0.97 0.01 0.98 0.98 0.01 0.92 1 0.08 -1.11 2.21 0
##
## $Tree
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100 0.91 0.03 0.92 0.91 0.03 0.82 0.97 0.15 -0.6 -0.36 0
##
## $KNN
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100 0.96 0.02 0.96 0.96 0.02 0.85 1 0.15 -1.42 3.32 0
##
## $LOG
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100 0.97 0.01 0.98 0.98 0.01 0.92 1 0.08 -1.11 2.21 0
##
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
##
## $Old
## $NBayes
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100 0.94 0.02 0.94 0.94 0.02 0.88 0.99 0.11 -0.4 -0.24 0
##
## $Tree
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 100 0.86 0.05 0.86 0.86 0.04 0.74 0.96 0.23 -0.31 -0.03
## se
## X1 0
##
## $KNN
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100 0.92 0.03 0.92 0.92 0.03 0.81 0.99 0.18 -0.4 1.11 0
##
## $LOG
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100 0.94 0.02 0.94 0.94 0.02 0.88 0.99 0.11 -0.4 -0.24 0
##
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
bmdata%>%split(.$task.id)%>%map(~PMCMR::posthoc.kruskal.nemenyi.test(formula=auc~learner.id,data=.,dist="Tukey"))
## $New
##
## Pairwise comparisons using Tukey and Kramer (Nemenyi) test
## with Tukey-Dist approximation for independent samples
##
## data: auc by learner.id
##
## NBayes Tree KNN
## Tree < 2e-16 - -
## KNN 7.6e-06 8.7e-14 -
## LOG 1 < 2e-16 7.6e-06
##
## P value adjustment method: none
##
## $Old
##
## Pairwise comparisons using Tukey and Kramer (Nemenyi) test
## with Tukey-Dist approximation for independent samples
##
## data: auc by learner.id
##
## NBayes Tree KNN
## Tree < 2e-16 - -
## KNN 5.5e-05 8.6e-14 -
## LOG 1 < 2e-16 5.5e-05
##
## P value adjustment method: none
The results showed that within the same test (Old or New), the diagnostic performance could be significantly modified by switching from one interpretation method to another. Indeed, the classifcation rules that based on Naive Bayes and Logistic models result the best AUC (0.94 for Old test and 0.97 for new test)
Averaged diagnostic performance, compared between the 2 tests
bmdata%>%split(.$learner.id)%>%map(~psych::describeBy(.$auc,group=.$task.id))
## $NBayes
## $New
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100 0.97 0.01 0.98 0.98 0.01 0.92 1 0.08 -1.11 2.21 0
##
## $Old
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100 0.94 0.02 0.94 0.94 0.02 0.88 0.99 0.11 -0.4 -0.24 0
##
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
##
## $Tree
## $New
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100 0.91 0.03 0.92 0.91 0.03 0.82 0.97 0.15 -0.6 -0.36 0
##
## $Old
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 100 0.86 0.05 0.86 0.86 0.04 0.74 0.96 0.23 -0.31 -0.03
## se
## X1 0
##
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
##
## $KNN
## $New
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100 0.96 0.02 0.96 0.96 0.02 0.85 1 0.15 -1.42 3.32 0
##
## $Old
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100 0.92 0.03 0.92 0.92 0.03 0.81 0.99 0.18 -0.4 1.11 0
##
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
##
## $LOG
## $New
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100 0.97 0.01 0.98 0.98 0.01 0.92 1 0.08 -1.11 2.21 0
##
## $Old
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100 0.94 0.02 0.94 0.94 0.02 0.88 0.99 0.11 -0.4 -0.24 0
##
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
bmdata%>%split(.$learner.id)%>%map(~DescTools::NemenyiTest(x=.$auc,g=.$task.id,dist="tukey"))
## $NBayes
##
## Nemenyi's test of multiple comparisons for independent samples (tukey)
##
## mean.rank.diff pval
## Old-New -74.81 2.4e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## $Tree
##
## Nemenyi's test of multiple comparisons for independent samples (tukey)
##
## mean.rank.diff pval
## Old-New -64.9 3e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## $KNN
##
## Nemenyi's test of multiple comparisons for independent samples (tukey)
##
## mean.rank.diff pval
## Old-New -67.68 2.2e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## $LOG
##
## Nemenyi's test of multiple comparisons for independent samples (tukey)
##
## mean.rank.diff pval
## Old-New -74.81 2.4e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Then we compared the averaged AUC of 4 algorithms between the Old and New tests (2 datasets), the result indicates that the New test provide a significantly better performance (as evaluated by AUC) than the old test.
generateCritDifferencesData(bmrk, p.value = 0.05, test = "nemenyi")
## $data
## mean.rank learner.id rank yend xend right short.name
## NBayes 1.5 NBayes 1.5 0.5 0 0 nbayes
## Tree 4.0 Tree 4.0 0.5 5 1 rpart
## KNN 3.0 KNN 3.0 1.5 5 1 kknn
## LOG 1.5 LOG 1.5 1.5 0 0 logreg
##
## $cd.info
## $cd.info$test
## [1] "nemenyi"
##
## $cd.info$cd
## [1] 3.316606
##
## $cd.info$x
## [1] 1.5
##
## $cd.info$y
## [1] 0.1
##
## $cd.info$nemenyi.data
## xstart xend diff y
## 1 1.5 4 2.5 0.1
##
##
## $friedman.nemenyi.test
##
## Friedman rank sum test
##
## data: auc.test.mean and learner.id and task.id
## Friedman chi-squared = 6, df = 3, p-value = 0.1116
##
##
## $baseline
## [1] NBayes
## Levels: KNN LOG NBayes Tree
##
## $p.value
## [1] 0.05
##
## attr(,"class")
## [1] "CritDifferencesData"
The ranking analysis suggests that the Logistic model and Naive Bayes model are the best algorithms that would be used for interpreting our test scores.
Those results could be visualized in Figures 3A, B, C below:
bmdata%>%ggplot(aes(x=auc,fill=task.id))+geom_density(alpha=0.6)+facet_wrap(~learner.id,ncol=3,scales="free")+geom_vline(xintercept=0.9)+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)
bmdata%>%ggplot(aes(x=task.id,y=auc,fill=task.id,color=learner.id))+stat_summary(fun.ymin=min,fun.ymax=max,fun.y="median",shape=21,alpha=0.6,size=1)+facet_wrap(~learner.id,scales="free")+coord_flip()
bmdata%>%ggplot(aes(x=learner.id,y=auc,fill=learner.id,color=learner.id))+stat_summary(fun.ymin=min,fun.ymax=max,fun.y="median",shape=21,alpha=0.6,size=1)+facet_wrap(~task.id,scales="free")+coord_flip()
bmdata%>%ggplot(aes(x=iter,y=auc,color=task.id))+geom_path(alpha=0.9,size=0.8)+facet_grid(learner.id~task.id,scales="free")+scale_color_manual(values=c("#ff0c4c","purple"))+geom_hline(yintercept=0.9,color="red")
Figure 4: Ranking of 4 algorithms by Nemenyi test
nemenyi = generateCritDifferencesData(bmrk, p.value = 0.05, test = "nemenyi")
plotCritDifferences(nemenyi)+coord_cartesian(xlim = c(-1,5), ylim = c(0,2))
Figure 5 : ROC curves of 4 algorithms
rocd=generateThreshVsPerfData(bmrk, measures = list(fpr, tpr))
plotROCCurves(rocd)+geom_line(size=1,aes(color=learner))+geom_polygon(aes(fill=learner),alpha=0.5)+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)
By the end, we would like to reveal the content of our two best algorithms: Naive Bayes and Logistic models
Figure 6 Classification rule by NAIVE BAYES model
NB=makeLearner("classif.naiveBayes",id="NB",predict.type = "prob",fix.factors.prediction = TRUE)
NBfit=train(NB,taskNew)
NBfit$learner.model
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## No Yes
## 0.4107692 0.5892308
##
## Conditional probabilities:
## Testscore
## Y [,1] [,2]
## No 12.05618 4.484794
## Yes 25.39687 5.106119
predict(NBfit,taskNew)%>%as_tibble()%>%mutate(.,TestScore=df2$Testscore)%>%ggplot(aes(x=TestScore,y=prob.Yes))+geom_line(size=1,color="red")
Figure 7 Classification rule by LOGISTIC model
LOG=makeLearner("classif.logreg",id="logistic",predict.type = "prob",fix.factors.prediction = TRUE)
logistic=train(LOG,taskNew)
summary(logistic$learner.model)
##
## Call:
## stats::glm(formula = f, family = "binomial", data = getTaskData(.task,
## .subset), weights = .weights, model = FALSE)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.49821 -0.20006 0.02969 0.21166 2.12916
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.29682 0.91512 -11.25 <2e-16 ***
## Testscore 0.58140 0.05085 11.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 880.28 on 649 degrees of freedom
## Residual deviance: 261.74 on 648 degrees of freedom
## AIC: 265.74
##
## Number of Fisher Scoring iterations: 7
predict(logistic,taskNew)%>%as_tibble()%>%mutate(.,TestScore=df2$Testscore)%>%ggplot(aes(x=TestScore,y=prob.Yes))+geom_line(size=1,color="purple")
The distribution graph for positive probability indicates that those 2 models are equivalent for clinical practice.
Conclusion: Our study has succesfully extended the conventional ROC-curve analysis. There are two messages announced by this document:
The performance of a diagnostic test could be improved by adopting the right decision rule or interpreting algorithm. You must carefully think about it before picking any arbitrary cut-off point, cause you might get something better via a benchmark study.
The ROC curve as we know is just a two dimensioned line plot that shows the relationship between Sensitivity and 1-Specificity of a given classifier. So ANY classification rule (not a specific test by its own, but any rule that implies stand-alone or combined clinical data, including symtomps, laboratory markers and medical imaging, simple or sophisticated algorithm…) could be visually evaluated by a ROC curve.
END