Case study 52: Beyond the ROC curves (or ML applied to Performance Evaluation)

Foreword: About the Machine Learning in Medicine (MLM) project

The MLM project has been initialized in 2016 and aims to:

1.Encourage using Machine Learning techniques in medical research in Vietnam

2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.

Background

The main objective of Case study 52 (Chapter 8 - Machine Learning in Medicine-Cookbook 3 by T. J. Cleophas and A. H. Zwinderman, Springer, 2014) is to compare the performance of two diagnostic Tests for Peripheral Vascular Disease.

There are 2 datasets of 2 parallel groups. The first group was evaluated by an old test (n=588) and the second one by the new test. The study question is whether the new test performs better than the old one ?

Looking at these data, we will find out that each set contains only 2 variables : the Testscore and the true diagnosis. Does it look familiar to you ? Yes ! Any medical student can quickly solve this problem using the Receiver Operating Characteristic (ROC) analysis (also known as C-statistic). However, our experiment will go beyond the ROC curve with some Machine learning techniques.

Materials and method

The original datasets were provided in Chaper 8, Machine Learning in Medicine-Cookbook 3 (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in SPSS SAV format ; vascdisease1.sav and vascdisease2.sav) from their website: extras.springer.com.

Data analysis was performed in R statistical programming language. First, we will perform a conventional ROC curve analysis using pROC package, then we will try an alternative method that implies a benchmark study on several Machine learning algorithms. The mlr package (https://mlr-org.github.io/mlr-tutorial/release/html/index.html) will be used through our benchmark experiment, including model training and Resampling by K folds cross-validation.

Instead of evaluating directly the diagnostic test, our benchmark study consists of evaluating the performance of classfication rules that based on 4 basic algorithms: Naive Bayes, KNN, Logistic regression and Decision tree.

First, following packages must be established in R-studio (some other packages might also be required during the analysis), and we will personalize the aesthetic effects for ggplot2

library(tidyverse)
library(foreign)
library(pROC)
library(mlr)
library(gridExtra)

my_theme <- function(base_size = 8, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
    theme(
      axis.text = element_text(size =7),
      axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
      axis.title = element_text(size = 11),
      panel.grid.major = element_line(color = "grey"),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = "#fffece"),
      strip.background = element_rect(fill = "#ffb20c", color = "black", size =0.5),
      strip.text = element_text(face = "bold", size = 8, color = "black"),
      legend.position = "bottom",
      legend.justification = "center",
      legend.background = element_blank(),
      panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)
    )
}
theme_set(my_theme())

myfillcolors=c("#ee0cff", "#ffbb00","#ff0c4c", "#ff300c","#aaff00","red")
mycolors=c("#9e00aa","#ff8300","#d1063b" ,"#db2202","#28b201","red3")

Results

Step 0: Data loading and reshape

require(foreign)

df1=read.spss("vascdisease1.sav",use.value.labels = TRUE,to.data.frame = TRUE)%>%as_tibble()%>%na.omit()

df1$VAR00002<-df1$VAR00002%>%as.factor()%>%recode_factor(.,`0` = "No", `1` = "Yes")

df1<-df1%>%rename(.,Testscore=VAR00001,Disease=VAR00002)%>%mutate(.,Type="Oldtest")

df2=read.spss("vascdisease2.sav",use.value.labels = TRUE,to.data.frame = TRUE)%>%as_tibble()%>%na.omit()

df2$VAR00002<-df2$VAR00002%>%as.factor()%>%recode_factor(.,`0` = "No", `1` = "Yes")

df2<-df2%>%rename(.,Testscore=VAR00001,Disease=VAR00002)%>%mutate(.,Type="Newtest")

df1
## # A tibble: 588 × 3
##    Testscore Disease    Type
##        <dbl>  <fctr>   <chr>
## 1          1      No Oldtest
## 2          1      No Oldtest
## 3          2      No Oldtest
## 4          2      No Oldtest
## 5          3      No Oldtest
## 6          3      No Oldtest
## 7          3      No Oldtest
## 8          4      No Oldtest
## 9          4      No Oldtest
## 10         4      No Oldtest
## # ... with 578 more rows
df2
## # A tibble: 650 × 3
##    Testscore Disease    Type
##        <dbl>  <fctr>   <chr>
## 1          1      No Newtest
## 2          2      No Newtest
## 3          2      No Newtest
## 4          3      No Newtest
## 5          3      No Newtest
## 6          3      No Newtest
## 7          4      No Newtest
## 8          4      No Newtest
## 9          4      No Newtest
## 10         4      No Newtest
## # ... with 640 more rows

Once the above commmands executed, we will get 2 data tibbles named df1 and df2, that will be used throughout our analysis.

Step 1) Conventional ROC curve analysis

Figure 1: ROC curves for evaluating performance of two diagnostic tests

rocobj2 <- plot.roc(df2$Disease, df2$Testscore,
                    main="Old vs New test",
                    col="purple",grid=T
                    )

rocobj1 <- lines.roc(df1$Disease, df1$Testscore,col="orange")
                     

testobj <- roc.test(rocobj2, rocobj1,method="delong",alternative="greater")

text(0.4,0.3,labels=paste("p-value =", format.pval(testobj$p.value),"(by Delong)"),cex =.75)
text(0.4,0.25,labels=paste("AUC New=",testobj$estimate[1]),cex =.75)
text(0.4,0.2,labels=paste("AUC Old=",testobj$estimate[2]),cex =.75)

legend("topright",legend=c("Old","New"),col=c("orange","purple"),lwd=2)

The figure 1 shows that: ROC-AUC of the new test is significantly better than that of the old test (0.974 vs 0.945, p-value=0.001 by DeLong method).

Table 1A ,1B: Threshold analysis for the Old and New tests

roc1=roc(df1$Disease, df1$Testscore, direction="auto",ci=T,smooth=F)

pre1=length(which(df1$Disease=="Yes"))/nrow(df1)

rocdf1=as_tibble(cbind(roc1$thresholds,roc1$sensitivities,roc1$specificities))%>%dplyr::rename(.,Cutoff=V1,Sens=V2,Spec=V3)

rocdf1<-rocdf1%>%mutate(.,FPR=(1-.$Spec),
                        FNR=(1-.$Sens),
                        PPV=((.$Sens*pre1)/(.$Sens*pre1+(1-.$Spec)*(1-pre1))),
                        NPV=(.$Spec*(1-pre1)/((1-.$Sens)*pre1+.$Spec*(1-pre1))),
                        d_distance=sqrt((1-.$Sens)^2+(1-.$Spec)^2),
                        Youden_J=.$Sens+.$Spec-1,Test="Old"
)%>%na.omit()

rocdf1%>%knitr::kable()
Cutoff Sens Spec FPR FNR PPV NPV d_distance Youden_J Test
1.5 1.0000000 0.0072202 0.9927798 0.0000000 0.5307167 1.0000000 0.9927798 0.0072202 Old
2.5 1.0000000 0.0144404 0.9855596 0.0000000 0.5325342 1.0000000 0.9855596 0.0144404 Old
3.5 1.0000000 0.0252708 0.9747292 0.0000000 0.5352840 1.0000000 0.9747292 0.0252708 Old
4.5 1.0000000 0.0433213 0.9566787 0.0000000 0.5399306 1.0000000 0.9566787 0.0433213 Old
5.5 1.0000000 0.0685921 0.9314079 0.0000000 0.5465729 1.0000000 0.9314079 0.0685921 Old
6.5 1.0000000 0.1046931 0.8953069 0.0000000 0.5563506 1.0000000 0.8953069 0.1046931 Old
7.5 1.0000000 0.1480144 0.8519856 0.0000000 0.5685558 1.0000000 0.8519856 0.1480144 Old
8.5 1.0000000 0.2021661 0.7978339 0.0000000 0.5845865 1.0000000 0.7978339 0.2021661 Old
9.5 1.0000000 0.2599278 0.7400722 0.0000000 0.6027132 1.0000000 0.7400722 0.2599278 Old
10.5 1.0000000 0.3212996 0.6787004 0.0000000 0.6232465 1.0000000 0.6787004 0.3212996 Old
11.5 1.0000000 0.3826715 0.6173285 0.0000000 0.6452282 1.0000000 0.6173285 0.3826715 Old
12.5 1.0000000 0.4476534 0.5523466 0.0000000 0.6702586 1.0000000 0.5523466 0.4476534 Old
13.5 1.0000000 0.5126354 0.4873646 0.0000000 0.6973094 1.0000000 0.4873646 0.5126354 Old
14.5 1.0000000 0.5812274 0.4187726 0.0000000 0.7283372 1.0000000 0.4187726 0.5812274 Old
15.5 0.9903537 0.6425993 0.3574007 0.0096463 0.7567568 0.9834254 0.3575309 0.6329530 Old
16.5 0.9774920 0.6967509 0.3032491 0.0225080 0.7835052 0.9650000 0.3040833 0.6742429 Old
17.5 0.9581994 0.7545126 0.2454874 0.0418006 0.8142077 0.9414414 0.2490208 0.7127120 Old
18.5 0.9324759 0.7942238 0.2057762 0.0675241 0.8357349 0.9128631 0.2165718 0.7266997 Old
19.5 0.9035370 0.8303249 0.1696751 0.0964630 0.8567073 0.8846154 0.1951788 0.7338619 Old
20.5 0.8842444 0.8592058 0.1407942 0.1157556 0.8757962 0.8686131 0.1822701 0.7434501 Old
21.5 0.8392283 0.8772563 0.1227437 0.1607717 0.8847458 0.8293515 0.2022710 0.7164846 Old
22.5 0.7813505 0.9061372 0.0938628 0.2186495 0.9033457 0.7868339 0.2379450 0.6874877 Old
23.5 0.7170418 0.9314079 0.0685921 0.2829582 0.9214876 0.7456647 0.2911532 0.6484497 Old
24.5 0.6430868 0.9494585 0.0505415 0.3569132 0.9345794 0.7032086 0.3604739 0.5925453 Old
25.5 0.5723473 0.9675090 0.0324910 0.4276527 0.9518717 0.6683292 0.4288852 0.5398563 Old
26.5 0.4951768 0.9819495 0.0180505 0.5048232 0.9685535 0.6340326 0.5051458 0.4771263 Old
27.5 0.4244373 0.9927798 0.0072202 0.5755627 0.9850746 0.6057269 0.5756080 0.4172171 Old
28.5 0.3504823 1.0000000 0.0000000 0.6495177 1.0000000 0.5782881 0.6495177 0.3504823 Old
29.5 0.2797428 1.0000000 0.0000000 0.7202572 1.0000000 0.5528942 0.7202572 0.2797428 Old
30.5 0.2122186 1.0000000 0.0000000 0.7877814 1.0000000 0.5306513 0.7877814 0.2122186 Old
31.5 0.1511254 1.0000000 0.0000000 0.8488746 1.0000000 0.5120148 0.8488746 0.1511254 Old
32.5 0.0996785 1.0000000 0.0000000 0.9003215 1.0000000 0.4973070 0.9003215 0.0996785 Old
33.5 0.0610932 1.0000000 0.0000000 0.9389068 1.0000000 0.4868190 0.9389068 0.0610932 Old
34.5 0.0353698 1.0000000 0.0000000 0.9646302 1.0000000 0.4800693 0.9646302 0.0353698 Old
35.5 0.0192926 1.0000000 0.0000000 0.9807074 1.0000000 0.4759450 0.9807074 0.0192926 Old
36.5 0.0096463 1.0000000 0.0000000 0.9903537 1.0000000 0.4735043 0.9903537 0.0096463 Old
37.5 0.0032154 1.0000000 0.0000000 0.9967846 1.0000000 0.4718910 0.9967846 0.0032154 Old
roc2=roc(df2$Disease, df2$Testscore, direction="auto",ci=T,smooth=F)

pre2=length(which(df2$Disease=="Yes"))/nrow(df2)

rocdf2=as_tibble(cbind(roc2$thresholds,roc2$sensitivities,roc2$specificities))%>%dplyr::rename(.,Cutoff=V1,Sens=V2,Spec=V3)

rocdf2<-rocdf2%>%mutate(.,FPR=(1-.$Spec),
                        FNR=(1-.$Sens),
                        PPV=((.$Sens*pre2)/(.$Sens*pre2+(1-.$Spec)*(1-pre2))),
                        NPV=(.$Spec*(1-pre2)/((1-.$Sens)*pre2+.$Spec*(1-pre2))),
                        d_distance=sqrt((1-.$Sens)^2+(1-.$Spec)^2),
                        Youden_J=.$Sens+.$Spec-1,Test="New"
)%>%na.omit()

rocdf2%>%knitr::kable()
Cutoff Sens Spec FPR FNR PPV NPV d_distance Youden_J Test
1.5 1.0000000 0.0037453 0.9962547 0.0000000 0.5901387 1.0000000 0.9962547 0.0037453 New
2.5 1.0000000 0.0112360 0.9887640 0.0000000 0.5919629 1.0000000 0.9887640 0.0112360 New
3.5 1.0000000 0.0224719 0.9775281 0.0000000 0.5947205 1.0000000 0.9775281 0.0224719 New
4.5 1.0000000 0.0411985 0.9588015 0.0000000 0.5993740 1.0000000 0.9588015 0.0411985 New
5.5 1.0000000 0.0711610 0.9288390 0.0000000 0.6069731 1.0000000 0.9288390 0.0711610 New
6.5 1.0000000 0.1161049 0.8838951 0.0000000 0.6187399 1.0000000 0.8838951 0.1161049 New
7.5 1.0000000 0.1647940 0.8352060 0.0000000 0.6320132 1.0000000 0.8352060 0.1647940 New
8.5 1.0000000 0.2322097 0.7677903 0.0000000 0.6513605 1.0000000 0.7677903 0.2322097 New
9.5 1.0000000 0.3033708 0.6966292 0.0000000 0.6731107 1.0000000 0.6966292 0.3033708 New
10.5 1.0000000 0.3782772 0.6217228 0.0000000 0.6976321 1.0000000 0.6217228 0.3782772 New
11.5 1.0000000 0.4569288 0.5430712 0.0000000 0.7253788 1.0000000 0.5430712 0.4569288 New
12.5 1.0000000 0.5355805 0.4644195 0.0000000 0.7554241 1.0000000 0.4644195 0.5355805 New
13.5 1.0000000 0.6179775 0.3820225 0.0000000 0.7896907 1.0000000 0.3820225 0.6179775 New
14.5 0.9869452 0.6928839 0.3071161 0.0130548 0.8217391 0.9736842 0.3073934 0.6798291 New
15.5 0.9686684 0.7602996 0.2397004 0.0313316 0.8528736 0.9441860 0.2417394 0.7289680 New
16.5 0.9477807 0.8277154 0.1722846 0.0522193 0.8875306 0.9170124 0.1800246 0.7754960 New
17.5 0.9242820 0.8838951 0.1161049 0.0757180 0.9194805 0.8905660 0.1386130 0.8081771 New
18.5 0.8955614 0.9288390 0.0711610 0.1044386 0.9475138 0.8611111 0.1263777 0.8244003 New
19.5 0.8616188 0.9513109 0.0486891 0.1383812 0.9620991 0.8273616 0.1466969 0.8129297 New
20.5 0.8224543 0.9700375 0.0299625 0.1775457 0.9752322 0.7920489 0.1800562 0.7924918 New
21.5 0.7728460 0.9812734 0.0187266 0.2271540 0.9833887 0.7507163 0.2279247 0.7541194 New
22.5 0.7127937 0.9925094 0.0074906 0.2872063 0.9927273 0.7066667 0.2873039 0.7053031 New
23.5 0.6475196 1.0000000 0.0000000 0.3524804 1.0000000 0.6641791 0.3524804 0.6475196 New
24.5 0.5744125 1.0000000 0.0000000 0.4255875 1.0000000 0.6209302 0.4255875 0.5744125 New
25.5 0.5039164 1.0000000 0.0000000 0.4960836 1.0000000 0.5842451 0.4960836 0.5039164 New
26.5 0.4281984 1.0000000 0.0000000 0.5718016 1.0000000 0.5493827 0.5718016 0.4281984 New
27.5 0.3577023 1.0000000 0.0000000 0.6422977 1.0000000 0.5204678 0.6422977 0.3577023 New
28.5 0.2845953 1.0000000 0.0000000 0.7154047 1.0000000 0.4935305 0.7154047 0.2845953 New
29.5 0.2271540 1.0000000 0.0000000 0.7728460 1.0000000 0.4742451 0.7728460 0.2271540 New
30.5 0.1723238 1.0000000 0.0000000 0.8276762 1.0000000 0.4571918 0.8276762 0.1723238 New
31.5 0.1227154 1.0000000 0.0000000 0.8772846 1.0000000 0.4427861 0.8772846 0.1227154 New
32.5 0.0809399 1.0000000 0.0000000 0.9190601 1.0000000 0.4313409 0.9190601 0.0809399 New
33.5 0.0496084 1.0000000 0.0000000 0.9503916 1.0000000 0.4231379 0.9503916 0.0496084 New
34.5 0.0287206 1.0000000 0.0000000 0.9712794 1.0000000 0.4178404 0.9712794 0.0287206 New
35.5 0.0156658 1.0000000 0.0000000 0.9843342 1.0000000 0.4145963 0.9843342 0.0156658 New
36.5 0.0078329 1.0000000 0.0000000 0.9921671 1.0000000 0.4126739 0.9921671 0.0078329 New
37.5 0.0026110 1.0000000 0.0000000 0.9973890 1.0000000 0.4114022 0.9973890 0.0026110 New

The tables 1A and 1B represents all information we can get from a ROC analysis, including Sensitivity, Specificity, False positive and negative rates, Positive and negative predictive values, d-distance and Youden’s J index for each cut-off level of Test score.

We can identify the optimal cut-off for each test based on Youden’J index and d_distance, as follow:

sel1=rocdf1%>%subset(.,d_distance==min(d_distance) | Youden_J==max(Youden_J))

sel2=rocdf2%>%subset(.,d_distance==min(d_distance) | Youden_J==max(Youden_J))

sel=rbind(sel1,sel2)

sel%>%knitr::kable()
Cutoff Sens Spec FPR FNR PPV NPV d_distance Youden_J Test
20.5 0.8842444 0.8592058 0.1407942 0.1157556 0.8757962 0.8686131 0.1822701 0.7434501 Old
18.5 0.8955614 0.9288390 0.0711610 0.1044386 0.9475138 0.8611111 0.1263777 0.8244003 New

The result indicates that the optimal cut-off for Old and new tests are respectively 20.5 and 18.5, that correspond to a Sensitivity of 0.884 vs 0.896 and a Specificty of 0.859 vs 0.929. Thus, the new test provides higher Sensitivity and Specificity than the old one.

Those results are represented in the Fig 2A and 2B;

Figure 2A,B: Threshold vs Performance analysis

rocdf<-rbind(rocdf1,rocdf2)%>%gather(Sens:Youden_J,key="Metric",value="Value")

rocdf%>%ggplot(aes(x=Cutoff,y=Value,color=Metric))+geom_line(size=1)+facet_grid(Metric~Test,scales="free")+geom_vline(xintercept=18.5,color="blue")

rocdf%>%ggplot(aes(x=Cutoff,y=Value,fill=Metric,color=Metric))+geom_bar(stat="identity",alpha=0.6)+facet_grid(Metric~Test,scales="free")+geom_vline(xintercept=18.5,color="blue")

This is the end of analysis by C-statistic. On the next step we will go beyond the notion of probabilities and arbitrary cut-off.

Step 2) Benchmark experiment

There will be 2 questions that we should answer:

  1. Whether the performance of each diagnostic test could be improved by changing the decision rule (or optimising the interpretation of test-score by implementing different classification algorithms) ?

  2. Can we still report our result with ROC-AUCs ?

In a few minutes you could find the answer for both of these questions:

library(mlr)
taskOld=makeClassifTask(id="Old",data=df1[,-3],target="Disease",positive = "Yes")
taskNew=makeClassifTask(id="New",data=df2[,-3],target="Disease",positive = "Yes")

tasks=list(taskOld,taskNew)

resmp= makeResampleDesc("RepCV",folds=10,reps=10)

learners = list(
  makeLearner("classif.naiveBayes",id="NBayes",predict.type = "prob",fix.factors.prediction = TRUE),
  makeLearner("classif.rpart",id="Tree",predict.type = "prob",fix.factors.prediction = TRUE),
  makeLearner("classif.kknn",id="KNN",predict.type = "prob",fix.factors.prediction = TRUE),
  makeLearner("classif.logreg",id="LOG",predict.type = "prob",fix.factors.prediction = TRUE)
)

mes=list(auc,bac)

set.seed(123)
bmrk=benchmark(learners,tasks,resmp,mes)

bmdata=getBMRPerformances(bmrk, as.df = TRUE)%>%as_tibble()

We justed performed a Benchmark study that involves 4 basic classfication algorithms: Naive Bayes, K-NN, Logistic and CART, each one of them were trained across 100 iterations on two different datasets.Their performance was evaluated based on Ballanced accuracy and ROC-AUC (Yes, AUC of the ROC curve) in a 10x10 cross-validation process.

Pairwised comparison of AUC among 4 algorithms

bmdata%>%split(.$task.id)%>%map(~psych::describeBy(.$auc,group=.$learner.id))
## $New
## $NBayes
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis se
## X1    1 100 0.97 0.01   0.98    0.98 0.01 0.92   1  0.08 -1.11     2.21  0
## 
## $Tree
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 100 0.91 0.03   0.92    0.91 0.03 0.82 0.97  0.15 -0.6    -0.36  0
## 
## $KNN
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis se
## X1    1 100 0.96 0.02   0.96    0.96 0.02 0.85   1  0.15 -1.42     3.32  0
## 
## $LOG
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis se
## X1    1 100 0.97 0.01   0.98    0.98 0.01 0.92   1  0.08 -1.11     2.21  0
## 
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
## 
## $Old
## $NBayes
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 100 0.94 0.02   0.94    0.94 0.02 0.88 0.99  0.11 -0.4    -0.24  0
## 
## $Tree
##    vars   n mean   sd median trimmed  mad  min  max range  skew kurtosis
## X1    1 100 0.86 0.05   0.86    0.86 0.04 0.74 0.96  0.23 -0.31    -0.03
##    se
## X1  0
## 
## $KNN
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 100 0.92 0.03   0.92    0.92 0.03 0.81 0.99  0.18 -0.4     1.11  0
## 
## $LOG
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 100 0.94 0.02   0.94    0.94 0.02 0.88 0.99  0.11 -0.4    -0.24  0
## 
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
bmdata%>%split(.$task.id)%>%map(~PMCMR::posthoc.kruskal.nemenyi.test(formula=auc~learner.id,data=.,dist="Tukey"))
## $New
## 
##  Pairwise comparisons using Tukey and Kramer (Nemenyi) test  
##                    with Tukey-Dist approximation for independent samples 
## 
## data:  auc by learner.id 
## 
##      NBayes  Tree    KNN    
## Tree < 2e-16 -       -      
## KNN  7.6e-06 8.7e-14 -      
## LOG  1       < 2e-16 7.6e-06
## 
## P value adjustment method: none 
## 
## $Old
## 
##  Pairwise comparisons using Tukey and Kramer (Nemenyi) test  
##                    with Tukey-Dist approximation for independent samples 
## 
## data:  auc by learner.id 
## 
##      NBayes  Tree    KNN    
## Tree < 2e-16 -       -      
## KNN  5.5e-05 8.6e-14 -      
## LOG  1       < 2e-16 5.5e-05
## 
## P value adjustment method: none

The results showed that within the same test (Old or New), the diagnostic performance could be significantly modified by switching from one interpretation method to another. Indeed, the classifcation rules that based on Naive Bayes and Logistic models result the best AUC (0.94 for Old test and 0.97 for new test)

Averaged diagnostic performance, compared between the 2 tests

bmdata%>%split(.$learner.id)%>%map(~psych::describeBy(.$auc,group=.$task.id))
## $NBayes
## $New
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis se
## X1    1 100 0.97 0.01   0.98    0.98 0.01 0.92   1  0.08 -1.11     2.21  0
## 
## $Old
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 100 0.94 0.02   0.94    0.94 0.02 0.88 0.99  0.11 -0.4    -0.24  0
## 
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
## 
## $Tree
## $New
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 100 0.91 0.03   0.92    0.91 0.03 0.82 0.97  0.15 -0.6    -0.36  0
## 
## $Old
##    vars   n mean   sd median trimmed  mad  min  max range  skew kurtosis
## X1    1 100 0.86 0.05   0.86    0.86 0.04 0.74 0.96  0.23 -0.31    -0.03
##    se
## X1  0
## 
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
## 
## $KNN
## $New
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis se
## X1    1 100 0.96 0.02   0.96    0.96 0.02 0.85   1  0.15 -1.42     3.32  0
## 
## $Old
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 100 0.92 0.03   0.92    0.92 0.03 0.81 0.99  0.18 -0.4     1.11  0
## 
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
## 
## $LOG
## $New
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis se
## X1    1 100 0.97 0.01   0.98    0.98 0.01 0.92   1  0.08 -1.11     2.21  0
## 
## $Old
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 100 0.94 0.02   0.94    0.94 0.02 0.88 0.99  0.11 -0.4    -0.24  0
## 
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
bmdata%>%split(.$learner.id)%>%map(~DescTools::NemenyiTest(x=.$auc,g=.$task.id,dist="tukey"))
## $NBayes
## 
##  Nemenyi's test of multiple comparisons for independent samples (tukey)  
## 
##         mean.rank.diff    pval    
## Old-New         -74.81 2.4e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## $Tree
## 
##  Nemenyi's test of multiple comparisons for independent samples (tukey)  
## 
##         mean.rank.diff  pval    
## Old-New          -64.9 3e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## $KNN
## 
##  Nemenyi's test of multiple comparisons for independent samples (tukey)  
## 
##         mean.rank.diff    pval    
## Old-New         -67.68 2.2e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## $LOG
## 
##  Nemenyi's test of multiple comparisons for independent samples (tukey)  
## 
##         mean.rank.diff    pval    
## Old-New         -74.81 2.4e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Then we compared the averaged AUC of 4 algorithms between the Old and New tests (2 datasets), the result indicates that the New test provide a significantly better performance (as evaluated by AUC) than the old test.

generateCritDifferencesData(bmrk, p.value = 0.05, test = "nemenyi")
## $data
##        mean.rank learner.id rank yend xend right short.name
## NBayes       1.5     NBayes  1.5  0.5    0     0     nbayes
## Tree         4.0       Tree  4.0  0.5    5     1      rpart
## KNN          3.0        KNN  3.0  1.5    5     1       kknn
## LOG          1.5        LOG  1.5  1.5    0     0     logreg
## 
## $cd.info
## $cd.info$test
## [1] "nemenyi"
## 
## $cd.info$cd
## [1] 3.316606
## 
## $cd.info$x
## [1] 1.5
## 
## $cd.info$y
## [1] 0.1
## 
## $cd.info$nemenyi.data
##   xstart xend diff   y
## 1    1.5    4  2.5 0.1
## 
## 
## $friedman.nemenyi.test
## 
##  Friedman rank sum test
## 
## data:  auc.test.mean and learner.id and task.id
## Friedman chi-squared = 6, df = 3, p-value = 0.1116
## 
## 
## $baseline
## [1] NBayes
## Levels: KNN LOG NBayes Tree
## 
## $p.value
## [1] 0.05
## 
## attr(,"class")
## [1] "CritDifferencesData"

The ranking analysis suggests that the Logistic model and Naive Bayes model are the best algorithms that would be used for interpreting our test scores.

Those results could be visualized in Figures 3A, B, C below:

bmdata%>%ggplot(aes(x=auc,fill=task.id))+geom_density(alpha=0.6)+facet_wrap(~learner.id,ncol=3,scales="free")+geom_vline(xintercept=0.9)+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)

bmdata%>%ggplot(aes(x=task.id,y=auc,fill=task.id,color=learner.id))+stat_summary(fun.ymin=min,fun.ymax=max,fun.y="median",shape=21,alpha=0.6,size=1)+facet_wrap(~learner.id,scales="free")+coord_flip()

bmdata%>%ggplot(aes(x=learner.id,y=auc,fill=learner.id,color=learner.id))+stat_summary(fun.ymin=min,fun.ymax=max,fun.y="median",shape=21,alpha=0.6,size=1)+facet_wrap(~task.id,scales="free")+coord_flip()

bmdata%>%ggplot(aes(x=iter,y=auc,color=task.id))+geom_path(alpha=0.9,size=0.8)+facet_grid(learner.id~task.id,scales="free")+scale_color_manual(values=c("#ff0c4c","purple"))+geom_hline(yintercept=0.9,color="red")

Figure 4: Ranking of 4 algorithms by Nemenyi test

nemenyi = generateCritDifferencesData(bmrk, p.value = 0.05, test = "nemenyi")
plotCritDifferences(nemenyi)+coord_cartesian(xlim = c(-1,5), ylim = c(0,2))

Figure 5 : ROC curves of 4 algorithms

rocd=generateThreshVsPerfData(bmrk, measures = list(fpr, tpr))
plotROCCurves(rocd)+geom_line(size=1,aes(color=learner))+geom_polygon(aes(fill=learner),alpha=0.5)+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)

By the end, we would like to reveal the content of our two best algorithms: Naive Bayes and Logistic models

Figure 6 Classification rule by NAIVE BAYES model

NB=makeLearner("classif.naiveBayes",id="NB",predict.type = "prob",fix.factors.prediction = TRUE)

NBfit=train(NB,taskNew)

NBfit$learner.model
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##        No       Yes 
## 0.4107692 0.5892308 
## 
## Conditional probabilities:
##      Testscore
## Y         [,1]     [,2]
##   No  12.05618 4.484794
##   Yes 25.39687 5.106119
predict(NBfit,taskNew)%>%as_tibble()%>%mutate(.,TestScore=df2$Testscore)%>%ggplot(aes(x=TestScore,y=prob.Yes))+geom_line(size=1,color="red")

Figure 7 Classification rule by LOGISTIC model

LOG=makeLearner("classif.logreg",id="logistic",predict.type = "prob",fix.factors.prediction = TRUE)

logistic=train(LOG,taskNew)
summary(logistic$learner.model)
## 
## Call:
## stats::glm(formula = f, family = "binomial", data = getTaskData(.task, 
##     .subset), weights = .weights, model = FALSE)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.49821  -0.20006   0.02969   0.21166   2.12916  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.29682    0.91512  -11.25   <2e-16 ***
## Testscore     0.58140    0.05085   11.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 880.28  on 649  degrees of freedom
## Residual deviance: 261.74  on 648  degrees of freedom
## AIC: 265.74
## 
## Number of Fisher Scoring iterations: 7
predict(logistic,taskNew)%>%as_tibble()%>%mutate(.,TestScore=df2$Testscore)%>%ggplot(aes(x=TestScore,y=prob.Yes))+geom_line(size=1,color="purple")

The distribution graph for positive probability indicates that those 2 models are equivalent for clinical practice.

Conclusion: Our study has succesfully extended the conventional ROC-curve analysis. There are two messages announced by this document:

  1. The performance of a diagnostic test could be improved by adopting the right decision rule or interpreting algorithm. You must carefully think about it before picking any arbitrary cut-off point, cause you might get something better via a benchmark study.

  2. The ROC curve as we know is just a two dimensioned line plot that shows the relationship between Sensitivity and 1-Specificity of a given classifier. So ANY classification rule (not a specific test by its own, but any rule that implies stand-alone or combined clinical data, including symtomps, laboratory markers and medical imaging, simple or sophisticated algorithm…) could be visually evaluated by a ROC curve.

END