Case study 52: Beyond the ROC curves (or ML applied to Performance Evaluation)

Foreword: About the Machine Learning in Medicine (MLM) project

The MLM project has been initialized in 2016 and aims to:

1.Encourage using Machine Learning techniques in medical research in Vietnam

2.Promote the use of R statistical programming language, an open source and leading tool for practicing data science.

Background

The main objective of Case study 52 (Chapter 8 - Machine Learning in Medicine-Cookbook 3 by T. J. Cleophas and A. H. Zwinderman, Springer, 2014) is to compare the performance of two diagnostic Tests for Peripheral Vascular Disease.

There are 2 datasets of 2 parallel groups. The first group was evaluated by an old test (n=588) and the second one by the new test. The study question is whether the new test performs better than the old one ?

Looking at these data, we will find out that each set contains only 2 variables : the Testscore and the true diagnosis. Does it look familiar to you ? Yes ! Any medical student can quickly solve this problem using the Receiver Operating Characteristic (ROC) analysis (also known as C-statistic). However, our experiment will go beyond the ROC curve with some Machine learning techniques.

Materials and method

The original datasets were provided in Chaper 8, Machine Learning in Medicine-Cookbook 3 (T. J. Cleophas and A. H. Zwinderman, SpringerBriefs in Statistics 2014). You can get the original data in SPSS SAV format ; vascdisease1.sav and vascdisease2.sav) from their website: extras.springer.com.

Data analysis was performed in R statistical programming language. First, we will perform a conventional ROC curve analysis using pROC package, then we will try an alternative method that implies a benchmark study on several Machine learning algorithms. The mlr package (https://mlr-org.github.io/mlr-tutorial/release/html/index.html) will be used through our benchmark experiment, including model training and Resampling by K folds cross-validation.

Instead of evaluating directly the diagnostic test, our benchmark study consists of evaluating the performance of classfication rules that based on 4 basic algorithms: Naive Bayes, KNN, Logistic regression and Decision tree.

First, following packages must be established in R-studio (some other packages might also be required during the analysis), and we will personalize the aesthetic effects for ggplot2

library(tidyverse)
library(foreign)
library(pROC)
library(mlr)
library(gridExtra)

my_theme <- function(base_size = 8, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
    theme(
      axis.text = element_text(size =7),
      axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
      axis.title = element_text(size = 11),
      panel.grid.major = element_line(color = "grey"),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = "#fffece"),
      strip.background = element_rect(fill = "#ffb20c", color = "black", size =0.5),
      strip.text = element_text(face = "bold", size = 8, color = "black"),
      legend.position = "bottom",
      legend.justification = "center",
      legend.background = element_blank(),
      panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)
    )
}
theme_set(my_theme())

myfillcolors=c("#ee0cff", "#ffbb00","#ff0c4c", "#ff300c","#aaff00","red")
mycolors=c("#9e00aa","#ff8300","#d1063b" ,"#db2202","#28b201","red3")

Results

Step 0: Data loading and reshape

require(foreign)

df1=read.spss("vascdisease1.sav",use.value.labels = TRUE,to.data.frame = TRUE)%>%as_tibble()%>%na.omit()

df1$VAR00002<-df1$VAR00002%>%as.factor()%>%recode_factor(.,`0` = "No", `1` = "Yes")

df1<-df1%>%rename(.,Testscore=VAR00001,Disease=VAR00002)%>%mutate(.,Type="Oldtest")

df2=read.spss("vascdisease2.sav",use.value.labels = TRUE,to.data.frame = TRUE)%>%as_tibble()%>%na.omit()

df2$VAR00002<-df2$VAR00002%>%as.factor()%>%recode_factor(.,`0` = "No", `1` = "Yes")

df2<-df2%>%rename(.,Testscore=VAR00001,Disease=VAR00002)%>%mutate(.,Type="Newtest")

df1

## # A tibble: 588 × 3
##    Testscore Disease    Type
##        <dbl>  <fctr>   <chr>
## 1          1      No Oldtest
## 2          1      No Oldtest
## 3          2      No Oldtest
## 4          2      No Oldtest
## 5          3      No Oldtest
## 6          3      No Oldtest
## 7          3      No Oldtest
## 8          4      No Oldtest
## 9          4      No Oldtest
## 10         4      No Oldtest
## # ... with 578 more rows

df2

## # A tibble: 650 × 3
##    Testscore Disease    Type
##        <dbl>  <fctr>   <chr>
## 1          1      No Newtest
## 2          2      No Newtest
## 3          2      No Newtest
## 4          3      No Newtest
## 5          3      No Newtest
## 6          3      No Newtest
## 7          4      No Newtest
## 8          4      No Newtest
## 9          4      No Newtest
## 10         4      No Newtest
## # ... with 640 more rows

Once the above commmands executed, we will get 2 data tibbles named df1 and df2, that will be used throughout our analysis.

Step 1) Conventional ROC curve analysis

Figure 1: ROC curves for evaluating performance of two diagnostic tests

rocobj2 <- plot.roc(df2$Disease, df2$Testscore,
                    main="Old vs New test",
                    col="purple",grid=T
                    )

rocobj1 <- lines.roc(df1$Disease, df1$Testscore,col="orange")
                     

testobj <- roc.test(rocobj2, rocobj1,method="delong",alternative="greater")

text(0.4,0.3,labels=paste("p-value =", format.pval(testobj$p.value),"(by Delong)"),cex =.75)
text(0.4,0.25,labels=paste("AUC New=",testobj$estimate[1]),cex =.75)
text(0.4,0.2,labels=paste("AUC Old=",testobj$estimate[2]),cex =.75)

legend("topright",legend=c("Old","New"),col=c("orange","purple"),lwd=2)

The figure 1 shows that: ROC-AUC of the new test is significantly better than that of the old test (0.974 vs 0.945, p-value=0.001 by DeLong method).

Table 1A ,1B: Threshold analysis for the Old and New tests

roc1=roc(df1$Disease, df1$Testscore, direction="auto",ci=T,smooth=F)

pre1=length(which(df1$Disease=="Yes"))/nrow(df1)

rocdf1=as_tibble(cbind(roc1$thresholds,roc1$sensitivities,roc1$specificities))%>%dplyr::rename(.,Cutoff=V1,Sens=V2,Spec=V3)

rocdf1<-rocdf1%>%mutate(.,FPR=(1-.$Spec),
                        FNR=(1-.$Sens),
                        PPV=((.$Sens*pre1)/(.$Sens*pre1+(1-.$Spec)*(1-pre1))),
                        NPV=(.$Spec*(1-pre1)/((1-.$Sens)*pre1+.$Spec*(1-pre1))),
                        d_distance=sqrt((1-.$Sens)^2+(1-.$Spec)^2),
                        Youden_J=.$Sens+.$Spec-1,Test="Old"
)%>%na.omit()

rocdf1%>%knitr::kable()

Cutoff	Sens	Spec	FPR	FNR	PPV	NPV	d_distance	Youden_J	Test
1.5	1.0000000	0.0072202	0.9927798	0.0000000	0.5307167	1.0000000	0.9927798	0.0072202	Old
2.5	1.0000000	0.0144404	0.9855596	0.0000000	0.5325342	1.0000000	0.9855596	0.0144404	Old
3.5	1.0000000	0.0252708	0.9747292	0.0000000	0.5352840	1.0000000	0.9747292	0.0252708	Old
4.5	1.0000000	0.0433213	0.9566787	0.0000000	0.5399306	1.0000000	0.9566787	0.0433213	Old
5.5	1.0000000	0.0685921	0.9314079	0.0000000	0.5465729	1.0000000	0.9314079	0.0685921	Old
6.5	1.0000000	0.1046931	0.8953069	0.0000000	0.5563506	1.0000000	0.8953069	0.1046931	Old
7.5	1.0000000	0.1480144	0.8519856	0.0000000	0.5685558	1.0000000	0.8519856	0.1480144	Old
8.5	1.0000000	0.2021661	0.7978339	0.0000000	0.5845865	1.0000000	0.7978339	0.2021661	Old
9.5	1.0000000	0.2599278	0.7400722	0.0000000	0.6027132	1.0000000	0.7400722	0.2599278	Old
10.5	1.0000000	0.3212996	0.6787004	0.0000000	0.6232465	1.0000000	0.6787004	0.3212996	Old
11.5	1.0000000	0.3826715	0.6173285	0.0000000	0.6452282	1.0000000	0.6173285	0.3826715	Old
12.5	1.0000000	0.4476534	0.5523466	0.0000000	0.6702586	1.0000000	0.5523466	0.4476534	Old
13.5	1.0000000	0.5126354	0.4873646	0.0000000	0.6973094	1.0000000	0.4873646	0.5126354	Old
14.5	1.0000000	0.5812274	0.4187726	0.0000000	0.7283372	1.0000000	0.4187726	0.5812274	Old
15.5	0.9903537	0.6425993	0.3574007	0.0096463	0.7567568	0.9834254	0.3575309	0.6329530	Old
16.5	0.9774920	0.6967509	0.3032491	0.0225080	0.7835052	0.9650000	0.3040833	0.6742429	Old
17.5	0.9581994	0.7545126	0.2454874	0.0418006	0.8142077	0.9414414	0.2490208	0.7127120	Old
18.5	0.9324759	0.7942238	0.2057762	0.0675241	0.8357349	0.9128631	0.2165718	0.7266997	Old
19.5	0.9035370	0.8303249	0.1696751	0.0964630	0.8567073	0.8846154	0.1951788	0.7338619	Old
20.5	0.8842444	0.8592058	0.1407942	0.1157556	0.8757962	0.8686131	0.1822701	0.7434501	Old
21.5	0.8392283	0.8772563	0.1227437	0.1607717	0.8847458	0.8293515	0.2022710	0.7164846	Old
22.5	0.7813505	0.9061372	0.0938628	0.2186495	0.9033457	0.7868339	0.2379450	0.6874877	Old
23.5	0.7170418	0.9314079	0.0685921	0.2829582	0.9214876	0.7456647	0.2911532	0.6484497	Old
24.5	0.6430868	0.9494585	0.0505415	0.3569132	0.9345794	0.7032086	0.3604739	0.5925453	Old
25.5	0.5723473	0.9675090	0.0324910	0.4276527	0.9518717	0.6683292	0.4288852	0.5398563	Old
26.5	0.4951768	0.9819495	0.0180505	0.5048232	0.9685535	0.6340326	0.5051458	0.4771263	Old
27.5	0.4244373	0.9927798	0.0072202	0.5755627	0.9850746	0.6057269	0.5756080	0.4172171	Old
28.5	0.3504823	1.0000000	0.0000000	0.6495177	1.0000000	0.5782881	0.6495177	0.3504823	Old
29.5	0.2797428	1.0000000	0.0000000	0.7202572	1.0000000	0.5528942	0.7202572	0.2797428	Old
30.5	0.2122186	1.0000000	0.0000000	0.7877814	1.0000000	0.5306513	0.7877814	0.2122186	Old
31.5	0.1511254	1.0000000	0.0000000	0.8488746	1.0000000	0.5120148	0.8488746	0.1511254	Old
32.5	0.0996785	1.0000000	0.0000000	0.9003215	1.0000000	0.4973070	0.9003215	0.0996785	Old
33.5	0.0610932	1.0000000	0.0000000	0.9389068	1.0000000	0.4868190	0.9389068	0.0610932	Old
34.5	0.0353698	1.0000000	0.0000000	0.9646302	1.0000000	0.4800693	0.9646302	0.0353698	Old
35.5	0.0192926	1.0000000	0.0000000	0.9807074	1.0000000	0.4759450	0.9807074	0.0192926	Old
36.5	0.0096463	1.0000000	0.0000000	0.9903537	1.0000000	0.4735043	0.9903537	0.0096463	Old
37.5	0.0032154	1.0000000	0.0000000	0.9967846	1.0000000	0.4718910	0.9967846	0.0032154	Old

roc2=roc(df2$Disease, df2$Testscore, direction="auto",ci=T,smooth=F)

pre2=length(which(df2$Disease=="Yes"))/nrow(df2)

rocdf2=as_tibble(cbind(roc2$thresholds,roc2$sensitivities,roc2$specificities))%>%dplyr::rename(.,Cutoff=V1,Sens=V2,Spec=V3)

rocdf2<-rocdf2%>%mutate(.,FPR=(1-.$Spec),
                        FNR=(1-.$Sens),
                        PPV=((.$Sens*pre2)/(.$Sens*pre2+(1-.$Spec)*(1-pre2))),
                        NPV=(.$Spec*(1-pre2)/((1-.$Sens)*pre2+.$Spec*(1-pre2))),
                        d_distance=sqrt((1-.$Sens)^2+(1-.$Spec)^2),
                        Youden_J=.$Sens+.$Spec-1,Test="New"
)%>%na.omit()

rocdf2%>%knitr::kable()

Cutoff	Sens	Spec	FPR	FNR	PPV	NPV	d_distance	Youden_J	Test
1.5	1.0000000	0.0037453	0.9962547	0.0000000	0.5901387	1.0000000	0.9962547	0.0037453	New
2.5	1.0000000	0.0112360	0.9887640	0.0000000	0.5919629	1.0000000	0.9887640	0.0112360	New
3.5	1.0000000	0.0224719	0.9775281	0.0000000	0.5947205	1.0000000	0.9775281	0.0224719	New
4.5	1.0000000	0.0411985	0.9588015	0.0000000	0.5993740	1.0000000	0.9588015	0.0411985	New
5.5	1.0000000	0.0711610	0.9288390	0.0000000	0.6069731	1.0000000	0.9288390	0.0711610	New
6.5	1.0000000	0.1161049	0.8838951	0.0000000	0.6187399	1.0000000	0.8838951	0.1161049	New
7.5	1.0000000	0.1647940	0.8352060	0.0000000	0.6320132	1.0000000	0.8352060	0.1647940	New
8.5	1.0000000	0.2322097	0.7677903	0.0000000	0.6513605	1.0000000	0.7677903	0.2322097	New
9.5	1.0000000	0.3033708	0.6966292	0.0000000	0.6731107	1.0000000	0.6966292	0.3033708	New
10.5	1.0000000	0.3782772	0.6217228	0.0000000	0.6976321	1.0000000	0.6217228	0.3782772	New
11.5	1.0000000	0.4569288	0.5430712	0.0000000	0.7253788	1.0000000	0.5430712	0.4569288	New
12.5	1.0000000	0.5355805	0.4644195	0.0000000	0.7554241	1.0000000	0.4644195	0.5355805	New
13.5	1.0000000	0.6179775	0.3820225	0.0000000	0.7896907	1.0000000	0.3820225	0.6179775	New
14.5	0.9869452	0.6928839	0.3071161	0.0130548	0.8217391	0.9736842	0.3073934	0.6798291	New
15.5	0.9686684	0.7602996	0.2397004	0.0313316	0.8528736	0.9441860	0.2417394	0.7289680	New
16.5	0.9477807	0.8277154	0.1722846	0.0522193	0.8875306	0.9170124	0.1800246	0.7754960	New
17.5	0.9242820	0.8838951	0.1161049	0.0757180	0.9194805	0.8905660	0.1386130	0.8081771	New
18.5	0.8955614	0.9288390	0.0711610	0.1044386	0.9475138	0.8611111	0.1263777	0.8244003	New
19.5	0.8616188	0.9513109	0.0486891	0.1383812	0.9620991	0.8273616	0.1466969	0.8129297	New
20.5	0.8224543	0.9700375	0.0299625	0.1775457	0.9752322	0.7920489	0.1800562	0.7924918	New
21.5	0.7728460	0.9812734	0.0187266	0.2271540	0.9833887	0.7507163	0.2279247	0.7541194	New
22.5	0.7127937	0.9925094	0.0074906	0.2872063	0.9927273	0.7066667	0.2873039	0.7053031	New
23.5	0.6475196	1.0000000	0.0000000	0.3524804	1.0000000	0.6641791	0.3524804	0.6475196	New
24.5	0.5744125	1.0000000	0.0000000	0.4255875	1.0000000	0.6209302	0.4255875	0.5744125	New
25.5	0.5039164	1.0000000	0.0000000	0.4960836	1.0000000	0.5842451	0.4960836	0.5039164	New
26.5	0.4281984	1.0000000	0.0000000	0.5718016	1.0000000	0.5493827	0.5718016	0.4281984	New
27.5	0.3577023	1.0000000	0.0000000	0.6422977	1.0000000	0.5204678	0.6422977	0.3577023	New
28.5	0.2845953	1.0000000	0.0000000	0.7154047	1.0000000	0.4935305	0.7154047	0.2845953	New
29.5	0.2271540	1.0000000	0.0000000	0.7728460	1.0000000	0.4742451	0.7728460	0.2271540	New
30.5	0.1723238	1.0000000	0.0000000	0.8276762	1.0000000	0.4571918	0.8276762	0.1723238	New
31.5	0.1227154	1.0000000	0.0000000	0.8772846	1.0000000	0.4427861	0.8772846	0.1227154	New
32.5	0.0809399	1.0000000	0.0000000	0.9190601	1.0000000	0.4313409	0.9190601	0.0809399	New
33.5	0.0496084	1.0000000	0.0000000	0.9503916	1.0000000	0.4231379	0.9503916	0.0496084	New
34.5	0.0287206	1.0000000	0.0000000	0.9712794	1.0000000	0.4178404	0.9712794	0.0287206	New
35.5	0.0156658	1.0000000	0.0000000	0.9843342	1.0000000	0.4145963	0.9843342	0.0156658	New
36.5	0.0078329	1.0000000	0.0000000	0.9921671	1.0000000	0.4126739	0.9921671	0.0078329	New
37.5	0.0026110	1.0000000	0.0000000	0.9973890	1.0000000	0.4114022	0.9973890	0.0026110	New

The tables 1A and 1B represents all information we can get from a ROC analysis, including Sensitivity, Specificity, False positive and negative rates, Positive and negative predictive values, d-distance and Youden’s J index for each cut-off level of Test score.

We can identify the optimal cut-off for each test based on Youden’J index and d_distance, as follow:

sel1=rocdf1%>%subset(.,d_distance==min(d_distance) | Youden_J==max(Youden_J))

sel2=rocdf2%>%subset(.,d_distance==min(d_distance) | Youden_J==max(Youden_J))

sel=rbind(sel1,sel2)

sel%>%knitr::kable()

Cutoff	Sens	Spec	FPR	FNR	PPV	NPV	d_distance	Youden_J	Test
20.5	0.8842444	0.8592058	0.1407942	0.1157556	0.8757962	0.8686131	0.1822701	0.7434501	Old
18.5	0.8955614	0.9288390	0.0711610	0.1044386	0.9475138	0.8611111	0.1263777	0.8244003	New

The result indicates that the optimal cut-off for Old and new tests are respectively 20.5 and 18.5, that correspond to a Sensitivity of 0.884 vs 0.896 and a Specificty of 0.859 vs 0.929. Thus, the new test provides higher Sensitivity and Specificity than the old one.

Those results are represented in the Fig 2A and 2B;

Figure 2A,B: Threshold vs Performance analysis

rocdf<-rbind(rocdf1,rocdf2)%>%gather(Sens:Youden_J,key="Metric",value="Value")

rocdf%>%ggplot(aes(x=Cutoff,y=Value,color=Metric))+geom_line(size=1)+facet_grid(Metric~Test,scales="free")+geom_vline(xintercept=18.5,color="blue")

rocdf%>%ggplot(aes(x=Cutoff,y=Value,fill=Metric,color=Metric))+geom_bar(stat="identity",alpha=0.6)+facet_grid(Metric~Test,scales="free")+geom_vline(xintercept=18.5,color="blue")

This is the end of analysis by C-statistic. On the next step we will go beyond the notion of probabilities and arbitrary cut-off.

Step 2) Benchmark experiment

There will be 2 questions that we should answer:

Whether the performance of each diagnostic test could be improved by changing the decision rule (or optimising the interpretation of test-score by implementing different classification algorithms) ?
Can we still report our result with ROC-AUCs ?

In a few minutes you could find the answer for both of these questions:

library(mlr)
taskOld=makeClassifTask(id="Old",data=df1[,-3],target="Disease",positive = "Yes")
taskNew=makeClassifTask(id="New",data=df2[,-3],target="Disease",positive = "Yes")

tasks=list(taskOld,taskNew)

resmp= makeResampleDesc("RepCV",folds=10,reps=10)

learners = list(
  makeLearner("classif.naiveBayes",id="NBayes",predict.type = "prob",fix.factors.prediction = TRUE),
  makeLearner("classif.rpart",id="Tree",predict.type = "prob",fix.factors.prediction = TRUE),
  makeLearner("classif.kknn",id="KNN",predict.type = "prob",fix.factors.prediction = TRUE),
  makeLearner("classif.logreg",id="LOG",predict.type = "prob",fix.factors.prediction = TRUE)
)

mes=list(auc,bac)

set.seed(123)
bmrk=benchmark(learners,tasks,resmp,mes)

bmdata=getBMRPerformances(bmrk, as.df = TRUE)%>%as_tibble()

We justed performed a Benchmark study that involves 4 basic classfication algorithms: Naive Bayes, K-NN, Logistic and CART, each one of them were trained across 100 iterations on two different datasets.Their performance was evaluated based on Ballanced accuracy and ROC-AUC (Yes, AUC of the ROC curve) in a 10x10 cross-validation process.

Pairwised comparison of AUC among 4 algorithms

bmdata%>%split(.$task.id)%>%map(~psych::describeBy(.$auc,group=.$learner.id))

## $New
## $NBayes
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis se
## X1    1 100 0.97 0.01   0.98    0.98 0.01 0.92   1  0.08 -1.11     2.21  0
## 
## $Tree
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 100 0.91 0.03   0.92    0.91 0.03 0.82 0.97  0.15 -0.6    -0.36  0
## 
## $KNN
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis se
## X1    1 100 0.96 0.02   0.96    0.96 0.02 0.85   1  0.15 -1.42     3.32  0
## 
## $LOG
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis se
## X1    1 100 0.97 0.01   0.98    0.98 0.01 0.92   1  0.08 -1.11     2.21  0
## 
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
## 
## $Old
## $NBayes
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 100 0.94 0.02   0.94    0.94 0.02 0.88 0.99  0.11 -0.4    -0.24  0
## 
## $Tree
##    vars   n mean   sd median trimmed  mad  min  max range  skew kurtosis
## X1    1 100 0.86 0.05   0.86    0.86 0.04 0.74 0.96  0.23 -0.31    -0.03
##    se
## X1  0
## 
## $KNN
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 100 0.92 0.03   0.92    0.92 0.03 0.81 0.99  0.18 -0.4     1.11  0
## 
## $LOG
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 100 0.94 0.02   0.94    0.94 0.02 0.88 0.99  0.11 -0.4    -0.24  0
## 
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)

bmdata%>%split(.$task.id)%>%map(~PMCMR::posthoc.kruskal.nemenyi.test(formula=auc~learner.id,data=.,dist="Tukey"))

## $New
## 
##  Pairwise comparisons using Tukey and Kramer (Nemenyi) test  
##                    with Tukey-Dist approximation for independent samples 
## 
## data:  auc by learner.id 
## 
##      NBayes  Tree    KNN    
## Tree < 2e-16 -       -      
## KNN  7.6e-06 8.7e-14 -      
## LOG  1       < 2e-16 7.6e-06
## 
## P value adjustment method: none 
## 
## $Old
## 
##  Pairwise comparisons using Tukey and Kramer (Nemenyi) test  
##                    with Tukey-Dist approximation for independent samples 
## 
## data:  auc by learner.id 
## 
##      NBayes  Tree    KNN    
## Tree < 2e-16 -       -      
## KNN  5.5e-05 8.6e-14 -      
## LOG  1       < 2e-16 5.5e-05
## 
## P value adjustment method: none

The results showed that within the same test (Old or New), the diagnostic performance could be significantly modified by switching from one interpretation method to another. Indeed, the classifcation rules that based on Naive Bayes and Logistic models result the best AUC (0.94 for Old test and 0.97 for new test)

Averaged diagnostic performance, compared between the 2 tests

bmdata%>%split(.$learner.id)%>%map(~psych::describeBy(.$auc,group=.$task.id))

## $NBayes
## $New
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis se
## X1    1 100 0.97 0.01   0.98    0.98 0.01 0.92   1  0.08 -1.11     2.21  0
## 
## $Old
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 100 0.94 0.02   0.94    0.94 0.02 0.88 0.99  0.11 -0.4    -0.24  0
## 
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
## 
## $Tree
## $New
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 100 0.91 0.03   0.92    0.91 0.03 0.82 0.97  0.15 -0.6    -0.36  0
## 
## $Old
##    vars   n mean   sd median trimmed  mad  min  max range  skew kurtosis
## X1    1 100 0.86 0.05   0.86    0.86 0.04 0.74 0.96  0.23 -0.31    -0.03
##    se
## X1  0
## 
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
## 
## $KNN
## $New
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis se
## X1    1 100 0.96 0.02   0.96    0.96 0.02 0.85   1  0.15 -1.42     3.32  0
## 
## $Old
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 100 0.92 0.03   0.92    0.92 0.03 0.81 0.99  0.18 -0.4     1.11  0
## 
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
## 
## $LOG
## $New
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis se
## X1    1 100 0.97 0.01   0.98    0.98 0.01 0.92   1  0.08 -1.11     2.21  0
## 
## $Old
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 100 0.94 0.02   0.94    0.94 0.02 0.88 0.99  0.11 -0.4    -0.24  0
## 
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)

bmdata%>%split(.$learner.id)%>%map(~DescTools::NemenyiTest(x=.$auc,g=.$task.id,dist="tukey"))

## $NBayes
## 
##  Nemenyi's test of multiple comparisons for independent samples (tukey)  
## 
##         mean.rank.diff    pval    
## Old-New         -74.81 2.4e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## $Tree
## 
##  Nemenyi's test of multiple comparisons for independent samples (tukey)  
## 
##         mean.rank.diff  pval    
## Old-New          -64.9 3e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## $KNN
## 
##  Nemenyi's test of multiple comparisons for independent samples (tukey)  
## 
##         mean.rank.diff    pval    
## Old-New         -67.68 2.2e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## $LOG
## 
##  Nemenyi's test of multiple comparisons for independent samples (tukey)  
## 
##         mean.rank.diff    pval    
## Old-New         -74.81 2.4e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Then we compared the averaged AUC of 4 algorithms between the Old and New tests (2 datasets), the result indicates that the New test provide a significantly better performance (as evaluated by AUC) than the old test.

generateCritDifferencesData(bmrk, p.value = 0.05, test = "nemenyi")

## $data
##        mean.rank learner.id rank yend xend right short.name
## NBayes       1.5     NBayes  1.5  0.5    0     0     nbayes
## Tree         4.0       Tree  4.0  0.5    5     1      rpart
## KNN          3.0        KNN  3.0  1.5    5     1       kknn
## LOG          1.5        LOG  1.5  1.5    0     0     logreg
## 
## $cd.info
## $cd.info$test
## [1] "nemenyi"
## 
## $cd.info$cd
## [1] 3.316606
## 
## $cd.info$x
## [1] 1.5
## 
## $cd.info$y
## [1] 0.1
## 
## $cd.info$nemenyi.data
##   xstart xend diff   y
## 1    1.5    4  2.5 0.1
## 
## 
## $friedman.nemenyi.test
## 
##  Friedman rank sum test
## 
## data:  auc.test.mean and learner.id and task.id
## Friedman chi-squared = 6, df = 3, p-value = 0.1116
## 
## 
## $baseline
## [1] NBayes
## Levels: KNN LOG NBayes Tree
## 
## $p.value
## [1] 0.05
## 
## attr(,"class")
## [1] "CritDifferencesData"

The ranking analysis suggests that the Logistic model and Naive Bayes model are the best algorithms that would be used for interpreting our test scores.

Those results could be visualized in Figures 3A, B, C below:

bmdata%>%ggplot(aes(x=auc,fill=task.id))+geom_density(alpha=0.6)+facet_wrap(~learner.id,ncol=3,scales="free")+geom_vline(xintercept=0.9)+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)

bmdata%>%ggplot(aes(x=task.id,y=auc,fill=task.id,color=learner.id))+stat_summary(fun.ymin=min,fun.ymax=max,fun.y="median",shape=21,alpha=0.6,size=1)+facet_wrap(~learner.id,scales="free")+coord_flip()

bmdata%>%ggplot(aes(x=learner.id,y=auc,fill=learner.id,color=learner.id))+stat_summary(fun.ymin=min,fun.ymax=max,fun.y="median",shape=21,alpha=0.6,size=1)+facet_wrap(~task.id,scales="free")+coord_flip()

bmdata%>%ggplot(aes(x=iter,y=auc,color=task.id))+geom_path(alpha=0.9,size=0.8)+facet_grid(learner.id~task.id,scales="free")+scale_color_manual(values=c("#ff0c4c","purple"))+geom_hline(yintercept=0.9,color="red")

Figure 4: Ranking of 4 algorithms by Nemenyi test

nemenyi = generateCritDifferencesData(bmrk, p.value = 0.05, test = "nemenyi")
plotCritDifferences(nemenyi)+coord_cartesian(xlim = c(-1,5), ylim = c(0,2))

Figure 5 : ROC curves of 4 algorithms

rocd=generateThreshVsPerfData(bmrk, measures = list(fpr, tpr))
plotROCCurves(rocd)+geom_line(size=1,aes(color=learner))+geom_polygon(aes(fill=learner),alpha=0.5)+scale_fill_manual(values=myfillcolors)+scale_color_manual(values=mycolors)

By the end, we would like to reveal the content of our two best algorithms: Naive Bayes and Logistic models

Figure 6 Classification rule by NAIVE BAYES model

NB=makeLearner("classif.naiveBayes",id="NB",predict.type = "prob",fix.factors.prediction = TRUE)

NBfit=train(NB,taskNew)

NBfit$learner.model

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##        No       Yes 
## 0.4107692 0.5892308 
## 
## Conditional probabilities:
##      Testscore
## Y         [,1]     [,2]
##   No  12.05618 4.484794
##   Yes 25.39687 5.106119

predict(NBfit,taskNew)%>%as_tibble()%>%mutate(.,TestScore=df2$Testscore)%>%ggplot(aes(x=TestScore,y=prob.Yes))+geom_line(size=1,color="red")

Figure 7 Classification rule by LOGISTIC model

LOG=makeLearner("classif.logreg",id="logistic",predict.type = "prob",fix.factors.prediction = TRUE)

logistic=train(LOG,taskNew)
summary(logistic$learner.model)

## 
## Call:
## stats::glm(formula = f, family = "binomial", data = getTaskData(.task, 
##     .subset), weights = .weights, model = FALSE)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.49821  -0.20006   0.02969   0.21166   2.12916  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.29682    0.91512  -11.25   <2e-16 ***
## Testscore     0.58140    0.05085   11.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 880.28  on 649  degrees of freedom
## Residual deviance: 261.74  on 648  degrees of freedom
## AIC: 265.74
## 
## Number of Fisher Scoring iterations: 7

predict(logistic,taskNew)%>%as_tibble()%>%mutate(.,TestScore=df2$Testscore)%>%ggplot(aes(x=TestScore,y=prob.Yes))+geom_line(size=1,color="purple")

The distribution graph for positive probability indicates that those 2 models are equivalent for clinical practice.

Conclusion: Our study has succesfully extended the conventional ROC-curve analysis. There are two messages announced by this document:

The performance of a diagnostic test could be improved by adopting the right decision rule or interpreting algorithm. You must carefully think about it before picking any arbitrary cut-off point, cause you might get something better via a benchmark study.
The ROC curve as we know is just a two dimensioned line plot that shows the relationship between Sensitivity and 1-Specificity of a given classifier. So ANY classification rule (not a specific test by its own, but any rule that implies stand-alone or combined clinical data, including symtomps, laboratory markers and medical imaging, simple or sophisticated algorithm…) could be visually evaluated by a ROC curve.

END

MLM Case study 52

Le Dong Nhat Nam (MD, PhD)

24 jan 2017

Case study 52: Beyond the ROC curves (or ML applied to Performance Evaluation)