Case study X11: Generic Bootstrap aggregating applied to Naive Bayes

Foreword: About the Machine Learning in Medicine (MLM) project

The MLM project has been initialized in 2016 and aims to:

  1. Encourage using Machine Learning techniques in medical research in Vietnam and

  2. Promote the use of R statistical programming language, an open source and leading tool for practicing data science.

Introduction

Bootstrap aggregating (BAGGING) is a machine learning ensembme meta-algorithm that was introduced by Leo Breiman in 1994. This technique aims to improve the performance and stability of a weak learner by averaging many weak learners on randomly generated training sets.

On an initial training set T of n instances, bagging algorithm generates many new training subset of equal sizes n’, by sampling randomly, uniformly and with replacement from T (some cases could be repeated, likes in Bootstrap). Then it will train a basic learner on each subset. The training might also imply a random feature selection for each model. The final step is aggregating, when all models are combined together. The prediction will be done by averaging (regression tasks) or by voting (classification tasks).

The most wellknown application of Bagging is in Random Forest algorithm where bagging is applied to Decision tree with random subsampling and feature selection. As we have seen before, a Random Forest is very powerful though it was based on weak learner which is decision tree. However, Bagging could be applied to any type of learner, such as KNN, linear regression or Naive Bayes… when we find it appropriate.

The mlr package provides a generic bagging procedure that could be used on any basic learners among the mlr library. The mlrs’ Bagging algorithm supports both Bootstraping and random subsampling (no replacement) with customisabled sample size. It also included a feature selection at a fixed ratio. The bagged model could be tuned up, trained, resampled, validated by conventional functions in mlr.

In this case study, we will explore the generic Bagging procedure on a binary classification task. Our experiment will imply the Biopsy dataset of Dr. William H. Wolberg (William H. Wolberg and O.L. Mangasarian, 1990). This dataset was previously used in Case study X6. Biopsies of breast tumours for 699 patients up to 15 July 1992 have been recorded. Each of nine attributes has been scored on a scale of 1 to 10, and the outcome is classified as “Benign” or “Malignant” tumors. Original dataset contained 699 instances of 10 variables. After removing missing values, there are 683 cases. The 9 features include scores of clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli and mitoses.

The main objective is to apply the generic bagging algorithm on a basic Naive Bayes learner and to find out whether the Bagged model would performs better than the basic one on a test set ?

Materials and method

First, we will prepare the ggplot theme for our experiment

library(tidyverse)

my_theme <- function(base_size = 10, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
    theme(
      axis.text = element_text(size = 10),
      axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
      axis.title = element_text(size = 12),
      panel.grid.major = element_line(color = "grey"),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = "white"),
      strip.background = element_rect(fill = "#400156", color = "#400156", size =0.5),
      strip.text = element_text(face = "bold", size = 10, color = "white"),
      legend.position = "bottom",
      legend.justification = "center",
      legend.background = element_blank(),
      panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)
    )
}
theme_set(my_theme())

Now we load the dataset from the famous http://vincentarelbundock.github.io website

df=read.csv("http://vincentarelbundock.github.io/Rdatasets/csv/MASS/biopsy.csv")%>%as_tibble()%>%.[,c(3:12)]%>%na.omit()

names(df)=c("clumpthickness",
            "SizeUniformity",
            "ShapeUniformity",
            "Margin_adhesion",
            "EpiCellSize",
            "Barenuclei",
            "BlandChromatin",
            "NormalNucleoli",
            "Mitoses",
            "Class"
)

Data visualising

dfscale<-df[,-10]%>%as.matrix()%>%scale()%>%as_tibble()%>%mutate(.,Class=df$Class,Id=row.names(.))

library(viridis)

dfscale%>%gather(clumpthickness:Mitoses,key="Criteria",value="Score")%>%ggplot(aes(x=reorder(Id,-Score),y=reorder(Criteria,Score),fill=Score))+geom_tile(show.legend=T)+facet_wrap(~Class,ncol=2,shrink=T,scale="free")+scale_fill_viridis(option="D",begin=1,end=0)+theme(axis.text.y=element_blank(),axis.text.x = element_text(angle =45,hjust=1,vjust=1))+coord_flip()+scale_y_discrete("Criteria")+scale_x_discrete("Patient's Id")

df%>%gather(clumpthickness:Mitoses,key="Criteria",value="Score")%>%ggplot(aes(x=Class,y=Score,color=Class))+geom_jitter(alpha=0.2,show.legend =T)+facet_wrap(~Criteria,scales="free",ncol=3)+scale_color_manual(values=c("#22cc9e","#9437f2"))+theme(axis.text.x=element_blank())

Then we will split the origin dataset into training and testing subsets:

library(caret)

set.seed(123)
id=createDataPartition(y=df$Class, p=0.5,list=FALSE)
trainset=df[-id,]
testset=df[id,]

p1=trainset%>%as.data.frame()%>%ggplot(aes(x=Class,fill=Class))+stat_count(show.legend=F)+scale_fill_manual(values=c("#22cc9e","#9437f2"))+coord_flip()+ggtitle("Train")

p2=testset%>%as.data.frame()%>%ggplot(aes(x=Class,fill=Class))+stat_count(show.legend=F)+scale_fill_manual(values=c("#22cc9e","#9437f2"))+coord_flip()+ggtitle("Test")

p3=df%>%as.data.frame()%>%ggplot(aes(x=Class,fill=Class))+stat_count(show.legend=F)+scale_fill_manual(values=c("#22cc9e","#9437f2"))+coord_flip()+ggtitle("Origin")

library(gridExtra)

grid.arrange(p1,p2,p3,ncol=1)

Then we will load the mlr package and generate a classification Task

library(mlr)

tasktrain=makeClassifTask(id="Biopsy",data=trainset,target = "Class",positive = "malignant")

We will generate 2 learners:

  1. A basic learner which is Naive Bayes

  2. A Bagging learner with 200 iterations, bootstrap resampling with replacement at 80% of original sample size and a feature selection with selection rate fixed at 50%.

basic.learner=makeLearner("classif.naiveBayes",predict.type="response")

bagNB.learner=makeBaggingWrapper(basic.learner,bw.iters=200,bw.replace=TRUE,bw.size=0.8,bw.feats=0.5) 

#This is a trick: the makeBaggingWrapper only accept the learner with predict.type ="response", but we should reset the type to "probability" in order to use some evaluation metrics.

bagNB.learner= setPredictType(bagNB.learner, predict.type = "prob")
basic.learner= setPredictType(basic.learner, predict.type = "prob")

BASIC NAIVE BAYES MODEL

First we train the baisc learner:

#Basic NB mod
basicmod=mlr::train(basic.learner,tasktrain)
basicmod$learner.model
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##    benign malignant 
## 0.6510264 0.3489736 
## 
## Conditional probabilities:
##            clumpthickness
## Y               [,1]     [,2]
##   benign    2.945946 1.660496
##   malignant 7.109244 2.378541
## 
##            SizeUniformity
## Y               [,1]      [,2]
##   benign    1.292793 0.7489615
##   malignant 6.630252 2.7551663
## 
##            ShapeUniformity
## Y               [,1]      [,2]
##   benign    1.432432 0.9332629
##   malignant 6.588235 2.5425590
## 
##            Margin_adhesion
## Y               [,1]      [,2]
##   benign    1.315315 0.7957123
##   malignant 5.462185 3.2016678
## 
##            EpiCellSize
## Y               [,1]      [,2]
##   benign    2.081081 0.9236475
##   malignant 5.453782 2.4242696
## 
##            Barenuclei
## Y               [,1]     [,2]
##   benign    1.355856 1.263636
##   malignant 7.638655 2.990752
## 
##            BlandChromatin
## Y               [,1]     [,2]
##   benign    2.049550 1.150359
##   malignant 5.739496 2.334205
## 
##            NormalNucleoli
## Y               [,1]      [,2]
##   benign    1.229730 0.8803634
##   malignant 5.554622 3.3387745
## 
##            Mitoses
## Y               [,1]      [,2]
##   benign    1.103604 0.6945721
##   malignant 2.512605 2.5240052

We evaluate its performance:

predbasic=predict(basicmod,newdata=testset)

mets=list(auc,bac,tpr,tnr,mmce,ber,fpr,fnr,ppv,npv)


pdfbasic=predbasic%>%performance(.,mets)%>%as_tibble()%>%mutate(.,Metric=row.names(.),Model="BasicNB")
cbind(pdfbasic$Metric,pdfbasic$value)
##       [,1]   [,2]                
##  [1,] "auc"  "0.988063063063063" 
##  [2,] "bac"  "0.96722972972973"  
##  [3,] "tpr"  "0.975"             
##  [4,] "tnr"  "0.959459459459459" 
##  [5,] "mmce" "0.0350877192982456"
##  [6,] "ber"  "0.0327702702702703"
##  [7,] "fpr"  "0.0405405405405405"
##  [8,] "fnr"  "0.025"             
##  [9,] "ppv"  "0.928571428571429" 
## [10,] "npv"  "0.986111111111111"
caret::confusionMatrix(predbasic$data$response,reference=testset$Class,mode="everything",positive="malignant")
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       213         3
##   malignant      9       117
##                                           
##                Accuracy : 0.9649          
##                  95% CI : (0.9395, 0.9817)
##     No Information Rate : 0.6491          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9238          
##  Mcnemar's Test P-Value : 0.1489          
##                                           
##             Sensitivity : 0.9750          
##             Specificity : 0.9595          
##          Pos Pred Value : 0.9286          
##          Neg Pred Value : 0.9861          
##               Precision : 0.9286          
##                  Recall : 0.9750          
##                      F1 : 0.9512          
##              Prevalence : 0.3509          
##          Detection Rate : 0.3421          
##    Detection Prevalence : 0.3684          
##       Balanced Accuracy : 0.9672          
##                                           
##        'Positive' Class : malignant       
## 

BAGGED NB MODEL

set.seed(123)
bagNBmod=mlr::train(bagNB.learner,tasktrain)

predBAG=predict(bagNBmod,newdata=testset)

pdfBAG=predBAG%>%performance(.,mets)%>%as_tibble()%>%mutate(.,Metric=row.names(.),Model="Bagged.NB")
cbind(pdfBAG$Metric,pdfBAG$value)
##       [,1]   [,2]                
##  [1,] "auc"  "0.995213963963964" 
##  [2,] "bac"  "0.969481981981982" 
##  [3,] "tpr"  "0.975"             
##  [4,] "tnr"  "0.963963963963964" 
##  [5,] "mmce" "0.0321637426900585"
##  [6,] "ber"  "0.030518018018018" 
##  [7,] "fpr"  "0.036036036036036" 
##  [8,] "fnr"  "0.025"             
##  [9,] "ppv"  "0.936"             
## [10,] "npv"  "0.986175115207373"
caret::confusionMatrix(predBAG$data$response,reference=testset$Class,mode="everything",positive="malignant")
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign       214         3
##   malignant      8       117
##                                           
##                Accuracy : 0.9678          
##                  95% CI : (0.9432, 0.9838)
##     No Information Rate : 0.6491          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9301          
##  Mcnemar's Test P-Value : 0.2278          
##                                           
##             Sensitivity : 0.9750          
##             Specificity : 0.9640          
##          Pos Pred Value : 0.9360          
##          Neg Pred Value : 0.9862          
##               Precision : 0.9360          
##                  Recall : 0.9750          
##                      F1 : 0.9551          
##              Prevalence : 0.3509          
##          Detection Rate : 0.3421          
##    Detection Prevalence : 0.3655          
##       Balanced Accuracy : 0.9695          
##                                           
##        'Positive' Class : malignant       
## 

The validation on testing subset only showed a slight improvement in False positive rate (Bagged model: 8 cases vs Basic model: 9 cases)

Here are two ROCs of those models:

preddf1=predbasic$data%>%as_tibble()%>%mutate(.,Model="Basic.NB")
preddf2=predBAG$data%>%as_tibble()%>%mutate(.,Model="Bagged.NB")

predDF=rbind(preddf1,preddf2)

predDF%>%ggplot()+geom_histogram(aes(x=prob.malignant,fill=Model),alpha=0.8,binwidth = 0.03,color="black")+geom_vline(xintercept=0.5,size=1,linetype=2,color="red2")+scale_x_continuous("predicted probability")+scale_fill_manual(values=c("gold","purple"))+ggtitle("Cut-offs")+facet_grid(truth~Model,scales="free_y")

rocdf1=generateThreshVsPerfData(predbasic,measures = list(fpr, tpr))
plotROCCurves(rocdf1)+ggtitle("Basic NB")+geom_line(size=1,color="purple4")+geom_polygon(fill="violet",alpha=0.5)

rocdf2=generateThreshVsPerfData(predBAG,measures = list(fpr, tpr))
plotROCCurves(rocdf2)+ggtitle("Bagged NB")+geom_line(size=1,color="red")+geom_polygon(fill="gold",alpha=0.5)

As the one point validation did not show clearly difference between two models, we will evaluate their performance by resampling. A 10x10 cross-validation will be applied then 3 most important performance metrics : Balanced accuracy, False negative rate and Balanced error rate will be averaged on 100 iterations:

10x10 CROSS-VALIDATION

rdesc=makeResampleDesc("RepCV",reps=10L,folds=10L,stratify = TRUE)

r1=resample(basic.learner,tasktrain,measures=list(bac,fnr,ber),rdesc) # 10x10CV
r2=resample(bagNB.learner,tasktrain,measures=list(bac,fnr,ber),rdesc) 

Bagged Naive Bayes model is characterised by better stability (as showed by the iteration trace), higher accuracy (BAC) and lower false negative misclassification rate than the basic model. The difference in BAC was statistically significative (p=0.0006).

rdf1=r1$measures.test%>%as_tibble()%>%mutate(.,Model="BasicNB")
rdf2=r2$measures.test%>%as_tibble()%>%mutate(.,Model="BaggedNB")
rdlong=rbind(rdf1,rdf2)
names(rdlong)=c("Iteration","BAC","FNR","BER","Model")
rdlong$Iteration=as.numeric(rdlong$Iteration)

rdlong%>%ggplot(aes(x=BAC,fill=Model))+geom_density(color="black",alpha=0.3)+geom_vline(xintercept=0.97,size=1,color="blue",linetype=2)+scale_fill_manual(values=c("red","blue"))+ggtitle("Balanced Accuracy")

rdlong%>%ggplot(aes(x=Iteration,y=BAC,fill=Model,color=Model))+geom_line(alpha=0.5)+geom_point()+geom_hline(yintercept=0.97,size=1,linetype=2)+scale_color_manual(values=c("red","blue"))+ggtitle("Balanced Accuracy")+facet_wrap(~Model,ncol=1)

rdlong%>%ggplot(aes(x=Iteration,y=FNR,fill=Model,color=Model))+geom_line(alpha=0.5)+geom_point()+geom_hline(yintercept=0.05,size=1,linetype=2)+scale_color_manual(values=c("red","blue"))+ggtitle("False negative rate")+facet_wrap(~Model,ncol=1)

rdlong%>%ggplot(aes(x=Iteration,y=BER,fill=Model,color=Model))+geom_line(alpha=0.5)+geom_point()+geom_hline(yintercept=0.025,size=1,linetype=2)+scale_color_manual(values=c("red","blue"))+ggtitle("Balanced Error rate")+facet_wrap(~Model,ncol=1)

rdlong%>%ggplot(aes(x=BER,fill=Model))+geom_density(color="black",alpha=0.3)+geom_vline(xintercept=0.025,size=1,color="blue",linetype=2)+scale_fill_manual(values=c("red","blue"))+ggtitle("Balanced error rate")

rdlong%>%wilcox.test(formula=BAC~Model,data=.,paired=FALSE)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  BAC by Model
## W = 6386, p-value = 0.0005612
## alternative hypothesis: true location shift is not equal to 0

INTERPRETIVE STUDY

pfdf=cbind(predBAG$data,testset)%>%gather(.,clumpthickness:NormalNucleoli,key="Feature",value="Score")

pfdf$Score=as.integer(pfdf$Score)

pfdf%>%ggplot(aes(x=Score,y=prob.malignant,color=Feature))+geom_smooth(show.legend = F,se=F)+scale_x_continuous(breaks=c(0,1,2,3,4,5,6,7,8,9,10))+facet_wrap(~Feature,ncol=2,scales="free")

Conclusion

Through this experiment, we found that:

  1. Naive Bayes is a simple but very powerful classification algorithm.

  2. Generic bagging algorithm is helpful for improving the stability and accuracy of any weak learner. The mlr package provided a generic and scalable bagging procedure that implies both random subsetting and feature selection.

The Bagging algorithm could also be applied to situations with large number of features (as the process implies naturally a model regularisation) and/or limited sample size. In our case, Bagging algorithm as sucessfully improved the performance of basic naive Bayes model using 200 models on a small sample of about 200 observations and randomly selected 4-5 features for each model.

See you in the next tutorial and thank for joining us !

END