1 Foreword:About the Machine Learning in Medicine project

You are reading the 36th case study by MLM team. Our MLM project has been initialized in 2016 and aims to:

  1. Encourage using Machine Learning techniques in medical research in Vietnam

and (2) Promote the use of R statistical programming language, an open source and leading tool for practicing data science.

2 Introduction

In machine learning, ensembling means to combine several base algorithms (learners) for making a stronger algorithm. Such integrated system incorporates the predictions from all individual learners in the best way to provide the final prediction. An ensemble model usually outperforms standalone base learners as brainstorming and teamwork do in our life. The present tutorial provides a hands-on experiment on some basic techniques of ensemble modelling. These include:

Weighted average: this method consists of applying weights on predictions then taking the average of predictions from base learners. Thus, the learners contribute differently to the averaged prediction, depending on their performance.

Majority vote consists of considering the prediction with maximum vote from all base algorithms. This method is strictly applied to classification problem.

Stacking: this method aims to build a multiple layers of models, where the output from the models in the under layers will be used as input for another in superior layer. The top layer model will take final decision.

3 Dataset and study objective

Our study implies a multilabel classification problem that involves the Cardiotocography (CTG) reading in Obstetrics. CTG is a method for monitoring the foetal heart rate (cardio) by ultrasound and its temporal relationship to uterine contractions (toco). This method could be used both before (3rd trimester) and during labour to monitor the baby for any signs of distress. Changes in fetal heart rate that occur along with contractions form a pattern. Certain changes in this pattern may suggest a problem.

A description of the dataset could be found here:

<https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.names

The original dataset has been created by Ayres de Campos et al. (Faculty of Medicine, University of Porto, Portugal) in 2000 for upgrading a computer program (SisPorto 2.0) for automated analysis of Cardiotocograms. The dataset included 2126 preprocessed fetal cardiotocograms (CTGs). The CTGs were classified by three expert obstetricians and a consensus classification label assigned to each of them.

The 21 features included: baseline values (LBE,LB, which were determined by approximating the mean FHR rounded to increments of 5 beats per minute (bpm) during a 10-minute window, excluding accelerations and decelerations and periods of marked FHR variability (greater than 25 bpm)), accelerations (AC), foetal movement (FM), uterine contractions (UC), percentage of time with abnormal short term variability (ASTV), mean value of short term variability (mSTV), percentage of time with abnormal long term variability (ALTV), mean value of long term variability (mLTV), light decelerations (DL), severe decelerations (DS), prolongued decelerations (DP), repetitive decelerations (DR), histogram width (Width), low freq. of the histogram (Min), high freq. of the histogram (Max), number of histogram peaks (nMax), number of histogram zeros (Nzeroes), histograms’ mode, mean, median, variance and tendency (left assymetric; symmetric, right assymetric).

Clinical objective was to classify 10 morphologic patterns, including: Calm sleep, REM sleep, calm vigilance, active vigilance, shift pattern (A or Susp with shifts), accelerative/decelerative pattern (stress situation), decelerative pattern (vagal stimulation), largely decelerative pattern, flat-sinusoidal pattern (pathological state) and suspect patterns. So, this is a multiclass classification problem The data could be downloaded from:

https://archive.ics.uci.edu/ml/datasets/Cardiotocography#

Reference: Ayres de Campos et al. (2000) SisPorto 2.0 A Program for Automated Analysis of Cardiotocograms. J Matern Fetal Med 5:311-318

Practical objective: The present data experiment aims to explore the utility of 3 ensembling techniques: Weighted averaging, Majority voting and Stacking.

4 Methods

Before being used as input, the original data was cleaned and transformed. The features with no or low variance, as well as high collinearity were removed. Only 15 features remained in the final dataframe. Some of them have been transformed by Yeo-Johnson power transformation in order to optimize their contrast among 10 classes. This dataset was the randomly splited into 2 subsets for training (50%,n=1066) and testing (50%,n=1060). Models’ performance was evaluated by visual confusion matrix, Kappa coefficient and F1 score (for each label).

Our ensembling experiment consisted of 4 steps

  • 1)First, we trained 5 basic classifiers that based on Support vector machine (SVM with linear kernel), k-nearest neighbors (KNN), simple decision tree (CART), One rule (OneR) and Random Forest (Ranger package).

  • 2)We developed the first ensemble model that based on Weighted Averaging principle.

  • 3)In a similar way, we applied the Majority voting principle to make another Ensemble model.

  • 4)Finally, we used Stacking ensemble technique to combine the 5 basic learners under a Gradient boosting machine learner.

The caret package (Max Kuhn, 2012) was used through out our experiment. All 3 ensembling procedures were manually performed using customized loops and functions in R. We also introduced 2 functions for visualizing the confusion matrix and the performance for multilabel classifier.

5 Data preparation and exploration

First, we will prepare the ggplot theme for our experiment

library(tidyverse)

my_theme <- function(base_size = 5, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
    theme(
      axis.text = element_text(size = 5),
      axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
      axis.title = element_text(size = 5),
      panel.grid.major = element_line(color = "grey"),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = "white"),
      strip.background = element_rect(fill = "#4f0870", color = "#4f0870", size =0.5),
      strip.text = element_text(face = "bold", size =5, color = "white"),
      legend.position = "bottom",
      legend.justification = "center",
      legend.background = element_blank(),
      panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)
    )
}
theme_set(my_theme())

mycol=c("#ff0048","#fce628","#0de2aa","#2dd6e2","#c362f7","#ffae22","#b1e827","#3495ea","#ac9af9","#ff51c5")

Now we load the dataset into R:

#load data

df=read.csv("Cardiotocography.csv",sep=";")%>%.[,-c(21:31)]%>%.[,-22]

df[c(1:20)] <- lapply(df[c(1:20)], as.numeric)

df$CLASS=recode_factor(df$CLASS,`1` = "CalmSleep", `2` = "REM",`3` = "CalmVig",`4` = "ActVig", `5` = "Shiftpattern",`6` = "AD",`7` = "DE", `8` = "LD",`9` = "Flatsinusoidal",`10` = "Suspect")

df$CLASS=as.factor(df$CLASS)

head(df)%>%knitr::kable()
LB AC FM UC DL DS DP ASTV MSTV ALTV MLTV Width Min Max Nmax Nzeros Mode Mean Median Variance CLASS
120 0.000 0 0.000 0.000 0 0.000 73 0.5 43 2.4 64 62 126 2 0 120 137 121 73 Flatsinusoidal
132 0.006 0 0.006 0.003 0 0.000 17 2.1 0 10.4 130 68 198 6 1 141 136 140 12 AD
133 0.003 0 0.008 0.003 0 0.000 16 2.1 0 13.4 130 68 198 5 1 141 135 138 13 AD
134 0.003 0 0.008 0.003 0 0.000 16 2.4 0 23.0 117 53 170 11 0 137 134 137 13 AD
132 0.007 0 0.008 0.000 0 0.000 16 2.4 0 19.9 117 53 170 9 0 137 136 138 11 REM
134 0.001 0 0.010 0.009 0 0.002 26 5.9 0 0.0 150 50 200 5 3 76 107 107 170 LD

6 Data cleaning, Visualisation and Transformations

df%>%gather(LB:Variance,key="Feature",value="Value")%>%ggplot(aes(x=Value,fill=CLASS))+geom_density(color="black",alpha=0.4,show.legend = T)+facet_wrap(~Feature,drop = TRUE,ncol=5,scales="free")+my_theme(5)+scale_fill_manual(values=mycol)+ggtitle("Original data: Density curves")

The density plots can tell us two things:

  • Some features seem to be equivalent (for example: Mean, Mode and Median) as their distribution densities were almost identical.

  • Some features had abnormal distribution (as they are count data). A transformation might be needed in order to improve the contrast of their value amongst 10 classes. These include: AC, ALTV, DL, DP, DB, MLTV, MBTV, Nmax,Nzeros,UC and Variance.

Therefore, we will apply a transformation to our data. This transformation consists of:

  • Removing the features with no or low variance, as they will not contribute to our prediction.
  • Applying systematically a Yeo-Johnson transformation to features that have skewed distribution
  • Filtering the features with high colinearity
library(caret)

YJT=preProcess(df[,-21],method=c("nzv", "conditionalX","YeoJohnson","corr"))
YJTdata=predict(YJT,newdata=df[,-21])%>%mutate(.,CLASS=df$CLASS)
  
head(YJTdata)%>%knitr::kable()
LB AC FM UC DL ASTV MSTV MLTV Width Min Max Nmax Nzeros Mode Variance CLASS
70.66978 0.0000000 0 0.0000000 0.0000000 37.79783 0.3656240 1.667680 14.78483 11.93782 1.302152 1.258806 0.0000000 140263.93 3.786020 Flatsinusoidal
76.75360 0.0036132 0 0.0055709 0.0017965 11.36043 0.8555929 4.631112 22.07718 12.52325 1.312364 2.486427 0.2093696 215262.77 2.374588 AD
77.25702 0.0022980 0 0.0072499 0.0017965 10.79710 0.8555929 5.430553 22.07718 12.52325 1.312364 2.243921 0.2093696 215262.77 2.437832 AD
77.75991 0.0022980 0 0.0072499 0.0017965 10.79710 0.9057134 7.526446 20.81537 10.99830 1.309291 3.409286 0.0000000 199414.11 2.437832 AD
76.75360 0.0039122 0 0.0072499 0.0000000 10.79710 0.9057134 6.907120 20.81537 10.99830 1.309291 3.083532 0.0000000 199414.11 2.305958 REM
77.75991 0.0009123 0 0.0088473 0.0025659 16.18889 1.2194202 0.000000 23.90556 10.66597 1.312555 2.243921 0.2181477 41837.61 4.414257 LD

After transformation, our dataset contains only 15 features (so 5 features have been removed from the original data, due to low variance or colinearity).

We visualise again the data:

YJTdata%>%gather(LB:Variance,key="Feature",value="Value")%>%ggplot(aes(x=Value,fill=CLASS))+geom_density(color="black",alpha=0.4)+facet_wrap(~Feature,drop = TRUE,ncol=5,scales="free")+my_theme(5)+scale_fill_manual(values=mycol)+ggtitle("Transformed data: Density curves")

As we could see, the Yeo_johnson transformation has corrected as best as it could the abnormal skewness in AC, FM, DL, Nzeores, Width, UC and Variance. After being transformed, those variables show better contrasts. However, such transformation might not be required for ASTV, LB,Min and Max.

Though the transformation automatically removed DP and DB due to low variance, we saw that the DP in raw value showed at least a contrast between 2/10 classes, so we will keep this feature.

We decide to build the final dataset as follows:

features in Raw values: ASTV, LB,Min and Max Among Mean, Mode and Median: only 1 is needed. We decide to keep raw Median Between DP and DB: only one is needed and in raw value, we choose DP Other variables would be in transformed value: MLTV, MSTV,Nmax,Nzero,Variance,Width,UC and DL

rawval=c("ASTV","LB","Min","Max","Median","DP")
yjval=c("MLTV","MSTV","Nmax","Nzero","Variance","Width","UC","DL","FM","AC")

data=cbind(df[,(names(df) %in% rawval)],YJTdata[,(names(YJTdata) %in% yjval)])%>%mutate(.,CLASS=df$CLASS)

rm(YJTdata,df,rawval,yjval,YJT)

Our final dataset contains 15 features and could be visualised as below:

data%>%gather(LB:Variance,key="Feature",value="Value")%>%ggplot(aes(x=Value,fill=CLASS))+geom_density(color="black",alpha=0.4)+facet_wrap(~Feature,drop = TRUE,ncol=3,scales="free")+my_theme(5)+scale_fill_manual(values=mycol)+ggtitle("Cleaned and transformed data")

data%>%gather(LB:Variance,key="Feature",value="Value")%>%ggplot(aes(x=CLASS,y=Value,fill=CLASS))+geom_violin(color="black",alpha=0.7)+facet_wrap(~Feature,drop = TRUE,ncol=3,scales="free")+my_theme(5)+scale_fill_manual(values=mycol)+ggtitle("Cleaned and transformed data")

This dataset will be randomly splitted to Train and test subsets:

set.seed(1234)

idTrain=createDataPartition(y=data$CLASS,p=0.5,list=FALSE)
trainset=data[idTrain,]
testset=data[-idTrain,]

A correlation matrix that shows the association pattern among 15 features in our train subset:

library(corrplot)
library(RColorBrewer)

cor_mat=as.matrix(cor(method="spearman",as.matrix(trainset[,c(1:15)])))

cor_mat%>%corrplot(.,order="hclust",type="lower",method="color",tl.col="black", tl.srt=45,tl.cex=0.6,col=rev(brewer.pal(n=10, name="Spectral")))

It’s another matrix that shows the distribution of the classes from different bidimensional viewpoints

plotfuncmid <- function(data,mapping){
  p <- ggplot(data = data,mapping=mapping)+geom_density(aes(fill=data$CLASS),alpha=0.5,color="black")+scale_fill_manual(values=mycol)+scale_color_manual(values=mycol)
  p
}

plotfuncLow <- function(data,mapping){
  p <- ggplot(data = data,mapping=mapping)+stat_density2d(geom="polygon",aes(fill=data$CLASS,alpha = ..level..))+scale_fill_manual(values=mycol)
  p
}

plotfuncUp <- function(data,mapping){
  p <- ggplot(data = data,mapping=mapping)+geom_smooth(method="lm",se=FALSE,alpha=0.6,aes(color=data$CLASS))+scale_fill_manual(values=mycol)
  p
}


library(GGally)

ggpairs(data,columns=1:15,lower=list(continuous=plotfuncLow),diag=list(continuous=plotfuncmid),upper = list(continuous=plotfuncUp))

7 Training 5 base learners

The first step in our experiment consists of training 5 different base learners. These include: SVM, CART, KNN, OneR and Random Forest.

#5 models

Control=trainControl(method="repeatedcv",number=10,repeats=10,classProbs=TRUE,summaryFunction=multiClassSummary,savePredictions ="final")

#1) SVM

set.seed(1234)

svmod=train(CLASS~.,data=trainset,method = "svmLinear2",trControl=Control,tuneLength=5)

predsvm=predict(svmod,newdata=testset)
cm1=confusionMatrix(predsvm,testset$CLASS)

#2) CART

set.seed(666)

cartmod=train(CLASS~.,data=trainset,method = "rpart",trControl=Control,tuneLength=5)

predcart=predict(cartmod,newdata=testset)
cm2=confusionMatrix(predcart,testset$CLASS)

#3) KNN

set.seed(333)

knnmod=train(CLASS~.,data=trainset,method = "knn",trControl=Control,tuneLength=5)

predknn=predict(knnmod,newdata=testset)
cm3=confusionMatrix(predknn,testset$CLASS)

#4) 1R

set.seed(333)

ormod=train(CLASS~.,data=trainset,method = "OneR",trControl=Control,tuneLength=5)

predor=predict(ormod,newdata=testset)
cm4=confusionMatrix(predor,testset$CLASS)

#5) rang

set.seed(333)

rgmod=train(CLASS~.,data=trainset,method = "ranger",trControl=Control)

predrg=predict(rgmod,newdata=testset)
cm5=confusionMatrix(predrg,testset$CLASS)

Because our problem is complexe, as we have 10 classes to classify, we would like to represent our confusion matrix and model performance in graphs. Here is the solutions:

multiclasscmplot=function(modelobj,testdf,target,modelname){
  predt=predict(modelobj,newdata=testdf)
  cm=caret::confusionMatrix(predt,testdf[,target])
  
  cmt=cm$table%>%as_tibble()
  
  freq=Hmisc::describe(testdf[,target])%>%.$values%>%.$frequency
  labels=row.names(cm$table)
  
  k=1
  cmt$rate=cmt$n/1
  for(i in 1:10){
    for(k in 1:100){
      if(cmt$Reference[k]==labels[i]){
        cmt$rate[k]<-100*(cmt$n[k]/freq[i])
        k=k+1
      }
      else{k=k}
    }
  }
  
  cmt%>%ggplot(aes(x=Prediction,y=Reference,fill=rate))+
    geom_tile(color="black")+my_theme(10)+
    scale_fill_gradient2(low="white",mid="#008cff",high="#ff0059",midpoint=50)+
    theme(axis.text.x = element_text(angle =45, hjust = 0.5))+
    ggtitle(modelname)+
    geom_text(aes(y=Reference,x=Prediction,label=round(rate,1)),color="black",size=3)+coord_fixed()
}
multiclasscmplot(modelobj=svmod,testdf=testset,target="CLASS",modelname="SVM")

multiclasscmplot(modelobj=cartmod,testdf=testset,target="CLASS",modelname="CART")

multiclasscmplot(modelobj=ormod,testdf=testset,target="CLASS",modelname="One_R")

multiclasscmplot(modelobj=knnmod,testdf=testset,target="CLASS",modelname="KNN")

multiclasscmplot(modelobj=rgmod,testdf=testset,target="CLASS",modelname="RandomForest")

multiclasscmsum=function(modelobj,testdf,target,modelname,lowcol,highcol){
  
  predt=predict(modelobj,newdata=testdf)
  cm=caret::confusionMatrix(predt,testdf[,target])
  
  sum=cm$byClass%>%as_tibble()%>%.[,c(7,11)]%>%mutate(.,Labels=row.names(cm$table))
  sum%>%ggplot(aes(x=reorder(Labels,F1),y=F1))+
    geom_bar(aes(fill=F1),stat="identity",color="black")+
    coord_flip()+
    geom_text(aes(label=round(F1,2)),hjust=1,size=5,color="white")+ggtitle(paste(modelname,"/ F1 score"))+
    scale_fill_gradient(low=lowcol,high=highcol)
  }
multiclasscmsum(modelobj=svmod,testdf=testset,target="CLASS",modelname="SVM",lowcol="#c94be5",highcol="#ff0066")

multiclasscmsum(modelobj=knnmod,testdf=testset,target="CLASS",modelname="KNN",lowcol="#c94be5",highcol="#ff0066")

multiclasscmsum(modelobj=rgmod,testdf=testset,target="CLASS",modelname="RF",lowcol="#c94be5",highcol="#ff0066")

multiclasscmsum(modelobj=ormod,testdf=testset,target="CLASS",modelname="ONER",lowcol="#c94be5",highcol="#ff0066")

multiclasscmsum(modelobj=cartmod,testdf=testset,target="CLASS",modelname="CART",lowcol="#c94be5",highcol="#ff0066")

Based on the confusion matrices and F1 score graphs, we can see that the best model was Random Forest, the second best was SVM. OneR and Decision tree (CART) were weak, both of them failed to classify correctly the pathological patterns among 10 target labels. The KNN was better than CART or OneR but this base learner cannot assure the correct prediction of all 10 labels.

8 First ensemble rule: Weighted average

Our fist ensemble model applies a simple principle of weighted average. As mentioned above, we found that the RF and SVM were two best models for the prediction, so we will assign higher weights for those two models (0.45 and 0.3, respectively). In contrast, KNN and CART were weak, and OneR was the worst, so we assign a lower weights to them (0.1 for KNN and CART, 0.01 to OneR). Then we combine those 5 predicted probabilities together to make a new prediction. The label with highest probability among 10 will be taken as the final prediction.

pred1=predict(svmod,newdata=testset,type="prob")
pred1=pred1*0.30
pred1=pred1%>%as_tibble()%>%mutate(.,Model="SVM")

pred2=predict(rgmod,newdata=testset,type="prob")
pred2=pred2*0.45
pred2=pred2%>%as_tibble()%>%mutate(.,Model="RF")

pred3=predict(knnmod,newdata=testset,type="prob")
pred3=pred3*0.1
pred3=pred3%>%as_tibble()%>%mutate(.,Model="KNN")

pred4=predict(ormod,newdata=testset,type="prob")
pred4=pred4*0.01
pred4=pred4%>%as_tibble()%>%mutate(.,Model="ONER")

pred5=predict(cartmod,newdata=testset,type="prob")
pred5=pred5*0.1
pred5=pred5%>%as_tibble()%>%mutate(.,Model="CART")

predt=pred1[,-11]
predt[,]<-NA
Labels=colnames(predt)

for(i in 1:1060){
  for (j in 1:10){
    predt[i,j]=pred1[i,j]+pred2[i,j]+pred3[i,j]+pred4[i,j]+pred5[i,j]
  }
}
  
predt$Predict=rep(NA,1060)


k=1
for(i in 1:1060){
  for(k in 1:10){
    if(is.na(predt$Predict[i])){
  maxprob=max(predt[i,c(1:10)])
  predt$Predict[i]<-ifelse(predt[i,k]==maxprob,Labels[k],NA)
  k=k+1
    }
    else{(k=k)
}
  }
}

We will also create two functions for plotting the confusion matrix and F1 scores for our Ensemble predictions:

ensemblecmplot=function(vote,truth){
  cm=confusionMatrix(reference = truth,vote)
  cmt=cm$table%>%as_tibble()
  freq=Hmisc::describe(truth)%>%.$values%>%.$frequency
  labels=row.names(cm$table)
  
  k=1
  cmt$rate=cmt$n/1
  for(i in 1:10){
    for(k in 1:100){
      if(cmt$Reference[k]==labels[i]){
        cmt$rate[k]<-100*(cmt$n[k]/freq[i])
        k=k+1
      }
      else{k=k}
    }
  }
  
  cmt%>%ggplot(aes(x=Prediction,y=Reference,fill=rate))+
    geom_tile(color="black")+my_theme(10)+
    scale_fill_gradient2(low="white",mid="#008cff",high="#ff0059",midpoint=50)+
    theme(axis.text.x = element_text(angle =45, hjust = 0.5))+
    ggtitle("Ensemble")+
    geom_text(aes(y=Reference,x=Prediction,label=round(rate,1)),color="black",size=3)+coord_fixed()
}
p1=ensemblecmplot(vote=predt$Predict,truth=testset$CLASS)+ggtitle("Weighted averaging")
p2=multiclasscmplot(modelobj=rgmod,testdf=testset,target="CLASS",modelname="RandomForest")

gridExtra::grid.arrange(p1,p2,ncol=2)

The confusion matrix of Weighted averaged prediction was better than that obtained by Random Forest but not for all patterns. The averaged prediction was improved for Shiftpattern, REM, Flatsinusoidal, DE, CalmSleep and ActVigilence; however it was unchanged for Suspect,AD and LD, and did worst for Calm vigilence. However we could say that the Ensemble model did imporve the accuracy of prediction.

ensemblecmsum=function(vote,truth,lowcol,highcol){
  
  cm=confusionMatrix(reference = truth,vote)
  
  sum=cm$byClass%>%as_tibble()%>%.[,c(7,11)]%>%mutate(.,Labels=row.names(cm$table))
  sum%>%ggplot(aes(x=reorder(Labels,F1),y=F1))+
    geom_bar(aes(fill=F1),stat="identity",color="black")+
    coord_flip()+
    geom_text(aes(label=round(F1,2)),hjust=1,size=5,color="white")+ggtitle(paste("Ensemble"))+
    scale_fill_gradient(low=lowcol,high=highcol)
}
p1=ensemblecmsum(vote=predt$Predict,truth=testset$CLASS,lowcol="#c94be5",highcol="#ff0066")+ggtitle("Weighted averaging")
p2=multiclasscmsum(modelobj=rgmod,testdf=testset,target="CLASS",modelname="RF",lowcol="#c94be5",highcol="#ff0066")

gridExtra::grid.arrange(p1,p2,ncol=2)

Based on F1 score per class, the Weighted average prediction performs better than the best individual model which was Random Forest.

9 Second ensemble rule: Majority Voting

The second Ensemble method consists of considering the highest number of votes by 5 individual models to make the final decision. We made a loop for doing this manually. Note that majority vote among 5 learners means that at least 3 learners gave the same prediction for a target label. If the majority cannot be reached, we will assign systematically the prediction by the strongest model which is Random Forest.

pdf=data.frame(RF=predrg,SVM=predsvm,KNN=predknn,ONER=predor,CART=predcart)

labels=row.names(cm1$table)

predt=rep(NA,1060)

k=1

for(i in 1:10){
  for(k in 1:1060){
    if(is.na(predt[k])){
      predt[[k]]<-ifelse(pdf$RF[k]==labels[i] & pdf$SVM[k]==labels[i] & pdf$KNN[k]==labels[i],labels[i],
                 ifelse(pdf$RF[k]==labels[i] & pdf$SVM[k]==labels[i] & pdf$ONER[k]==labels[i],labels[i],
                 ifelse(pdf$RF[k]==labels[i] & pdf$SVM[k]==labels[i] & pdf$CART[k]==labels[i],labels[i],
                 ifelse(pdf$RF[k]==labels[i] & pdf$KNN[k]==labels[i] & pdf$ONER[k]==labels[i],labels[i],
                 ifelse(pdf$RF[k]==labels[i] & pdf$KNN[k]==labels[i] & pdf$CART[k]==labels[i],labels[i],
                 ifelse(pdf$RF[k]==labels[i] & pdf$ONER[k]==labels[i] & pdf$CART[k]==labels[i],labels[i],
                 ifelse(pdf$SVM[k]==labels[i] & pdf$KNN[k]==labels[i] & pdf$ONER[k]==labels[i],labels[i],
                 ifelse(pdf$SVM[k]==labels[i] & pdf$KNN[k]==labels[i] & pdf$CART[k]==labels[i],labels[i],
                 ifelse(pdf$KNN[k]==labels[i] & pdf$CART[k]==labels[i] & pdf$ONER[k]==labels[i],labels[i],NA
                                      )))))))))
      k=k+1}
    else{k=k}
  }
}

for(k in 1:1060){
  if(is.na(predt[[k]])){predt[[k]]<-as.vector(pdf$RF[[k]])
  k=k+1
  }
}

pdf$Majority=as.vector(predt)
pdf$Truth=testset$CLASS
p1=ensemblecmplot(vote=pdf$Majority,truth=pdf$Truth)
p2=multiclasscmplot(modelobj=rgmod,testdf=testset,target="CLASS",modelname="RandomForest")

gridExtra::grid.arrange(p1,p2,ncol=2)

p1=ensemblecmsum(vote=pdf$Majority,truth=pdf$Truth,lowcol="red",highcol="purple")
p2=multiclasscmsum(modelobj=rgmod,testdf=testset,target="CLASS",modelname="RF",lowcol="#c94be5",highcol="#ff0066")
gridExtra::grid.arrange(p1,p2,ncol=2)

Overall, the ensemble prediction that based on Majority vote is worst than that by Weighted average method, and worst than that by the stand alone Random Forest model.

10 The third ensmble model: GBM Stacking

The 3rd Ensemble method consists of using the output of 5 individual models as the input of a top layer model which is a GBM learner. In brief, we will generate new trainsubset and new testsubset that incorporate the predicted probabilities of 10 labels x 5 models = 50 new input features for training and testing a GBM model.

#TRAINPRED

pred1=predict(svmod,newdata=trainset,type="prob")%>%as_tibble()
colnames(pred1)<-paste(Labels,1,sep="_")

pred2=predict(rgmod,newdata=trainset,type="prob")%>%as_tibble()
colnames(pred2)<-paste(Labels,2,sep="_")

pred3=predict(knnmod,newdata=trainset,type="prob")%>%as_tibble()
colnames(pred3)<-paste(Labels,3,sep="_")

pred4=predict(ormod,newdata=trainset,type="prob")%>%as_tibble()
colnames(pred4)<-paste(Labels,4,sep="_")

pred5=predict(cartmod,newdata=trainset,type="prob")%>%as_tibble()
colnames(pred5)<-paste(Labels,5,sep="_")

trainstackdf=cbind(pred1,pred2,pred3,pred4,pred5)%>%mutate(.,CLASS=trainset$CLASS)

#TESTPRED

pred1=predict(svmod,newdata=testset,type="prob")%>%as_tibble()
colnames(pred1)<-paste(Labels,1,sep="_")

pred2=predict(rgmod,newdata=testset,type="prob")%>%as_tibble()
colnames(pred2)<-paste(Labels,2,sep="_")

pred3=predict(knnmod,newdata=testset,type="prob")%>%as_tibble()
colnames(pred3)<-paste(Labels,3,sep="_")

pred4=predict(ormod,newdata=testset,type="prob")%>%as_tibble()
colnames(pred4)<-paste(Labels,4,sep="_")

pred5=predict(cartmod,newdata=testset,type="prob")%>%as_tibble()
colnames(pred5)<-paste(Labels,5,sep="_")

teststackdf=cbind(pred1,pred2,pred3,pred4,pred5)%>%mutate(.,CLASS=testset$CLASS)
#Stacking GBM

set.seed(1234)
Control2=trainControl(method="repeatedcv",number=5,repeats=5,classProbs=TRUE,summaryFunction=multiClassSummary)

stackGBM<-train(CLASS~.,data=trainstackdf,method='gbm',trControl=Control2,tuneLength=3)

predSTKGBM=predict(stackGBM,newdata=teststackdf)

cmstk=confusionMatrix(predSTKGBM,teststackdf$CLASS)
p1=ensemblecmplot(vote=predSTKGBM,truth=teststackdf$CLASS)+ggtitle("GBM Stacking")

p2=multiclasscmplot(modelobj=rgmod,testdf=testset,target="CLASS",modelname="RandomForest")

gridExtra::grid.arrange(p1,p2,ncol=2)

ensemblecmsum(vote=predSTKGBM,truth=teststackdf$CLASS,lowcol="gold",highcol="red")+ggtitle("GBM Stacking")

Yes, as you could see: the two layers Stacking ensemble model did a very good job, it could perform even better than a standalone Random Forest model.

10.1 Conclusion

Ensembling is a simple but effective approach to improve the performance of weak base algorithms. In most situations, an ensemble model work better than a standalone learner. Ensembling offers robustness and stability for both classification and regression tasks, even the complex problem likes the multilabel classification in our study.

However, an ensemble model might be not necessary if you already have a strong learner likes Random Forest or GBM among base learners. In fact, Random Forest and GBM are also Ensemble models that correspond to Bagging and Boosting algorithms, respectively.

There are some disadvantages of Ensemble modelling; as this process is time consuming and makes the model very difficult to interpret. By consequence, ensembling is not a good choice for interpretive studies and real time analyzing tasks.

Thank you for joining us and see you in the next tutorial.

---
title: "MLM Case study X16"
subtitle: "Ensemble modelling"
author: "Le Dong Nhat Nam (MD, PhD)"
date: "July 05, 2017"
output: 
  html_document: 
    code_download: true
    code_folding: hide
    number_sections: yes
    theme: journal
    df_print: kable
    toc: TRUE
    toc_float: TRUE
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, error = FALSE, warning = FALSE, message = FALSE)

library(tidyverse)
library(matrixStats)
library(corrplot)
library(RColorBrewer)
library(caret)

```

![](ensembling.png)

# Foreword:About the Machine Learning in Medicine project 

You are reading the 36th case study by MLM team. Our MLM project has been initialized in 2016 and aims to:

(1) Encourage using Machine Learning techniques in medical research in Vietnam

and (2) Promote the use of R statistical programming language, an open source and leading tool for practicing data science.
    
# Introduction

In machine learning, ensembling means to combine several base algorithms (learners) for making a stronger algorithm. Such integrated system incorporates the predictions from all individual learners in the best way to provide the final prediction. An ensemble model usually outperforms standalone base learners as brainstorming and teamwork do in our life. 
The present tutorial provides a hands-on experiment on some basic techniques of ensemble modelling. These include: 

Weighted average: this method consists of applying weights on predictions then taking the average of predictions from base learners. Thus, the learners contribute differently to the averaged prediction, depending on their performance. 

Majority vote consists of considering the prediction with maximum vote from all base algorithms. This method is strictly applied to classification problem.

Stacking: this method aims to build a multiple layers of models, where the output from the models in the under layers will be used as input for another in superior layer. The top layer model will take final decision.

![](heroteam.png)

# Dataset and study objective

Our study implies a multilabel classification problem that involves the Cardiotocography (CTG) reading in Obstetrics. CTG is a method for monitoring the foetal heart rate (cardio) by ultrasound and its temporal relationship to uterine contractions (toco). This method could be used both before (3rd trimester) and during labour to monitor the baby for any signs of distress. Changes in fetal heart rate that occur along with contractions form a pattern. Certain changes in this pattern may suggest a problem.

A description of the dataset could be found here: 

<https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.names

![](toco2.png)

The original dataset has been created by Ayres de Campos et al. (Faculty of Medicine, University of Porto, Portugal) in 2000 for upgrading a computer program (SisPorto 2.0) for automated analysis of Cardiotocograms. The dataset included 2126 preprocessed fetal cardiotocograms (CTGs). The CTGs were classified by three expert obstetricians and a consensus classification label assigned to each of them. 

**The 21 features included: ** baseline values (LBE,LB, which were determined by approximating the mean FHR rounded to increments of 5 beats per minute (bpm) during a 10-minute window, excluding accelerations and decelerations and periods of marked FHR variability (greater than 25 bpm)),  accelerations (AC), foetal movement (FM), uterine contractions (UC), percentage of time with abnormal short term variability (ASTV), mean value of short term variability (mSTV), percentage of time with abnormal long term variability (ALTV), mean value of long term variability (mLTV), light decelerations (DL), severe decelerations (DS), prolongued decelerations (DP), repetitive decelerations (DR), histogram width (Width), low freq. of the histogram (Min), high freq. of the histogram (Max), number of histogram peaks (nMax), number of histogram zeros (Nzeroes), histograms’ mode, mean, median, variance and tendency (left assymetric; symmetric, right assymetric).

**Clinical objective** was to classify 10 morphologic patterns, including: Calm sleep, REM sleep, calm vigilance, active vigilance, shift pattern (A or Susp with shifts), accelerative/decelerative pattern (stress situation), decelerative pattern (vagal stimulation), largely decelerative pattern, flat-sinusoidal pattern (pathological state) and suspect patterns. So, this is a multiclass classification problem
The data could be downloaded from:

<https://archive.ics.uci.edu/ml/datasets/Cardiotocography#>

Reference: Ayres de Campos et al. (2000) SisPorto 2.0 A Program for Automated Analysis of Cardiotocograms. J Matern Fetal Med 5:311-318

**Practical objective**: The present data experiment aims to explore the utility of 3 ensembling techniques: Weighted averaging, Majority voting and Stacking.

# Methods

Before being used as input, the original data was cleaned and transformed. The features with no or low variance, as well as high collinearity were removed. Only 15 features remained in the final dataframe. Some of them have been transformed by Yeo-Johnson power transformation in order to optimize their contrast among 10 classes. This dataset was the randomly splited into 2 subsets for training (50%,n=1066) and testing (50%,n=1060). Models’ performance was evaluated by visual confusion matrix, Kappa coefficient and F1 score (for each label).

*Our ensembling experiment consisted of 4 steps*

+ 1)First, we trained 5 basic classifiers that based on Support vector machine (SVM with linear kernel), k-nearest neighbors (KNN), simple decision tree (CART), One rule (OneR) and Random Forest (Ranger package).

+ 2)We developed the first ensemble model that based on Weighted Averaging principle.

+ 3)In a similar way, we applied the Majority voting principle to make another Ensemble model.

+ 4)Finally, we used Stacking ensemble technique to combine the 5 basic learners under a Gradient boosting machine learner. 

The caret package (Max Kuhn, 2012) was used through out our experiment. All 3 ensembling procedures were manually performed using customized loops and functions in R. We also introduced 2 functions for visualizing the confusion matrix and the performance for multilabel classifier.

# Data preparation and exploration

First, we will prepare the ggplot theme for our experiment

```{r,message = FALSE,warning=FALSE}

library(tidyverse)

my_theme <- function(base_size = 5, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
    theme(
      axis.text = element_text(size = 5),
      axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
      axis.title = element_text(size = 5),
      panel.grid.major = element_line(color = "grey"),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = "white"),
      strip.background = element_rect(fill = "#4f0870", color = "#4f0870", size =0.5),
      strip.text = element_text(face = "bold", size =5, color = "white"),
      legend.position = "bottom",
      legend.justification = "center",
      legend.background = element_blank(),
      panel.border = element_rect(color = "grey30", fill = NA, size = 0.5)
    )
}
theme_set(my_theme())

mycol=c("#ff0048","#fce628","#0de2aa","#2dd6e2","#c362f7","#ffae22","#b1e827","#3495ea","#ac9af9","#ff51c5")
```

Now we load the dataset into R:

```{r,message = FALSE,warning=FALSE}
#load data

df=read.csv("Cardiotocography.csv",sep=";")%>%.[,-c(21:31)]%>%.[,-22]

df[c(1:20)] <- lapply(df[c(1:20)], as.numeric)

df$CLASS=recode_factor(df$CLASS,`1` = "CalmSleep", `2` = "REM",`3` = "CalmVig",`4` = "ActVig", `5` = "Shiftpattern",`6` = "AD",`7` = "DE", `8` = "LD",`9` = "Flatsinusoidal",`10` = "Suspect")

df$CLASS=as.factor(df$CLASS)

head(df)%>%knitr::kable()

```

# Data cleaning, Visualisation and Transformations

```{r}
df%>%gather(LB:Variance,key="Feature",value="Value")%>%ggplot(aes(x=Value,fill=CLASS))+geom_density(color="black",alpha=0.4,show.legend = T)+facet_wrap(~Feature,drop = TRUE,ncol=5,scales="free")+my_theme(5)+scale_fill_manual(values=mycol)+ggtitle("Original data: Density curves")

```

The density plots can tell us two things:

+ Some features seem to be equivalent (for example: Mean, Mode and Median) as their distribution densities were almost identical.

+ Some features had abnormal distribution (as they are count data). A transformation might be needed in order to improve the contrast of their value amongst 10 classes. These include: AC, ALTV, DL, DP, DB, MLTV, MBTV, Nmax,Nzeros,UC and Variance.

Therefore, we will apply a transformation to our data. This transformation consists of:

+ Removing the features with no or low variance, as they will not contribute to our prediction.
+ Applying systematically a Yeo-Johnson transformation to features that have skewed distribution
+ Filtering the features with high colinearity 

```{r}
library(caret)

YJT=preProcess(df[,-21],method=c("nzv", "conditionalX","YeoJohnson","corr"))
YJTdata=predict(YJT,newdata=df[,-21])%>%mutate(.,CLASS=df$CLASS)
  
head(YJTdata)%>%knitr::kable()

```

After transformation, our dataset contains only 15 features (so 5 features have been removed from the original data, due to low variance or colinearity). 

We visualise again the data:

```{r}
YJTdata%>%gather(LB:Variance,key="Feature",value="Value")%>%ggplot(aes(x=Value,fill=CLASS))+geom_density(color="black",alpha=0.4)+facet_wrap(~Feature,drop = TRUE,ncol=5,scales="free")+my_theme(5)+scale_fill_manual(values=mycol)+ggtitle("Transformed data: Density curves")

```

As we could see, the Yeo_johnson transformation has corrected as best as it could the abnormal skewness in AC, FM, DL, Nzeores, Width, UC and Variance. After being transformed, those variables show better contrasts. However, such transformation might not be required for ASTV, LB,Min and Max.

Though the transformation automatically removed DP and DB due to low variance, we saw that the DP in raw value showed at least a contrast between 2/10 classes, so we will keep this feature.

We decide to build the final dataset as follows:

features in Raw values: ASTV, LB,Min and Max
Among Mean, Mode and Median: only 1 is needed. We decide to keep raw Median
Between DP and DB: only one is needed and in raw value, we choose DP
Other variables would be in transformed value: MLTV, MSTV,Nmax,Nzero,Variance,Width,UC and DL

```{r}
rawval=c("ASTV","LB","Min","Max","Median","DP")
yjval=c("MLTV","MSTV","Nmax","Nzero","Variance","Width","UC","DL","FM","AC")

data=cbind(df[,(names(df) %in% rawval)],YJTdata[,(names(YJTdata) %in% yjval)])%>%mutate(.,CLASS=df$CLASS)

rm(YJTdata,df,rawval,yjval,YJT)

```

Our final dataset contains 15 features and could be visualised as below:

```{r}
data%>%gather(LB:Variance,key="Feature",value="Value")%>%ggplot(aes(x=Value,fill=CLASS))+geom_density(color="black",alpha=0.4)+facet_wrap(~Feature,drop = TRUE,ncol=3,scales="free")+my_theme(5)+scale_fill_manual(values=mycol)+ggtitle("Cleaned and transformed data")

data%>%gather(LB:Variance,key="Feature",value="Value")%>%ggplot(aes(x=CLASS,y=Value,fill=CLASS))+geom_violin(color="black",alpha=0.7)+facet_wrap(~Feature,drop = TRUE,ncol=3,scales="free")+my_theme(5)+scale_fill_manual(values=mycol)+ggtitle("Cleaned and transformed data")

```

This dataset will be randomly splitted to Train and test subsets:

```{r}
set.seed(1234)

idTrain=createDataPartition(y=data$CLASS,p=0.5,list=FALSE)
trainset=data[idTrain,]
testset=data[-idTrain,]

```

A correlation matrix that shows the association pattern among 15 features in our train subset:

```{r}
library(corrplot)
library(RColorBrewer)

cor_mat=as.matrix(cor(method="spearman",as.matrix(trainset[,c(1:15)])))

cor_mat%>%corrplot(.,order="hclust",type="lower",method="color",tl.col="black", tl.srt=45,tl.cex=0.6,col=rev(brewer.pal(n=10, name="Spectral")))

```

It's another matrix that shows the distribution of the classes from different bidimensional viewpoints

```{r}

plotfuncmid <- function(data,mapping){
  p <- ggplot(data = data,mapping=mapping)+geom_density(aes(fill=data$CLASS),alpha=0.5,color="black")+scale_fill_manual(values=mycol)+scale_color_manual(values=mycol)
  p
}

plotfuncLow <- function(data,mapping){
  p <- ggplot(data = data,mapping=mapping)+stat_density2d(geom="polygon",aes(fill=data$CLASS,alpha = ..level..))+scale_fill_manual(values=mycol)
  p
}

plotfuncUp <- function(data,mapping){
  p <- ggplot(data = data,mapping=mapping)+geom_smooth(method="lm",se=FALSE,alpha=0.6,aes(color=data$CLASS))+scale_fill_manual(values=mycol)
  p
}


library(GGally)

ggpairs(data,columns=1:15,lower=list(continuous=plotfuncLow),diag=list(continuous=plotfuncmid),upper = list(continuous=plotfuncUp))

```

# Training 5 base learners

The first step in our experiment consists of training 5 different base learners. These include: SVM, CART, KNN, OneR and Random Forest. 

```{r,message = FALSE,results="hide"}

#5 models

Control=trainControl(method="repeatedcv",number=10,repeats=10,classProbs=TRUE,summaryFunction=multiClassSummary,savePredictions ="final")

#1) SVM

set.seed(1234)

svmod=train(CLASS~.,data=trainset,method = "svmLinear2",trControl=Control,tuneLength=5)

predsvm=predict(svmod,newdata=testset)
cm1=confusionMatrix(predsvm,testset$CLASS)

#2) CART

set.seed(666)

cartmod=train(CLASS~.,data=trainset,method = "rpart",trControl=Control,tuneLength=5)

predcart=predict(cartmod,newdata=testset)
cm2=confusionMatrix(predcart,testset$CLASS)

#3) KNN

set.seed(333)

knnmod=train(CLASS~.,data=trainset,method = "knn",trControl=Control,tuneLength=5)

predknn=predict(knnmod,newdata=testset)
cm3=confusionMatrix(predknn,testset$CLASS)

#4) 1R

set.seed(333)

ormod=train(CLASS~.,data=trainset,method = "OneR",trControl=Control,tuneLength=5)

predor=predict(ormod,newdata=testset)
cm4=confusionMatrix(predor,testset$CLASS)

#5) rang

set.seed(333)

rgmod=train(CLASS~.,data=trainset,method = "ranger",trControl=Control)

predrg=predict(rgmod,newdata=testset)
cm5=confusionMatrix(predrg,testset$CLASS)

```

Because our problem is complexe, as we have 10 classes to classify, we would like to represent our confusion matrix and model performance in graphs. Here is the solutions:

```{r,message = FALSE,warning=FALSE}
multiclasscmplot=function(modelobj,testdf,target,modelname){
  predt=predict(modelobj,newdata=testdf)
  cm=caret::confusionMatrix(predt,testdf[,target])
  
  cmt=cm$table%>%as_tibble()
  
  freq=Hmisc::describe(testdf[,target])%>%.$values%>%.$frequency
  labels=row.names(cm$table)
  
  k=1
  cmt$rate=cmt$n/1
  for(i in 1:10){
    for(k in 1:100){
      if(cmt$Reference[k]==labels[i]){
        cmt$rate[k]<-100*(cmt$n[k]/freq[i])
        k=k+1
      }
      else{k=k}
    }
  }
  
  cmt%>%ggplot(aes(x=Prediction,y=Reference,fill=rate))+
    geom_tile(color="black")+my_theme(10)+
    scale_fill_gradient2(low="white",mid="#008cff",high="#ff0059",midpoint=50)+
    theme(axis.text.x = element_text(angle =45, hjust = 0.5))+
    ggtitle(modelname)+
    geom_text(aes(y=Reference,x=Prediction,label=round(rate,1)),color="black",size=3)+coord_fixed()
}

```

```{r,message = FALSE,warning=FALSE}
multiclasscmplot(modelobj=svmod,testdf=testset,target="CLASS",modelname="SVM")

multiclasscmplot(modelobj=cartmod,testdf=testset,target="CLASS",modelname="CART")

multiclasscmplot(modelobj=ormod,testdf=testset,target="CLASS",modelname="One_R")

multiclasscmplot(modelobj=knnmod,testdf=testset,target="CLASS",modelname="KNN")

multiclasscmplot(modelobj=rgmod,testdf=testset,target="CLASS",modelname="RandomForest")



```

```{r,message = FALSE,warning=FALSE}
multiclasscmsum=function(modelobj,testdf,target,modelname,lowcol,highcol){
  
  predt=predict(modelobj,newdata=testdf)
  cm=caret::confusionMatrix(predt,testdf[,target])
  
  sum=cm$byClass%>%as_tibble()%>%.[,c(7,11)]%>%mutate(.,Labels=row.names(cm$table))
  sum%>%ggplot(aes(x=reorder(Labels,F1),y=F1))+
    geom_bar(aes(fill=F1),stat="identity",color="black")+
    coord_flip()+
    geom_text(aes(label=round(F1,2)),hjust=1,size=5,color="white")+ggtitle(paste(modelname,"/ F1 score"))+
    scale_fill_gradient(low=lowcol,high=highcol)
  }

```

```{r,message = FALSE,warning=FALSE}
multiclasscmsum(modelobj=svmod,testdf=testset,target="CLASS",modelname="SVM",lowcol="#c94be5",highcol="#ff0066")

multiclasscmsum(modelobj=knnmod,testdf=testset,target="CLASS",modelname="KNN",lowcol="#c94be5",highcol="#ff0066")

multiclasscmsum(modelobj=rgmod,testdf=testset,target="CLASS",modelname="RF",lowcol="#c94be5",highcol="#ff0066")

multiclasscmsum(modelobj=ormod,testdf=testset,target="CLASS",modelname="ONER",lowcol="#c94be5",highcol="#ff0066")

multiclasscmsum(modelobj=cartmod,testdf=testset,target="CLASS",modelname="CART",lowcol="#c94be5",highcol="#ff0066")

```

Based on the confusion matrices and F1 score graphs, we can see that the best model was Random Forest, the second best was SVM. OneR and Decision tree (CART) were weak, both of them failed to classify correctly the pathological patterns among 10 target labels. The KNN was better than CART or OneR but this base learner cannot assure the correct prediction of all 10 labels.

# First ensemble rule: Weighted average

Our fist ensemble model applies a simple principle of weighted average. As mentioned above, we found that the RF and SVM were two best models for the prediction, so we will assign higher weights for those two models (0.45 and 0.3, respectively). In contrast, KNN and CART were weak, and OneR was the worst, so we assign a lower weights to them (0.1 for KNN and CART, 0.01 to OneR). Then we combine those 5 predicted probabilities together to make a new prediction. The label with highest probability among 10 will be taken as the final prediction.

```{r,message = FALSE,warning=FALSE}

pred1=predict(svmod,newdata=testset,type="prob")
pred1=pred1*0.30
pred1=pred1%>%as_tibble()%>%mutate(.,Model="SVM")

pred2=predict(rgmod,newdata=testset,type="prob")
pred2=pred2*0.45
pred2=pred2%>%as_tibble()%>%mutate(.,Model="RF")

pred3=predict(knnmod,newdata=testset,type="prob")
pred3=pred3*0.1
pred3=pred3%>%as_tibble()%>%mutate(.,Model="KNN")

pred4=predict(ormod,newdata=testset,type="prob")
pred4=pred4*0.01
pred4=pred4%>%as_tibble()%>%mutate(.,Model="ONER")

pred5=predict(cartmod,newdata=testset,type="prob")
pred5=pred5*0.1
pred5=pred5%>%as_tibble()%>%mutate(.,Model="CART")

predt=pred1[,-11]
predt[,]<-NA
Labels=colnames(predt)

for(i in 1:1060){
  for (j in 1:10){
    predt[i,j]=pred1[i,j]+pred2[i,j]+pred3[i,j]+pred4[i,j]+pred5[i,j]
  }
}
  
predt$Predict=rep(NA,1060)


k=1
for(i in 1:1060){
  for(k in 1:10){
    if(is.na(predt$Predict[i])){
  maxprob=max(predt[i,c(1:10)])
  predt$Predict[i]<-ifelse(predt[i,k]==maxprob,Labels[k],NA)
  k=k+1
    }
    else{(k=k)
}
  }
}
```

We will also create two functions for plotting the confusion matrix and F1 scores for our Ensemble predictions:

```{r,message = FALSE,warning=FALSE}

ensemblecmplot=function(vote,truth){
  cm=confusionMatrix(reference = truth,vote)
  cmt=cm$table%>%as_tibble()
  freq=Hmisc::describe(truth)%>%.$values%>%.$frequency
  labels=row.names(cm$table)
  
  k=1
  cmt$rate=cmt$n/1
  for(i in 1:10){
    for(k in 1:100){
      if(cmt$Reference[k]==labels[i]){
        cmt$rate[k]<-100*(cmt$n[k]/freq[i])
        k=k+1
      }
      else{k=k}
    }
  }
  
  cmt%>%ggplot(aes(x=Prediction,y=Reference,fill=rate))+
    geom_tile(color="black")+my_theme(10)+
    scale_fill_gradient2(low="white",mid="#008cff",high="#ff0059",midpoint=50)+
    theme(axis.text.x = element_text(angle =45, hjust = 0.5))+
    ggtitle("Ensemble")+
    geom_text(aes(y=Reference,x=Prediction,label=round(rate,1)),color="black",size=3)+coord_fixed()
}
```

```{r,message = FALSE,warning=FALSE}
p1=ensemblecmplot(vote=predt$Predict,truth=testset$CLASS)+ggtitle("Weighted averaging")
p2=multiclasscmplot(modelobj=rgmod,testdf=testset,target="CLASS",modelname="RandomForest")

gridExtra::grid.arrange(p1,p2,ncol=2)
```

The confusion matrix of Weighted averaged prediction was better than that obtained by Random Forest but not for all patterns. The averaged prediction was improved for Shiftpattern, REM, Flatsinusoidal, DE, CalmSleep and ActVigilence; however it was unchanged for Suspect,AD and LD, and did worst for Calm vigilence. However we could say that the Ensemble model did imporve the accuracy of prediction.

```{r,message = FALSE,warning=FALSE}

ensemblecmsum=function(vote,truth,lowcol,highcol){
  
  cm=confusionMatrix(reference = truth,vote)
  
  sum=cm$byClass%>%as_tibble()%>%.[,c(7,11)]%>%mutate(.,Labels=row.names(cm$table))
  sum%>%ggplot(aes(x=reorder(Labels,F1),y=F1))+
    geom_bar(aes(fill=F1),stat="identity",color="black")+
    coord_flip()+
    geom_text(aes(label=round(F1,2)),hjust=1,size=5,color="white")+ggtitle(paste("Ensemble"))+
    scale_fill_gradient(low=lowcol,high=highcol)
}

```

```{r,message = FALSE,warning=FALSE}

p1=ensemblecmsum(vote=predt$Predict,truth=testset$CLASS,lowcol="#c94be5",highcol="#ff0066")+ggtitle("Weighted averaging")
p2=multiclasscmsum(modelobj=rgmod,testdf=testset,target="CLASS",modelname="RF",lowcol="#c94be5",highcol="#ff0066")

gridExtra::grid.arrange(p1,p2,ncol=2)
```

Based on F1 score per class, the Weighted average prediction performs better than the best individual model which was Random Forest.

# Second ensemble rule: Majority Voting

The second Ensemble method consists of considering the highest number of votes by 5 individual models to make the final decision. We made a loop for doing this manually. Note that majority vote among 5 learners means that at least 3 learners gave the same prediction for a target label. If the majority cannot be reached, we will assign systematically the prediction by the strongest model which is Random Forest.

```{r,message = FALSE,warning=FALSE}
pdf=data.frame(RF=predrg,SVM=predsvm,KNN=predknn,ONER=predor,CART=predcart)

labels=row.names(cm1$table)

predt=rep(NA,1060)

k=1

for(i in 1:10){
  for(k in 1:1060){
    if(is.na(predt[k])){
      predt[[k]]<-ifelse(pdf$RF[k]==labels[i] & pdf$SVM[k]==labels[i] & pdf$KNN[k]==labels[i],labels[i],
                 ifelse(pdf$RF[k]==labels[i] & pdf$SVM[k]==labels[i] & pdf$ONER[k]==labels[i],labels[i],
                 ifelse(pdf$RF[k]==labels[i] & pdf$SVM[k]==labels[i] & pdf$CART[k]==labels[i],labels[i],
                 ifelse(pdf$RF[k]==labels[i] & pdf$KNN[k]==labels[i] & pdf$ONER[k]==labels[i],labels[i],
                 ifelse(pdf$RF[k]==labels[i] & pdf$KNN[k]==labels[i] & pdf$CART[k]==labels[i],labels[i],
                 ifelse(pdf$RF[k]==labels[i] & pdf$ONER[k]==labels[i] & pdf$CART[k]==labels[i],labels[i],
                 ifelse(pdf$SVM[k]==labels[i] & pdf$KNN[k]==labels[i] & pdf$ONER[k]==labels[i],labels[i],
                 ifelse(pdf$SVM[k]==labels[i] & pdf$KNN[k]==labels[i] & pdf$CART[k]==labels[i],labels[i],
                 ifelse(pdf$KNN[k]==labels[i] & pdf$CART[k]==labels[i] & pdf$ONER[k]==labels[i],labels[i],NA
                                      )))))))))
      k=k+1}
    else{k=k}
  }
}

for(k in 1:1060){
  if(is.na(predt[[k]])){predt[[k]]<-as.vector(pdf$RF[[k]])
  k=k+1
  }
}

pdf$Majority=as.vector(predt)
pdf$Truth=testset$CLASS
```

```{r,message = FALSE,warning=FALSE}

p1=ensemblecmplot(vote=pdf$Majority,truth=pdf$Truth)
p2=multiclasscmplot(modelobj=rgmod,testdf=testset,target="CLASS",modelname="RandomForest")

gridExtra::grid.arrange(p1,p2,ncol=2)

```

```{r,message = FALSE,warning=FALSE}
p1=ensemblecmsum(vote=pdf$Majority,truth=pdf$Truth,lowcol="red",highcol="purple")
p2=multiclasscmsum(modelobj=rgmod,testdf=testset,target="CLASS",modelname="RF",lowcol="#c94be5",highcol="#ff0066")
gridExtra::grid.arrange(p1,p2,ncol=2)

```

Overall, the ensemble prediction that based on Majority vote is worst than that by Weighted average method, and worst than that by the stand alone Random Forest model.

# The third ensmble model: GBM Stacking 

The 3rd Ensemble method consists of using the output of 5 individual models as the input of a top layer model which is a GBM learner. In brief, we will generate new trainsubset and new testsubset that incorporate the predicted probabilities of 10 labels x 5 models = 50 new input features for training and testing a GBM model. 

```{r,message = FALSE,warning=FALSE,results="hide"}

#TRAINPRED

pred1=predict(svmod,newdata=trainset,type="prob")%>%as_tibble()
colnames(pred1)<-paste(Labels,1,sep="_")

pred2=predict(rgmod,newdata=trainset,type="prob")%>%as_tibble()
colnames(pred2)<-paste(Labels,2,sep="_")

pred3=predict(knnmod,newdata=trainset,type="prob")%>%as_tibble()
colnames(pred3)<-paste(Labels,3,sep="_")

pred4=predict(ormod,newdata=trainset,type="prob")%>%as_tibble()
colnames(pred4)<-paste(Labels,4,sep="_")

pred5=predict(cartmod,newdata=trainset,type="prob")%>%as_tibble()
colnames(pred5)<-paste(Labels,5,sep="_")

trainstackdf=cbind(pred1,pred2,pred3,pred4,pred5)%>%mutate(.,CLASS=trainset$CLASS)

#TESTPRED

pred1=predict(svmod,newdata=testset,type="prob")%>%as_tibble()
colnames(pred1)<-paste(Labels,1,sep="_")

pred2=predict(rgmod,newdata=testset,type="prob")%>%as_tibble()
colnames(pred2)<-paste(Labels,2,sep="_")

pred3=predict(knnmod,newdata=testset,type="prob")%>%as_tibble()
colnames(pred3)<-paste(Labels,3,sep="_")

pred4=predict(ormod,newdata=testset,type="prob")%>%as_tibble()
colnames(pred4)<-paste(Labels,4,sep="_")

pred5=predict(cartmod,newdata=testset,type="prob")%>%as_tibble()
colnames(pred5)<-paste(Labels,5,sep="_")

teststackdf=cbind(pred1,pred2,pred3,pred4,pred5)%>%mutate(.,CLASS=testset$CLASS)



```

```{r,message = FALSE,warning=FALSE,results="hide"}

#Stacking GBM

set.seed(1234)
Control2=trainControl(method="repeatedcv",number=5,repeats=5,classProbs=TRUE,summaryFunction=multiClassSummary)

stackGBM<-train(CLASS~.,data=trainstackdf,method='gbm',trControl=Control2,tuneLength=3)

predSTKGBM=predict(stackGBM,newdata=teststackdf)

cmstk=confusionMatrix(predSTKGBM,teststackdf$CLASS)

```

```{r,message = FALSE,warning=FALSE}

p1=ensemblecmplot(vote=predSTKGBM,truth=teststackdf$CLASS)+ggtitle("GBM Stacking")

p2=multiclasscmplot(modelobj=rgmod,testdf=testset,target="CLASS",modelname="RandomForest")

gridExtra::grid.arrange(p1,p2,ncol=2)

ensemblecmsum(vote=predSTKGBM,truth=teststackdf$CLASS,lowcol="gold",highcol="red")+ggtitle("GBM Stacking")

```

Yes, as you could see: the two layers Stacking ensemble model did a very good job, it could perform even better than a standalone Random Forest model.

## Conclusion

Ensembling is a simple but effective approach to improve the performance of weak base algorithms. In most situations, an ensemble model work better than a standalone learner. Ensembling offers robustness and stability for both classification and regression tasks, even the complex problem likes the multilabel classification in our study. 

However, an ensemble model might be not necessary if you already have a strong learner likes Random Forest or GBM among base learners. In fact, Random Forest and GBM are also Ensemble models that correspond to Bagging and Boosting algorithms, respectively. 

There are some disadvantages of Ensemble modelling; as this process is time consuming and makes the model very difficult to interpret. By consequence, ensembling is not a good choice for interpretive studies and real time analyzing tasks.

**Thank you for joining us and see you in the next tutorial.**
