Foreword:About the Machine Learning in Medicine project
This is the 35th case study by MLM team. The MLM project has been initialized in 2016 and aims to:
Encourage using Machine Learning techniques in medical research in Vietnam and
Promote the use of R statistical programming language, an open source and leading tool for practicing data science.
Introduction
In the previous tutorials (X13 and X14), we have explored the variable filtration and linear model selection. Filtration allows excluding the irrelevant features before model training, while model selection aims to optimize the model’s structure and performance and to prevent the overfitting. We have introduced some methods including Bayesian model averaging, Lasso and Elastic net regularisations. These solutions are strictly based on the linear regression.
In the present case study, we will explore a new method that could be used for feature filtration: the bootstrap aggregating or Random Forest algorithm. Given a classification problem, our goal is to fit the best Random Forest model for predictive purpose with strongest variables. We will also compare the performance of this method with logistic classifiers that based on Lasso, Ridge and Elastic net regularizations. Note that the Bayesian Model averaging (BMA) method will not be considered in our comparative study, due to high complexity of our data.
Dataset and Study objective
The present study implies the Cardiac Arrhythmia dataset. This dataset was created by H. Altay Guvenir (Ankara, Turkey) in 1997 for the development of a machine learning based algorithm to detect Arrhythmia from ECG recording.
The original dataset has 452 observations and 279 features, 206 of which are numerical and the rest are nominal. This was an ambitious study as the author want to classify 15 types of cardiac arrhythmia, such as Ischemic change, Myocardial infarction (anterior and Inferior), Sinus bradycardy and tachycardy, Ventricular Premature Contraction (PVC), Supraventricular Premature Contraction, Left/right bundle branch block, AtrioVentricular blocks (3 degrees), Left ventricule hypertrophy, Atrial Fibrillation or Flutter and Others.
For the demonstration purpose, we will simplify the study question to a binary classification problem. Our experiment aims to develop an algorithm to distinguish between the presence and absence of arrhythmia. As the dataset contains a large number of features, and many of them are irrelevant, we also want to select the best combination of important features in order to improve our model’s performance.
A description of the dataset could be found here:
<https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.names
Material and method
Cardiac Arrhythmia is a famous dataset in the community of Medical datascientists because it’s a good opportunity to practice our skills of data wrangling, feature selection and multilabel modelling. This dataset has been used by Dr. Shirin Glander in her very good tutorial using “Sparkling water” (a tool that combines h2o and Spark):
https://shiring.github.io/machine_learning/2017/02/27/h2o
In our tutorial, we borrowed some of her R codes, but our approach is totally different. First, we treated numeric and nominal features separately, and we dont use neither PCA, nor variance based feature selection but focus on the Random Forest and other regularisation methods.
First, the original dataset D1 was decomposed into Numeric and Nominal parts. Each one was then visually explored in order to detect the missing values, invariance features or missing label problem. Those variables will be removed in the first place. The remaining variables will be gathered in a smaller dataset D2. This one was randomly splitted into two subsets for training and testing. From that moment, we dont touch the test subset as it will be used for the model validation.
On the train subset, we performed our data experiment. It consists of using a replicated Random Forest model training to identify the most important features in the train subset. The h2O package was used for this procedure.
The basic algorithm consisted of a Random Forest with max tree number of 100, max depth of 50, sample rate of 70% (which means each tree is trained on 70% of the input data); and each split only used about ~15% (234^0.5) features as candidates. This algorithm was trained by a 10x10 crossvalidation with balanced stratification for each fold and early stopping criteria.
Such procedure was replicated 100 times. For each step the trained RF model was validated on the test set, and feature importances have been extracted for each model. These information were then averaged for 100 models to provide a full and clear picture of the importance of each features. This allows building the best combination of most relevant variables with a good certainty. This feature combination was used as input for a new Random Forest algorithm. The final version of RF model entered a comparative study against 3 other logistic based algorithms.
We also built 3 other logistic models that based on Lasso, Ridge and Elasticnet regularisations. The Lasso and Ridge models were supported by h2o framework, while the elasticnet model was based on glmnet package. All 3 models were trained by a 10x10 crossvalidation.
Data preparation and exploration
First, we will prepare the ggplot theme for our experiment
library(tidyverse)
theme_bare <- function(base_size=8,base_family="sans"){theme_bw(base_size = base_size, base_family = base_family)+
theme(
axis.line = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position = "bottom",
panel.background = element_rect(fill = NA),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.margin = unit(c(0,0,0,0), "lines")
)
}
my_theme <- function(base_size =5, base_family = "sans"){
theme_bw(base_size = base_size, base_family = base_family) +
theme(
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = NA),
strip.background = element_rect(fill = "#8e0444", color = "#8e0444", size =0.5),
strip.text = element_text(face = "bold", size = 5, color = "white"),
legend.position = "bottom",
legend.justification = "center",
legend.background = element_blank(),
legend.margin = margin(0.5,0.5,0.5,0.5)
)
}
theme_set(my_theme())
Now we load the dataset from UCI database:
#load data
arrhythmia <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.data", sep = ",")
arrhythmia[arrhythmia == "?"] <- NA
# Extract the nominal features
arrhythmia[-280] <- lapply(arrhythmia[-280], as.character)
arrhythmia[-280] <- lapply(arrhythmia[-280], as.numeric)
dfnominal=arrhythmia[,c(280,2,22:27,34:39,46:51,58:63,70:75,82:87,94:99,106:111,118:123,130:135,142:147,154:159)]
#k1=74
dfnominal[]<-lapply(dfnominal, as.factor)
colnames(dfnominal)[1] <- "class"
The nominal part has 74 categorical variables (all of them are binary factor). The target variable (class) is also included as a categorical factor. A simple visualisation will show us the proportion of 16 labels:
ggplot(dfnominal, aes(x = as.factor(class)))+geom_bar(aes(fill=as.factor(class)), alpha = 0.7,show.legend = F)+my_theme(10)+scale_x_discrete("16 labels of cardiac Arrhythmia")+geom_text(stat='count',aes(label=..count..),vjust=-0.5)

As mentioned above, we will simplify the study question by considering only two labels: healthy or arrhythmia (as Dr. Shirin Glander did).
dfnominal$class<-ifelse(dfnominal$class == "1", "healthy", "arrhythmia")
dfnominal$class =as.factor(dfnominal$class)
ggplot(dfnominal, aes(x = class))+geom_bar(aes(fill=as.factor(class)), alpha = 0.7,show.legend = F)+my_theme(10)+geom_text(stat='count',aes(label=..count..),hjust=5,size=10,color="white")+coord_flip()+scale_fill_manual(values=c("#f70260","#079eef"))

By counting the distribution of two target labels (healthy and arrhythmia) within each label of 73 nominal variables, we can see that some nominal variables don’t contribute to the prediction of the target labels. Those variables could be removed (even if we keep them, they will be later rejected as h2o can detect automatically those variables and remove them from the input before training).
dfnominal%>%gather(V2:V159,key="Feature",value="Level")%>%ggplot(aes(x=Feature,fill=class,color=class))+geom_bar(position="fill",alpha=0.7)+scale_fill_manual(values=c("#f70260","#079eef"))+scale_color_manual(values=c("#d80053","#006fd8"))+facet_wrap(~Level)+coord_polar()+scale_x_discrete("Levels (0 or 1)")

Now we extract the numeric variables from the original dataset. There are 206 numeric variables:
dfnumeric=arrhythmia[,c(1,3:21,28:33,40:45,52:57,64:69,76:81,88:93,100:105,112:117,124:129,136:141,148:153,160:279)]
First, we check whether there are any missing value among those numeric variables
bar_missing <- function(x){
library(dplyr)
library(reshape2)
library(ggplot2)
x %>%
is.na %>%
melt %>%
ggplot(data = .,
aes(x = Var2)) +
geom_bar(aes(y=(..count..),fill=value),alpha=0.7)+scale_fill_manual(values=c("gold","red3"),name = "",
labels = c("Available","Missing"))+
theme_minimal(5)+
theme(axis.text.x = element_text(angle=45, vjust=0.5)) +
labs(x = "Variables in Dataset",
y = "Observations")+coord_flip()
}
dfnumeric%>%bar_missing()

As we could see, the variables V11,V12,V13 and V15 contain some missing values, while the variable V14 has so many missing values. We decided to remove those variables.
dfnumeric=dfnumeric[,-c(11,12,13,14,15)]#remove because too many missing value
The next step consists of visualising the variance of the remaining 205 numeric variables. A variable with small or null variance will be considered irrelevant for modelling so they could also be removed.
Based on the results, following variables have no variance: V20,68,140,152,165,205,265,275.
The graph shows that most of the numeric variable have low variance.
library(matrixStats)
colvars <- data.frame(feature = colnames(dfnumeric),
variance = colVars(as.matrix(dfnumeric)))
subset(colvars, variance==0)
## feature variance
## 14 V20 0
## 38 V68 0
## 74 V140 0
## 80 V152 0
## 87 V165 0
## 127 V205 0
## 187 V265 0
## 197 V275 0
subset(colvars, variance>=20)%>%
mutate(feature = factor(feature, levels = colnames(dfnumeric)))%>%
ggplot(aes(x = reorder(feature,variance), y = variance))+
geom_point(aes(color=variance), alpha = 0.7)+
theme_minimal(5)+
scale_color_gradient2(low="gold",high="purple",mid="red",midpoint = 700)+scale_x_discrete("Numeric variables")+coord_polar()

We decided to remove following variables, as they dont have any contribution to our prediction:
df=cbind(dfnominal,dfnumeric)
removed=c("V22", "V165", "V140", "V20", "V68", "V146", "V133", "V85", "V158", "V145", "V132", "V84", "V157", "V46", "V155", "V142", "V72", "V144", "V152", "V86", "V38", "V70", "V205", "V275", "V265")
df=df[,!(names(df) %in% removed)]
The final dataset contain 452 observations and 248 potential features. It’s a good start but there is still many (200) numeric features, as showed in this correlation matrix and correlation network. Filtering those variable would be difficult without using advanced methods.
library(corrplot)
library(RColorBrewer)
cor_mat=as.matrix(cor(use="pairwise.complete.obs",method="spearman",as.matrix(dfnumeric)))
cor_mat%>%corrplot(.,hclust.method ="ward.D2",type="lower",method="color",tl.col="black", tl.srt=45,tl.cex=0.5,col=rev(brewer.pal(n=10, name="RdBu")))

library(igraph)
diag(cor_mat)<-0
library(ggraph)
m=cor_mat
cdf=data.frame(row=rownames(m)[row(m)[upper.tri(m)]],
col=colnames(m)[col(m)[upper.tri(m)]],
corr=m[upper.tri(m)])
cdf=subset(cdf,abs(corr)>0.5)
cdf$Close=cdf$corr>=0.7
names(cdf)=c("from","to","corr","Close")
graph<-graph_from_data_frame(cdf)
ggraph(graph, layout = 'fr')+geom_edge_density(aes(fill = Close))+geom_node_point()+geom_node_text(aes(label = name),nudge_x=,nudge_y=0.1,color="grey")+geom_edge_link(aes(colour = corr),width=1,alpha=0.5)+theme_bare()+scale_edge_fill_manual(values=c("white","#ffdf14"))+scale_edge_color_gradient(high="#ff003f",low="#0094ff")
To this point, we got a smaller dataset of 248 variables, we will split this into two subsets, one for training and another for testing:
set.seed(1234)
idTrain=caret::createDataPartition(y=df$class,p=351/452,list=FALSE)
trainset=df[idTrain,]
testset=df[-idTrain,]
sp1=df%>%ggplot(aes(x=class,fill=class))+stat_count(color="black",alpha=0.7,show.legend = F)+scale_fill_manual(values=c("#f70260","#079eef"))+coord_flip()+ggtitle("Origin")+geom_text(stat='count',aes(label=..count..),hjust=5,size=5,color="white")+theme_bw(10)
sp2=trainset%>%ggplot(aes(x=class,fill=class))+stat_count(color="black",alpha=0.7,show.legend = F)+scale_fill_manual(values=c("#f70260","#079eef"))+coord_flip()+ggtitle("Train")+geom_text(stat='count',aes(label=..count..),hjust=5,size=5,color="white")+theme_bw(10)
sp3=testset%>%ggplot(aes(x=class,fill=class))+stat_count(color="black",alpha=0.7,show.legend = F)+scale_fill_manual(values=c("#f70260","#079eef"))+coord_flip()+ggtitle("Test")+geom_text(stat='count',aes(label=..count..),hjust=5,size=5,color="white")+theme_bw(10)
gridExtra::grid.arrange(sp1,sp2,sp3,ncol=1)

Now we begin our machine learning experiment. First, we load the h2O package and convert train/test subsets into h2O class:
Feature selection using replicated Random Forest
library(h2o)
h2o.init(nthreads = -1,max_mem_size ="4g")
wtrain=as.h2o(trainset)
wtest=as.h2o(testset)
features=setdiff(colnames(wtrain),"class")
The first solution consists of replicating a basic Random Forest (RF) algorithm 100 times, then extracting the information of variable importances.
Note that each RF model is trained using a different seed number, however this randomness will be reproducible so you will get the same result as given here (unlike the randomforest algorithm in mlr package).
Let’s run these codes and drink a cup of coffee because it will take some times…
set.seed(12345)
rfseed=sample(10000,100,replace=F)
res=NULL
pdf=NULL
for (i in 1:100){
rfmod=h2o.randomForest(x = features,
y = "class",
training_frame = wtrain,nfolds=10,
fold_assignment = "Stratified",
balance_classes = TRUE,
ntrees = 100, max_depth = 50,mtries = -1,sample_rate = 0.7,
stopping_metric = "misclassification",
stopping_tolerance = 0.01,
stopping_rounds = 3,
keep_cross_validation_fold_assignment = F,
keep_cross_validation_predictions=F,
score_each_iteration = TRUE,
seed=rfseed[i])
pdftemp=predict(rfmod,wtest)%>%as_tibble()%>%mutate(.,Id=row.names(testset),Truth=testset$class,Accuracy=ifelse(testset$class ==.$predict, "Correct", "Wrong"),Model=i)
vimp=rfmod@model$variable_importances%>%as_tibble()%>%mutate(.,Seed=rfseed[i],Model=i)
res=rbind(res,vimp)
pdf=rbind(pdf,pdftemp)
}
res=as_tibble(res)
res$Model=as.factor(res$Model)
As you could see in the loop, we have performed 3 tasks on each RF model: Training by crossvalidation, predicting the probability of arrhythmia in test subset, and extracting the variable importance matrix of the trained model. By replicating those tasks 100 times for 100 different versions of RF models, we finally got 2 large dataframes: (1) “pdf” contains information of accuracy of 100 predictions in testset and (2) “res”" dataframe that contains information of ariable importances.
First, we will explore the pdf dataframe, as before exploring the variable importances we want to know how good 100 RF model performed if the input consists of all 248 features. If we dont consider a feature filtration is there a consequence on the model’s performance ?
pdf$Id=as.factor(pdf$Id)
pdf%>%ggplot(aes(x=Model,y=reorder(Id,arrhythmia),fill=arrhythmia))+geom_tile()+theme_bw()+scale_fill_gradient2(low="green",mid="gold",high="red",midpoint = 0.5)+facet_wrap(~Truth,shrink=TRUE,scale="free",ncol=1)+theme_bw(8)+theme(axis.text.x = element_text(angle =45, hjust = 1))

The first heatmap plot represents the predicted probability of arrhythmia by 100 RF models (horizontal axis) for each case among 100 patients (vertical axis), when all 248 features have been used as input. A good model would give high probability for patients with true arrhythmia (orange or red colors) and lower probability in healthy subjects (green color). The presence of green color in Arrhythmic group indicates a false negative, while the presence of red color in Healthy group means false positive. Both of them are wrong and affect the balanced accuracy of the model.
Overall, We can see that the Random Forest algorithm handled well the problem without any feature filtration. In fact, the randomness in RF algorithm already removed the irrelevant features. Such performance is consistent through 100 different versions, so we can trust the variable importance results given by these models. However, there are still false positive and false negative predictions.
The accuracy of 100 RF models can also be evaluated qualitatively, as follows:
pdf%>%ggplot(aes(x=Model,y=reorder(Id,1-arrhythmia),fill=Accuracy))+geom_tile()+theme_bw()+scale_fill_manual(values=c("skyblue","red"))+theme_bw(8)+theme(axis.text.x = element_text(angle =45, hjust = 1))+facet_wrap(~Truth,shrink=TRUE,scale="free",ncol=1)+scale_y_discrete("Idx")

We will run a second loop for estimating the averaged predicted probability and averaged accuracy of 100 models:
pdf2=NULL
for (i in 1:100){
Idt=pdf$Id[i]
dtemp=subset(pdf,Id==Idt)
sum=Hmisc::describe(dtemp)
predprob=sum$arrhythmia$counts[[5]]
acc=sum$Accuracy$values[[2]][1]/100
wrong=sum$Accuracy$values[[2]][2]/100
pdftemp=cbind.data.frame(Id=Idt,predprob,acc,wrong)
pdf2=rbind(pdf2,pdftemp)
}
The 2 following graphs display the median of predicted probability and percentage of accurate/wrong predictions when 100 models were applied to the test subset (you can feel it likes reading a confusion matrix).
pdf2$class=testset$class
pdf2$predprob=as.numeric(as.character(pdf2$predprob))
pdf2$Id=as.factor(pdf2$Id)
pdf2$wrong[is.na(pdf2$wrong)] <- 0
pdf2%>%ggplot(aes(x=reorder(Id,predprob),y=predprob,fill=predprob))+geom_bar(alpha=0.7,stat="identity")+theme_bw(8)+scale_fill_gradient2(low="gold",mid="red",high="purple",midpoint = 0.4)+facet_wrap(~class,shrink = TRUE,ncol=1,scales = "free")+geom_hline(yintercept = 0.5,linetype=2,size=1)

pdf2%>%gather(wrong,acc,key="Accuracy",value="Prediction")%>%ggplot(aes(x=reorder(Id,-predprob),y=Prediction,fill=Accuracy))+geom_bar(alpha=0.7,position="fill",stat="Identity")+theme_bw(8)+facet_wrap(~class,shrink = TRUE,ncol=1,scales = "free")+scale_fill_manual(values=c("skyblue","red"))+scale_x_discrete("Id")

So, we can trust our 100 RF models in terms of prediction, now we will see what information they give us about the importances of 248 features:
The following heatmap plots indicate the relative importance and scaled importance scores, respectively. More important a feature is, more saturated in color that feature would be. Based on those graphs, we can make a conclusion that up to 83/248 features (1/3) could be removed, as they did not contribute to the prediction. We only need about 150 to 160 features to build a good Random Forest model.
res%>%ggplot(aes(x=reorder(Model,relative_importance),y=reorder(variable,-relative_importance),fill=relative_importance,color=relative_importance))+geom_tile(show.legend = F)+theme_bw()+scale_fill_gradient2(low="white",mid="red",high="purple",midpoint = 50)+scale_color_gradient2(low="white",mid="red",high="purple",midpoint = 50)+theme_bw(5)+theme(axis.text.x = element_text(angle =45, hjust = 1))+scale_x_discrete("Models")+scale_y_discrete("Features")

res%>%ggplot(aes(x=reorder(Model,scaled_importance),y=reorder(variable,-scaled_importance),fill=scaled_importance,color=scaled_importance))+geom_tile(show.legend = F)+theme_bw()+scale_fill_gradient2(low="white",mid="gold",high="red",midpoint = 0.5)+scale_color_gradient2(low="white",mid="gold",high="red",midpoint =0.5)+theme_bw(5)+theme(axis.text.x = element_text(angle =45, hjust = 1))+scale_x_discrete("Models")+scale_y_discrete("Features")

As above, we will also run another loop to extract the averaged importance score for each features:
res2=NULL
for (i in 1:253){
dtemp=subset(res,variable==features[i])
sum=psych::describe(dtemp)
rei=sum[[5]][2]
sci=sum[[5]][3]
pct=sum[[5]][4]
rtemp=cbind.data.frame(rei,sci,pct,var=dtemp[[1]][1])
res2=rbind(res2,rtemp)
}
res2%>%ggplot(aes(x=reorder(var,rei),y=rei,fill=rei,color=rei))+geom_segment(aes(x=reorder(var,rei),xend=var,y=0,yend=rei),size=1,show.legend = F)+geom_point(size=1,show.legend = F)+scale_x_discrete("Features")+scale_y_continuous("Relative importance")+coord_flip()+scale_fill_gradient(low="blue",high="red")+scale_color_gradient(low="blue",high="red")+ggtitle("Relative importance")+theme_bw(5)

We decided to keep only 150 strongest variables in our input:
res3=res2%>%arrange(desc(rei))%>%.[c(1:150),]
res3%>%ggplot(aes(x=reorder(var,rei),y=rei,fill=rei,color=rei))+geom_segment(aes(x=reorder(var,rei),xend=var,y=0,yend=rei),size=1,show.legend = F)+geom_point(size=1,show.legend = F)+scale_x_discrete("Features")+scale_y_continuous("Relative importance")+coord_flip()+scale_fill_gradient(low="blue",high="red")+scale_color_gradient(low="blue",high="red")+ggtitle("Relative importance")+theme_bw(5)

Based on a new input that contains only 150 strongest features, we will train another model (Note that the seed number is reset to 333)
features2=as.vector(res3$var)
rfmod=h2o.randomForest(x = features2,
y = "class",
training_frame = wtrain,nfolds=10,
fold_assignment = "Stratified",
balance_classes = TRUE,
ntrees = 100, max_depth = 50,mtries = -1,sample_rate = 0.8,
stopping_metric = "misclassification",
stopping_tolerance = 0.01,
stopping_rounds = 3,
keep_cross_validation_fold_assignment = F,
keep_cross_validation_predictions=F,
score_each_iteration = TRUE,
seed=333)
Logistic models
We have finish our first task - building a Random Forest model for our classification problem. Now, we will consider another 3 solutions that based on Logistic regression.
The first one will be LASSO logistic model (seed =546)
lassomod=h2o.glm(x=features,
y="class",
training_frame=wtrain,
nfolds = 10,seed = 546,
keep_cross_validation_predictions = FALSE,
keep_cross_validation_fold_assignment = FALSE,
fold_assignment = "Stratified",
family = "binomial",
alpha=1,
lambda_search = TRUE,nlambdas =20,
early_stopping = TRUE,
standardize = TRUE,
missing_values_handling ="Skip",
compute_p_values = FALSE,
remove_collinear_columns = FALSE,
intercept = TRUE,
non_negative = FALSE
)
Here are the coefficients of selected features for this Lasso logistic model, note that the Lasso regularisation has automatically removed all the features that were found to be not important. The final logistic model only contains 25 features.
coefs=lassomod@model$coefficients_table%>%as.tibble()%>%subset(.,coefficients!=0)
coefs
## # A tibble: 36 x 3
## names coefficients standardized_coefficients
## <chr> <dbl> <dbl>
## 1 Intercept 2.494735507 0.12566888
## 2 V5 -0.019842112 -0.29095732
## 3 V8 -0.001109552 -0.03821251
## 4 V9 0.006421963 0.16184794
## 5 V21 -0.011460841 -0.11614159
## 6 V30 -0.001667630 -0.03562249
## 7 V40 -0.002662741 -0.05973024
## 8 V44 0.003609711 0.01170101
## 9 V56 0.011496913 0.01977849
## 10 V65 -0.002871800 -0.06530719
## # ... with 26 more rows
The second solution is Ridge regularisation (seed=999). This could also be trained in h2O. Unlikes LASSO, Ridge regularisation only shrink the coefficients close to zero, but not zero, so basically the Ridge model cannot be used for feature selection.
ridgemod=h2o.glm(x=features,
y="class",
training_frame=wtrain,
nfolds = 10,seed = 999,
keep_cross_validation_predictions = FALSE,
keep_cross_validation_fold_assignment = FALSE,
fold_assignment = "Stratified",
family = "binomial",
alpha=0,
lambda_search = TRUE,nlambdas =20,
early_stopping = TRUE,
standardize = TRUE,
missing_values_handling ="Skip",
compute_p_values = FALSE,
remove_collinear_columns = FALSE,
intercept = TRUE,
non_negative = FALSE
)
The last solution will be Elastic net model (seed =5566). The training of this model is done using glmnet package. Note that this package requires transforming input data to matrix, and it does not accept missing values.
#Elastic net
library(glmnet)
set.seed(5566)
trainsetenet=trainset%>%na.omit()
testsetenet=testset%>%na.omit()
trainsetenet[-1]<-lapply(trainsetenet[-1], as.numeric)
testsetenet[-1]<-lapply(testsetenet[-1], as.numeric)
x=as.matrix(trainsetenet[,-1])
y=trainsetenet[,1]
enet = cv.glmnet(x,y,family = "binomial",type.measure = "class")
plot(enet)

The elastic net regularisation is a hybrid method that combines both l1 and l2 regularisations (Lasso and Ridge). The results showed that Elastic net model only kept 30 features
coef(enet)%>%as.matrix()%>%as.data.frame()%>%subset(.,`1`!=0)
## 1
## (Intercept) 2.8119512883
## V25 -0.1274432489
## V49 0.1015183400
## V61 -0.5036084377
## V95 -0.0273590394
## V109 -0.1351489123
## V5 -0.0239069823
## V9 0.0041789599
## V21 -0.0016934672
## V30 -0.0010216309
## V69 -0.0053685839
## V81 -0.0073229387
## V90 0.0035919755
## V91 -0.0226190775
## V93 -0.0087924941
## V100 -0.0101638548
## V103 -0.0283531476
## V112 -0.0126353743
## V136 -0.0061550696
## V163 0.0108526736
## V167 0.2202514658
## V176 0.0415050940
## V199 -0.0174245959
## V211 0.1937751294
## V239 0.0008013161
## V252 -0.0071867336
## V256 0.0646729497
## V260 0.1065758008
Validation and comparative study
Finally, we got 4 candidates of classifiers. In order to determine which has the best performance, we will perform a comparative study. We will evaluate 14 performance metrics for each model when applied to the test subset, then plot the results on a radar chart to visualize the difference.
The performance metrics can be divided into two groups : the first group has best value of 1, including: Balanced accuracy(BAC), Area under ROC curve (AUC), Accuracy (ACC), F1 score, Positive/Negative predictive value and True negative/positive rates. These metrics indicate the accuracy, sensitivity and specificity of the model. The second group contains the metrics that have best value of 0 and worst value of 1, including Balance error rate (BER), Brier score, False negative/positive rates, Mean misclassification error and False discovery rate (FDR). Those scores measure the weakness of each model. For ECG reading, false negative prediction is more dangerous than false positive.
library(mlr)
task=makeClassifTask(data=testset,target="class")
dummylrner=makeLearner("classif.rpart",predict.type ="prob")
dummymod=train(dummylrner,task)
dummypred=predict(dummymod,task)
LASSOPRED=predict(lassomod,wtest)%>%as.data.frame()
LASSOPRED=LASSOPRED%>%mutate(.,id=c(1:100),truth=testset$class,prob.arrhythmia=.[,2],prob.healthy=.[,3],response=.$predict)%>%.[,-c(1:3)]
RIDGEPRED=predict(ridgemod,wtest)%>%as.data.frame()
RIDGEPRED=RIDGEPRED%>%mutate(.,id=c(1:100),truth=testset$class,prob.arrhythmia=.[,2],prob.healthy=.[,3],response=.$predict)%>%.[,-c(1:3)]
enetclass=predict(enet,as.matrix(testsetenet[,-1]),type="class")%>%as.data.frame()
ENETPRED=predict(enet,as.matrix(testsetenet[,-1]),type="response")%>%as.data.frame()%>%mutate(.,id=c(1:98),truth=testsetenet$class,prob.arrhythmia=1-.[,1],prob.healthy=.[,1],response=enetclass[,1])%>%.[,-1]
RFPRED=predict(rfmod,wtest)%>%as.data.frame()
RFPRED=RFPRED%>%mutate(.,id=c(1:100),truth=testset$class,prob.arrhythmia=.[,2],prob.healthy=.[,3],response=.$predict)%>%.[,-c(1:3)]
pred1=dummypred
pred1$data=LASSOPRED
pred2=dummypred
pred2$data=ENETPRED
pred3=dummypred
pred3$data=RFPRED
pred4=dummypred
pred4$data=RIDGEPRED%>%na.omit()
measure=list(mlr::bac,mlr::auc,mlr::acc,mlr::f1,mlr::ppv,mlr::npv,mlr::tnr,mlr::tpr)
p1=performance(pred1,measure)
p2=performance(pred2,measure)
p3=performance(pred3,measure)
p4=performance(pred4,measure)
lasso=p1%>%as.vector()
enet=p2%>%as.vector()
rf=p3%>%as.vector()
rdg=p4%>%as.vector()
labs=c("BAC","AUC","ACC","F1","PPV","NPV","TPR","TNR")
scores=list("Lasso"=lasso*10,"Elasticnet"=enet*10,"RandomForest"=rf*10,"Ridge"=rdg*10)
library(radarchart)
chartJSRadar(scores = scores, labs = labs, maxScale = 10)
measure2=list(ber,brier,fnr,fpr,mmce,fdr)
e1=performance(pred1,measure2)
e2=performance(pred2,measure2)
e3=performance(pred3,measure2)
e4=performance(pred4,measure2)
lasso=e1%>%as.vector()
enet=e2%>%as.vector()
rf=e3%>%as.vector()
rdg=e4%>%as.vector()
labs=c("BER","BRIER","FNR","FPR","MMCE","FDR")
scores=list("Lasso"=lasso,"Elasticnet"=enet,"RandomForest"=rf,"Ridge"=rdg)
chartJSRadar(scores = scores, labs = labs, maxScale =0.8)
The results suggest that the performances was almost similar for 4 methods.
Conclusion
In this case study, we have travelled a long journey from a messy dataset to the final classifier that could correctly distinguish patients with arrhythmia from healthy subjects in 78.6% of cases. Following points could be kept in mind:
Data exploration and visualisation is very important, this procedure allows to detect some obvious problem with our data, including the missing value, invariance variables, outliers, missing labels… and these information can already help us to eliminate the irrelevant features in our dataset, without using any special technique.
Random Forest is a powerful method for both feature selection and overfitting control. Due to the randomness in bootstrap aggregating, only one model could be not trustful; however by replicating the model training many times, we can reach a high level of certainty. In this tutorial, we provided a reproducible loop function for this method. Such procedure allows training 100 or more basic Random Forest algorithm with both randomness and reproducibility.
LASSO, Ridge and Elasticnet regularisations are also good for the feature selection task. These methods dont require too much computation cost but can also give a good performance. Lasso and Elasticnet also keep smaller number of features and as they are based on logistic regression, the coefficients of those variables could be interpreted, not like the Random Forest which is a Blackbox.
Thank you for joining us and see you in the next tutorial.
---
title: "MLM Case study X15"
subtitle: "Model selection. Part 2: Classification problem"
author: "Le Ngoc Kha Nhi (MD, PhD)"
date: "July 01, 2017"
output: 
  html_document: 
    code_download: true
    code_folding: hide
    number_sections: yes
    theme: journal
    toc: TRUE
    toc_float: TRUE
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, error = FALSE, warning = FALSE, message = FALSE)

library(tidyverse)
library(matrixStats)
library(corrplot)
library(RColorBrewer)
library(igraph)
library(ggraph)
library(glmnet)
library(h2o)
library(mlr)
library(radarchart)

```

![](ModelSelClassif.png)

## Foreword:About the Machine Learning in Medicine project 

This is the 35th case study by MLM team. The MLM project has been initialized in 2016 and aims to: 

(1) Encourage using Machine Learning techniques in medical research in Vietnam and

(2) Promote the use of R statistical programming language, an open source and leading tool for practicing data science.
    
## Introduction

In the previous tutorials (X13 and X14), we have explored the variable filtration and linear model selection. Filtration allows excluding the irrelevant features before model training, while model selection aims to optimize the model's structure and performance and to prevent the overfitting. We have introduced some methods including Bayesian model averaging, Lasso and Elastic net regularisations. These solutions are strictly based on the linear regression.

In the present case study, we will explore a new method that could be used for feature filtration: the bootstrap aggregating or Random Forest algorithm. Given a classification problem, our goal is to fit the best Random Forest model for predictive purpose with strongest variables. We will also compare the performance of this method with logistic classifiers that based on Lasso, Ridge and Elastic net regularizations. Note that the Bayesian Model averaging (BMA) method will not be considered in our comparative study, due to high complexity of our data.

## Dataset and Study objective

The present study implies the Cardiac Arrhythmia dataset. This dataset was created by H. Altay Guvenir (Ankara, Turkey) in 1997 for the development of a machine learning based algorithm to detect Arrhythmia from ECG recording. 

The original dataset has 452 observations and 279 features, 206 of which are numerical and the rest are nominal. This was an ambitious study as the author want to classify 15 types of cardiac arrhythmia, such as Ischemic change, Myocardial infarction (anterior and Inferior), Sinus bradycardy and tachycardy, Ventricular Premature Contraction (PVC), Supraventricular Premature Contraction, Left/right bundle branch block, AtrioVentricular blocks (3 degrees), Left ventricule hypertrophy, Atrial Fibrillation or Flutter and Others.

For the demonstration purpose, we will simplify the study question to a binary classification problem. Our experiment aims to develop an algorithm to distinguish between the presence and absence of arrhythmia. As the dataset contains a large number of features, and many of them are irrelevant, we also want to select the best combination of important features in order to improve our model’s performance.

A description of the dataset could be found here: 

<https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.names

## Material and method

Cardiac Arrhythmia is a famous dataset in the community of Medical datascientists because it's a good opportunity to practice our skills of data wrangling, feature selection and multilabel modelling. This dataset has been used by Dr. Shirin Glander in her very good tutorial using "Sparkling water" (a tool that combines h2o and Spark):

<https://shiring.github.io/machine_learning/2017/02/27/h2o>


In our tutorial, we borrowed some of her R codes, but our approach is totally different. First, we treated numeric and nominal features separately, and we dont use  neither PCA, nor variance based feature selection but focus on the Random Forest and other regularisation methods. 

![](Modelselectiondata.png)

First, the original dataset D1 was decomposed into Numeric and Nominal parts. Each one was then visually explored in order to detect the missing values, invariance features or missing label problem. Those variables will be removed in the first place. The remaining variables will be gathered in a smaller dataset D2. This one was randomly splitted into two subsets for training and testing. From that moment, we dont touch the test subset as it will be used for the model validation.

On the train subset, we performed our data experiment. It consists of using a replicated Random Forest model training to identify the most important features in the train subset. The h2O package was used for this procedure. 

The basic algorithm consisted of a Random Forest with max tree number of 100, max depth of 50, sample rate of 70% (which means each tree is trained on 70% of the input data); and each split only used about ~15% (234^0.5) features as candidates. This algorithm was trained by a 10x10 crossvalidation with balanced stratification for each fold and early stopping criteria.

Such procedure was replicated 100 times. For each step the trained RF model was validated on the test set, and feature importances have been extracted for each model. These information were then averaged for 100 models to provide a full and clear picture of the importance of each features. This allows building the best combination of most relevant variables with a good certainty. This feature combination was used as input for a new Random Forest algorithm. The final version of RF model entered a comparative study against 3 other logistic based algorithms.

We also built 3 other logistic models that based on Lasso, Ridge and Elasticnet regularisations. The Lasso and Ridge models were supported by h2o framework, while the elasticnet model was based on glmnet package. All 3 models were trained by a 10x10 crossvalidation.

## Data preparation and exploration

First, we will prepare the ggplot theme for our experiment

```{r,message = FALSE,warning=FALSE}

library(tidyverse)

theme_bare <- function(base_size=8,base_family="sans"){theme_bw(base_size = base_size, base_family = base_family)+
    theme(
      axis.line = element_blank(), 
      axis.text.x = element_blank(), 
      axis.text.y = element_blank(),
      axis.ticks = element_blank(), 
      axis.title.x = element_blank(), 
      axis.title.y = element_blank(), 
      legend.position = "bottom", 
      panel.background = element_rect(fill = NA), 
      panel.border = element_blank(), 
      panel.grid.major = element_blank(), 
      panel.grid.minor = element_blank(), 
      plot.margin = unit(c(0,0,0,0), "lines")
    )
}

my_theme <- function(base_size =5, base_family = "sans"){
  theme_bw(base_size = base_size, base_family = base_family) +
    theme(
      panel.grid.major = element_line(color = "gray"),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = NA),
      strip.background = element_rect(fill = "#8e0444", color = "#8e0444", size =0.5),
      strip.text = element_text(face = "bold", size = 5, color = "white"),
      legend.position = "bottom",
      legend.justification = "center",
      legend.background = element_blank(),
      legend.margin = margin(0.5,0.5,0.5,0.5)
    )
}

theme_set(my_theme())

```

Now we load the dataset from UCI database:

```{r,message = FALSE,warning=FALSE}
#load data

arrhythmia <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/arrhythmia/arrhythmia.data", sep = ",")
arrhythmia[arrhythmia == "?"] <- NA


# Extract the nominal features

arrhythmia[-280] <- lapply(arrhythmia[-280], as.character)
arrhythmia[-280] <- lapply(arrhythmia[-280], as.numeric)

dfnominal=arrhythmia[,c(280,2,22:27,34:39,46:51,58:63,70:75,82:87,94:99,106:111,118:123,130:135,142:147,154:159)]

#k1=74

dfnominal[]<-lapply(dfnominal, as.factor)

colnames(dfnominal)[1] <- "class"

```

The nominal part has 74 categorical variables (all of them are binary factor). The target variable (class) is also included as a categorical factor. A simple visualisation will show us the proportion of 16 labels:

```{r}
ggplot(dfnominal, aes(x = as.factor(class)))+geom_bar(aes(fill=as.factor(class)), alpha = 0.7,show.legend = F)+my_theme(10)+scale_x_discrete("16 labels of cardiac Arrhythmia")+geom_text(stat='count',aes(label=..count..),vjust=-0.5)

```

As mentioned above, we will simplify the study question by considering only two labels: healthy or arrhythmia (as Dr. Shirin Glander did). 

```{r}
dfnominal$class<-ifelse(dfnominal$class == "1", "healthy", "arrhythmia")
dfnominal$class =as.factor(dfnominal$class)
```

```{r}
ggplot(dfnominal, aes(x = class))+geom_bar(aes(fill=as.factor(class)), alpha = 0.7,show.legend = F)+my_theme(10)+geom_text(stat='count',aes(label=..count..),hjust=5,size=10,color="white")+coord_flip()+scale_fill_manual(values=c("#f70260","#079eef"))

```

By counting the distribution of two target labels (healthy and arrhythmia) within each label of 73 nominal variables, we can see that some nominal variables don't contribute to the prediction of the target labels. Those variables could be removed (even if we keep them, they will be later rejected as h2o can detect automatically those variables and remove them from the input before training).

```{r}
dfnominal%>%gather(V2:V159,key="Feature",value="Level")%>%ggplot(aes(x=Feature,fill=class,color=class))+geom_bar(position="fill",alpha=0.7)+scale_fill_manual(values=c("#f70260","#079eef"))+scale_color_manual(values=c("#d80053","#006fd8"))+facet_wrap(~Level)+coord_polar()+scale_x_discrete("Levels (0 or 1)")

```

Now we extract the numeric variables from the original dataset. There are 206 numeric variables:

```{r}
dfnumeric=arrhythmia[,c(1,3:21,28:33,40:45,52:57,64:69,76:81,88:93,100:105,112:117,124:129,136:141,148:153,160:279)]

```

First, we check whether there are any missing value among those numeric variables

```{r}
bar_missing <- function(x){
  library(dplyr)
  library(reshape2)
  library(ggplot2)
  x %>%
    is.na %>%
    melt %>%
    ggplot(data = .,
           aes(x = Var2)) +
    geom_bar(aes(y=(..count..),fill=value),alpha=0.7)+scale_fill_manual(values=c("gold","red3"),name = "",
                                                                        labels = c("Available","Missing"))+
    theme_minimal(5)+
    theme(axis.text.x = element_text(angle=45, vjust=0.5)) +
    labs(x = "Variables in Dataset",
         y = "Observations")+coord_flip()
}

dfnumeric%>%bar_missing()
```

As we could see, the variables V11,V12,V13 and V15 contain some missing values, while the variable V14 has so many missing values. We decided to remove those variables.

![](missing.png)

```{r}
dfnumeric=dfnumeric[,-c(11,12,13,14,15)]#remove because too many missing value
```

The next step consists of visualising the variance of the remaining 205 numeric variables. A variable with small or null variance will be considered irrelevant for modelling so they could also be removed.

Based on the results, following variables have no variance: V20,68,140,152,165,205,265,275. 

The graph shows that most of the numeric variable have low variance.

```{r}
library(matrixStats)

colvars <- data.frame(feature = colnames(dfnumeric),
                      variance = colVars(as.matrix(dfnumeric)))

subset(colvars, variance==0)

subset(colvars, variance>=20)%>%
  mutate(feature = factor(feature, levels = colnames(dfnumeric)))%>%
  ggplot(aes(x = reorder(feature,variance), y = variance))+
  geom_point(aes(color=variance), alpha = 0.7)+
  theme_minimal(5)+
  scale_color_gradient2(low="gold",high="purple",mid="red",midpoint = 700)+scale_x_discrete("Numeric variables")+coord_polar()

```

We decided to remove following variables, as they dont have any contribution to our prediction:

```{r}
df=cbind(dfnominal,dfnumeric)

removed=c("V22", "V165", "V140", "V20", "V68", "V146", "V133", "V85", "V158", "V145", "V132", "V84", "V157", "V46", "V155", "V142", "V72", "V144", "V152", "V86", "V38", "V70", "V205", "V275", "V265")

df=df[,!(names(df) %in% removed)] 

```

The final dataset contain 452 observations and 248 potential features. It's a good start but there is still many (200) numeric features, as showed in this correlation matrix and correlation network. Filtering those variable would be difficult without using advanced methods.

```{r}
library(corrplot)
library(RColorBrewer)

cor_mat=as.matrix(cor(use="pairwise.complete.obs",method="spearman",as.matrix(dfnumeric)))

cor_mat%>%corrplot(.,hclust.method ="ward.D2",type="lower",method="color",tl.col="black", tl.srt=45,tl.cex=0.5,col=rev(brewer.pal(n=10, name="RdBu")))

```


```{r,fig.show ="hide"}
library(igraph)

diag(cor_mat)<-0

library(ggraph)

m=cor_mat

cdf=data.frame(row=rownames(m)[row(m)[upper.tri(m)]], 
              col=colnames(m)[col(m)[upper.tri(m)]], 
              corr=m[upper.tri(m)])

cdf=subset(cdf,abs(corr)>0.5)

cdf$Close=cdf$corr>=0.7
names(cdf)=c("from","to","corr","Close")


graph<-graph_from_data_frame(cdf)

ggraph(graph, layout = 'fr')+geom_edge_density(aes(fill = Close))+geom_node_point()+geom_node_text(aes(label = name),nudge_x=,nudge_y=0.1,color="grey")+geom_edge_link(aes(colour = corr),width=1,alpha=0.5)+theme_bare()+scale_edge_fill_manual(values=c("white","#ffdf14"))+scale_edge_color_gradient(high="#ff003f",low="#0094ff")

```

![](cmat.png)

To this point, we got a smaller dataset of 248 variables, we will split this into two subsets, one for training and another for testing:

```{r,message = FALSE,results="hide"}

set.seed(1234)

idTrain=caret::createDataPartition(y=df$class,p=351/452,list=FALSE)
trainset=df[idTrain,]
testset=df[-idTrain,]

sp1=df%>%ggplot(aes(x=class,fill=class))+stat_count(color="black",alpha=0.7,show.legend = F)+scale_fill_manual(values=c("#f70260","#079eef"))+coord_flip()+ggtitle("Origin")+geom_text(stat='count',aes(label=..count..),hjust=5,size=5,color="white")+theme_bw(10)

sp2=trainset%>%ggplot(aes(x=class,fill=class))+stat_count(color="black",alpha=0.7,show.legend = F)+scale_fill_manual(values=c("#f70260","#079eef"))+coord_flip()+ggtitle("Train")+geom_text(stat='count',aes(label=..count..),hjust=5,size=5,color="white")+theme_bw(10)

sp3=testset%>%ggplot(aes(x=class,fill=class))+stat_count(color="black",alpha=0.7,show.legend = F)+scale_fill_manual(values=c("#f70260","#079eef"))+coord_flip()+ggtitle("Test")+geom_text(stat='count',aes(label=..count..),hjust=5,size=5,color="white")+theme_bw(10)

gridExtra::grid.arrange(sp1,sp2,sp3,ncol=1)


```

Now we begin our machine learning experiment. First, we load the h2O package and convert train/test subsets into h2O class:

## Feature selection using replicated Random Forest 

```{r,message = FALSE,results="hide"}

library(h2o)

h2o.init(nthreads = -1,max_mem_size ="4g")

wtrain=as.h2o(trainset)
wtest=as.h2o(testset)

features=setdiff(colnames(wtrain),"class")

```

The first solution consists of replicating a basic Random Forest (RF) algorithm 100 times, then extracting the information of variable importances.

Note that each RF model is trained using a different seed number, however this randomness will be reproducible so you will get the same result as given here (unlike the randomforest algorithm in mlr package).

Let's run these codes and drink a cup of coffee because it will take some times...

```{r,message = FALSE,results="hide"}

set.seed(12345)
rfseed=sample(10000,100,replace=F)

res=NULL
pdf=NULL

for (i in 1:100){
  
  rfmod=h2o.randomForest(x = features,
                         y = "class",
                         training_frame = wtrain,nfolds=10,
                         fold_assignment = "Stratified",
                         balance_classes = TRUE,
                         ntrees = 100, max_depth = 50,mtries = -1,sample_rate = 0.7,
                         stopping_metric = "misclassification",
                         stopping_tolerance = 0.01,
                         stopping_rounds = 3,
                         keep_cross_validation_fold_assignment = F,
                         keep_cross_validation_predictions=F,
                         score_each_iteration = TRUE,
                         seed=rfseed[i])
  
  pdftemp=predict(rfmod,wtest)%>%as_tibble()%>%mutate(.,Id=row.names(testset),Truth=testset$class,Accuracy=ifelse(testset$class ==.$predict, "Correct", "Wrong"),Model=i)
  
  vimp=rfmod@model$variable_importances%>%as_tibble()%>%mutate(.,Seed=rfseed[i],Model=i)
  
  res=rbind(res,vimp)
  
  pdf=rbind(pdf,pdftemp)

}

res=as_tibble(res)

res$Model=as.factor(res$Model)
```

As you could see in the loop, we have performed 3 tasks on each RF model: Training by crossvalidation, predicting the probability of arrhythmia in test subset, and extracting the variable importance matrix of the trained model. By replicating those tasks 100 times for 100 different versions of RF models, we finally got 2 large dataframes: (1) "pdf" contains information of accuracy of 100 predictions in testset and (2) "res"" dataframe that contains information of ariable importances.

First, we will explore the pdf dataframe, as before exploring the variable importances we want to know how good 100 RF model performed if the input consists of all 248 features. If we dont consider a feature filtration is there a consequence on the model's performance ?

```{r}
pdf$Id=as.factor(pdf$Id)

pdf%>%ggplot(aes(x=Model,y=reorder(Id,arrhythmia),fill=arrhythmia))+geom_tile()+theme_bw()+scale_fill_gradient2(low="green",mid="gold",high="red",midpoint = 0.5)+facet_wrap(~Truth,shrink=TRUE,scale="free",ncol=1)+theme_bw(8)+theme(axis.text.x = element_text(angle =45, hjust = 1))

```

The first heatmap plot represents the predicted probability of arrhythmia by 100 RF models (horizontal axis) for each case among 100 patients (vertical axis), when all 248 features have been used as input. A good model would give high probability for patients with true arrhythmia (orange or red colors) and lower probability in healthy subjects (green color). The presence of green color in Arrhythmic group indicates a false negative, while the presence of red color in Healthy group means false positive. Both of them are wrong and affect the balanced accuracy of the model.

Overall, We can see that the Random Forest algorithm handled well the problem without any feature filtration. In fact, the randomness in RF algorithm already removed the irrelevant features. Such performance is consistent through 100 different versions, so we can trust the variable importance results given by these models. However, there are still false positive and false negative predictions.

The accuracy of 100 RF models can also be evaluated qualitatively, as follows:

```{r}

pdf%>%ggplot(aes(x=Model,y=reorder(Id,1-arrhythmia),fill=Accuracy))+geom_tile()+theme_bw()+scale_fill_manual(values=c("skyblue","red"))+theme_bw(8)+theme(axis.text.x = element_text(angle =45, hjust = 1))+facet_wrap(~Truth,shrink=TRUE,scale="free",ncol=1)+scale_y_discrete("Idx")
```

We will run a second loop for estimating the averaged predicted probability and averaged accuracy of 100 models:

```{r}
pdf2=NULL

for (i in 1:100){
   Idt=pdf$Id[i]
   dtemp=subset(pdf,Id==Idt)
   sum=Hmisc::describe(dtemp)
   
   predprob=sum$arrhythmia$counts[[5]]
   acc=sum$Accuracy$values[[2]][1]/100
   wrong=sum$Accuracy$values[[2]][2]/100

   pdftemp=cbind.data.frame(Id=Idt,predprob,acc,wrong)
   pdf2=rbind(pdf2,pdftemp)
}

```

The 2 following graphs display the median of predicted probability and percentage of accurate/wrong predictions when 100 models were applied to the test subset (you can feel it likes reading a confusion matrix).

```{r}
pdf2$class=testset$class
pdf2$predprob=as.numeric(as.character(pdf2$predprob))

pdf2$Id=as.factor(pdf2$Id)

pdf2$wrong[is.na(pdf2$wrong)] <- 0


pdf2%>%ggplot(aes(x=reorder(Id,predprob),y=predprob,fill=predprob))+geom_bar(alpha=0.7,stat="identity")+theme_bw(8)+scale_fill_gradient2(low="gold",mid="red",high="purple",midpoint = 0.4)+facet_wrap(~class,shrink = TRUE,ncol=1,scales = "free")+geom_hline(yintercept = 0.5,linetype=2,size=1)

pdf2%>%gather(wrong,acc,key="Accuracy",value="Prediction")%>%ggplot(aes(x=reorder(Id,-predprob),y=Prediction,fill=Accuracy))+geom_bar(alpha=0.7,position="fill",stat="Identity")+theme_bw(8)+facet_wrap(~class,shrink = TRUE,ncol=1,scales = "free")+scale_fill_manual(values=c("skyblue","red"))+scale_x_discrete("Id")

```

So, we can trust our 100 RF models in terms of prediction, now we will see what information they give us about the importances of 248 features:

The following heatmap plots indicate the relative importance and scaled importance scores, respectively. More important a feature is, more saturated in color that feature would be. Based on those graphs, we can make a conclusion that up to 83/248 features (1/3) could be removed, as they did not contribute to the prediction. We only need about 150 to 160 features to build a good Random Forest model.

```{r}
res%>%ggplot(aes(x=reorder(Model,relative_importance),y=reorder(variable,-relative_importance),fill=relative_importance,color=relative_importance))+geom_tile(show.legend = F)+theme_bw()+scale_fill_gradient2(low="white",mid="red",high="purple",midpoint = 50)+scale_color_gradient2(low="white",mid="red",high="purple",midpoint = 50)+theme_bw(5)+theme(axis.text.x = element_text(angle =45, hjust = 1))+scale_x_discrete("Models")+scale_y_discrete("Features")

res%>%ggplot(aes(x=reorder(Model,scaled_importance),y=reorder(variable,-scaled_importance),fill=scaled_importance,color=scaled_importance))+geom_tile(show.legend = F)+theme_bw()+scale_fill_gradient2(low="white",mid="gold",high="red",midpoint = 0.5)+scale_color_gradient2(low="white",mid="gold",high="red",midpoint =0.5)+theme_bw(5)+theme(axis.text.x = element_text(angle =45, hjust = 1))+scale_x_discrete("Models")+scale_y_discrete("Features")


```

As above, we will also run another loop to extract the averaged importance score for each features:

```{r}
res2=NULL

for (i in 1:253){
  dtemp=subset(res,variable==features[i])
  sum=psych::describe(dtemp)
  rei=sum[[5]][2]
  sci=sum[[5]][3]
  pct=sum[[5]][4]

  rtemp=cbind.data.frame(rei,sci,pct,var=dtemp[[1]][1])
  res2=rbind(res2,rtemp)
  
}
```

```{r}
res2%>%ggplot(aes(x=reorder(var,rei),y=rei,fill=rei,color=rei))+geom_segment(aes(x=reorder(var,rei),xend=var,y=0,yend=rei),size=1,show.legend = F)+geom_point(size=1,show.legend = F)+scale_x_discrete("Features")+scale_y_continuous("Relative importance")+coord_flip()+scale_fill_gradient(low="blue",high="red")+scale_color_gradient(low="blue",high="red")+ggtitle("Relative importance")+theme_bw(5)

```

We decided to keep only 150 strongest variables in our input:

```{r}
res3=res2%>%arrange(desc(rei))%>%.[c(1:150),]

res3%>%ggplot(aes(x=reorder(var,rei),y=rei,fill=rei,color=rei))+geom_segment(aes(x=reorder(var,rei),xend=var,y=0,yend=rei),size=1,show.legend = F)+geom_point(size=1,show.legend = F)+scale_x_discrete("Features")+scale_y_continuous("Relative importance")+coord_flip()+scale_fill_gradient(low="blue",high="red")+scale_color_gradient(low="blue",high="red")+ggtitle("Relative importance")+theme_bw(5)

```

Based on a new input that contains only 150 strongest features, we will train another model (Note that the seed number is reset to 333)

```{r,message = FALSE,results="hide"}
features2=as.vector(res3$var)

rfmod=h2o.randomForest(x = features2,
                       y = "class",
                       training_frame = wtrain,nfolds=10,
                       fold_assignment = "Stratified",
                       balance_classes = TRUE,
                       ntrees = 100, max_depth = 50,mtries = -1,sample_rate = 0.8,
                       stopping_metric = "misclassification",
                       stopping_tolerance = 0.01,
                       stopping_rounds = 3,
                       keep_cross_validation_fold_assignment = F,
                       keep_cross_validation_predictions=F,
                       score_each_iteration = TRUE,
                       seed=333)

```

## Logistic models

We have finish our first task - building a Random Forest model for our classification problem. Now, we will consider another 3 solutions that based on Logistic regression. 

The first one will be LASSO logistic model (seed =546)

```{r,message = FALSE,results="hide"}
lassomod=h2o.glm(x=features,
               y="class",
               training_frame=wtrain, 
               nfolds = 10,seed = 546, 
               keep_cross_validation_predictions = FALSE,
               keep_cross_validation_fold_assignment = FALSE, 
               fold_assignment = "Stratified",
               family = "binomial",
               alpha=1,
               lambda_search = TRUE,nlambdas =20,
               early_stopping = TRUE,
               standardize = TRUE,
               missing_values_handling ="Skip",
               compute_p_values = FALSE, 
               remove_collinear_columns = FALSE,
               intercept = TRUE, 
               non_negative = FALSE
)
```

Here are the coefficients of selected features for this Lasso logistic model, note that the Lasso regularisation has automatically removed all the features that were found to be not important. The final logistic model only contains 25 features.

```{r}
coefs=lassomod@model$coefficients_table%>%as.tibble()%>%subset(.,coefficients!=0)

coefs
```

The second solution is Ridge regularisation (seed=999). This could also be trained in h2O. Unlikes LASSO, Ridge regularisation only shrink the coefficients close to zero, but not zero, so basically the Ridge model cannot be used for feature selection.

```{r,message = FALSE,results="hide"}
ridgemod=h2o.glm(x=features,
                 y="class",
                 training_frame=wtrain, 
                 nfolds = 10,seed = 999, 
                 keep_cross_validation_predictions = FALSE,
                 keep_cross_validation_fold_assignment = FALSE, 
                 fold_assignment = "Stratified",
                 family = "binomial",
                 alpha=0,
                 lambda_search = TRUE,nlambdas =20,
                 early_stopping = TRUE,
                 standardize = TRUE,
                 missing_values_handling ="Skip",
                 compute_p_values = FALSE, 
                 remove_collinear_columns = FALSE,
                 intercept = TRUE, 
                 non_negative = FALSE
)
```

The last solution will be Elastic net model (seed =5566). The training of this model is done using glmnet package. Note that this package requires transforming input data to matrix, and it does not accept missing values.

```{r}
#Elastic net

library(glmnet)

set.seed(5566)

trainsetenet=trainset%>%na.omit()
testsetenet=testset%>%na.omit()

trainsetenet[-1]<-lapply(trainsetenet[-1], as.numeric)
testsetenet[-1]<-lapply(testsetenet[-1], as.numeric)

x=as.matrix(trainsetenet[,-1])
y=trainsetenet[,1]

enet = cv.glmnet(x,y,family = "binomial",type.measure = "class")

plot(enet)
```

The elastic net regularisation is a hybrid method that combines both l1 and l2 regularisations (Lasso and Ridge). The results showed that Elastic net model only kept 30 features

```{r}

coef(enet)%>%as.matrix()%>%as.data.frame()%>%subset(.,`1`!=0)


```

## Validation and comparative study

Finally, we got 4 candidates of classifiers. In order to determine which has the best performance, we will perform a comparative study. We will evaluate 14 performance metrics for each model when applied to the test subset, then plot the results on a radar chart to visualize the difference.

The performance metrics can be divided into two groups : the first group has best value of 1, including: Balanced accuracy(BAC), Area under ROC curve (AUC), Accuracy (ACC), F1 score, Positive/Negative predictive value and True negative/positive rates. These metrics indicate the accuracy, sensitivity and specificity of the model. The second group contains the metrics that have best value of 0 and worst value of 1, including Balance error rate (BER), Brier score, False negative/positive rates, Mean misclassification error and False discovery rate (FDR). Those scores measure the weakness of each model. For ECG reading, false negative prediction is more dangerous than false positive.

```{r,message = FALSE,results="hide"}
library(mlr)

task=makeClassifTask(data=testset,target="class")

dummylrner=makeLearner("classif.rpart",predict.type ="prob")
dummymod=train(dummylrner,task)
dummypred=predict(dummymod,task)

LASSOPRED=predict(lassomod,wtest)%>%as.data.frame()
LASSOPRED=LASSOPRED%>%mutate(.,id=c(1:100),truth=testset$class,prob.arrhythmia=.[,2],prob.healthy=.[,3],response=.$predict)%>%.[,-c(1:3)]


RIDGEPRED=predict(ridgemod,wtest)%>%as.data.frame()
RIDGEPRED=RIDGEPRED%>%mutate(.,id=c(1:100),truth=testset$class,prob.arrhythmia=.[,2],prob.healthy=.[,3],response=.$predict)%>%.[,-c(1:3)]

enetclass=predict(enet,as.matrix(testsetenet[,-1]),type="class")%>%as.data.frame()
ENETPRED=predict(enet,as.matrix(testsetenet[,-1]),type="response")%>%as.data.frame()%>%mutate(.,id=c(1:98),truth=testsetenet$class,prob.arrhythmia=1-.[,1],prob.healthy=.[,1],response=enetclass[,1])%>%.[,-1]

RFPRED=predict(rfmod,wtest)%>%as.data.frame()
RFPRED=RFPRED%>%mutate(.,id=c(1:100),truth=testset$class,prob.arrhythmia=.[,2],prob.healthy=.[,3],response=.$predict)%>%.[,-c(1:3)]

```

```{r}

pred1=dummypred
pred1$data=LASSOPRED
pred2=dummypred
pred2$data=ENETPRED
pred3=dummypred
pred3$data=RFPRED
pred4=dummypred
pred4$data=RIDGEPRED%>%na.omit()

measure=list(mlr::bac,mlr::auc,mlr::acc,mlr::f1,mlr::ppv,mlr::npv,mlr::tnr,mlr::tpr)

p1=performance(pred1,measure)
p2=performance(pred2,measure)
p3=performance(pred3,measure)
p4=performance(pred4,measure)

lasso=p1%>%as.vector()
enet=p2%>%as.vector()
rf=p3%>%as.vector()
rdg=p4%>%as.vector()


```


```{r}
labs=c("BAC","AUC","ACC","F1","PPV","NPV","TPR","TNR")

scores=list("Lasso"=lasso*10,"Elasticnet"=enet*10,"RandomForest"=rf*10,"Ridge"=rdg*10)


library(radarchart)

chartJSRadar(scores = scores, labs = labs, maxScale = 10)
```

```{r}
measure2=list(ber,brier,fnr,fpr,mmce,fdr)
e1=performance(pred1,measure2)
e2=performance(pred2,measure2)
e3=performance(pred3,measure2)
e4=performance(pred4,measure2)

lasso=e1%>%as.vector()
enet=e2%>%as.vector()
rf=e3%>%as.vector()
rdg=e4%>%as.vector()

labs=c("BER","BRIER","FNR","FPR","MMCE","FDR")

scores=list("Lasso"=lasso,"Elasticnet"=enet,"RandomForest"=rf,"Ridge"=rdg)

chartJSRadar(scores = scores, labs = labs, maxScale =0.8)
```

The results suggest that the performances was almost similar for 4 methods.

## Conclusion

In this case study, we have travelled a long journey from a messy dataset to the final classifier that could correctly distinguish patients with arrhythmia from healthy subjects in 78.6% of cases. Following points could be kept in mind:

1) Data exploration and visualisation is very important, this procedure allows to detect some obvious problem with our data, including the missing value, invariance variables, outliers, missing labels... and these information can already help us to eliminate the irrelevant features in our dataset, without using any special technique.

2) Random Forest is a powerful method for both feature selection and overfitting control. Due to the randomness in bootstrap aggregating, only one model could be not trustful; however by replicating the model training many times, we can reach a high level of certainty. In this tutorial, we provided a reproducible loop function for this method. Such procedure allows training 100 or more basic Random Forest algorithm with both randomness and reproducibility.

3) LASSO, Ridge and Elasticnet regularisations are also good for the feature selection task. These methods dont require too much computation cost but can also give a good performance. Lasso and Elasticnet also keep smaller number of features and as they are based on logistic regression, the coefficients of those variables could be interpreted, not like the Random Forest which is a Blackbox.


**Thank you for joining us and see you in the next tutorial.**
