Foreword: About the Machine Learning in Medicine (MLM) project
The MLM project has been initialized in 2016 and aims to:
Encourage using Machine Learning techniques in medical research in Vietnam and
Promote the use of R statistical programming language, an open source and leading tool for practicing data science.
Introduction
How many features and which ones should be used to develop our model ? This question could be difficult, particularly when we have a very large dataset that contains hundred or thousands of variables. Feature selection can help us to identify the most relevant features for our problem.
Note that feature selection is different from Model selection, Model regularisation and dimensionality reduction. The feature selection procedure consists of selecting a subset of relevant features BEFORE fitting the models, whilst Model selection (Bayesian Model Averaging, Stepwise…) or Regularisation techniques such as boosting, Lasso or Elastic net imply modifications on the model itself. The dimensionality reduction techniques like PCA aims to generate new combinations from all features instead of excluding some of them.
Objective
In this case study X13, we will explore 12 simple filter methods of feature selection that are suported by the mlr package.
http://mlr-org.github.io/mlr-tutorial/devel/html/filter_methods/index.html
Filter methods consist of estimating a statistical measure to assign a scoring to each feature. The high ranked features will be considered more relevant and the ones with low rank could be excluded. Different statistical methods could be applied, including the simplest ones like Chi-square test, ANOVA or Kruskal-Wallis test or the scores that based on real machine learning algorithms like Information gain, gain ratio, Random Forest based variable’s importance. Those methods are often univariate and consider the feature independently, or with regard to the dependent variable.
We also introduce the Kuncheva stability index (KSI) for determining the best feature subset(s) for a binary classification problem.
Materials and method
Our experiment implies the SPECTF Heart Data Set, as described on the UCI ML database:
https://archive.ics.uci.edu/ml/datasets/SPECTF+Heart
The dataset represents a technique called “Cardiac Single Proton Emission Computed Tomography (SPECT)”. The train subset included 80 image sets and the test subset included 187 image sets. Each of the patients can be classified as normal and abnormal.
The data of these 267 SPECT image sets (patients) were processed to extract features that summarize the original SPECT images. As a result, 44 continuous feature pattern was created for each patient.Classification rule should be developed from these 44 patterns. Assuming that we already decided to apply Logistic model for constructing our classifier, and we would like to use only 15 features instead of 44. Our goal is to identify the most useful subset of 12 features from the original dataset.
Data preparation
First, we will prepare the ggplot theme for our experiment
library(tidyverse)
theme_bare <- function(base_size=8,base_family="sans"){theme_bw(base_size = base_size, base_family = base_family)+
theme(
axis.line = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position = "bottom",
panel.background = element_rect(fill = NA),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.margin = unit(c(0,0,0,0), "lines")
)
}
my_theme <- function(base_size =5, base_family = "sans"){
theme_bw(base_size = base_size, base_family = base_family) +
theme(
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = NA),
strip.background = element_rect(fill = "#001d60", color = "#00113a", size =0.5),
strip.text = element_text(face = "bold", size = 5, color = "white"),
legend.position = "bottom",
legend.justification = "center",
legend.background = element_blank(),
legend.margin = margin(0.5,0.5,0.5,0.5)
)
}
theme_set(my_theme())
myfillcolors=c("#ff003f","#0094ff", "#ae00ff" , "#94ff00", "#ffc700","#fc1814")
First, we load the data to R:
testset=read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/spect/SPECTF.test",sep = ",")%>%as_tibble()
trainset=read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/spect/SPECTF.train",sep = ",")%>%as_tibble()
testset$V1<-factor(testset$V1,levels = c(1,0), labels = c("Positive", "Negative") )
trainset$V1<-factor(trainset$V1,levels = c(1,0), labels = c("Positive", "Negative") )
names(testset)=c("Class",paste(rep("F",times=44),c(1:44),sep="") )
names(trainset)=c("Class",paste(rep("F",times=44),c(1:44),sep="") )
Data visualising
trainset%>%gather(F1:F44,key="Features",value="Value")%>%ggplot(aes(x=Value,fill=Class))+geom_density(alpha=0.5)+facet_wrap(~Features,scales="free",ncol=4)+scale_fill_manual(values=myfillcolors)
trainset%>%gather(F1:F44,key="Features",value="Value")%>%ggplot(aes(x=Class,y=Value,fill=Class))+geom_boxplot(alpha=0.7)+facet_wrap(~Features,scales="free",ncol=4)+scale_fill_manual(values=myfillcolors)+coord_flip()
library(corrplot)
library(RColorBrewer)
cor_mat=as.matrix(cor(as.matrix(trainset[,-1])))
cor_mat%>%corrplot(.,order="hclust",type="lower",method="ellipse",tl.col="black", tl.srt=45,tl.cex=0.5,col=rev(brewer.pal(n=11, name="RdBu")))

cor_mat=as.matrix(cor(as.matrix(trainset[,-1])))
library(igraph)
diag(cor_mat)<-0
library(ggraph)
m=cor_mat
df=data.frame(row=rownames(m)[row(m)[upper.tri(m)]],
col=colnames(m)[col(m)[upper.tri(m)]],
corr=m[upper.tri(m)])
cdf=subset(df,abs(corr)>0.5)
names(cdf)=c("from","to","corr")
graph<-graph_from_data_frame(cdf)
ggraph(graph, layout = 'linear',circular=T)+geom_edge_arc(aes(colour = corr,width=corr),alpha=0.3)+geom_node_label(aes(label = name))+coord_fixed()+scale_edge_color_gradient(high="#ff003f",low="#0094ff")+scale_edge_width_continuous()+theme_bare()
Feature selection by 12 Filter methods
We introduce a function called “vimplot”. This function will perform a filter method of interest then extract the subset of features and plot the result.
Then we apply the function with 12 different filter methods and save the output within the objects from impvar1 to impvar12
library(FSelector)
library(mlr)
library(randomForestSRC)
library(rJava)
library(MLmetrics)
vimplot=function(data,target,method){
train.task=makeClassifTask(data=data, target=target)
imp_var=generateFilterValuesData(train.task, method=method)
fsdf=imp_var$data%>%.[,-2]
names(fsdf)=c("Feature","Value")
fsdf$Method=method
fsdf$Feature
vimp=fsdf%>%arrange(desc(Value))%>% select(Feature)
plot=fsdf%>%ggplot(aes(x=reorder(fsdf$Feature,fsdf$Value),y=Value,fill=Value,color=Value))+geom_segment(aes(x=reorder(fsdf$Feature,fsdf$Value),xend=Feature,y=0,yend=Value),size=1,show.legend = F)+geom_point(size=2,show.legend = F)+scale_x_discrete("Features")+scale_y_continuous(method)+coord_flip()+scale_fill_gradient(low="blue",high="red")+scale_color_gradient(low="blue",high="red")+ggtitle(method)
return(list(plot=plot,name=vimp))
}
#ANOVA TEST
res=vimplot(data=trainset,target="Class",method="anova.test")
impvar1 <-res$name
res$plot

# cforest.importance
res=vimplot(data=trainset,target="Class",method="cforest.importance")
impvar2 <-res$name
res$plot

#chi.squared
res=vimplot(data=trainset,target="Class",method="chi.squared")
impvar3 <-res$name
res$plot

#gain.ratio
res=vimplot(data=trainset,target="Class",method="gain.ratio")
impvar4 <-res$name
res$plot

#information.gain
res=vimplot(data=trainset,target="Class",method="information.gain")
impvar5 <-res$name
res$plot

#OneR
res=vimplot(data=trainset,target="Class",method="oneR")
## Class ~ .
## <environment: 0x00000000704520a0>
impvar6 <-res$name
res$plot

#randomForest.importance
res=vimplot(data=trainset,target="Class",method="randomForest.importance")
impvar7 <-res$name
res$plot

#relief
res=vimplot(data=trainset,target="Class",method="relief")
impvar8 <-res$name
res$plot

#symmetrical.uncertainty
res=vimplot(data=trainset,target="Class",method="symmetrical.uncertainty")
impvar9 <-res$name
res$plot

# randomForestSRC.rfsrc
res=vimplot(data=trainset,target="Class",method="randomForestSRC.rfsrc")
impvar10 <-res$name
res$plot

# variance
res=vimplot(data=trainset,target="Class",method="variance")
impvar11<-res$name
res$plot

#univariate.model.score (rpart)
res=vimplot(data=trainset,target="Class",method="univariate.model.score")
impvar12<-res$name
res$plot

Kuncheva stability index
The Kuncheva stability index is supported by OmicsMarkeR pacakge that could be installed from Bioconductor platform:
#source("https://bioconductor.org/biocLite.R")
#biocLite("OmicsMarkeR")
First, we will load the package, then apply a loop on our 12 output objects of filter methods (V1 to V12). As mentioned above, we would like to reduce the number of features from 44 to only 15 (thus excluding 29 features). Note that the subsets of 15 features might be different, depending on the filtering method in use.
The Kuncheva stability index aims to measure the consistency level between those 12 feature subsets. The result looks similar to a correlation matrix.
library("OmicsMarkeR")
v1=impvar1$Feature[1:15]
v2=impvar2$Feature[1:15]
v3=impvar3$Feature[1:15]
v4=impvar4$Feature[1:15]
v5=impvar5$Feature[1:15]
v6=impvar6$Feature[1:15]
v7=impvar7$Feature[1:15]
v8=impvar8$Feature[1:15]
v9=impvar9$Feature[1:15]
v10=impvar10$Feature[1:15]
v11=impvar11$Feature[1:15]
v12=impvar12$Feature[1:15]
mat <- matrix(NA,12,12)
varcol <- paste0("v",seq(1,12,by = 1))
varrow <- paste0("v",seq(1,12,by = 1))
for (row in 1:12){
for(col in 1:12){
command <- paste0(" mat[row,col] <- kuncheva(",
varrow[row],",",
varcol[col],",num.features = dim(trainset[,-1])[[2]])")
eval(parse(text = command))
}
}
This is the KSI matrix
colnames(mat)=c("ANOV","CFRI","CHISQ","GRT","INFG","ONER","RFI","REL","SYMU","SRC","VAR","UMS")
rownames(mat)=c("ANOV","CFRI","CHISQ","GRT","INFG","ONER","RFI","REL","SYMU","SRC","VAR","UMS")
print(mat)
## ANOV CFRI CHISQ GRT INFG ONER
## ANOV 1.0000000 0.9494253 0.7471264 0.7471264 0.7471264 0.5448276
## CFRI 0.9494253 1.0000000 0.7471264 0.7471264 0.7471264 0.5448276
## CHISQ 0.7471264 0.7471264 1.0000000 1.0000000 1.0000000 0.7977011
## GRT 0.7471264 0.7471264 1.0000000 1.0000000 1.0000000 0.7977011
## INFG 0.7471264 0.7471264 1.0000000 1.0000000 1.0000000 0.7977011
## ONER 0.5448276 0.5448276 0.7977011 0.7977011 0.7977011 1.0000000
## RFI 0.7471264 0.7471264 0.7471264 0.7471264 0.7471264 0.5448276
## REL 0.7977011 0.7977011 0.6965517 0.6965517 0.6965517 0.5954023
## SYMU 0.7471264 0.7471264 1.0000000 1.0000000 1.0000000 0.7977011
## SRC 0.7977011 0.8482759 0.7977011 0.7977011 0.7977011 0.6459770
## VAR 0.6965517 0.6965517 0.6459770 0.6459770 0.6459770 0.5448276
## UMS 0.6459770 0.6965517 0.6459770 0.6459770 0.6459770 0.5448276
## RFI REL SYMU SRC VAR UMS
## ANOV 0.7471264 0.7977011 0.7471264 0.7977011 0.6965517 0.6459770
## CFRI 0.7471264 0.7977011 0.7471264 0.8482759 0.6965517 0.6965517
## CHISQ 0.7471264 0.6965517 1.0000000 0.7977011 0.6459770 0.6459770
## GRT 0.7471264 0.6965517 1.0000000 0.7977011 0.6459770 0.6459770
## INFG 0.7471264 0.6965517 1.0000000 0.7977011 0.6459770 0.6459770
## ONER 0.5448276 0.5954023 0.7977011 0.6459770 0.5448276 0.5448276
## RFI 1.0000000 0.6459770 0.7471264 0.7977011 0.7471264 0.7471264
## REL 0.6459770 1.0000000 0.6965517 0.7471264 0.6459770 0.6459770
## SYMU 0.7471264 0.6965517 1.0000000 0.7977011 0.6459770 0.6459770
## SRC 0.7977011 0.7471264 0.7977011 1.0000000 0.6459770 0.7471264
## VAR 0.7471264 0.6459770 0.6459770 0.6459770 1.0000000 0.5448276
## UMS 0.7471264 0.6459770 0.6459770 0.7471264 0.5448276 1.0000000
Based on this consistency matrix, we can develop some graphs to visualise the coherence level between 12 feaure filtering methods:
mat%>%corrplot(.,order="hclust",type="lower",method="pie",tl.col="black", tl.srt=45,tl.cex=0.5,col=(brewer.pal(n=9, name="Reds")))

diag(mat)<-0
m=mat
df=data.frame(row=rownames(m)[row(m)[upper.tri(m)]],
col=colnames(m)[col(m)[upper.tri(m)]],
corr=m[upper.tri(m)])
df$Coherent=df$corr>=0.8
names(df)=c("from","to","corr","coherent")
cdf=subset(df,abs(corr)>=0.8)
graph1<-graph_from_data_frame(df)
graph2<-graph_from_data_frame(cdf)
ggraph(graph2, layout = 'linear',circular=T)+geom_edge_arc2(aes(colour = corr,width=corr),alpha=0.3)+geom_node_label(aes(label = name))+coord_fixed()+scale_edge_color_gradient(high="#ff003f",low="#0094ff")+scale_edge_width_continuous()+theme_bare()

ggraph(graph1, layout = 'fr')+geom_edge_density(aes(fill = coherent))+geom_edge_link(alpha=0.6)+geom_node_label(aes(label = name))+theme_bare()+coord_fixed()+scale_edge_fill_manual(values=c("#0094ff","#ff003f"))

The network revealed that the coherence level was highest among methods that based on Decision tree or Random Forest algorithms; consistency level was lower between OneR or ANOVA. The methods that shared a high value of KSI could be considered equivalent, as they would result the same features subset. Note that such result could change for each iteration, as the Random forest and permutation algorithms are random.
EVALUATING THE PERFORMANCE OF 12 LOGISTIC MODELS
The last step consists of fitting 12 logistic models that based on 12 filtered feature subsets, then evaluating the performance of each model. A good feature subset would results accurate model, as evaluated by AUC, Balanced accuracy, F1 score and balanced error rate
set.seed(123)
res <- NULL
for (i in 1:12){
train=trainset[,c(eval(parse(text = varrow[i])),"Class")]
test=testset[,c(eval(parse(text = varrow[i])),"Class")]
#Logistic regression
train.task=makeClassifTask(data=train, target="Class")
test.task =makeClassifTask(data=test, target="Class")
# Build Model
learner_logis=makeLearner("classif.logreg", predict.type="prob")
model=mlr::train(learner_logis, train.task)
# Prediction and Evaluation
pred=predict(model, test.task)
mes=list(mlr::auc,mlr::bac,mlr::ber,mlr::f1)
perf=performance(pred,measures=mes)
auc=perf[[1]]
bac=perf[[2]]
ber=perf[[3]]
f1=perf[[4]]
line=data.frame(FSmethod = i,AUC=auc,BAC=bac,BER=ber,F1=f1)
res[[i]] <- line
}
Here is the verdict:
res <- do.call(rbind,res)
res$FSmethod <- c("ANOV","CFRI","CHISQ","GRT","INFG","ONER","RFI","REL","SYMU","SRC","VAR","UMS")
res
res%>%gather(AUC:F1,key="Metric",value="Value")%>%ggplot()+geom_tile(aes(x=reorder(Metric,Value),y=reorder(FSmethod,Value),fill=Value),color="black")+geom_text(aes(x=reorder(Metric,Value),y=reorder(FSmethod,Value),label=round(Value,4)),color="black")+scale_fill_distiller(palette = "Spectral")+theme_bw()+scale_x_discrete("Metrics")+scale_y_discrete("FS methods")

res%>%gather(AUC:F1,key="Metric",value="Value")%>%ggplot()+geom_point(shape=21,color="black",size=3,aes(x=reorder(FSmethod,Value),y=Value,fill=Metric),color="black")+scale_fill_manual(values=myfillcolors)+theme_bw()+scale_x_discrete("FSmethod")+scale_y_continuous("Values")+facet_wrap(~Metric,scales="free")+coord_flip()

Based on those result, Random Forest based importance (RFI) based method were the most powerful methods for feature selection.
V(graph1)$AUC=res$AUC
ggraph(graph1, layout = 'linear',circular=T)+geom_edge_fan(alpha=0.1,aes(colour = corr,width=corr))+geom_node_point(shape=21,aes(size=AUC,fill=AUC))+geom_node_text(aes(label = name),nudge_x=,nudge_y=0.1)+theme_bare()+coord_fixed()+scale_edge_color_gradient(low="#0094ff",high="#ff003f")+scale_fill_gradient(high="red",low="gold")

V(graph1)$BAC=res$BAC
ggraph(graph1, layout = 'linear',circular=T)+geom_edge_fan(alpha=0.1,aes(colour = corr,width=corr))+geom_node_point(shape=21,aes(size=BAC,fill=BAC))+geom_node_text(aes(label = name),nudge_x=,nudge_y=0.1)+theme_bare()+coord_fixed()+scale_edge_color_gradient(low="#0094ff",high="#ff003f")+scale_fill_gradient(high="red",low="gold")

The network analysis that combines KSI and AUC/Accuracy values showed that the coherence level as measured by KSI is NOT associated to the performance as measured by AUC or BAC. This means the methods that have the same KSI value could have an equivalent performance (as they generate the same feature subset), but the KSI is not a measure of goodness of filter method, as a method with low consistency in regard to other methods could become a better one.
Conclusion
In thiscase study, we introduced different methods for filtering the features. The consistency level among features subsets could be evaluated by the Kuncheva stability index. However, this index is not a measure of the performance for each method. The Random Forest based criteria, including the variable importance, information gain and Gain ratio are very useful for selecting features.
In a next tutorial, we will discuss about the methods that modify our models instead of input data, i.e Model selection.
Reference: https://pdfs.semanticscholar.org/9c4f/006d5547cddecef588e66ab6b33e6845b33a.pdf
Contribution:
This joint tutorial was created by Viet Tran and Nam Le Dong as a part of MLM project
Viet Tran designed the data experiment. He is also the author of most functions and loops for Feature selection and Model performance evaluation.
Nam Le Dong handled the graphs, network analysis and Rmarkdown writing.
See you in the next tutorial and thank for joining us !
END
---
title: "MLM Case study X13"
subtitle: "Feature selection"
author: "Viet Tran, Nam Le Dong"
date: "18 June, 2017"
output: 
  html_document: 
    code_download: true
    code_folding: hide
    number_sections: yes
    theme: journal
    df_print: paged
    toc: TRUE
    toc_float: TRUE
---

```{r setup, include=FALSE}

knitr::opts_chunk$set(echo = TRUE, error = FALSE, warning = FALSE, message = FALSE)

library(tidyverse)
library(igraph)
library(ggraph)
library(corrplot)
library(RColorBrewer)
library(FSelector)
library(mlr)
library(randomForestSRC)
library(rJava)
library(MLmetrics)
library("OmicsMarkeR")
```

![](featureselection.png)

# Foreword: About the Machine Learning in Medicine (MLM) project

The MLM project has been initialized in 2016 and aims to: 

1) Encourage using Machine Learning techniques in medical research in Vietnam and

2) Promote the use of R statistical programming language, an open source and leading tool for practicing data science.


# Introduction

How many features and which ones should be used to develop our model ? This question could be difficult, particularly when we have a very large dataset that contains hundred or thousands of variables. Feature selection can help us to identify the most relevant features for our problem.

Note that feature selection is different from Model selection, Model regularisation and dimensionality reduction. The feature selection procedure consists of selecting a subset of relevant features BEFORE fitting the models, whilst Model selection (Bayesian Model Averaging, Stepwise...) or Regularisation techniques such as boosting, Lasso or Elastic net imply modifications on the model itself. The dimensionality reduction techniques like PCA aims to generate new combinations from all features instead of excluding some of them.

# Objective

In this case study X13, we will explore 12 simple filter methods of feature selection that are suported by the mlr package. 

<http://mlr-org.github.io/mlr-tutorial/devel/html/filter_methods/index.html>

Filter methods consist of estimating a statistical measure to assign a scoring to each feature. The high ranked features will be considered more relevant and the ones with low rank could be excluded. Different statistical methods could be applied, including the simplest ones like Chi-square test, ANOVA or Kruskal-Wallis test or the scores that based on real machine learning algorithms like Information gain, gain ratio, Random Forest based variable's importance. Those methods are often univariate and consider the feature independently, or with regard to the dependent variable.

We also introduce the Kuncheva stability index (KSI) for determining the best feature subset(s) for a binary classification problem.

# Materials and method

Our experiment implies the SPECTF Heart Data Set, as described on the UCI ML database:

<https://archive.ics.uci.edu/ml/datasets/SPECTF+Heart>

The dataset represents a technique called "Cardiac Single Proton Emission Computed Tomography (SPECT)". The train subset included 80 image sets and the test subset included 187 image sets. Each of the patients can be classified as normal and abnormal. 

The data of these 267 SPECT image sets (patients) were processed to extract features that summarize the original SPECT images. As a result, 44 continuous feature pattern was created for each patient.Classification rule should be developed from these 44 patterns. 
Assuming that we already decided to apply Logistic model for constructing our classifier, and we would like to use only 15 features instead of 44. Our goal is to identify the most useful subset of 12 features from the original dataset.

# Data preparation

First, we will prepare the ggplot theme for our experiment

```{r,message = FALSE,warning=FALSE}

library(tidyverse)

theme_bare <- function(base_size=8,base_family="sans"){theme_bw(base_size = base_size, base_family = base_family)+
  theme(
  axis.line = element_blank(), 
  axis.text.x = element_blank(), 
  axis.text.y = element_blank(),
  axis.ticks = element_blank(), 
  axis.title.x = element_blank(), 
  axis.title.y = element_blank(), 
  legend.position = "bottom", 
  panel.background = element_rect(fill = NA), 
  panel.border = element_blank(), 
  panel.grid.major = element_blank(), 
  panel.grid.minor = element_blank(), 
  plot.margin = unit(c(0,0,0,0), "lines")
)
}



my_theme <- function(base_size =5, base_family = "sans"){
  theme_bw(base_size = base_size, base_family = base_family) +
    theme(
      panel.grid.major = element_line(color = "gray"),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = NA),
      strip.background = element_rect(fill = "#001d60", color = "#00113a", size =0.5),
      strip.text = element_text(face = "bold", size = 5, color = "white"),
      legend.position = "bottom",
      legend.justification = "center",
      legend.background = element_blank(),
      legend.margin = margin(0.5,0.5,0.5,0.5)
    )
}

theme_set(my_theme())

myfillcolors=c("#ff003f","#0094ff", "#ae00ff" , "#94ff00", "#ffc700","#fc1814")


```

First, we load the data to R:

```{r,message = FALSE,warning=FALSE}
testset=read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/spect/SPECTF.test",sep = ",")%>%as_tibble()
trainset=read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/spect/SPECTF.train",sep = ",")%>%as_tibble()

testset$V1<-factor(testset$V1,levels = c(1,0), labels = c("Positive", "Negative") )
trainset$V1<-factor(trainset$V1,levels = c(1,0), labels = c("Positive", "Negative") )

names(testset)=c("Class",paste(rep("F",times=44),c(1:44),sep="") )
names(trainset)=c("Class",paste(rep("F",times=44),c(1:44),sep="") )

```

# Data visualising

```{r,message = FALSE,results="hide",fig.show ="hide"}

trainset%>%gather(F1:F44,key="Features",value="Value")%>%ggplot(aes(x=Value,fill=Class))+geom_density(alpha=0.5)+facet_wrap(~Features,scales="free",ncol=4)+scale_fill_manual(values=myfillcolors)

trainset%>%gather(F1:F44,key="Features",value="Value")%>%ggplot(aes(x=Class,y=Value,fill=Class))+geom_boxplot(alpha=0.7)+facet_wrap(~Features,scales="free",ncol=4)+scale_fill_manual(values=myfillcolors)+coord_flip()

```

![](FS1.png)

![](FS2.png)

```{r,message = FALSE,warning=FALSE}
library(corrplot)
library(RColorBrewer)

cor_mat=as.matrix(cor(as.matrix(trainset[,-1])))

cor_mat%>%corrplot(.,order="hclust",type="lower",method="ellipse",tl.col="black", tl.srt=45,tl.cex=0.5,col=rev(brewer.pal(n=11, name="RdBu")))

```

```{r,message = FALSE,warning=FALSE,results="hide",fig.show ="hide"}
cor_mat=as.matrix(cor(as.matrix(trainset[,-1])))

library(igraph)

diag(cor_mat)<-0

library(ggraph)

m=cor_mat

df=data.frame(row=rownames(m)[row(m)[upper.tri(m)]], 
           col=colnames(m)[col(m)[upper.tri(m)]], 
           corr=m[upper.tri(m)])

cdf=subset(df,abs(corr)>0.5)

names(cdf)=c("from","to","corr")

graph<-graph_from_data_frame(cdf)

ggraph(graph, layout = 'linear',circular=T)+geom_edge_arc(aes(colour = corr,width=corr),alpha=0.3)+geom_node_label(aes(label = name))+coord_fixed()+scale_edge_color_gradient(high="#ff003f",low="#0094ff")+scale_edge_width_continuous()+theme_bare()

```

![](FS3.png)

# Feature selection by 12 Filter methods

We introduce a function called "vimplot". This function will perform a filter method of interest then extract the subset of features and plot the result.

Then we apply the function with 12 different filter methods and save the output within the objects from impvar1 to impvar12

```{r,message = FALSE,warning=FALSE}

library(FSelector)
library(mlr)
library(randomForestSRC)
library(rJava)
library(MLmetrics)

vimplot=function(data,target,method){
  train.task=makeClassifTask(data=data, target=target)
  imp_var=generateFilterValuesData(train.task, method=method)
  fsdf=imp_var$data%>%.[,-2]
  names(fsdf)=c("Feature","Value")
  fsdf$Method=method
  fsdf$Feature
  vimp=fsdf%>%arrange(desc(Value))%>% select(Feature)
  plot=fsdf%>%ggplot(aes(x=reorder(fsdf$Feature,fsdf$Value),y=Value,fill=Value,color=Value))+geom_segment(aes(x=reorder(fsdf$Feature,fsdf$Value),xend=Feature,y=0,yend=Value),size=1,show.legend = F)+geom_point(size=2,show.legend = F)+scale_x_discrete("Features")+scale_y_continuous(method)+coord_flip()+scale_fill_gradient(low="blue",high="red")+scale_color_gradient(low="blue",high="red")+ggtitle(method)
  return(list(plot=plot,name=vimp))
  }

#ANOVA TEST

res=vimplot(data=trainset,target="Class",method="anova.test")

impvar1 <-res$name

res$plot

# cforest.importance

res=vimplot(data=trainset,target="Class",method="cforest.importance")

impvar2 <-res$name

res$plot

#chi.squared

res=vimplot(data=trainset,target="Class",method="chi.squared")

impvar3 <-res$name

res$plot

#gain.ratio

res=vimplot(data=trainset,target="Class",method="gain.ratio")

impvar4 <-res$name

res$plot

#information.gain

res=vimplot(data=trainset,target="Class",method="information.gain")

impvar5 <-res$name

res$plot

#OneR

res=vimplot(data=trainset,target="Class",method="oneR")

impvar6 <-res$name

res$plot

#randomForest.importance

res=vimplot(data=trainset,target="Class",method="randomForest.importance")

impvar7 <-res$name

res$plot

#relief

res=vimplot(data=trainset,target="Class",method="relief")

impvar8 <-res$name

res$plot

#symmetrical.uncertainty

res=vimplot(data=trainset,target="Class",method="symmetrical.uncertainty")

impvar9 <-res$name

res$plot

# randomForestSRC.rfsrc

res=vimplot(data=trainset,target="Class",method="randomForestSRC.rfsrc")

impvar10 <-res$name

res$plot

# variance

res=vimplot(data=trainset,target="Class",method="variance")

impvar11<-res$name

res$plot

#univariate.model.score (rpart)

res=vimplot(data=trainset,target="Class",method="univariate.model.score")

impvar12<-res$name

res$plot

```

# Kuncheva stability index

The Kuncheva stability index is supported by OmicsMarkeR pacakge that could be installed from Bioconductor platform:

```{r,message = FALSE,warning=FALSE}

#source("https://bioconductor.org/biocLite.R")
#biocLite("OmicsMarkeR")
```

First, we will load the package, then apply a loop on our 12 output objects of filter methods (V1 to V12). As mentioned above, we would like to reduce the number of features from 44 to only 15 (thus excluding 29 features). Note that the subsets of 15 features might be different, depending on the filtering method in use. 

The Kuncheva stability index aims to measure the consistency level between those 12 feature subsets. The result looks similar to a correlation matrix. 

```{r,message = FALSE,warning=FALSE}
library("OmicsMarkeR")

v1=impvar1$Feature[1:15]
v2=impvar2$Feature[1:15]
v3=impvar3$Feature[1:15]
v4=impvar4$Feature[1:15]
v5=impvar5$Feature[1:15]
v6=impvar6$Feature[1:15]
v7=impvar7$Feature[1:15]
v8=impvar8$Feature[1:15]
v9=impvar9$Feature[1:15]
v10=impvar10$Feature[1:15]
v11=impvar11$Feature[1:15]
v12=impvar12$Feature[1:15]

mat <- matrix(NA,12,12)
varcol <- paste0("v",seq(1,12,by = 1))
varrow <- paste0("v",seq(1,12,by = 1))

for (row in 1:12){
  for(col in 1:12){
    command <- paste0(" mat[row,col] <- kuncheva(",
                      varrow[row],",",
                      varcol[col],",num.features = dim(trainset[,-1])[[2]])")
    
    eval(parse(text = command))
  }
}

```

This is the KSI matrix

```{r,message = FALSE,warning=FALSE}

colnames(mat)=c("ANOV","CFRI","CHISQ","GRT","INFG","ONER","RFI","REL","SYMU","SRC","VAR","UMS")
rownames(mat)=c("ANOV","CFRI","CHISQ","GRT","INFG","ONER","RFI","REL","SYMU","SRC","VAR","UMS")

print(mat)

```

Based on this consistency matrix, we can develop some graphs to visualise the coherence level between 12 feaure filtering methods:

```{r,message = FALSE,warning=FALSE}
mat%>%corrplot(.,order="hclust",type="lower",method="pie",tl.col="black", tl.srt=45,tl.cex=0.5,col=(brewer.pal(n=9, name="Reds")))

diag(mat)<-0

m=mat

df=data.frame(row=rownames(m)[row(m)[upper.tri(m)]], 
              col=colnames(m)[col(m)[upper.tri(m)]], 
              corr=m[upper.tri(m)])

df$Coherent=df$corr>=0.8
names(df)=c("from","to","corr","coherent")

cdf=subset(df,abs(corr)>=0.8)

graph1<-graph_from_data_frame(df)
graph2<-graph_from_data_frame(cdf)

ggraph(graph2, layout = 'linear',circular=T)+geom_edge_arc2(aes(colour = corr,width=corr),alpha=0.3)+geom_node_label(aes(label = name))+coord_fixed()+scale_edge_color_gradient(high="#ff003f",low="#0094ff")+scale_edge_width_continuous()+theme_bare()

ggraph(graph1, layout = 'fr')+geom_edge_density(aes(fill = coherent))+geom_edge_link(alpha=0.6)+geom_node_label(aes(label = name))+theme_bare()+coord_fixed()+scale_edge_fill_manual(values=c("#0094ff","#ff003f"))

```

The network revealed that the coherence level was highest among methods that based on Decision tree or Random Forest algorithms; consistency level was lower between OneR or ANOVA. The methods that shared a high value of KSI could be considered equivalent, as they would result the same features subset. Note that such result could change for each iteration, as the Random forest and permutation algorithms are random.

# EVALUATING THE PERFORMANCE OF 12 LOGISTIC MODELS

The last step consists of fitting 12 logistic models that based on 12 filtered feature subsets, then evaluating the performance of each model. A good feature subset would results accurate model, as evaluated by AUC, Balanced accuracy, F1 score and balanced error rate

```{r,message = FALSE,warning=FALSE}
set.seed(123)

res <- NULL

for (i in 1:12){
  
  train=trainset[,c(eval(parse(text = varrow[i])),"Class")]
  test=testset[,c(eval(parse(text = varrow[i])),"Class")]
  
  #Logistic regression
  train.task=makeClassifTask(data=train, target="Class")
  test.task =makeClassifTask(data=test, target="Class")
  
  # Build Model
  learner_logis=makeLearner("classif.logreg", predict.type="prob")
  model=mlr::train(learner_logis, train.task)
  
  # Prediction and Evaluation
  pred=predict(model, test.task)
  mes=list(mlr::auc,mlr::bac,mlr::ber,mlr::f1)
  perf=performance(pred,measures=mes)
  auc=perf[[1]]
  bac=perf[[2]]
  ber=perf[[3]]
  f1=perf[[4]]
  line=data.frame(FSmethod = i,AUC=auc,BAC=bac,BER=ber,F1=f1)
  res[[i]] <- line
}

```

Here is the verdict: 

```{r,message = FALSE,warning=FALSE}
res <- do.call(rbind,res)
res$FSmethod <- c("ANOV","CFRI","CHISQ","GRT","INFG","ONER","RFI","REL","SYMU","SRC","VAR","UMS")

res

res%>%gather(AUC:F1,key="Metric",value="Value")%>%ggplot()+geom_tile(aes(x=reorder(Metric,Value),y=reorder(FSmethod,Value),fill=Value),color="black")+geom_text(aes(x=reorder(Metric,Value),y=reorder(FSmethod,Value),label=round(Value,4)),color="black")+scale_fill_distiller(palette = "Spectral")+theme_bw()+scale_x_discrete("Metrics")+scale_y_discrete("FS methods")

res%>%gather(AUC:F1,key="Metric",value="Value")%>%ggplot()+geom_point(shape=21,color="black",size=3,aes(x=reorder(FSmethod,Value),y=Value,fill=Metric),color="black")+scale_fill_manual(values=myfillcolors)+theme_bw()+scale_x_discrete("FSmethod")+scale_y_continuous("Values")+facet_wrap(~Metric,scales="free")+coord_flip()

```

Based on those result, Random Forest based importance (RFI) based method were the most powerful methods for feature selection.

```{r,message = FALSE,warning=FALSE,results="hide"}

V(graph1)$AUC=res$AUC

ggraph(graph1, layout = 'linear',circular=T)+geom_edge_fan(alpha=0.1,aes(colour = corr,width=corr))+geom_node_point(shape=21,aes(size=AUC,fill=AUC))+geom_node_text(aes(label = name),nudge_x=,nudge_y=0.1)+theme_bare()+coord_fixed()+scale_edge_color_gradient(low="#0094ff",high="#ff003f")+scale_fill_gradient(high="red",low="gold")


V(graph1)$BAC=res$BAC

ggraph(graph1, layout = 'linear',circular=T)+geom_edge_fan(alpha=0.1,aes(colour = corr,width=corr))+geom_node_point(shape=21,aes(size=BAC,fill=BAC))+geom_node_text(aes(label = name),nudge_x=,nudge_y=0.1)+theme_bare()+coord_fixed()+scale_edge_color_gradient(low="#0094ff",high="#ff003f")+scale_fill_gradient(high="red",low="gold")


```

The network analysis that combines KSI and AUC/Accuracy values showed that the coherence level as measured by KSI is NOT associated to the performance as measured by AUC or BAC. This means the methods that have the same KSI value could have an equivalent performance (as they generate the same feature subset), but the KSI is not a measure of goodness of filter method, as a method with low consistency in regard to other methods could become a better one.

# Conclusion

In thiscase study, we introduced different methods for filtering the features. The consistency level among features subsets could be evaluated by the Kuncheva stability index. However, this index is not a measure of the performance for each method. The Random Forest based criteria, including the variable importance, information gain and Gain ratio are very useful for selecting features.

In a next tutorial, we will discuss about the methods that modify our models instead of input data, i.e Model selection.

Reference: <https://pdfs.semanticscholar.org/9c4f/006d5547cddecef588e66ab6b33e6845b33a.pdf>


# Contribution:

*This joint tutorial was created by Viet Tran and Nam Le Dong as a part of MLM project*

*Viet Tran* designed the data experiment. He is also the author of most functions and loops for Feature selection and Model performance evaluation.

*Nam Le Dong* handled the graphs, network analysis and Rmarkdown writing.

**See you in the next tutorial and thank for joining us !**

*END*
