The MLM project has been initialized in 2016 and aims to:
Encourage using Machine Learning techniques in medical research in Vietnam and
Promote the use of R statistical programming language, an open source and leading tool for practicing data science.
In the last tutorial X13, we have introduced the feature selection methods for a classification problem. This time, we will do the same thing but for a regression problem and with new tools. Given a dataset with a large matrix of covariates X and we want to build a linear regression model for estimating a response variable Y. This problem rises two questions:
I) Among many potential features in the original dataset, how many and which features should be included in the model ?
II) How important are those features ?
This problem could be resolved either before (using feature selection methods) or during the model training process. When linear regression algorithm is selected for fitting the model, some methods could be considered for both model regularisation and feature selection.
In the present tutorial, we will explore 5 different methods for model selection. These include:
Our data experiment implies the “Body Fat” dataset by Roger W. Johnson and Carleton College https://ww2.amstat.org/publications/jse/v4n1/datasets.johnson.html
The author measured percentage of body fat, age, weight, height, and ten body circumference measurements (e.g., abdomen) for 251 men. The Body fat was measured by an underwater weighing technique. Their study aimed to predict body fat in terms of other measurements using multiple regression.
The caret package was used for training a Stepwise GLM algorithm. Lasso and elastic net logistic models are supported by GLMnet package. The MCMC based Bayesian Model averaging was done using BMS package. We used mlr package for wrapping a logistic learner with random feature selection integrated.
The performance of 5 regression models will be compared.
First, we will prepare the ggplot theme for our experiment
library(tidyverse)
theme_bare <- function(base_size=8,base_family="sans"){theme_bw(base_size = base_size, base_family = base_family)+
theme(
axis.line = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position = "bottom",
panel.background = element_rect(fill = NA),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.margin = unit(c(0,0,0,0), "lines")
)
}
my_theme <- function(base_size =5, base_family = "sans"){
theme_bw(base_size = base_size, base_family = base_family) +
theme(
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = NA),
strip.background = element_rect(fill = "#001d60", color = "#00113a", size =0.5),
strip.text = element_text(face = "bold", size = 5, color = "white"),
legend.position = "bottom",
legend.justification = "center",
legend.background = element_blank(),
legend.margin = margin(0.5,0.5,0.5,0.5)
)
}
theme_set(my_theme())
First, we will load the dataframe to R:
df=read.table("https://ww2.amstat.org/publications/jse/datasets/fat.dat.txt")
df= df[,c(2, 5:7, 10:19)]
names(df)=c("brozek_fat","age","weight","height","neck","chest","abdomen","hip","thigh","knee","ankle","biceps","forearm","wrist")
df= df[-42,]
head(df)%>%knitr::kable()
| brozek_fat | age | weight | height | neck | chest | abdomen | hip | thigh | knee | ankle | biceps | forearm | wrist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12.6 | 23 | 154.25 | 67.75 | 36.2 | 93.1 | 85.2 | 94.5 | 59.0 | 37.3 | 21.9 | 32.0 | 27.4 | 17.1 |
| 6.9 | 22 | 173.25 | 72.25 | 38.5 | 93.6 | 83.0 | 98.7 | 58.7 | 37.3 | 23.4 | 30.5 | 28.9 | 18.2 |
| 24.6 | 22 | 154.00 | 66.25 | 34.0 | 95.8 | 87.9 | 99.2 | 59.6 | 38.9 | 24.0 | 28.8 | 25.2 | 16.6 |
| 10.9 | 26 | 184.75 | 72.25 | 37.4 | 101.8 | 86.4 | 101.2 | 60.1 | 37.3 | 22.8 | 32.4 | 29.4 | 18.2 |
| 27.8 | 24 | 184.25 | 71.25 | 34.4 | 97.3 | 100.0 | 101.9 | 63.2 | 42.2 | 24.0 | 32.2 | 27.7 | 17.7 |
| 20.6 | 24 | 210.25 | 74.75 | 39.0 | 104.5 | 94.4 | 107.8 | 66.0 | 42.0 | 25.6 | 35.7 | 30.6 | 18.8 |
The original dataset contains 1 response variable (brozek_fat) and 13 features that would be included as potential predictors in our model.
Hmisc::describe(df)
## df
##
## 14 Variables 251 Observations
## ---------------------------------------------------------------------------
## brozek_fat
## n missing distinct Info Mean Gmd .05 .10
## 251 0 174 1 18.89 8.807 6.85 8.80
## .25 .50 .75 .90 .95
## 12.80 19.00 24.55 28.80 31.20
##
## lowest : 0.0 1.9 4.1 4.6 4.7, highest: 33.8 34.7 36.5 38.2 45.1
## ---------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 251 0 51 0.999 44.89 14.35 25.0 27.0
## .25 .50 .75 .90 .95
## 35.5 43.0 54.0 63.0 67.0
##
## lowest : 22 23 24 25 26, highest: 69 70 72 74 81
## ---------------------------------------------------------------------------
## weight
## n missing distinct Info Mean Gmd .05 .10
## 251 0 196 1 178.8 31.83 136.4 146.8
## .25 .50 .75 .90 .95
## 158.8 176.2 196.9 217.0 225.8
##
## lowest : 118.50 125.00 125.25 125.75 126.50, highest: 241.75 244.25 247.25 262.75 363.15
## ---------------------------------------------------------------------------
## height
## n missing distinct Info Mean Gmd .05 .10
## 251 0 47 0.999 70.31 2.983 66.00 67.00
## .25 .50 .75 .90 .95
## 68.25 70.00 72.25 73.75 74.50
##
## lowest : 64.00 64.75 65.00 65.50 65.75, highest: 75.25 75.50 76.00 77.50 77.75
## ---------------------------------------------------------------------------
## neck
## n missing distinct Info Mean Gmd .05 .10
## 251 0 90 1 38 2.685 34.25 35.10
## .25 .50 .75 .90 .95
## 36.40 38.00 39.45 40.90 41.85
##
## lowest : 31.1 31.5 32.8 33.2 33.4, highest: 42.5 42.8 43.2 43.9 51.2
## ---------------------------------------------------------------------------
## chest
## n missing distinct Info Mean Gmd .05 .10
## 251 0 173 1 100.8 9.35 89.0 91.1
## .25 .50 .75 .90 .95
## 94.3 99.6 105.3 112.3 116.4
##
## lowest : 79.3 83.4 85.1 86.0 86.7, highest: 119.8 119.9 121.6 128.3 136.2
## ---------------------------------------------------------------------------
## abdomen
## n missing distinct Info Mean Gmd .05 .10
## 251 0 185 1 92.51 11.87 76.85 79.50
## .25 .50 .75 .90 .95
## 84.55 90.90 99.20 105.70 110.80
##
## lowest : 69.4 70.4 72.8 73.7 73.9, highest: 115.9 118.0 122.1 126.2 148.1
## ---------------------------------------------------------------------------
## hip
## n missing distinct Info Mean Gmd .05 .10
## 251 0 151 1 99.84 7.509 89.15 91.80
## .25 .50 .75 .90 .95
## 95.50 99.30 103.35 108.60 111.85
##
## lowest : 85.0 85.3 87.2 87.5 87.6, highest: 114.3 114.4 116.1 125.6 147.7
## ---------------------------------------------------------------------------
## thigh
## n missing distinct Info Mean Gmd .05 .10
## 251 0 138 1 59.36 5.702 51.15 53.00
## .25 .50 .75 .90 .95
## 56.00 59.00 62.30 65.80 68.45
##
## lowest : 47.2 49.3 49.6 50.0 50.1, highest: 71.2 72.5 72.9 74.4 87.3
## ---------------------------------------------------------------------------
## knee
## n missing distinct Info Mean Gmd .05 .10
## 251 0 90 1 38.57 2.668 34.80 35.60
## .25 .50 .75 .90 .95
## 36.95 38.50 39.90 41.70 42.65
##
## lowest : 33.0 33.4 33.5 33.7 34.2, highest: 44.0 44.2 45.0 46.0 49.1
## ---------------------------------------------------------------------------
## ankle
## n missing distinct Info Mean Gmd .05 .10
## 251 0 61 0.999 23.1 1.701 21.00 21.50
## .25 .50 .75 .90 .95
## 22.00 22.80 24.00 24.80 25.45
##
## lowest : 19.1 19.7 20.1 20.2 20.4, highest: 26.6 27.0 29.6 33.7 33.9
## ---------------------------------------------------------------------------
## biceps
## n missing distinct Info Mean Gmd .05 .10
## 251 0 104 1 32.27 3.406 27.60 28.70
## .25 .50 .75 .90 .95
## 30.20 32.00 34.35 36.20 37.20
##
## lowest : 24.8 25.3 25.6 25.8 26.0, highest: 38.2 38.4 38.5 39.1 45.0
## ---------------------------------------------------------------------------
## forearm
## n missing distinct Info Mean Gmd .05 .10
## 251 0 77 1 28.66 2.256 25.70 26.20
## .25 .50 .75 .90 .95
## 27.30 28.70 30.00 31.10 31.75
##
## lowest : 21.0 22.0 23.1 24.6 24.8, highest: 32.8 33.1 33.7 33.8 34.9
## ---------------------------------------------------------------------------
## wrist
## n missing distinct Info Mean Gmd .05 .10
## 251 0 44 0.998 18.23 1.047 16.8 17.0
## .25 .50 .75 .90 .95
## 17.6 18.3 18.8 19.4 19.8
##
## lowest : 15.8 16.1 16.3 16.5 16.6, highest: 20.1 20.2 20.4 20.9 21.4
## ---------------------------------------------------------------------------
First, we would like to know how the data are distributed among 251 men, as well as the correlation pattern among the variables:
plotfuncLow <- function(data,mapping){
p <- ggplot(data = data,mapping=mapping)+geom_point(aes(fill=df$brozek_fat,color=df$brozek_fat),shape=21,alpha=0.3,show.legend = F)+geom_smooth(method="lm",se=F,color="red4")+scale_fill_gradient2(low="#ccfc2d",mid="#fccc2d",high="#fc2d2d")+scale_color_gradient2(low="#ccfc2d",mid="#fccc2d",high="#fc2d2d")
p
}
plotfuncmid <- function(data,mapping){
p <- ggplot(data = data,mapping=mapping)+geom_histogram(aes(),alpha=0.5,fill="#fc881b",color="#fc411b")+scale_fill_gradient2(low="#ccfc2d",mid="#fccc2d",high="#fc2d2d")
p
}
library(GGally)
ggpairs(df,columns=1:14,lower=list(continuous=plotfuncLow),upper=NULL,diag=list(continuous=plotfuncmid))
library(corrplot)
library(RColorBrewer)
cor_mat=as.matrix(cor(method="pearson",as.matrix(df)))
cor_mat%>%corrplot(.,order="hclust",type="lower",method="color",tl.col="black", tl.srt=45,tl.cex=0.5,col=rev(brewer.pal(n=10, name="Spectral")))
A simple correlation network analysis can help us to identify 6 variable that have the strongest correlations (absolute value of Pearson’s r coefficient were greater than 0.5) with the Body fat. Those variables included Abdomen, Chest, Weight, Hip, Thigh and Knee.
dfscale<-df%>%as.matrix()%>%scale()%>%as_tibble()
m=as.matrix(cor(as.matrix(dfscale)))
library(igraph)
diag(m)<-0
library(ggraph)
cdf=data.frame(row=rownames(m)[row(m)[upper.tri(m)]],
col=colnames(m)[col(m)[upper.tri(m)]],
corr=m[upper.tri(m)])
names(cdf)=c("from","to","corr")
cdf=subset(cdf,cdf$from=="brozek_fat" & abs(cdf$corr)>0.5)
graph<-graph_from_data_frame(cdf)
ggraph(graph, layout = 'star',circular=F)+geom_edge_fan(aes(colour = corr,width=corr),alpha=0.4)+geom_node_label(aes(label = name))+coord_fixed()+scale_edge_color_gradient(high="#ff003f",low="#0094ff")+scale_edge_width_continuous()+theme_bare(10)
First, there are 6 methods that could be used to select features for a regression problem:
http://mlr-org.github.io/mlr-tutorial/devel/html/filter_methods/index.html
vimplot=function(data,target,method){
train.task=makeRegrTask(data=data, target=target)
imp_var=generateFilterValuesData(train.task, method=method)
fsdf=imp_var$data%>%.[,-2]
names(fsdf)=c("Feature","Value")
fsdf$Method=method
plot=fsdf%>%ggplot(aes(x=reorder(fsdf$Feature,fsdf$Value),y=Value,fill=Value,color=Value))+geom_segment(aes(x=reorder(fsdf$Feature,fsdf$Value),xend=Feature,y=0,yend=Value),size=1,show.legend = F)+geom_point(size=2,show.legend = F)+scale_x_discrete("Features")+scale_y_continuous(method)+coord_flip()+scale_fill_gradient(low="blue",high="red")+scale_color_gradient(low="blue",high="red")+ggtitle(method)
return(plot)
}
p1=vimplot(data=df,target="brozek_fat",method="mrmr")
#mrmr
p2=vimplot(data=df,target="brozek_fat",method="linear.correlation")
#carscore
p3=vimplot(data=df,target="brozek_fat",method="carscore")
#variance
p4=vimplot(data=df,target="brozek_fat",method="variance")
#IG
p5=vimplot(data=df,target="brozek_fat",method="information.gain")
#cforest
p6=vimplot(data=df,target="brozek_fat",method="cforest.importance")
library(gridExtra)
grid.arrange(p1,p2,p3,p4,p5,p6,ncol=3)
First, we will consider the Stepwise GLM, but this time we will extend it a little bit with a Bootstrapping. This could be done using caret package:
library(caret)
crt=trainControl(method="boot",number=100,verboseIter = F)
set.seed(123)
glmcar= caret::train(data=df, brozek_fat~ .,
method = "glmStepAIC",
trControl = crt,
verbose = FALSE)
Despite that the model training costs so much time, the result is not very good, as we could see below:
The Rsquare of the survived model was only 0.7 and there are 8 features included
glmcar
## Generalized Linear Model with Stepwise Feature Selection
##
## 251 samples
## 13 predictor
##
## No pre-processing
## Resampling: Bootstrapped (100 reps)
## Summary of sample sizes: 251, 251, 251, 251, 251, 251, ...
## Resampling results:
##
## RMSE Rsquared
## 4.296321 0.695796
summary(glmcar)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -10.0169 -2.7372 -0.1662 2.7053 9.5436
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.55135 10.88287 -1.797 0.07366 .
## age 0.05679 0.02874 1.976 0.04927 *
## weight -0.08235 0.03708 -2.221 0.02729 *
## neck -0.41978 0.20893 -2.009 0.04563 *
## abdomen 0.88059 0.06686 13.171 < 2e-16 ***
## hip -0.20147 0.13016 -1.548 0.12297
## thigh 0.27889 0.12011 2.322 0.02106 *
## forearm 0.47843 0.17280 2.769 0.00606 **
## wrist -1.37393 0.47426 -2.897 0.00411 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 15.75356)
##
## Null deviance: 14915.5 on 250 degrees of freedom
## Residual deviance: 3812.4 on 242 degrees of freedom
## AIC: 1415.2
##
## Number of Fisher Scoring iterations: 2
Lasso penalizes the absolute values of some coefficients toward zero, so it will remove the less important features from the model. This method has proven to be useful for feature selection in problems with large number of covariates. Lasso is also better than Ridge regularization if our goal would be optimizing the interpretability of the model. When there are many correlated features, lasso tends to let only one of them survive and eliminate other from the final model by shrinking their coefficient to zero.
The elastic net regularization is a flexible solution between Ridge and Lasso, as it combines both l1 and l2 penalties under a parameter called alpha (when a=0, the regularization is strictly Ridge, while a=1 corresponds to lasso penalty). This method provides the strength of both types of regularization, since the lasso optimizes feature selection and interpretability while Ridge allows grouping effect (the correlated variables will be grouped together then depending on the penalty strength, they could contribute to the prediction all at once or totally dropped out from the model.
Both Lasso and Elastic net regularisations are supported by the glmnet package:
#Lasso model
library(glmnet)
lasso = cv.glmnet(x=as.matrix(df[,-1]),y=df$brozek_fat,alpha=1,nfolds=5)
plot(lasso,xvar="lambda")
coef(lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -4.40419369
## age 0.03491352
## weight .
## height -0.25287929
## neck .
## chest .
## abdomen 0.60370088
## hip .
## thigh .
## knee .
## ankle .
## biceps .
## forearm .
## wrist -0.89634390
#Elastic net
enet= cv.glmnet(x=as.matrix(df[,-1]),y=df$brozek_fat,nfolds=5)
plot(enet)
coef(enet)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -6.53342851
## age 0.02997906
## weight .
## height -0.24850948
## neck .
## chest .
## abdomen 0.58853559
## hip .
## thigh .
## knee .
## ankle .
## biceps .
## forearm .
## wrist -0.70732370
Note that each method provided a different model structure:
We might believe that Abdomen would be important, as this variable was included in both Lasso and Elastic net model.
Given a response variable Y and a matrix of all potential predictors (features) X, y could be estimated from X by a linear model :
\[ y= \beta_{0} + \beta_{\gamma} X_{\gamma}+\varepsilon \]
Such model is defined by a constant beta0, a subset of coefficients beta that correspond to a feature subset and E is the residual error that would follow the normal distribution with zero mean.
The goal of Bayesian Model averaging method is to construct a weighted average over all possible combination of features. The model weights are determined from posterior model probabilities that could be estimated from classic Bayes theorem:
\[p(M_{\gamma }|y,X) = \frac{p(y|M_{\gamma},X)p(M_{\gamma})}{p(y|X))} = \frac{p(y|M_{\gamma},X)p(M_{\gamma})}{\sum_{s=1}^{2^K} p(y|M_s,X)p(M_s)}\]
Where K is the number of potential features
p(y|X) indicates the integrated likelihood, it’s simply a constant term over all models.
The PMP p(M|y,X) could be interpreted as “probability of a specific model giving a matrix X and response variable Y, over all possible models“ or posterior model probability (PMP).
PMP is proportional to the marginal likelihood of the model p(y|M,X) = probability of the data given the model M) times a prior model probability p(M)
The model prior indicates the searcher’s belief about the model before looking at data (for example, the searcher believes strongly that a biomarker Y is always associated to patient’s gender, age and weight, so a prior could be formed, based on which there is a high probability for the model that contains Age, gender and Weight over all possible models of other covariates.
When there is no prior knowledge about the problem, a popular choice is to set a uniform prior probability.
The model weighted posterior distribution for any coefficients (beta) could be estimated by:
\[p(\beta |y,X)=\sum_{\gamma =1}^{2^K}p(\beta |M_\gamma ,y,X)p(M_\gamma|X,y)\]
Though BMA could be performed in a straightforward way, that means to enumerate all possible variable combinations until we got the posterior result, MCMC sampling is a more powerful method for BMA. The MCMC sampler should be considered for large dataset with more than 20 variables. MCMC sampler aims to gather the most important part of the posterior model distribution, thus only approximate it but as closely as possible.
In our experiment, we will apply a MCMC sampler that based on based on Metropolis-Hastings algorithm. Our sampler walks through the model space as follow:
At the step I, the sampler stands at a given model Mi with posterior p(Mi|y,X), on the next step i+1, a new candidate Mj is given and becomes the current model. The sampler switches from Mi to Mj with a probability pi,j:
\[p_{i,j}=min(1,p(M_i | y,X)/p(M_i |y,x))\]
If the model Mj is considered worse than Mi, it will be rejected and the sampler moves to the next step and a new candidate Mk will be proposed against Mi. In the case Mj is accepted, it will replace Mi as the current model and has to survive against further candidate models in the next step. The number of times each model survives will converge to the distribution of posterior model probabilities (PMP) p(Mi|y,X).
Here, we apply the standard MCMC sampler called “birth death”. At each step i, the sampler randomly picks one feature Fi among K potential features. If Fi is already included in the current model Mi , the new candidate model Mj on next step will include all the features from Mi except for Fi that will be dropped out. If the current model Mi does not contain Fi, then the candidate model will include all the feature set from Mi plus the chosen feature Fi.
The accuracy of PMP approximation depends on the number of iteration steps that MCMC sampler runs through. The first part of such procedure must also be omitted from the approximation (burn-ins or warming-up). The argument burn specifies the number of burn-ins and the argument iter the number of subsequent iterations to be retained.
Model-specific g prior
In BMA, g prior measures the level of certainty for a zero coefficient. A small g implies a strong belief of researcher (conservative prior) that the coefficients are zero, a large g indicates that the searcher is highly uncertain that coefficients are zero.
The posterior distribution of coefficients reflect prior uncertainty, for a given g, it follows a t-distribution with expected value E(beta|y,X,g,M) = standard OLS beta * g/(1+g)
The smaller g, the more important is the prior, and the more the expected value of coefficients is shrunk toward the prior mean zero.
In MCMC sampling, the model g-prior is not fixed but should be specified from information from data (y,X). The Empirical Bayes g local (EBL) algorithm sets gi = max(0,F_OLS -1) where F_OLS is the standard OLS F-statistic for model Mi.
Now it’s time for coding; We will use the BMS package: The following command consists of a MCMC with 1000 warmup steps and 10,000 steps to approximate the PMP of models. The sampler implied an EBL model specific g prior, the model prior is uniform. The MCMC algorithm is “Brith and death”
library(BMS)
bmsmod= bms(df, burn = 1000, iter = 10000, g = "EBL",mprior = "uniform",mcmc="bdn", user.int = F)
First, we extract the coefficients from averaged model output
The first column in this matrix shows all 13 variable names. First we look at the Post mean, which is the coefficients averaged over all models, including the models that does not include that variable (the coefficient will be zero). The coefficients posterior standard deviation (Post SD) might indicate the certainty of the positive or negative effect. The coefficient sign can also be inferred from the Cond Pos Sign, indicating the posterior probability of a positive coefficient expected value conditional on inclusion or “sign certainty”
The importance of each variable in predicting outcome is measured by PIP which represents the posterior inclusion probabilities or the sim of PMPs for all models wherein that feature was included. The Post mean column also indicate the positive or negative effect of a feature
For example: we can say that Abdomen would be the mist important variable. It was included in all models, it has a certain positive effect on Body fat with standardised coefficient of 1.26.
The less important features will have lower unconditional coefficients since their coefficients beta were null in most of models.
The final column denotes the original order of features in dataset
coef(bmsmod, std.coefs = T, order.by.pip = T, include.constant = T)
## PIP Post Mean Post SD Cond.Pos.Sign Idx
## abdomen 1.0000 1.2579739893 0.09422997 1.0000000 6
## weight 0.9317 -0.4410399100 0.17323011 0.0000000 2
## wrist 0.8131 -0.1270660003 0.07994976 0.0000000 13
## forearm 0.6211 0.0700082566 0.06598249 1.0000000 12
## neck 0.2790 -0.0344094725 0.06710771 0.0000000 4
## biceps 0.2291 0.0227760452 0.05131205 1.0000000 11
## thigh 0.1885 0.0210560768 0.05624798 0.9994695 8
## hip 0.1834 -0.0303147090 0.08711173 0.0000000 7
## age 0.1579 0.0105590771 0.03152008 0.9898670 1
## height 0.1379 -0.0047555336 0.02478991 0.1254532 3
## ankle 0.1139 0.0038949770 0.01909055 0.9894644 10
## knee 0.0979 0.0038758457 0.02460337 0.9223698 9
## chest 0.0857 -0.0005275042 0.02950314 0.4527421 5
## (Intercept) 1.0000 -3.7175785209 NA NA 0
summary(bmsmod)
## Mean no. regressors Draws Burnins
## "4.8392" "10000" "1000"
## Time No. models visited Modelspace 2^K
## "1.320708 secs" "3325" "8192"
## % visited % Topmodels Corr PMP
## "41" "98" "0.9899"
## No. Obs. Model Prior g-Prior
## "251" "uniform / 6.5" "EBL"
## Shrinkage-Stats
## "Av=0.9928"
Corr PMP indicates the correlation between the iteration (MCMC sampler) counts and analytical PMP for best models
At 0.993, this correlation indicates a good degree of convergence
To verify how far the posterior model size distribution matches up to our prior, the convergence between analytical and MCMC PMP could be compared on a graph:
plotModelsize(bmsmod)
The posterior model probability curve is well different from our prior. The graph indicates that models that include 4 features have highest probability.
The importance and potential combination of features could be visualized by this graph:
Here Red color indicates a positive coefficient, Skyblue refers to a negative coefficient, and white corresponds to non-inclusion (with zero beta coefficient).
The horizontal axis shows the best models, based on PMPs. The models are ranked from left to right.
image(bmsmod,col=c("#199ac1","#e30c37"))
Based on this graph, the best model would include 4 features which are Abdomen, Weight, Wrist and Forearm (note that this was different to the model structure as suggested by Lasso regularisation).
We can ask for the 3 best models :
This is a binary representation of models’ structure (1 means included, 0 is excluded). The result also provides PMP (posterior model probability) values estimated by either exact enumeration or by MCMC sampler
topmodels.bma(bmsmod)[,1:3]
## 0883 0881 0885
## age 0.0000000 0.00000000 0.00000000
## weight 1.0000000 1.00000000 1.00000000
## height 0.0000000 0.00000000 0.00000000
## neck 0.0000000 0.00000000 0.00000000
## chest 0.0000000 0.00000000 0.00000000
## abdomen 1.0000000 1.00000000 1.00000000
## hip 0.0000000 0.00000000 0.00000000
## thigh 0.0000000 0.00000000 0.00000000
## knee 0.0000000 0.00000000 0.00000000
## ankle 0.0000000 0.00000000 0.00000000
## biceps 0.0000000 0.00000000 1.00000000
## forearm 1.0000000 0.00000000 0.00000000
## wrist 1.0000000 1.00000000 1.00000000
## PMP (Exact) 0.1364928 0.05813918 0.03899962
## PMP (MCMC) 0.1447000 0.05170000 0.03200000
Wrapper methods use the performance of a learning algorithm to assess the usefulness of a feature set. In order to select a feature subset a learner is trained repeatedly on different feature subsets and the subset which leads to the best learner performance is chosen.
The search strategy is defined by functions following the naming convention makeFeatSelControl<search_strategy. The following search strategies are available:
The exhaustive search is a brute force method, it will search for the best feature set among all possible combinations of our features. Such method is only feasible for simple problem (K<5). For larger dataset it’s not recommmended as the computation cost is too high.
In our experiment, we will apply a random search. As the BMS model already suggested high PMP for model of 4 variables, we will set the maximal number of variables to 4 in random search.
A wrapper consists of 3 elements:
A resampling protocole, which could be Cross-validation, K folds CV, Bootstrap…
A feature selection control mode, which is Random search in our example
A basic learner which is linear regression (glm based learner)
We must also create a regression task for training
task=makeRegrTask(data=df,target="brozek_fat")
rdesc= makeResampleDesc("CV",iters=100L)
ctrlRdm=makeFeatSelControlRandom(maxit=100L,max.features=4)
wrap=makeFeatSelWrapper(learner = "regr.glm", resampling = rdesc,control=ctrlRdm)
modwrap= train(wrap,task)
Here is the result of model training:
modwrap$learner.model$next.model$learner.model%>%summary()
##
## Call:
## stats::glm(formula = f, family = family, data = d, control = ctrl,
## model = FALSE)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -11.3968 -2.8413 -0.3158 2.9395 9.4035
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.27167 9.88286 -1.849 0.06568 .
## weight -0.08398 0.03269 -2.569 0.01078 *
## abdomen 0.91221 0.05353 17.041 < 2e-16 ***
## hip -0.10900 0.11525 -0.946 0.34522
## wrist -1.16974 0.41694 -2.806 0.00542 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 16.57369)
##
## Null deviance: 14915.5 on 250 degrees of freedom
## Residual deviance: 4077.1 on 246 degrees of freedom
## AIC: 1424
##
## Number of Fisher Scoring iterations: 2
The model contains indeed 4 variables, but its structure is totally different to other methods.
We have performed so far 5 different methods for feature/model selection. We can see that they are not equivalent, each one suggested a different feature subset. Before closing our experiment, we will perform a small comparative study that aims to evaluate the performance of each one among those 5 models.
The model validation is based upon 4 metrics:
#Combining them together:
dummylrner=makeLearner("regr.glm")
dummymod=train(dummylrner,task)
dummypred=predict(dummymod,task)
BMSpred=predict(bmsmod,df)%>%as.vector()
LASSOpred=predict(lasso,as.matrix(df[,-1]))%>%as.vector()
ENETpred=predict(enet,as.matrix(df[,-1]))%>%as.vector()
WRAPpred=predict(modwrap$learner.model$next.model$learner.model)%>%as.vector()
STEPpred=predict(glmcar,df)%>%as.vector()
pred1=dummypred
pred1$data$response=BMSpred
pred2=dummypred
pred2$data$response=LASSOpred
pred3=dummypred
pred3$data$response=ENETpred
pred4=dummypred
pred4$data$response=WRAPpred
pred5=dummypred
pred5$data$response=STEPpred
measure=list(rsq,rmse,mae,sae)
p1=performance(pred1,measure)
p2=performance(pred2,measure)
p3=performance(pred3,measure)
p4=performance(pred4,measure)
p5=performance(pred5,measure)
pdf=rbind(p1,p2,p3,p4,p5)%>%as_tibble()%>%mutate(.,Method=c("BMS","LASSO","ENET","WRAPPER","STEPWISE"))
pdfl=pdf%>%gather(rsq:sae,key="Metric",value="Value")
ggplot(pdfl)+geom_tile(aes(y=Method,x=Metric,fill=Value),color="black")+scale_fill_viridis(option="A",begin=0.7,end=0.95)+geom_text(aes(y=Method,x=Metric,label=round(Value,3)),color="black")+ggtitle("Comparison")+theme_bw(15)
pdf2=cbind(BMSpred,LASSOpred,ENETpred,WRAPpred,STEPpred)%>%as_tibble()%>%mutate(.,Truth=df$brozek_fat)
pdf2%>%gather(BMSpred:Truth,key="Models",value="Value")%>%ggplot(aes(x=Value,fill=Models))+geom_density(alpha=0.3)+theme_bw()
In this case study, we have explored 5 different methods for model (and feature) selection applied to a regression problem. It’s clear that those 5 methods are not equivalent in terms of feature subset and model’s performance.
BMA, Lasso and Elasticnet are more effective than Stepwise and Wrapper methods. BMA is the most interesting method, as this method provides many useful information for both feature selection and model interpretation, including the posterior model probabilities and certainty of coefficient values. The BMA also performed better than Elastic net and Lasso models.
In our example, the stepwise model selection performed surprisingly well, as it has the lowest error and largest R2 value. However, the cost of such performance could be high. The methods that based on brute force such as Exhaustive feature tuning, Forward and Backward Stepwise selection… are not recommended for large datasets.
Though the wrapper method is scalable (it could be applied to any algorithm, not only the GLM learner), such method is not reliable, due to a poor reproducibility and high computation cost. Despite that Wrapper is based on model performance, its final outcome was worse than Bayesian Model averaging method.
The Elasticnet performed better than Lasso. This regularisation was also integrated within the GLM algorithm by h2o.
In conclusion, if your study is interpretive, BMA is the best choice. For predictive purpose, Lasso and Elastic net are very good solutions. The wrapper method is not recommended unless you want to imply unusual algorithms.
Thank for joining us and see you in the next tutorial
END