Introduction

Feature selection is an important topic that requires in-depth knowledge of the problem domain. Having the right features can help the model perform better. For example, removing the highly correlated attributes can lead to a better model and improve prediction.

In this post, I will explore finding highly correlated variables, Recursive Feature Elimination, stepwise elimination, and Boruta feature selection.

First, I will plot a histogram of the variables in the Pima Indians Diabetes dataset. Various variable exhibit skew distributions and should be considered for transformations.

library("mlbench")
library("caret")
# load the data
data(PimaIndiansDiabetes)
DataExplorer::HistogramContinuous(PimaIndiansDiabetes)

Correlations

Below we will transform the response variable from a factor to numeric. Changing the variable allows us to see any correlations with other variables. I also created a highly correlated variable by combining glucose and mass. As expected glucose and mass are highly correlated with the combination variable HCorrelated.

library("corrplot")
indians = PimaIndiansDiabetes
indians$diabetes = ifelse(indians$diabetes =="pos",1,0)
indians$HCorrelated =indians$glucose*indians$mass

cor_mx = cor(indians  ,use="pairwise.complete.obs", method = "pearson")
corrplot(cor_mx, method = "color", 
         type = "upper", order = "original", number.cex = .7,
         addCoef.col = "black", # Add coefficient of correlation
         tl.col = "black", tl.srt = 90, # Text label color and rotation
                  # hide correlation coefficient on the principal diagonal
         diag = TRUE)

rm(indians,cor_mx)

Recursive Feature Elimination

Recursive Feature Elimination(RFE) builds models with different subsets of a dataset to identify a feature that might not be required. Caret provides a rfe function that facilitates this process.

Below we will load the Pima Indians Diabetes and fit the rfe function. Control was implemented using random forest cross-validated with kfold of 10. The final plot indicates that eight variables have an accuracy of 77.73.

# define the control using a random forest selection function
control = rfeControl(functions=rfFuncs, method="cv", number=10)
# run the RFE algorithm
set.seed(143)
results = rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control)
# summarize the results
print(results)
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          1   0.6992 0.2758    0.03055 0.07310         
##          2   0.7382 0.3967    0.04873 0.12182         
##          3   0.7331 0.3929    0.03586 0.09671         
##          4   0.7422 0.4187    0.03456 0.09678         
##          5   0.7604 0.4588    0.02829 0.06920         
##          6   0.7395 0.4096    0.04892 0.11552         
##          7   0.7578 0.4494    0.05136 0.12454         
##          8   0.7721 0.4795    0.04204 0.10473        *
## 
## The top 5 variables (out of 8):
##    glucose, mass, age, pregnant, pedigree
# list the chosen features
predictors(results)
## [1] "glucose"  "mass"     "age"      "pregnant" "pedigree" "insulin" 
## [7] "triceps"  "pressure"
# plot the results
plot(results, type=c("g", "o"))

rm(control,results,PimaIndiansDiabetes)

variable importance

Another method relies on fitting a random forest model and identifying variable importance. In this method, variable importance can vary by model.

set.seed(143)
rPartMod = train(Class ~ ., data=Data, method="rpart")
rpartImp = varImp(rPartMod,10)
plot(rpartImp, top = 6, main='Variable Importance Using Glaucoma Dataset')

rm(rPartMod,rpartImp)

stepwise Selection

Stepwise selection is a method that allows for variables to be added or remove in either direction. The model performance is measured in AIC. Akaike information criterion (AIC) estimates the quality of each model relative to each model with the lowest AIC being the best model.

backward Selection

The backward procedure begins with a general model that includes all variables and eliminates one variable at a time.

car = mtcars
step(lm(mpg~.,data=car),direction="backward")
## Start:  AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - cyl   1    0.0799 147.57 68.915
## - vs    1    0.1601 147.66 68.932
## - carb  1    0.4067 147.90 68.986
## - gear  1    1.3531 148.85 69.190
## - drat  1    1.6270 149.12 69.249
## - disp  1    3.9167 151.41 69.736
## - hp    1    6.8399 154.33 70.348
## - qsec  1    8.8641 156.36 70.765
## <none>              147.49 70.898
## - am    1   10.5467 158.04 71.108
## - wt    1   27.0144 174.51 74.280
## 
## Step:  AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1    0.2685 147.84 66.973
## - carb  1    0.5201 148.09 67.028
## - gear  1    1.8211 149.40 67.308
## - drat  1    1.9826 149.56 67.342
## - disp  1    3.9009 151.47 67.750
## - hp    1    7.3632 154.94 68.473
## <none>              147.57 68.915
## - qsec  1   10.0933 157.67 69.032
## - am    1   11.8359 159.41 69.384
## - wt    1   27.0280 174.60 72.297
## 
## Step:  AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  1    0.6855 148.53 65.121
## - gear  1    2.1437 149.99 65.434
## - drat  1    2.2139 150.06 65.449
## - disp  1    3.6467 151.49 65.753
## - hp    1    7.1060 154.95 66.475
## <none>              147.84 66.973
## - am    1   11.5694 159.41 67.384
## - qsec  1   15.6830 163.53 68.200
## - wt    1   27.3799 175.22 70.410
## 
## Step:  AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  1     1.565 150.09 63.457
## - drat  1     1.932 150.46 63.535
## <none>              148.53 65.121
## - disp  1    10.110 158.64 65.229
## - am    1    12.323 160.85 65.672
## - hp    1    14.826 163.35 66.166
## - qsec  1    26.408 174.94 68.358
## - wt    1    69.127 217.66 75.350
## 
## Step:  AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     3.345 153.44 62.162
## - disp  1     8.545 158.64 63.229
## <none>              150.09 63.457
## - hp    1    13.285 163.38 64.171
## - am    1    20.036 170.13 65.466
## - qsec  1    25.574 175.67 66.491
## - wt    1    67.572 217.66 73.351
## 
## Step:  AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.629 160.07 61.515
## <none>              153.44 62.162
## - hp    1    12.572 166.01 62.682
## - qsec  1    26.470 179.91 65.255
## - am    1    32.198 185.63 66.258
## - wt    1    69.043 222.48 72.051
## 
## Step:  AIC=61.52
## mpg ~ hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## - qsec  1    20.225 180.29 63.323
## - am    1    25.993 186.06 64.331
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## - am    1    26.178 195.46 63.908
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = car)
## 
## Coefficients:
## (Intercept)           wt         qsec           am  
##       9.618       -3.917        1.226        2.936

forward Selection

The forward method begins with a simple model then adds suitable variable one at a time until the best model is obtained.

step(lm(mpg~1,data=car),direction="forward",scope=~cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb)
## Start:  AIC=115.94
## mpg ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + wt    1    847.73  278.32  73.217
## + cyl   1    817.71  308.33  76.494
## + disp  1    808.89  317.16  77.397
## + hp    1    678.37  447.67  88.427
## + drat  1    522.48  603.57  97.988
## + vs    1    496.53  629.52  99.335
## + am    1    405.15  720.90 103.672
## + carb  1    341.78  784.27 106.369
## + gear  1    259.75  866.30 109.552
## + qsec  1    197.39  928.66 111.776
## <none>              1126.05 115.943
## 
## Step:  AIC=73.22
## mpg ~ wt
## 
##        Df Sum of Sq    RSS    AIC
## + cyl   1    87.150 191.17 63.198
## + hp    1    83.274 195.05 63.840
## + qsec  1    82.858 195.46 63.908
## + vs    1    54.228 224.09 68.283
## + carb  1    44.602 233.72 69.628
## + disp  1    31.639 246.68 71.356
## <none>              278.32 73.217
## + drat  1     9.081 269.24 74.156
## + gear  1     1.137 277.19 75.086
## + am    1     0.002 278.32 75.217
## 
## Step:  AIC=63.2
## mpg ~ wt + cyl
## 
##        Df Sum of Sq    RSS    AIC
## + hp    1   14.5514 176.62 62.665
## + carb  1   13.7724 177.40 62.805
## <none>              191.17 63.198
## + qsec  1   10.5674 180.60 63.378
## + gear  1    3.0281 188.14 64.687
## + disp  1    2.6796 188.49 64.746
## + vs    1    0.7059 190.47 65.080
## + am    1    0.1249 191.05 65.177
## + drat  1    0.0010 191.17 65.198
## 
## Step:  AIC=62.66
## mpg ~ wt + cyl + hp
## 
##        Df Sum of Sq    RSS    AIC
## <none>              176.62 62.665
## + am    1    6.6228 170.00 63.442
## + disp  1    6.1762 170.44 63.526
## + carb  1    2.5187 174.10 64.205
## + drat  1    2.2453 174.38 64.255
## + qsec  1    1.4010 175.22 64.410
## + gear  1    0.8558 175.76 64.509
## + vs    1    0.0599 176.56 64.654
## 
## Call:
## lm(formula = mpg ~ wt + cyl + hp, data = car)
## 
## Coefficients:
## (Intercept)           wt          cyl           hp  
##    38.75179     -3.16697     -0.94162     -0.01804

both Selection

The both method is the combination of backwrad and forward procedures.

step(lm(mpg~.,data=car),direction="both")
## Start:  AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - cyl   1    0.0799 147.57 68.915
## - vs    1    0.1601 147.66 68.932
## - carb  1    0.4067 147.90 68.986
## - gear  1    1.3531 148.85 69.190
## - drat  1    1.6270 149.12 69.249
## - disp  1    3.9167 151.41 69.736
## - hp    1    6.8399 154.33 70.348
## - qsec  1    8.8641 156.36 70.765
## <none>              147.49 70.898
## - am    1   10.5467 158.04 71.108
## - wt    1   27.0144 174.51 74.280
## 
## Step:  AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1    0.2685 147.84 66.973
## - carb  1    0.5201 148.09 67.028
## - gear  1    1.8211 149.40 67.308
## - drat  1    1.9826 149.56 67.342
## - disp  1    3.9009 151.47 67.750
## - hp    1    7.3632 154.94 68.473
## <none>              147.57 68.915
## - qsec  1   10.0933 157.67 69.032
## - am    1   11.8359 159.41 69.384
## + cyl   1    0.0799 147.49 70.898
## - wt    1   27.0280 174.60 72.297
## 
## Step:  AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  1    0.6855 148.53 65.121
## - gear  1    2.1437 149.99 65.434
## - drat  1    2.2139 150.06 65.449
## - disp  1    3.6467 151.49 65.753
## - hp    1    7.1060 154.95 66.475
## <none>              147.84 66.973
## - am    1   11.5694 159.41 67.384
## - qsec  1   15.6830 163.53 68.200
## + vs    1    0.2685 147.57 68.915
## + cyl   1    0.1883 147.66 68.932
## - wt    1   27.3799 175.22 70.410
## 
## Step:  AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  1     1.565 150.09 63.457
## - drat  1     1.932 150.46 63.535
## <none>              148.53 65.121
## - disp  1    10.110 158.64 65.229
## - am    1    12.323 160.85 65.672
## - hp    1    14.826 163.35 66.166
## + carb  1     0.685 147.84 66.973
## + vs    1     0.434 148.09 67.028
## + cyl   1     0.414 148.11 67.032
## - qsec  1    26.408 174.94 68.358
## - wt    1    69.127 217.66 75.350
## 
## Step:  AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     3.345 153.44 62.162
## - disp  1     8.545 158.64 63.229
## <none>              150.09 63.457
## - hp    1    13.285 163.38 64.171
## + gear  1     1.565 148.53 65.121
## + cyl   1     1.003 149.09 65.242
## + vs    1     0.645 149.45 65.319
## + carb  1     0.107 149.99 65.434
## - am    1    20.036 170.13 65.466
## - qsec  1    25.574 175.67 66.491
## - wt    1    67.572 217.66 73.351
## 
## Step:  AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.629 160.07 61.515
## <none>              153.44 62.162
## - hp    1    12.572 166.01 62.682
## + drat  1     3.345 150.09 63.457
## + gear  1     2.977 150.46 63.535
## + cyl   1     2.447 150.99 63.648
## + vs    1     1.121 152.32 63.927
## + carb  1     0.011 153.43 64.160
## - qsec  1    26.470 179.91 65.255
## - am    1    32.198 185.63 66.258
## - wt    1    69.043 222.48 72.051
## 
## Step:  AIC=61.52
## mpg ~ hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## + disp  1     6.629 153.44 62.162
## + carb  1     3.227 156.84 62.864
## + drat  1     1.428 158.64 63.229
## - qsec  1    20.225 180.29 63.323
## + cyl   1     0.249 159.82 63.465
## + vs    1     0.249 159.82 63.466
## + gear  1     0.171 159.90 63.481
## - am    1    25.993 186.06 64.331
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## + hp    1     9.219 160.07 61.515
## + carb  1     8.036 161.25 61.751
## + disp  1     3.276 166.01 62.682
## + cyl   1     1.501 167.78 63.022
## + drat  1     1.400 167.89 63.042
## + gear  1     0.123 169.16 63.284
## + vs    1     0.000 169.29 63.307
## - am    1    26.178 195.46 63.908
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = car)
## 
## Coefficients:
## (Intercept)           wt         qsec           am  
##       9.618       -3.917        1.226        2.936
rm(car)

Boruta

Boruta is a feature ranking and selection algorithm based on random forest algorithm. The advantages of using this package are the ease of variables selection and the ability to adjust variable selection.

Below I fitted the Boruta function with the dataset for evaluation.

Data$Class = as.factor(Data$Class)
library('Boruta')
set.seed(143)
boruta_output = Boruta(Class ~ ., data=na.omit(Data), doTrace=0) 

Significant Variables

The significant variables can be extracted from the selection. Tentative variables are variables that can be dropped or kept.

Sinificant_vars = getSelectedAttributes(boruta_output, withTentative = TRUE)
Sinificant_vars
##  [1] "ag"   "as"   "ai"   "eag"  "ean"  "abrg" "abrs" "abrn" "abri" "hic" 
## [11] "mhcg" "mhcs" "mhcn" "mhci" "phcg" "phcn" "phci" "hvc"  "vbsg" "vbss"
## [21] "vbsn" "vbsi" "vasg" "vass" "vasi" "vbrg" "vbrs" "vbrn" "vbri" "varg"
## [31] "vart" "vars" "varn" "vari" "mdn"  "tmg"  "tmt"  "tms"  "tmi"  "rnf" 
## [41] "mdic" "emd"
rm(Sinificant_vars)

Boruta has a method for making the selecting tentative variable for the user.

roughFixMod = TentativeRoughFix(boruta_output)
boruta_signif = getSelectedAttributes(roughFixMod)
boruta_signif
##  [1] "as"   "abrg" "abrs" "abrn" "abri" "hic"  "mhcg" "mhcn" "mhci" "phcg"
## [11] "phcn" "phci" "hvc"  "vbss" "vbsn" "vbsi" "vasg" "vass" "vbrg" "vbrs"
## [21] "vbrn" "vbri" "varg" "vart" "vars" "varn" "vari" "tmg"  "tms"  "tmi" 
## [31] "rnf"  "mdic"
rm(boruta_signif)

The importance of variables can be shown by the below method with vari being the most important variable.

# Variable Importance Scores
imps = attStats(roughFixMod)
imps2 = imps[imps$decision != 'Rejected', c('meanImp', 'decision')]
head(imps2[order(-imps2$meanImp), ],10)  # descending sort
rm(imps,imps2,roughFixMod)

The Boruta output can be plotted to see the variable importance.

plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")  

feature-selection

Source2