1.0 Feature Selection Introduction
Feature selection is an important topic that requires in-depth knowledge of the problem domain. Having the right features can help the model perform better. For example, removing the highly correlated attributes can lead to a better model and improve prediction.
In this post, I will explore finding highly correlated variables, Recursive Feature Elimination, stepwise elimination, and Boruta feature selection.
1.1 Plot Histograms
First, I will plot a histogram of the variables in the Pima Indians Diabetes dataset. Various variable exhibit skew distributions and should be considered for transformations.
library("mlbench")
library("caret")
# load the data
data(PimaIndiansDiabetes)
DataExplorer::plot_histogram(PimaIndiansDiabetes)

1.2 Correlations
Below we will transform the response variable from a factor to numeric. Changing the variable allows us to see any correlations with other variables. I also created a highly correlated variable by combining glucose and mass. As expected glucose and mass are highly correlated with the combination variable HCorrelated.
library("corrplot")
package 㤼㸱corrplot㤼㸲 was built under R version 3.6.1corrplot 0.84 loaded
indians = PimaIndiansDiabetes
indians$diabetes = ifelse(indians$diabetes =="pos",1,0)
indians$HCorrelated =indians$glucose*indians$mass
cor_mx = cor(indians ,use="pairwise.complete.obs", method = "pearson")
corrplot(cor_mx, method = "color",
type = "upper", order = "original", number.cex = .7,
addCoef.col = "black", # Add coefficient of correlation
tl.col = "black", tl.srt = 90, # Text label color and rotation
# hide correlation coefficient on the principal diagonal
diag = TRUE)

NA
NA
NA
NA
2.0 Recursive Feature Elimination
2.1 Run the Feature Selection
Recursive Feature Elimination(RFE) builds models with different subsets of a dataset to identify a feature that might not be required. Caret provides a rfe function that facilitates this process.
Below we will load the Pima Indians Diabetes and fit the rfe function. Control was implemented using random forest cross-validated with kfold of 10. The final plot indicates that eight variables have an accuracy of 77.73.
# define the control using a random forest selection function
control = rfeControl(functions=rfFuncs, method="cv", number=10, repeats = 1) # method="cv" , leave out repeats to speed up or method = "repeatedcv", and leave out repeats
# run the RFE algorithm
set.seed(143)
#
# NOTE THAT THE PrimaIndians [,9] is a Factor Variable with neg pos levels (factor variable not numeric)- We choose the number of variables we want to see.
#
results = rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control,
verbose=FALSE)
# summarize the results
print(results)
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold)
Resampling performance over subset size:
The top 5 variables (out of 8):
glucose, mass, age, pregnant, insulin
results
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold)
Resampling performance over subset size:
The top 5 variables (out of 8):
glucose, mass, age, pregnant, insulin
2.2 List the predictors in the Order of Choice
# list the chosen features
predictors(results)
[1] "glucose" "mass" "age" "pregnant" "insulin" "pedigree" "triceps" "pressure"
2.3 Plot The Results
# plot the results
plot(results, type=c("g", "o"))

Another method relies on fitting a random forest model and identifying variable importance. In this method, variable importance can vary by model.
3.0 Variable Importance
# Another method relies on fitting a random forest model and identifying variable importance. In this method, variable importance can vary by model.
glimpse(PimaIndiansDiabetes)
Rows: 768
Columns: 9
$ pregnant [3m[38;5;246m<dbl>[39m[23m 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1, 3, 8, 7, 9, ...
$ glucose [3m[38;5;246m<dbl>[39m[23m 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139, 189, 166, 100...
$ pressure [3m[38;5;246m<dbl>[39m[23m 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74, 80, 60, 72, 0, 84, 74, 30, 7...
$ triceps [3m[38;5;246m<dbl>[39m[23m 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, 23, 19, 0, 47, 0, 38, 30, 41, ...
$ insulin [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0, 230, 0, 83, 96, 2...
$ mass [3m[38;5;246m<dbl>[39m[23m 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, 0.0, 37.6, 38.0, 27.1...
$ pedigree [3m[38;5;246m<dbl>[39m[23m 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158, 0.232, 0.191...
$ age [3m[38;5;246m<dbl>[39m[23m 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 32, 31, 31, 33,...
$ diabetes [3m[38;5;246m<fct>[39m[23m pos, neg, pos, neg, pos, neg, pos, neg, pos, pos, neg, pos, neg, pos, pos, ...
Data=PimaIndiansDiabetes
set.seed(143)
rPartMod = train(diabetes ~ ., data=Data, method="rpart")
rpartImp = varImp(rPartMod,10)
plot(rpartImp, top = 8, main='Variable Importance Using Glaucoma Dataset')

NA
NA
4.0 StepWise Selection
Stepwise selection is a method that allows for variables to be added or remove in either direction. The model performance is measured in AIC. Akaike information criterion (AIC) estimates the quality of each model relative to each model with the lowest AIC being the best model.
4.1a Backward Selection
The backward procedure begins with a general model that includes all variables and eliminates one variable at a time.
#glm(diabetes ~ ., data=Data, family="binomial")
step(glm(diabetes ~ ., data=Data, family="binomial"),direction="backward")
Start: AIC=741.45
diabetes ~ pregnant + glucose + pressure + triceps + insulin +
mass + pedigree + age
Df Deviance AIC
- triceps 1 723.45 739.45
- insulin 1 725.19 741.19
<none> 723.45 741.45
- age 1 725.97 741.97
- pressure 1 729.99 745.99
- pedigree 1 733.78 749.78
- pregnant 1 738.68 754.68
- mass 1 764.22 780.22
- glucose 1 838.37 854.37
Step: AIC=739.45
diabetes ~ pregnant + glucose + pressure + insulin + mass + pedigree +
age
Df Deviance AIC
<none> 723.45 739.45
- insulin 1 725.46 739.46
- age 1 725.97 739.97
- pressure 1 730.13 744.13
- pedigree 1 733.92 747.92
- pregnant 1 738.69 752.69
- mass 1 768.77 782.77
- glucose 1 840.87 854.87
Call: glm(formula = diabetes ~ pregnant + glucose + pressure + insulin +
mass + pedigree + age, family = "binomial", data = Data)
Coefficients:
(Intercept) pregnant glucose pressure insulin mass pedigree
-8.405136 0.123172 0.035112 -0.013214 -0.001157 0.090089 0.947595
age
0.014789
Degrees of Freedom: 767 Total (i.e. Null); 760 Residual
Null Deviance: 993.5
Residual Deviance: 723.5 AIC: 739.5
4.1b Forward Selection
The forward method begins with a simple model then adds suitable variable one at a time until the best model is obtained.
#glm(diabetes ~ ., data=Data, family="binomial")
step(glm(diabetes ~ ., data=Data, family="binomial"),direction="forward")
Start: AIC=741.45
diabetes ~ pregnant + glucose + pressure + triceps + insulin +
mass + pedigree + age
Call: glm(formula = diabetes ~ pregnant + glucose + pressure + triceps +
insulin + mass + pedigree + age, family = "binomial", data = Data)
Coefficients:
(Intercept) pregnant glucose pressure triceps insulin mass
-8.404696 0.123182 0.035164 -0.013296 0.000619 -0.001192 0.089701
pedigree age
0.945180 0.014869
Degrees of Freedom: 767 Total (i.e. Null); 759 Residual
Null Deviance: 993.5
Residual Deviance: 723.4 AIC: 741.4
4.1c Both Selection
The both method is the combination of backwrad and forward procedures.
#glm(diabetes ~ ., data=Data, family="binomial")
step(glm(diabetes ~ ., data=Data, family="binomial"),direction="both")
Start: AIC=741.45
diabetes ~ pregnant + glucose + pressure + triceps + insulin +
mass + pedigree + age
Df Deviance AIC
- triceps 1 723.45 739.45
- insulin 1 725.19 741.19
<none> 723.45 741.45
- age 1 725.97 741.97
- pressure 1 729.99 745.99
- pedigree 1 733.78 749.78
- pregnant 1 738.68 754.68
- mass 1 764.22 780.22
- glucose 1 838.37 854.37
Step: AIC=739.45
diabetes ~ pregnant + glucose + pressure + insulin + mass + pedigree +
age
Df Deviance AIC
<none> 723.45 739.45
- insulin 1 725.46 739.46
- age 1 725.97 739.97
+ triceps 1 723.45 741.45
- pressure 1 730.13 744.13
- pedigree 1 733.92 747.92
- pregnant 1 738.69 752.69
- mass 1 768.77 782.77
- glucose 1 840.87 854.87
Call: glm(formula = diabetes ~ pregnant + glucose + pressure + insulin +
mass + pedigree + age, family = "binomial", data = Data)
Coefficients:
(Intercept) pregnant glucose pressure insulin mass pedigree
-8.405136 0.123172 0.035112 -0.013214 -0.001157 0.090089 0.947595
age
0.014789
Degrees of Freedom: 767 Total (i.e. Null); 760 Residual
Null Deviance: 993.5
Residual Deviance: 723.5 AIC: 739.5
Seems that the model without Triceps is Best
5.0 Boruta
Boruta is a feature ranking and selection algorithm based on random forest algorithm. The advantages of using this package are the ease of variables selection and the ability to adjust variable selection.
Below I fitted the Boruta function with the dataset for evaluation.
library('Boruta')
set.seed(143)
boruta_output = Boruta(diabetes ~ ., data=na.omit(Data), doTrace=0)
5.1 Significant Variables
The significant variables can be extracted from the selection. Tentative variables are variables that can be dropped or kept.
Significant_vars = getSelectedAttributes(boruta_output, withTentative = TRUE)
Significant_vars
[1] "pregnant" "glucose" "pressure" "triceps" "insulin" "mass" "pedigree" "age"
Seems all are significant
Boruta has a method for making the selecting tentative variable for the user.
roughFixMod = TentativeRoughFix(boruta_output)
boruta_signif = getSelectedAttributes(roughFixMod)
boruta_signif
[1] "pregnant" "glucose" "pressure" "triceps" "insulin" "mass" "pedigree" "age"
The importance of variables can be shown by the below method with being the most important variable.
5.2 Importance of Variables
The importance of variables can be shown by the below method with vari being the most important variable.
# Variable Importance Scores
imps = attStats(roughFixMod)
imps2 = imps[imps$decision != 'Rejected', c('meanImp', 'decision')]
head(imps2[order(-imps2$meanImp), ],10) # descending sort
NA
NA
5.3 Plotting Importance of Variables
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")

6.0 Information Value and Weights of Evidence (Categorical Variables)
The Information Value can be used to judge how important a given categorical variable is in explaining the binary Y variable. It goes well with logistic regression and other classification models that can model binary variables.
Let’s try to find out how important the categorical variables are in predicting if an individual will earn >50k from the ‘adult.csv’ dataset. Just run the code below to import the dataset.
library(InformationValue)
inputData <- read.csv("./Data/adult.csv")
print(head(inputData))
Alright, let’s now find the information value for the categorical variables in the inputData.
Here is what the quantum of Information Value means:
Less than 0.02, then the predictor is not useful for modeling (separating the Goods from the Bads) 0.02 to 0.1, then the predictor has only a weak relationship. 0.1 to 0.3, then the predictor has a medium strength relationship. 0.3 or higher, then the predictor has a strong relationship. That was about IV. Then what is Weight of Evidence?
Weights of evidence can be useful to find out how important a given categorical variable is in explaining the ‘events’ (called ‘Goods’ in below table.)
WOE = ln(%good of all good/%bad of all bad)
Here is what the quantum of Information Value means:
Less than 0.02, then the predictor is not useful for modeling (separating the Goods from the Bads) 0.02 to 0.1, then the predictor has only a weak relationship. 0.1 to 0.3, then the predictor has a medium strength relationship. 0.3 or higher, then the predictor has a strong relationship. That was about IV. Then what is Weight of Evidence?
Weights of evidence can be useful to find out how important a given categorical variable is in explaining the ‘events’ (called ‘Goods’ in below table.)
# The ‘Information Value’ of the categorical variable can then be derived from the respective WOE values.
#
# IV?=?(perc good of all goods?perc bad of all bads)?*?WOE
#
# The ‘WOETable’ below given the computation in more detail.
WOETable(X=inputData[, 'workclass'], Y=inputData$income)
The total IV of a variable is the sum of IV�s of its categories.
#—————————————————————————————————————- ## TUTORIAL USING OPTIMAL BINNING () # ——————————————————————————————————————-
8.0 WOE Bimmnning
This package generates, visualizes, tabulates and deploys a supervised weight of evidence (WOE) binning of variables.
Details
This package generates, visualizes, tabulates and deploys a supervised weight of evidence (WOE) binning of variables.
The package woeBinning automates the process of binning of numeric variables and factors with respect to a dichotomous target variable. Additionally, it visualizes the realized binning solution, tabulates it and deploys it to (new) data. All functions can be used with single variables or an entire data frame.
8.1 Binning
woe.binning generates a supervised fine and coarse classing of numeric variables and factors.
woe.tree.binning generates a supervised tree-like segmentation of numeric variables and factors.
woe.binning.plot visualizes the binning solution generated and saved via woe.binning or woe.tree.binning.
woe.binning.table tabulates the binning solution generated and saved via woe.binning or woe.tree.binning.
woe.binning.deploy deploys the binning solution generated and saved via woe.binning or woe.tree.binning to (new) data.
References
Siddiqi, N. 2006: Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. Hoboken, New Jersey: John Wiley & Sons.
Anderson, R. 2007: The Credit Scoring Toolkit: Theory and Practice for Retail Credit Risk Management and Decision Automation. Oxford / New York: Oxford University Press.
8.2 Examples
# Load German credit data and create subset
data(germancredit)
data set 㤼㸱germancredit㤼㸲 not found
df <- germancredit[, c('creditability', 'credit.amount', 'duration.in.month',
'savings.account.and.bonds', 'purpose')]
Error: object 'germancredit' not found
