1.0 Feature Selection Introduction

Feature selection is an important topic that requires in-depth knowledge of the problem domain. Having the right features can help the model perform better. For example, removing the highly correlated attributes can lead to a better model and improve prediction.

In this post, I will explore finding highly correlated variables, Recursive Feature Elimination, stepwise elimination, and Boruta feature selection.

1.1 Plot Histograms

First, I will plot a histogram of the variables in the Pima Indians Diabetes dataset. Various variable exhibit skew distributions and should be considered for transformations.

library("mlbench")
library("caret")
# load the data
data(PimaIndiansDiabetes)
DataExplorer::plot_histogram(PimaIndiansDiabetes)

1.2 Correlations

Below we will transform the response variable from a factor to numeric. Changing the variable allows us to see any correlations with other variables. I also created a highly correlated variable by combining glucose and mass. As expected glucose and mass are highly correlated with the combination variable HCorrelated.


library("corrplot")
package 㤼㸱corrplot㤼㸲 was built under R version 3.6.1corrplot 0.84 loaded
indians = PimaIndiansDiabetes
indians$diabetes = ifelse(indians$diabetes =="pos",1,0)
indians$HCorrelated =indians$glucose*indians$mass

cor_mx = cor(indians  ,use="pairwise.complete.obs", method = "pearson")
corrplot(cor_mx, method = "color", 
         type = "upper", order = "original", number.cex = .7,
         addCoef.col = "black", # Add coefficient of correlation
         tl.col = "black", tl.srt = 90, # Text label color and rotation
                  # hide correlation coefficient on the principal diagonal
         diag = TRUE)

NA
NA
NA
NA

2.0 Recursive Feature Elimination

2.1 Run the Feature Selection

Recursive Feature Elimination(RFE) builds models with different subsets of a dataset to identify a feature that might not be required. Caret provides a rfe function that facilitates this process.

Below we will load the Pima Indians Diabetes and fit the rfe function. Control was implemented using random forest cross-validated with kfold of 10. The final plot indicates that eight variables have an accuracy of 77.73.

# define the control using a random forest selection function
control = rfeControl(functions=rfFuncs, method="cv", number=10, repeats = 1) # method="cv" , leave out repeats to speed up or method = "repeatedcv", and leave out repeats 
# run the RFE algorithm
set.seed(143)
#
#   NOTE THAT THE PrimaIndians [,9] is a Factor Variable with neg pos levels (factor variable not numeric)- We choose the number of variables we want to see.
#
results = rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control,
              verbose=FALSE)
# summarize the results
print(results)

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold) 

Resampling performance over subset size:

The top 5 variables (out of 8):
   glucose, mass, age, pregnant, insulin
results

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold) 

Resampling performance over subset size:

The top 5 variables (out of 8):
   glucose, mass, age, pregnant, insulin

2.2 List the predictors in the Order of Choice


# list the chosen features
predictors(results)
[1] "glucose"  "mass"     "age"      "pregnant" "insulin"  "pedigree" "triceps"  "pressure"

2.3 Plot The Results

# plot the results
plot(results, type=c("g", "o"))

Another method relies on fitting a random forest model and identifying variable importance. In this method, variable importance can vary by model.

3.0 Variable Importance

# Another method relies on fitting a random forest model and identifying variable importance. In this method, variable importance can vary by model.

glimpse(PimaIndiansDiabetes)
Rows: 768
Columns: 9
$ pregnant <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1, 3, 8, 7, 9, ...
$ glucose  <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139, 189, 166, 100...
$ pressure <dbl> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74, 80, 60, 72, 0, 84, 74, 30, 7...
$ triceps  <dbl> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, 23, 19, 0, 47, 0, 38, 30, 41, ...
$ insulin  <dbl> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0, 230, 0, 83, 96, 2...
$ mass     <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, 0.0, 37.6, 38.0, 27.1...
$ pedigree <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158, 0.232, 0.191...
$ age      <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 32, 31, 31, 33,...
$ diabetes <fct> pos, neg, pos, neg, pos, neg, pos, neg, pos, pos, neg, pos, neg, pos, pos, ...
Data=PimaIndiansDiabetes

set.seed(143)
rPartMod = train(diabetes ~ ., data=Data, method="rpart")
rpartImp = varImp(rPartMod,10)
plot(rpartImp, top = 8, main='Variable Importance Using Glaucoma Dataset')

NA
NA

4.0 StepWise Selection

Stepwise selection is a method that allows for variables to be added or remove in either direction. The model performance is measured in AIC. Akaike information criterion (AIC) estimates the quality of each model relative to each model with the lowest AIC being the best model.

4.1a Backward Selection

The backward procedure begins with a general model that includes all variables and eliminates one variable at a time.



#glm(diabetes ~ ., data=Data, family="binomial")
step(glm(diabetes ~ ., data=Data, family="binomial"),direction="backward")
Start:  AIC=741.45
diabetes ~ pregnant + glucose + pressure + triceps + insulin + 
    mass + pedigree + age

           Df Deviance    AIC
- triceps   1   723.45 739.45
- insulin   1   725.19 741.19
<none>          723.45 741.45
- age       1   725.97 741.97
- pressure  1   729.99 745.99
- pedigree  1   733.78 749.78
- pregnant  1   738.68 754.68
- mass      1   764.22 780.22
- glucose   1   838.37 854.37

Step:  AIC=739.45
diabetes ~ pregnant + glucose + pressure + insulin + mass + pedigree + 
    age

           Df Deviance    AIC
<none>          723.45 739.45
- insulin   1   725.46 739.46
- age       1   725.97 739.97
- pressure  1   730.13 744.13
- pedigree  1   733.92 747.92
- pregnant  1   738.69 752.69
- mass      1   768.77 782.77
- glucose   1   840.87 854.87

Call:  glm(formula = diabetes ~ pregnant + glucose + pressure + insulin + 
    mass + pedigree + age, family = "binomial", data = Data)

Coefficients:
(Intercept)     pregnant      glucose     pressure      insulin         mass     pedigree  
  -8.405136     0.123172     0.035112    -0.013214    -0.001157     0.090089     0.947595  
        age  
   0.014789  

Degrees of Freedom: 767 Total (i.e. Null);  760 Residual
Null Deviance:      993.5 
Residual Deviance: 723.5    AIC: 739.5

4.1b Forward Selection

The forward method begins with a simple model then adds suitable variable one at a time until the best model is obtained.



#glm(diabetes ~ ., data=Data, family="binomial")
step(glm(diabetes ~ ., data=Data, family="binomial"),direction="forward")
Start:  AIC=741.45
diabetes ~ pregnant + glucose + pressure + triceps + insulin + 
    mass + pedigree + age


Call:  glm(formula = diabetes ~ pregnant + glucose + pressure + triceps + 
    insulin + mass + pedigree + age, family = "binomial", data = Data)

Coefficients:
(Intercept)     pregnant      glucose     pressure      triceps      insulin         mass  
  -8.404696     0.123182     0.035164    -0.013296     0.000619    -0.001192     0.089701  
   pedigree          age  
   0.945180     0.014869  

Degrees of Freedom: 767 Total (i.e. Null);  759 Residual
Null Deviance:      993.5 
Residual Deviance: 723.4    AIC: 741.4

4.1c Both Selection

The both method is the combination of backwrad and forward procedures.



#glm(diabetes ~ ., data=Data, family="binomial")
step(glm(diabetes ~ ., data=Data, family="binomial"),direction="both")
Start:  AIC=741.45
diabetes ~ pregnant + glucose + pressure + triceps + insulin + 
    mass + pedigree + age

           Df Deviance    AIC
- triceps   1   723.45 739.45
- insulin   1   725.19 741.19
<none>          723.45 741.45
- age       1   725.97 741.97
- pressure  1   729.99 745.99
- pedigree  1   733.78 749.78
- pregnant  1   738.68 754.68
- mass      1   764.22 780.22
- glucose   1   838.37 854.37

Step:  AIC=739.45
diabetes ~ pregnant + glucose + pressure + insulin + mass + pedigree + 
    age

           Df Deviance    AIC
<none>          723.45 739.45
- insulin   1   725.46 739.46
- age       1   725.97 739.97
+ triceps   1   723.45 741.45
- pressure  1   730.13 744.13
- pedigree  1   733.92 747.92
- pregnant  1   738.69 752.69
- mass      1   768.77 782.77
- glucose   1   840.87 854.87

Call:  glm(formula = diabetes ~ pregnant + glucose + pressure + insulin + 
    mass + pedigree + age, family = "binomial", data = Data)

Coefficients:
(Intercept)     pregnant      glucose     pressure      insulin         mass     pedigree  
  -8.405136     0.123172     0.035112    -0.013214    -0.001157     0.090089     0.947595  
        age  
   0.014789  

Degrees of Freedom: 767 Total (i.e. Null);  760 Residual
Null Deviance:      993.5 
Residual Deviance: 723.5    AIC: 739.5

Seems that the model without Triceps is Best

5.0 Boruta

Boruta is a feature ranking and selection algorithm based on random forest algorithm. The advantages of using this package are the ease of variables selection and the ability to adjust variable selection.

Below I fitted the Boruta function with the dataset for evaluation.

library('Boruta')
set.seed(143)
boruta_output = Boruta(diabetes ~ ., data=na.omit(Data), doTrace=0) 

5.1 Significant Variables

The significant variables can be extracted from the selection. Tentative variables are variables that can be dropped or kept.


Significant_vars = getSelectedAttributes(boruta_output, withTentative = TRUE)
Significant_vars
[1] "pregnant" "glucose"  "pressure" "triceps"  "insulin"  "mass"     "pedigree" "age"     

Seems all are significant

Boruta has a method for making the selecting tentative variable for the user.

roughFixMod = TentativeRoughFix(boruta_output)
boruta_signif = getSelectedAttributes(roughFixMod)
boruta_signif
[1] "pregnant" "glucose"  "pressure" "triceps"  "insulin"  "mass"     "pedigree" "age"     

The importance of variables can be shown by the below method with being the most important variable.

5.2 Importance of Variables

The importance of variables can be shown by the below method with vari being the most important variable.


# Variable Importance Scores
imps = attStats(roughFixMod)
imps2 = imps[imps$decision != 'Rejected', c('meanImp', 'decision')]
head(imps2[order(-imps2$meanImp), ],10)  # descending sort
NA
NA

5.3 Plotting Importance of Variables


plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")  

6.0 Information Value and Weights of Evidence (Categorical Variables)

The Information Value can be used to judge how important a given categorical variable is in explaining the binary Y variable. It goes well with logistic regression and other classification models that can model binary variables.

Let’s try to find out how important the categorical variables are in predicting if an individual will earn >50k from the ‘adult.csv’ dataset. Just run the code below to import the dataset.

library(InformationValue)
inputData <- read.csv("./Data/adult.csv")
print(head(inputData))

Alright, let’s now find the information value for the categorical variables in the inputData.

Here is what the quantum of Information Value means:

Less than 0.02, then the predictor is not useful for modeling (separating the Goods from the Bads) 0.02 to 0.1, then the predictor has only a weak relationship. 0.1 to 0.3, then the predictor has a medium strength relationship. 0.3 or higher, then the predictor has a strong relationship. That was about IV. Then what is Weight of Evidence?

Weights of evidence can be useful to find out how important a given categorical variable is in explaining the ‘events’ (called ‘Goods’ in below table.)

WOE = ln(%good of all good/%bad of all bad)

Here is what the quantum of Information Value means:

Less than 0.02, then the predictor is not useful for modeling (separating the Goods from the Bads) 0.02 to 0.1, then the predictor has only a weak relationship. 0.1 to 0.3, then the predictor has a medium strength relationship. 0.3 or higher, then the predictor has a strong relationship. That was about IV. Then what is Weight of Evidence?

Weights of evidence can be useful to find out how important a given categorical variable is in explaining the ‘events’ (called ‘Goods’ in below table.)

# The ‘Information Value’ of the categorical variable can then be derived from the respective WOE values.
# 
# IV?=?(perc good of all goods?perc bad of all bads)?*?WOE
# 
# The ‘WOETable’ below given the computation in more detail.

WOETable(X=inputData[, 'workclass'], Y=inputData$income)

The total IV of a variable is the sum of IV�s of its categories.

#—————————————————————————————————————- ## TUTORIAL USING OPTIMAL BINNING () # ——————————————————————————————————————-

7.0 WOE FROM Information Pcakage

7.1 Woe Continuous Variables and Factor Variables

library(Information)
package 㤼㸱Information㤼㸲 was built under R version 3.6.3

7.2 Make an independent Variable a Factor

It is important to note here the number of bins for ‘rank’ variable. Since it is a categorical variable, the number of bins would be according to unique values of the factor variable. The parameter bins=10 does not work for a factor variable.


mydata$rank=factor(mydata$rank)

7.3 Compute The Info Value

7.5 Put The Tables if Dataframes

To get WOE table for variable gre, you need to call Tables list from IV list.

We can do this for any of the variables

gre = data.frame(IV$Tables$gre)

gpa = data.frame(IV$Tables$gpa)

X = data.frame(IV$Tables$X)

rank = data.frame(IV$Tables$rank)

7.6 Plot Woe Scores For 1 Variable

We can plot 1 at a time


plot_infotables(IV, "gre")

7.7 Plot Woe Scores For Many Variable

We can plot many at a time


plot_infotables(IV, IV$Summary$Variable[1:4], same_scale=FALSE)

8.0 WOE Bimmnning

This package generates, visualizes, tabulates and deploys a supervised weight of evidence (WOE) binning of variables.

Details

This package generates, visualizes, tabulates and deploys a supervised weight of evidence (WOE) binning of variables.

The package woeBinning automates the process of binning of numeric variables and factors with respect to a dichotomous target variable. Additionally, it visualizes the realized binning solution, tabulates it and deploys it to (new) data. All functions can be used with single variables or an entire data frame.

8.1 Binning

woe.binning generates a supervised fine and coarse classing of numeric variables and factors.

woe.tree.binning generates a supervised tree-like segmentation of numeric variables and factors.

woe.binning.plot visualizes the binning solution generated and saved via woe.binning or woe.tree.binning.

woe.binning.table tabulates the binning solution generated and saved via woe.binning or woe.tree.binning.

woe.binning.deploy deploys the binning solution generated and saved via woe.binning or woe.tree.binning to (new) data.

References

Siddiqi, N. 2006: Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. Hoboken, New Jersey: John Wiley & Sons.

Anderson, R. 2007: The Credit Scoring Toolkit: Theory and Practice for Retail Credit Risk Management and Decision Automation. Oxford / New York: Oxford University Press.

8.2 Examples


# Load German credit data and create subset
data(germancredit)
data set 㤼㸱germancredit㤼㸲 not found
df <- germancredit[, c('creditability', 'credit.amount', 'duration.in.month',
                  'savings.account.and.bonds', 'purpose')]
Error: object 'germancredit' not found
