The aim of this document is to introduce the reader with the concept of modern dimension reduction algorithm called Boruta. This method is widely used in many Kaggle competitions in order to perform feature selection before developing further predictive models.

Introduction

Dimension reduction has several advantages from a machine learning point of view. Firstly, since the model has fewer degrees of freedom the probability of overfitting (disproportion between errors obtained on training and validation data set, poor generalization) is lower. Secondly, if one use feature selection methods or linear methods (e.g LDA, PCA) then reduction will promote the most important variables (or it’s linear combinations) which will have an impact in interpretability of developed model. There are many approaches in performing feature selection:

iterative algorithms (subset selection, stepwise regression) that adds and remove variables from the model until some criterium is met
shrinking and penalization methods: works by adding penalty term in cost function when to many variables are considered. Consecutive iterations tends to decrease some coefficients to zero.
model selection based on information criteria (BIC, AIC)
feature selection based on correlation coefficient or VIF (Variance Inflation Factor)

The method that we want to introduce today is a all-relevant feature selection based on fully supervised appropach (which is understood as performing feature selection using the impact of particular variables in explaining response variables).

Boruta algorithm

Boruta method was invented by two Polish researchers working on the University of Warsaw: Miron Kursa and Witold Rudnicki. It works as a wrapper algorithm around Random Forest. As mentioned before, this method follows all-relevant selection. In contrast, most of the traditional feature selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier. As far as one is familiar with how does random forest classifier works, it delivers an importance measure which informs about the features that were important while performing classification/regression task. The measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index and for regression, it is measured by residual sum of squares.
Let’s list core steps how the Boruta algorithm works:

firstly, the data set is extended by adding copies of all variables (extending information system)
added attributes are shuffled randomly in order to remove any corellation with the response variable
random forest classifier is run on the whole data set and Z-scores are computed for all attributes (another importance measure implemented in basic random forest)
out of all shadow attributes find the one with the maximum Z score and then assign a hit to every attribute that scored better than the one with maximum Z-score
for each attribute with undetermined importance perform a two-sided test of equality with the the one obtained for shadow attribute with maximum Z-score
mark the attributes which have importance signifcantly lower than the shadow with maximum Z-score as `unimportant’ and permanently remove them from the dat aset
remove all shadow, artificially added attributes
repeat the procedure until the importance is assigned for all the attributes, or the algorithm has reached the previously set limit of the random forest runs.

# Code snippet for basic usage

We will show how this algorithm works on Ozone data set which one can find in mlbench package. Let’s start from loading relevant packages.

if(!require(mlbench)){
  install.packages('mlbench')
  library('mlbench')
} else {
  library(mlbench)
}

if(!require(Boruta)){
  install.packages('Boruta')
  library('Boruta')
} else {
  library(Boruta)
}

data("Ozone")
ozone <- na.omit(Ozone)
summary(ozone)

##        V1           V2      V3           V4              V5      
##  3      :21   9      :  9   1:37   Min.   : 1.00   Min.   :5320  
##  4      :21   12     :  9   2:45   1st Qu.: 5.00   1st Qu.:5690  
##  12     :21   13     :  8   3:43   Median : 9.00   Median :5760  
##  10     :18   14     :  8   4:36   Mean   :11.37   Mean   :5746  
##  1      :17   15     :  8   5:42   3rd Qu.:16.00   3rd Qu.:5830  
##  2      :17   22     :  8   6: 0   Max.   :38.00   Max.   :5950  
##  (Other):88   (Other):153   7: 0                                 
##        V6               V7              V8              V9       
##  Min.   : 0.000   Min.   :19.00   Min.   :25.00   Min.   :27.68  
##  1st Qu.: 3.000   1st Qu.:46.00   1st Qu.:51.50   1st Qu.:49.64  
##  Median : 5.000   Median :64.00   Median :61.00   Median :56.48  
##  Mean   : 4.867   Mean   :57.61   Mean   :61.11   Mean   :56.54  
##  3rd Qu.: 6.000   3rd Qu.:73.00   3rd Qu.:71.00   3rd Qu.:66.20  
##  Max.   :11.000   Max.   :93.00   Max.   :93.00   Max.   :82.58  
##                                                                  
##       V10            V11              V12             V13       
##  Min.   : 111   Min.   :-69.00   Min.   :27.50   Min.   :  0.0  
##  1st Qu.: 869   1st Qu.:-14.00   1st Qu.:51.26   1st Qu.: 60.0  
##  Median :2083   Median : 18.00   Median :60.98   Median :100.0  
##  Mean   :2602   Mean   : 14.43   Mean   :60.69   Mean   :122.2  
##  3rd Qu.:5000   3rd Qu.: 43.00   3rd Qu.:70.88   3rd Qu.:150.0  
##  Max.   :5000   Max.   :107.00   Max.   :90.68   Max.   :350.0  
##

This dataset contains data needed to predict the daily maximum one-hour-average ozone reding (V4). Listed features inform about e.g temperature, humidity, wind speed, pressure, visibility. Before we start with dimension reduction, we need to ensure the reporducibility of obtained results by setting random seed. As pointed out in the introduction, algorithm contains many steps performing random permitatuions which might produce different results in different machines and R sessions. Ntree parameter indicates mentioned limit of random forest runs.

set.seed(42)
Boruta.Ozone <- Boruta(V4 ~ ., data = ozone, ntree = 500)
Boruta.Ozone

## Boruta performed 18 iterations in 1.744612 secs.
##  9 attributes confirmed important: V1, V10, V11, V12, V13 and 4
## more.
##  3 attributes confirmed unimportant: V2, V3, V6.

The Ozone set consists of 12 attributes where three of them are rejected by our run (V2, V3, V6). In the plot below we can see the variabilty of Z scores among all features.

plot(Boruta.Ozone)

The box between V2 and V5 is a Z-score distribution for the shadow feature with maximum Z-score. One can see that it clearly separates all relevant (green) and irrelevant features (red). Blue boxes corresponds to minimal, average and maximum Z score of a shadow attribute. Now one can proceed with selected variables to develop other predictive algorithms.

Boruta - modern dimension reduction algorithm

Adam Kordeczka

13 03 2018

Introduction

Boruta algorithm

Summary