The aim of this document is to introduce the reader with the concept of modern dimension reduction algorithm called Boruta. This method is widely used in many Kaggle competitions in order to perform feature selection before developing further predictive models.
Dimension reduction has several advantages from a machine learning point of view. Firstly, since the model has fewer degrees of freedom the probability of overfitting (disproportion between errors obtained on training and validation data set, poor generalization) is lower. Secondly, if one use feature selection methods or linear methods (e.g LDA, PCA) then reduction will promote the most important variables (or it’s linear combinations) which will have an impact in interpretability of developed model. There are many approaches in performing feature selection:
The method that we want to introduce today is a all-relevant feature selection based on fully supervised appropach (which is understood as performing feature selection using the impact of particular variables in explaining response variables).
Boruta method was invented by two Polish researchers working on the University of Warsaw: Miron Kursa and Witold Rudnicki. It works as a wrapper algorithm around Random Forest. As mentioned before, this method follows all-relevant selection. In contrast, most of the traditional feature selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier. As far as one is familiar with how does random forest classifier works, it delivers an importance measure which informs about the features that were important while performing classification/regression task. The measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index and for regression, it is measured by residual sum of squares.
Let’s list core steps how the Boruta algorithm works:
# Code snippet for basic usage
We will show how this algorithm works on Ozone data set which one can find in mlbench package. Let’s start from loading relevant packages.
if(!require(mlbench)){
install.packages('mlbench')
library('mlbench')
} else {
library(mlbench)
}
if(!require(Boruta)){
install.packages('Boruta')
library('Boruta')
} else {
library(Boruta)
}
data("Ozone")
ozone <- na.omit(Ozone)
summary(ozone)
## V1 V2 V3 V4 V5
## 3 :21 9 : 9 1:37 Min. : 1.00 Min. :5320
## 4 :21 12 : 9 2:45 1st Qu.: 5.00 1st Qu.:5690
## 12 :21 13 : 8 3:43 Median : 9.00 Median :5760
## 10 :18 14 : 8 4:36 Mean :11.37 Mean :5746
## 1 :17 15 : 8 5:42 3rd Qu.:16.00 3rd Qu.:5830
## 2 :17 22 : 8 6: 0 Max. :38.00 Max. :5950
## (Other):88 (Other):153 7: 0
## V6 V7 V8 V9
## Min. : 0.000 Min. :19.00 Min. :25.00 Min. :27.68
## 1st Qu.: 3.000 1st Qu.:46.00 1st Qu.:51.50 1st Qu.:49.64
## Median : 5.000 Median :64.00 Median :61.00 Median :56.48
## Mean : 4.867 Mean :57.61 Mean :61.11 Mean :56.54
## 3rd Qu.: 6.000 3rd Qu.:73.00 3rd Qu.:71.00 3rd Qu.:66.20
## Max. :11.000 Max. :93.00 Max. :93.00 Max. :82.58
##
## V10 V11 V12 V13
## Min. : 111 Min. :-69.00 Min. :27.50 Min. : 0.0
## 1st Qu.: 869 1st Qu.:-14.00 1st Qu.:51.26 1st Qu.: 60.0
## Median :2083 Median : 18.00 Median :60.98 Median :100.0
## Mean :2602 Mean : 14.43 Mean :60.69 Mean :122.2
## 3rd Qu.:5000 3rd Qu.: 43.00 3rd Qu.:70.88 3rd Qu.:150.0
## Max. :5000 Max. :107.00 Max. :90.68 Max. :350.0
##
This dataset contains data needed to predict the daily maximum one-hour-average ozone reding (V4). Listed features inform about e.g temperature, humidity, wind speed, pressure, visibility. Before we start with dimension reduction, we need to ensure the reporducibility of obtained results by setting random seed. As pointed out in the introduction, algorithm contains many steps performing random permitatuions which might produce different results in different machines and R sessions. Ntree parameter indicates mentioned limit of random forest runs.
set.seed(42)
Boruta.Ozone <- Boruta(V4 ~ ., data = ozone, ntree = 500)
Boruta.Ozone
## Boruta performed 18 iterations in 1.744612 secs.
## 9 attributes confirmed important: V1, V10, V11, V12, V13 and 4
## more.
## 3 attributes confirmed unimportant: V2, V3, V6.
The Ozone set consists of 12 attributes where three of them are rejected by our run (V2, V3, V6). In the plot below we can see the variabilty of Z scores among all features.
plot(Boruta.Ozone)
The box between V2 and V5 is a Z-score distribution for the shadow feature with maximum Z-score. One can see that it clearly separates all relevant (green) and irrelevant features (red). Blue boxes corresponds to minimal, average and maximum Z score of a shadow attribute. Now one can proceed with selected variables to develop other predictive algorithms.
We reduced feature space from 12 attributes to 9 without performing deep analysis and writing many lines of codes. Boruta algorithm is very simple: there aren’t many parameters to tune. One has to remember that the data set has to be complete (no NA’s), othwerwise the error occurs.