Random Forest: Extracting Feature Importance

Intro

In this introductory first post I explore my favorite machine learner, Random Forest. Specifically, I will be applying it to the Wisconsin Breast Cancer dataset to extract Feature Importance in the classification of a benign or malignant response.

Random Forest

Random Forest (RF) is a method of regression (non-parametric). Developed by Leo Breiman, RF is a machine learner that essentially works by aggregating multiple classification trees (500, 1000, etc) into a “forest”. As the trees are being developed, they generate weight-based “votes” for the input variables used, and ultimately combine to output a “forest” of classifications.

As an example, consider a data set comprised of n observations (rows), and m input variables (columns). At each tree, Breiman’s method of bagging (bootstrap + aggregating) works like this:

A subset of n is sampled with repleacement = ns.
A subset ms << m of variables is chosen at random.
A classification tree is constructed using ns and ms.

If in a subsequent tree, exclusion of previously-used variables yields a better classification, those excluded variables are weighted lower, and vice-versa.

The power of this bagging methodology results in many excellent features of RF, the most relevant of which being:

An extremely low probability of overfitting.
Very robust out-of-bag (OOB) predictions.
Extracted feature importance for each of the input variables used.

The Data

The Wisconsin Breast Cancer dataset is made up of 683 samples. It contains two predictor classes:

malignant or
benign breast mass.

The nine input variables (phenotypes) for description of the classes are:

Clump thickness
Uniformity of cell size
Uniformity of cell shape
Marginal adhesion
Single epithelial cell size
Number of bare nuclei
Bland chromatin
Number of normal nuclei
Mitosis

A look at a density comparison of the nine input variables (grouped by class) identifies obvious distinctions in characteristics of benign vs malignant cells:

The Random Forest Model

There are numerous ports of the Random Forest algorithm in R, but here I am using R’s basic randomForest package. As a note of interest, I have retained all of the randomForest() default values (ntree = 500, etc) in order to demonstrate just how powerful it is with minimal (if any) tuning by the user.

Out-Of-Bag (OOB)

For OOB prediction validation, I created a binary vector of TRUE or FALSE values based on wether or not the oob prediction matched the actual class.

oob	actual	correct
benign	benign	TRUE
malignant	benign	FALSE
benign	benign	TRUE
malignant	benign	FALSE
benign	benign	TRUE
malignant	malignant	TRUE
…	…	…

A look at the frequency of correct predictions is a testament to RF’s OOB strength:

Feature Importance

Percent Increase MSE

As previously mentioned, during RF’s forest development each tree uses only a subset of the input variables (ie “features”). If in a subsequent tree, exclusion of previously-used features yields a lower classification, those features would in turn have a higher % increase in MSE. So, logically, input variables yielding higher %IncMSE when excluded have a higher Feature Importance.

Taking a look at the feature importance yeilds the following visual:

As we can see, the Uniformity of Cell Size feature is ranked as having the highest importance according to our RF model.

Now, this result would make a lot of sense to any cell biologist that has ever looked at cancer cells vs normal cells. However, we can use another method of justifying that Uniformity of Cell Size is of high importance as a class descriptor - Partial Dependence.

Partial Dependence

The partialPlot() function in R’s randomForest package provides a depiction of the marginal effect of a variable of interest (in our case Uniformity of Cell Size) on the class response.

Looking at the partial dependence plot below, we can see a strong relation between our variables:

Summary

The analysis performed here is just the tip of the iceburg in regards to the efficacy of the Random Forest algorithm. My aim with this post was to show that with minimal (often none) tuning of R’s randomForest() function, the user is able to extract a plethora of valuable information from a dataset.

And Feature Importance is by no means where RF’s utility ends! The bagging methodology that makes RF the powerhouse that it is results in high robustness in Classification, Regression, and Prediction. Its what makes RF my go-to ML algorithm of choice.