In this introductory first post I explore my favorite machine learner, Random Forest. Specifically, I will be applying it to the Wisconsin Breast Cancer dataset to extract Feature Importance in the classification of a benign or malignant response.
Random Forest (RF) is a method of regression (non-parametric). Developed by Leo Breiman, RF is a machine learner that essentially works by aggregating multiple classification trees (500, 1000, etc) into a “forest”. As the trees are being developed, they generate weight-based “votes” for the input variables used, and ultimately combine to output a “forest” of classifications.
As an example, consider a data set comprised of n observations (rows), and m input variables (columns). At each tree, Breiman’s method of bagging (bootstrap + aggregating) works like this:
If in a subsequent tree, exclusion of previously-used variables yields a better classification, those excluded variables are weighted lower, and vice-versa.
The power of this bagging methodology results in many excellent features of RF, the most relevant of which being:
The Wisconsin Breast Cancer dataset is made up of 683 samples. It contains two predictor classes:
The nine input variables (phenotypes) for description of the classes are:
A look at a density comparison of the nine input variables (grouped by class) identifies obvious distinctions in characteristics of benign vs malignant cells:
There are numerous ports of the Random Forest algorithm in R, but here I am using R’s basic randomForest package. As a note of interest, I have retained all of the randomForest()
default values (ntree = 500, etc) in order to demonstrate just how powerful it is with minimal (if any) tuning by the user.
Out-Of-Bag (OOB)
For OOB prediction validation, I created a binary vector of TRUE
or FALSE
values based on wether or not the oob prediction matched the actual class.
oob | actual | correct |
---|---|---|
benign | benign | TRUE |
malignant | benign | FALSE |
benign | benign | TRUE |
malignant | benign | FALSE |
benign | benign | TRUE |
malignant | malignant | TRUE |
… | … | … |
A look at the frequency of correct predictions is a testament to RF’s OOB strength:
Percent Increase MSE
As previously mentioned, during RF’s forest development each tree uses only a subset of the input variables (ie “features”). If in a subsequent tree, exclusion of previously-used features yields a lower classification, those features would in turn have a higher % increase in MSE. So, logically, input variables yielding higher %IncMSE when excluded have a higher Feature Importance.
Taking a look at the feature importance yeilds the following visual:
As we can see, the Uniformity of Cell Size feature is ranked as having the highest importance according to our RF model.
Now, this result would make a lot of sense to any cell biologist that has ever looked at cancer cells vs normal cells. However, we can use another method of justifying that Uniformity of Cell Size is of high importance as a class descriptor - Partial Dependence.
Partial Dependence
The partialPlot()
function in R’s randomForest
package provides a depiction of the marginal effect of a variable of interest (in our case Uniformity of Cell Size) on the class response.
Looking at the partial dependence plot below, we can see a strong relation between our variables:
The analysis performed here is just the tip of the iceburg in regards to the efficacy of the Random Forest algorithm. My aim with this post was to show that with minimal (often none) tuning of R’s randomForest()
function, the user is able to extract a plethora of valuable information from a dataset.
And Feature Importance is by no means where RF’s utility ends! The bagging methodology that makes RF the powerhouse that it is results in high robustness in Classification, Regression, and Prediction. Its what makes RF my go-to ML algorithm of choice.