Random Forest Package

Classification procedures are one of the most used statistical methods in ecology. A new and powerful machine learner classifier is Random Forests. This classifier has been utilized in other fields but is relatively new and unknown in ecology (Cutler et al. 2007). RF models generate many classification trees (training the model) and aggregate their results to create a final model (Liaw and Wiener 2002). Training of the model is considered stochastic since the same input can result in a different output, but the completed model is considered deterministic since the same input will always give the same outputs, unless the model is re-trained. Compared to other statistical classifiers, the advantages of using RF are (but not limited to): 1) its extremely high classification rate, 2) ability to run efficiently on large data bases that can have hundreds of input variables without variable deletion, 3) its capability to determine variable importance, and 4) its ability to model complex interactions among predictor variables (Breiman 2001, Liaw and Wiener 2002, Cutler et al. 2007, Rodriguez-Galiano et al. 2012).

I had a large dataset (n=134) containing both binomial (mark-recapture) and continuous (habitat percent cover) variables. Since I had so many variables and did not know how the predictor variables interacted, I decided to use the statistical classifier Random Forests (RF) to model my data instead of generalized linear models. For this model, the predictor variables were vegetation and substrate percent cover within each plot and the response variable was presence or absence of tiger beetles.

Detactability Pebble Pine.twig.litter Rock.slab Tree
1 50 5 40 5
0 25 30 15 30
0 10 15 70 5
1 35 10 25 30

RF Model

randomForest(x , y , ntree =500 , mtry = 5 , ntree = 1000 , importance = TRUE , proximity = TRUE)
  • x: A data frame or a matrix of predictors, or a formula describing the model to be fitted.
  • y: A response vector. If a factor, classification is assumed, otherwise regression is assumed.
  • mtry: Number of variables randomly sampled as candidates at each split.
  • ntree: Number of trees to grow
  • importance: Should importance of predictors be assessed?
  • proximity: Should proximity measure among the rows be calculated?
nsg.rf = randomForest (y=NSvSgroup2[,1],x=NSvSgroup2[,2:13],mtry=5, proximity = TRUE, importance=TRUE, ntree=1000)

Out-of-Bag Error

The algorithm that RF utilizes starts with many bootstrap samples from the original dataset. The bootstrap dataset does not include all of the original data, but on average it contains 63% of the original data (Cutler et al. 2007). To train the model, a classification tree is fit to each bootstrap sample, but there is only a set number of randomly selected variables at each node. The number of variables tried at a node and the number of classification trees used can be changed to increase accuracy of the model. Once trees are fully grown, they are used to predict out-of-bag observations (Cutler et al. 2007). Out-of-bag observations are observations from the original dataset that do not occur in the bootstrap sample (Breiman 2001, Cutler et al. 2007). After the out-of-bag observations run through the trees, a majority vote is taken of how accurate the trees were at predicting the observations. The output from the majority vote is the out-of-bag error rate which estimates accuracy of the completed model.

nsg.rf
## 
## Call:
##  randomForest(x = NSvSgroup2[, 2:13], y = NSvSgroup2[, 1], ntree = 1000,      mtry = 5, importance = TRUE, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 10.29%
## Confusion matrix:
##    0  1 class.error
## 0 14  3  0.17647059
## 1  4 47  0.07843137

Variable Importance Plot

To determine what are the most important variables to the model, we can utilize function of the randomForest package. Variable importance plots show how much more accurate the model is with the individual predictor variables. Higher values of mean decrease in accuracy indicate which predictor variables are more important to the classification (Cutler et al. 2007).

varImpPlot(x , sort = TRUE , n.var , main = "Title" )
  • x: Random forest model.
  • sort: Should the variables be sorted in decreasing order of importance?
  • n.var: How many variables to show?
  • main: Plot title.
varImpPlot(nsg.rf, sort=T, n.var= 12, main= "Historical area vs. currently present", pch=16)

MeanDecreaseAccuracy

  • Test how much worse the model performs without each variable
  • Maximum importance in contributing to accuracy (top variable)
  • If top variable is removed while making the trees what will be the mean decrease in accuracy

These plots also show how strong the variables at the nodes were based on the mean decrease of Gini. This is less accurate since it is based of the training data and doesn’t take the out-of-bag error into considerations.

nsg.rf$importance
##                             0            1 MeanDecreaseAccuracy
## Small.grains      0.030329365  0.004404055         1.016522e-02
## Pebble            0.186985101  0.053171252         8.403731e-02
## Boulder           0.013158730 -0.002617587         1.060731e-03
## Rock.slab         0.064806854  0.008753708         2.263144e-02
## Moss.lichen       0.007436905  0.001797182         2.344414e-03
## Pine.twig.litter  0.167582756  0.043533205         7.339990e-02
## Soil              0.005167857 -0.001636625         2.627528e-05
## Road              0.004197222  0.006239461         5.837909e-03
## Wood.chips        0.000000000  0.000000000         0.000000e+00
## Tree              0.008734524  0.004010677         5.476981e-03
## Shrub            -0.001465476  0.001470912         7.474707e-04
## Forb              0.004925325  0.002624160         2.872497e-03
##                  MeanDecreaseGini
## Small.grains           0.92514215
## Pebble                 6.36942758
## Boulder                0.89707605
## Rock.slab              3.18437211
## Moss.lichen            0.86597258
## Pine.twig.litter       6.55211575
## Soil                   1.04533391
## Road                   0.74323554
## Wood.chips             0.06763972
## Tree                   1.61323568
## Shrub                  1.93785719
## Forb                   0.97057549

The first column shows the predictor variables used in the RF model, the following columns shows how much each of those variables contribute to predicting absences (0), presences (1) or both (MeanDecreaseAccuracy). The last column shows the node impurity of the model. Pebble percent cover accounts for just under half (8.95%) of the total predictive power of the model.

Partial Dependence Plots

Now that we know what the most important variables to our model are, we must determine their relationship to the response variable. Partial dependence plots produce a graph of the effect the predictor variable has on the class probablity (response variable). By looking at the direction of the slope, you can determine the relationship of the variables.

partialPlot(x , pred.data, x.var, which.class, x.lab="Title", ylab="Title")
  • x: Random forest model.
  • **pred.data:** A data frame used for contructing the plot.
  • x.var: Name of the variable for which partial dependence is to be examined.
  • which.class: For classification data, the class to focus on.
  • x.lab: X-axis title
  • y.lab: y-lab title
partialPlot(nsg.rf, NSvSgroup2, Pebble, "1", xlab="Pebble percent cover", ylab="Log of fraction of votes", lwd=4, col="green")

Since our partial plot is for pebble and crafted for class = 1 (beetle detected), then Y is in the negative region, that means that class 0 (beetle not detected) is the more common class for pebble percent cover = 0, and that it becomes less and less common (the line increases) as Pebble increases. It never really crosses that far beyond zero but that’s ok - having it just a tad above zero shows the preference for Pebble at higher percent cover of Pebble . but higher than the preference at Pebble = 0 is what matters.

partialPlot(nsg.rf, NSvSgroup2, Pine.twig.litter, "1",xlab="Pine and twig litter percent cover", ylab="Log of fraction of votes",lwd=4,col="red") 

As pine and twig litter percent cover increases, class 0 (beetle not detected) becomes more and more common. This shows a strong negative relationship to beetle precense and pine and twig litter percent cover.

partialPlot(nsg.rf, NSvSgroup2, Rock.slab,"1",xlab="Rock slab percent cover", ylab="Log of fraction of votes", lwd=4, col="blue")  

Rock slab percent cover is different from previous plots. This partial dependence plot shows a bimodal trend. Beetle presence has a strong positive relationship around 35–45% percent cover and a weaker positive relationship around 60-75%.