Random Forests

Random forests are sort of an extension to bagging. Some features:

  1. Samples are bootstrapped
  2. At each split, variables are bootstrapped – meaning each split can use a different combination of variables to build its prediction.
  3. Then we can grow several (or many!) trees and vote on or average the outcome.

Pros:

  1. Accuracy

Cons:

  1. Speed
  2. Interpretability
  3. Overfitting

Let’s try an example in the iris dataset:

require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
require(ggplot2)
data(iris)
inTrain<-createDataPartition(y=iris$Species,p=.7,list=FALSE)
training<-iris[inTrain,]
testing<-iris[-inTrain,]
modFit<-train(Species~.,data=training,method="rf",prox=TRUE)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
modFit
## Random Forest 
## 
## 105 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 105, 105, 105, 105, 105, 105, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9461078  0.9180661
##   3     0.9462081  0.9182368
##   4     0.9452081  0.9167274
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 3.

To look at a specific tree, we can use getTree:

getTree(modFit$finalModel,k=2)
##    left daughter right daughter split var split point status prediction
## 1              2              3         3        4.75      1          0
## 2              4              5         4        0.70      1          0
## 3              6              7         4        1.75      1          0
## 4              0              0         0        0.00     -1          1
## 5              0              0         0        0.00     -1          2
## 6              8              9         1        6.50      1          0
## 7              0              0         0        0.00     -1          3
## 8             10             11         3        4.95      1          0
## 9              0              0         0        0.00     -1          2
## 10             0              0         0        0.00     -1          2
## 11             0              0         0        0.00     -1          3
irisP<-classCenter(training[,c(3,4)],training$Species,modFit$finalModel$prox)
irisP<-as.data.frame(irisP);irisP$Species<-rownames(irisP)
ggplot(data=training,aes(x=Petal.Width,y=Petal.Length,col=Species))+geom_point() + geom_point(aes(x=Petal.Width,y=Petal.Length,col=Species),size=5,shape=4,data=irisP)

This is neat, because it shows that for a given species, the prediction trends to the center of mass for the specific species.

Predicting new values

pred<-predict(modFit,testing)
testing$predRight<-pred==testing$Species
table(pred,testing$Species)
##             
## pred         setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         14         1
##   virginica       0          1        14
qplot(Petal.Width,Petal.Length,col=predRight,data=testing,main="newdata Predictions")

Boosting

Basic Idea:

  1. Take a lot of (possibly) weak predictors
  2. Weight them and add them up
  3. Get a single stronger predictor

Note: Check out adaboost, probably the most famous boosting algorithm.

Boosting libraries in R: * gbm: boosting with trees * mboost: model-based boosting * ada: additive logistic regression * gamBoost: boosting for general additive models

Let’s try an example with the wage data from the ISLR library.

require(ISLR)
## Loading required package: ISLR
require(ggplot2)
require(caret)
data(Wage)
Wage<-subset(Wage,select=-c(logwage))
inTrain<-createDataPartition(y=Wage$wage,p=.7,list=FALSE)
training<-Wage[inTrain,]
testing<-Wage[-inTrain,]

Now we fit the model:

modFit<-train(wage~.,data=training,method="gbm",verbose=FALSE)
## Loading required package: gbm
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
## Loading required package: plyr
print(modFit)
## Stochastic Gradient Boosting 
## 
## 2102 samples
##   10 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE      Rsquared 
##   1                   50      34.84529  0.3278808
##   1                  100      34.31544  0.3387085
##   1                  150      34.22663  0.3407599
##   2                   50      34.30300  0.3400982
##   2                  100      34.18234  0.3425403
##   2                  150      34.30223  0.3390463
##   3                   50      34.19510  0.3426544
##   3                  100      34.27236  0.3396166
##   3                  150      34.50482  0.3326119
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.
qplot(predict(modFit,testing),wage,data=testing)

We can see from the above plot that the model does a decent job predicting wage, but certainly does not explain all the variation (else the data points would converge more closely on the diagonal).

.