Random forests are sort of an extension to bagging. Some features:
Let’s try an example in the iris dataset:
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
require(ggplot2)
data(iris)
inTrain<-createDataPartition(y=iris$Species,p=.7,list=FALSE)
training<-iris[inTrain,]
testing<-iris[-inTrain,]
modFit<-train(Species~.,data=training,method="rf",prox=TRUE)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
modFit
## Random Forest
##
## 105 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 105, 105, 105, 105, 105, 105, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9461078 0.9180661
## 3 0.9462081 0.9182368
## 4 0.9452081 0.9167274
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
To look at a specific tree, we can use getTree:
getTree(modFit$finalModel,k=2)
## left daughter right daughter split var split point status prediction
## 1 2 3 3 4.75 1 0
## 2 4 5 4 0.70 1 0
## 3 6 7 4 1.75 1 0
## 4 0 0 0 0.00 -1 1
## 5 0 0 0 0.00 -1 2
## 6 8 9 1 6.50 1 0
## 7 0 0 0 0.00 -1 3
## 8 10 11 3 4.95 1 0
## 9 0 0 0 0.00 -1 2
## 10 0 0 0 0.00 -1 2
## 11 0 0 0 0.00 -1 3
irisP<-classCenter(training[,c(3,4)],training$Species,modFit$finalModel$prox)
irisP<-as.data.frame(irisP);irisP$Species<-rownames(irisP)
ggplot(data=training,aes(x=Petal.Width,y=Petal.Length,col=Species))+geom_point() + geom_point(aes(x=Petal.Width,y=Petal.Length,col=Species),size=5,shape=4,data=irisP)
This is neat, because it shows that for a given species, the prediction trends to the center of mass for the specific species.
pred<-predict(modFit,testing)
testing$predRight<-pred==testing$Species
table(pred,testing$Species)
##
## pred setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 14 1
## virginica 0 1 14
qplot(Petal.Width,Petal.Length,col=predRight,data=testing,main="newdata Predictions")
Note: Check out adaboost, probably the most famous boosting algorithm.
Boosting libraries in R: * gbm: boosting with trees * mboost: model-based boosting * ada: additive logistic regression * gamBoost: boosting for general additive models
Let’s try an example with the wage data from the ISLR library.
require(ISLR)
## Loading required package: ISLR
require(ggplot2)
require(caret)
data(Wage)
Wage<-subset(Wage,select=-c(logwage))
inTrain<-createDataPartition(y=Wage$wage,p=.7,list=FALSE)
training<-Wage[inTrain,]
testing<-Wage[-inTrain,]
Now we fit the model:
modFit<-train(wage~.,data=training,method="gbm",verbose=FALSE)
## Loading required package: gbm
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
## Loading required package: plyr
print(modFit)
## Stochastic Gradient Boosting
##
## 2102 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared
## 1 50 34.84529 0.3278808
## 1 100 34.31544 0.3387085
## 1 150 34.22663 0.3407599
## 2 50 34.30300 0.3400982
## 2 100 34.18234 0.3425403
## 2 150 34.30223 0.3390463
## 3 50 34.19510 0.3426544
## 3 100 34.27236 0.3396166
## 3 150 34.50482 0.3326119
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 100,
## interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.
qplot(predict(modFit,testing),wage,data=testing)
We can see from the above plot that the model does a decent job predicting wage, but certainly does not explain all the variation (else the data points would converge more closely on the diagonal).
.