the basic idea is to reduce the variance of prediction accuracy of the model fit by averaging large number of independent models of the same type. to achieve independent models from the sametraining set, bootstraping of training observations as well as bootstraping of predictors are used.
The RF algorithm, by definition, requires fully grown unprunned trees. This is the case because RF can only reduce variance, not bias (where error=bias+variance). Since the bias of the entire forest is roughly equal to the bias of a single tree, the base model used has to be a very deep tree to guarantee a low bias. Variance is subsequently reduced by growing many deep, uncorrelated trees and averaging their predictions.
In practice, Random Forest seldom overfit. But what would tend to favor overfitting would be having too many trees in the forest. At some point it is not necessary to keep adding trees (it does not reduce variance anymore, but can slightly increase it). which can be complicated by the fact that it’s very hard to understand which trees are leading to that overfitting, and so it’s very important to use cross validation when building random forests.
random forests, which you can think of as an extension to bagging for classification and regression trees.
if you have a strong dominant predictor in dataset, all bagged trees will have this predictors as a first split and bcoz of that all trees will have similar splits. now when we average these highly correlated trees the reduction in variance will not be big as compared to reduction due to averaging of uncorrelated trees.
This problem of “Bagging” method is overcome in “Random Forests”.
the basic idea is very similar to bagging in the sense that we bootstrap samples, so we take a resample of our observed data, and our training data set. And then we rebuild classification or regression trees on each of those bootstrap samples.
The one difference is that at each split, when we split the data each time in a classification tree, we also bootstrap the variables.
In other words, only a subset of the variables is considered at each potential split. m = sqrt(p).
This makes for a diverse set of potential trees that can be built. And so the idea is we grow a large number of trees. And then we either vote or average those trees in order to get the prediction for a new outcome.
The pros for this approach are that it’s quite accurate. And along with boosting, it’s one of the most, widely used and highly accurate methods for prediction in competitions like Kaggle.
The cons are that it’s, it can be quite slow. It has to build a large number of trees. And it can be hard to interpret, in the sense that you might have a large number of trees that are averaged together.
And those trees represent bootstrap samples with bootstrap nodes that can be a little bit complicated to understand.
It can also lead to a little bit of overfitting which can be complicated by the fact that it’s very hard to understand which trees are leading to that overfitting, and so it’s very important to use cross validation when building random forests.
suppressMessages(library(caret))
data(iris)
names(iris) = tolower(names(iris))
names(iris)
## [1] "sepal.length" "sepal.width" "petal.length" "petal.width"
## [5] "species"
index = createDataPartition(y=iris$species, p=0.7, list=FALSE)
train = iris[index,]
test = iris[-index,]
dim(train)
## [1] 105 5
dim(test)
## [1] 45 5
# fit the model
modfit = train(species ~., data=train,
method="rf",
prox=TRUE,
trControl = trainControl(method = "cv"))
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:Hmisc':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
modfit
## Random Forest
##
## 105 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 93, 95, 95, 93, 95, 96, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9314646 0.8953493
## 3 0.9314646 0.8962963
## 4 0.9414646 0.9116739
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.
getTree(modfit$finalModel, k=2) # 2ND TREE
## left daughter right daughter split var split point status prediction
## 1 2 3 3 2.50 1 0
## 2 0 0 0 0.00 -1 1
## 3 4 5 3 4.85 1 0
## 4 0 0 0 0.00 -1 2
## 5 6 7 4 1.75 1 0
## 6 8 9 2 2.35 1 0
## 7 0 0 0 0.00 -1 3
## 8 0 0 0 0.00 -1 3
## 9 0 0 0 0.00 -1 2
# class label centers (3 in this case using petal.width and petal.length)
irisP = classCenter(x=train[,c(3,4)], label=train$species, prox=modfit$finalModel$prox) #classCenter is a randomForests function
irisP = as.data.frame(irisP) #convert into df
irisP$species = rownames(irisP) # add a column of species
#plot:
p = qplot(petal.width, petal.length, col=species, data=train, size=3)
p + geom_point(aes(x=petal.width, y=petal.length, col=species),
size=8,
shape=4,
data=irisP)
pred = predict(modfit, test)
# add a column in test set for prediction TRUE or FALSE:
test$predRight = pred == test$species
table(pred, test$species)
##
## pred setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 15 1
## virginica 0 0 14
qplot(petal.width, petal.length, col=test$predRight, data=test, size=3, alpha=0.1)
2 red points are misclassified points.
Random forests are usually one of the top performing algorithms along with boosting in any prediction contests.
They’re often difficult to interpret because of these multiple trees that we’re fitting but they can be very accurate for a wide range of problems.
You can check out the rfcv function to make sure that cross validation is being performed, but the train function in caret also handles that for you.