A random forest is a combination of a tree methodology and a “bootstrap” sample.
We use statistics in the tree to identify covariates that may play a role in each node.
Random forest tells us what the importance of the covariates in each node.
Let us assume we have a classification problem for a dataset where the outcome is binary. Let us assume this outcome is determined by 10 covariates. Thus there are 10 predictors for each subject for the outcome.
However there is no limitation in the number of levels for the response variable
In random forest we try to build a tree with the following processs:
1. Select a random sample of subjects with replacement from the training data.
2. We attempt to get a bootstrap sample. This means that if we repeat 10 times it will select a random sample of subjects 10 times and make a datasets.
3. Now we build a classification tree of the bootstrap sample. However a modification is applied. We choose and fix a number between 1 - 10. 10 = number of predictors. This number is called “M”. 4. Now follow the process of classification as outline previously, with a modification that we choose “M” predictors at random without replacement. Take this predictor and using this predictor set choose the best predictor and the best covariate to split the tree.
5. For the next node repeat the step 4. So every time we split a node we do the process of choosing the predictor afresh. 6. This is repeated till we reach the lower limit. In this method no pruning is used.
7. Take another bootstrap sample and build a tree as described in the process 3 - 6. 8. Repeat this bootstrapping and tree building for 500 times. (We can do more if we choose). 9. Now find out the decision in the majority of the trees for each of the nodes. This node is the correct classification for the given node.
To reduce misclassification rate for a single classification tree approach.
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
rf1 <- randomForest(Species~ ., data=iris, importance=T )
rf1
##
## Call:
## randomForest(formula = Species ~ ., data = iris, importance = T)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 5.33%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 50 0 0 0.00
## versicolor 0 47 3 0.06
## virginica 0 5 45 0.10
summary(rf1)
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 150 factor numeric
## err.rate 2000 -none- numeric
## confusion 12 -none- numeric
## votes 450 matrix numeric
## oob.times 150 -none- numeric
## classes 3 -none- character
## importance 20 -none- numeric
## importanceSD 16 -none- numeric
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 150 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
Default m parameter is square root of the number of predictors. This the number of variables tried at each split.
OOB = Out of bag error estimate. Every one of the 500 trees makes a prediction for the dataset. There is a misclassification rate associated with each tree. This value is the aggregate of the misclassification rate for the 500 trees.
The confusion matrix shows the misclassification rate for each of the species.
We can get the 95% confidence intervals of the misclassification rate by repeating this process for 1000 times using a loop command.
varImpPlot(rf1, pch=16, col=2, var=4, sort=T, main="Importance of Variable in Iris Data")
## Warning in mtext(labs, side = 2, line = loffset, at = y, adj = 0, col =
## color, : "var" is not a graphical parameter
## Warning in title(main = main, xlab = xlab, ylab = ylab, ...): "var" is not
## a graphical parameter
## Warning in mtext(labs, side = 2, line = loffset, at = y, adj = 0, col =
## color, : "var" is not a graphical parameter
## Warning in title(main = main, xlab = xlab, ylab = ylab, ...): "var" is not
## a graphical parameter
Petal length and width are the most important predictor in the classification tree.
Sepal length and sepal width is not much important.
For this underlying process is as below. We shuffle the values of one of the predictor variables for the dataset (say petal length). Keep the other variables constant. Then build a random forrest. For this forest another OOB is calculated. The mean decrease accuracy tells that OOB error rate increase by 30% if we mess up the petal length.
This thus gives us a criterion of information.
If we build the random forest with only the important covariates the error rate will go down.
rf2 <- randomForest(Species~ iris$Petal.Length+iris$Petal.Width, data=iris, importance=T )
rf2
##
## Call:
## randomForest(formula = Species ~ iris$Petal.Length + iris$Petal.Width, data = iris, importance = T)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 4%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 50 0 0 0.00
## versicolor 0 47 3 0.06
## virginica 0 3 47 0.06
Thus the random forest technique is a model selection technique.