A random forest is a combination of a tree methodology and a “bootstrap” sample.
We use statistics in the tree to identify covariates that may play a role in each node.
Random forest tells us what the importance of the covariates in each node.

Let us assume we have a classification problem for a dataset where the outcome is binary. Let us assume this outcome is determined by 10 covariates. Thus there are 10 predictors for each subject for the outcome.

However there is no limitation in the number of levels for the response variable

Theory behind the Process

In random forest we try to build a tree with the following processs:
1. Select a random sample of subjects with replacement from the training data.
2. We attempt to get a bootstrap sample. This means that if we repeat 10 times it will select a random sample of subjects 10 times and make a datasets.
3. Now we build a classification tree of the bootstrap sample. However a modification is applied. We choose and fix a number between 1 - 10. 10 = number of predictors. This number is called “M”. 4. Now follow the process of classification as outline previously, with a modification that we choose “M” predictors at random without replacement. Take this predictor and using this predictor set choose the best predictor and the best covariate to split the tree.
5. For the next node repeat the step 4. So every time we split a node we do the process of choosing the predictor afresh. 6. This is repeated till we reach the lower limit. In this method no pruning is used.
7. Take another bootstrap sample and build a tree as described in the process 3 - 6. 8. Repeat this bootstrapping and tree building for 500 times. (We can do more if we choose). 9. Now find out the decision in the majority of the trees for each of the nodes. This node is the correct classification for the given node.

Why use a random forest?

To reduce misclassification rate for a single classification tree approach.

Load the Random Forest package after downloading it (randomForest).

## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.

Build the Model

rf1 <- randomForest(Species~ ., data=iris, importance=T )
rf1
## 
## Call:
##  randomForest(formula = Species ~ ., data = iris, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 5.33%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         50          0         0        0.00
## versicolor      0         47         3        0.06
## virginica       0          5        45        0.10
summary(rf1)
##                 Length Class  Mode     
## call               4   -none- call     
## type               1   -none- character
## predicted        150   factor numeric  
## err.rate        2000   -none- numeric  
## confusion         12   -none- numeric  
## votes            450   matrix numeric  
## oob.times        150   -none- numeric  
## classes            3   -none- character
## importance        20   -none- numeric  
## importanceSD      16   -none- numeric  
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y                150   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call

Default m parameter is square root of the number of predictors. This the number of variables tried at each split.
OOB = Out of bag error estimate. Every one of the 500 trees makes a prediction for the dataset. There is a misclassification rate associated with each tree. This value is the aggregate of the misclassification rate for the 500 trees.
The confusion matrix shows the misclassification rate for each of the species.

We can get the 95% confidence intervals of the misclassification rate by repeating this process for 1000 times using a loop command.

Graphing importance of variables.

varImpPlot(rf1, pch=16, col=2, var=4, sort=T, main="Importance of Variable in Iris Data")
## Warning in mtext(labs, side = 2, line = loffset, at = y, adj = 0, col =
## color, : "var" is not a graphical parameter
## Warning in title(main = main, xlab = xlab, ylab = ylab, ...): "var" is not
## a graphical parameter
## Warning in mtext(labs, side = 2, line = loffset, at = y, adj = 0, col =
## color, : "var" is not a graphical parameter
## Warning in title(main = main, xlab = xlab, ylab = ylab, ...): "var" is not
## a graphical parameter

Petal length and width are the most important predictor in the classification tree.
Sepal length and sepal width is not much important.

For this underlying process is as below. We shuffle the values of one of the predictor variables for the dataset (say petal length). Keep the other variables constant. Then build a random forrest. For this forest another OOB is calculated. The mean decrease accuracy tells that OOB error rate increase by 30% if we mess up the petal length.

This thus gives us a criterion of information.

If we build the random forest with only the important covariates the error rate will go down.

rf2 <- randomForest(Species~ iris$Petal.Length+iris$Petal.Width, data=iris, importance=T )
rf2
## 
## Call:
##  randomForest(formula = Species ~ iris$Petal.Length + iris$Petal.Width,      data = iris, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 4%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         50          0         0        0.00
## versicolor      0         47         3        0.06
## virginica       0          3        47        0.06

Thus the random forest technique is a model selection technique.