Random_forest

Objective

: In this short article, I would like to walk readers through some basics of random forest.
source: R programming: Random forest(in Korean)

1. Download required packages

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.6.1

library(tree)
library(ggplot2)
library(GGally)
library(dplyr)

2. Get a glimpse on the dataset and see the summary

iris %>% head()

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

iris %>% tail()

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

150 observations
Three species: setosa, versicolor, virginica
Variables: Sepal length, Sepal width, Petal length, Petal width

Just for your information, the photo’s of above flowers are as follows:
Setosa
Setosa
Versicolor

Virginica

3. Set the goal

We would like to build up a model to predict a given observation’s species
In doing so, we will use Random forest model

4. Decision tree

Radom forest is based on decision tree.

decision_tree <- tree(Species ~ ., data = iris) # Interpretation
                                                # 1. use tree function  
                                                # 2. sort species
                                                # 3. based on all(.) variables
                                                # 4. data is iris dataset
decision_tree

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 150 329.600 setosa ( 0.33333 0.33333 0.33333 )  
##    2) Petal.Length < 2.45 50   0.000 setosa ( 1.00000 0.00000 0.00000 ) *
##    3) Petal.Length > 2.45 100 138.600 versicolor ( 0.00000 0.50000 0.50000 )  
##      6) Petal.Width < 1.75 54  33.320 versicolor ( 0.00000 0.90741 0.09259 )  
##       12) Petal.Length < 4.95 48   9.721 versicolor ( 0.00000 0.97917 0.02083 )  
##         24) Sepal.Length < 5.15 5   5.004 versicolor ( 0.00000 0.80000 0.20000 ) *
##         25) Sepal.Length > 5.15 43   0.000 versicolor ( 0.00000 1.00000 0.00000 ) *
##       13) Petal.Length > 4.95 6   7.638 virginica ( 0.00000 0.33333 0.66667 ) *
##      7) Petal.Width > 1.75 46   9.635 virginica ( 0.00000 0.02174 0.97826 )  
##       14) Petal.Length < 4.95 6   5.407 virginica ( 0.00000 0.16667 0.83333 ) *
##       15) Petal.Length > 4.95 40   0.000 virginica ( 0.00000 0.00000 1.00000 ) *

summary(decision_tree)

## 
## Classification tree:
## tree(formula = Species ~ ., data = iris)
## Variables actually used in tree construction:
## [1] "Petal.Length" "Petal.Width"  "Sepal.Length"
## Number of terminal nodes:  6 
## Residual mean deviance:  0.1253 = 18.05 / 144 
## Misclassification error rate: 0.02667 = 4 / 150

The first table above describes the nodes information. (how the tree decides to diverge observations to sort species)
The second one is a summary.

This becomes clear when you visualise. It helps if you attempt to infer what each line means in both summary and the table hand-on-hand with the plot.

plot(decision_tree)
text(decision_tree)

5. ggpairs

Before we move onto random forest, there is a handy tool you can use to grasp a given data.
ggpairs automatically creates this beautiful plot showing the relationships between variables in one shot.

ggpairs(iris[,1:5])

You should be able to see that Petal information must be better predictors for species compared to Sepal information. The last column and the last row cleary demonstrates the point.

6. Random Forest

There is one more thing to keep in mind before we begin: the basic understanding of modelling process is that.

In every modelling process, you have to split a given data randomly into two sets; training set upon which you build a model, and test set with which you test the predicting power of your model. You may wonder why we ought to. The reason is actually very simple. There are endless variety of model types out there and picking the right model for a given case is what indicates a researcher’s ability. However, when you hold the whole dataset and build a model out of it, it’s like you already know all the answers. In other words, you already know which observation is which speicies so you can build a (seemingly but not actually) perfect model. But when you have only say 70% of the dataset and build a model out of that, by extrapolating the model to testset you can see whether your model actually works.

The convention is setting apart 30% of a given dataset as the test one.

Testset, Trainset

index_row <- sample(2, 
                    nrow(iris), 
                    replace = T, 
                    prob = c(0.7, 0.3)
                    )                 #assign values to the rows (1: Training, 2: Test)
train_data <- iris[index_row == 1,]
test_data <- iris[index_row == 2,]

Now that we divided the dataset, we are all ready to run random forest function.

Random Forest(Training)

iris_classifier <- randomForest(Species ~., 
                                data = train_data, #train data set 
                                importance = T) 
iris_classifier               #Confusion matrix: prediction evaluation

## 
## Call:
##  randomForest(formula = Species ~ ., data = train_data, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 2.97%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         34          0         0  0.00000000
## versicolor      0         32         2  0.05882353
## virginica       0          1        32  0.03030303

plot(iris_classifier)

importance(iris_classifier)   #Petal features are more important

##                 setosa versicolor virginica MeanDecreaseAccuracy
## Sepal.Length  6.474330  0.6083961 10.696517            10.855986
## Sepal.Width   4.770164  5.3968525  6.612005             7.854257
## Petal.Length 23.575582 27.3513069 27.103185            32.412798
## Petal.Width  21.364612 28.0634599 32.477279            34.900055
##              MeanDecreaseGini
## Sepal.Length         5.416192
## Sepal.Width          2.773880
## Petal.Length        28.496606
## Petal.Width         29.894320

varImpPlot(iris_classifier)

Interpretation
Confusion Matrix? → It’s about evaluation of the model. This shows how much the prediction based on the model is off from actual species.
Ex) Of the total of 39 versicolor, the model sorted 37 as versicolor and 2 as virginica.

Plot? → It’s about the number of trees required. This tells we don’t even need to create 100 trees to push down the error rates. (default is 500 trees)
Importance? → This tells the predicting power of each predictor. In consistent with the ggpair plot we’ve seen, Petal variables have greater predicting power. This is visualised in the plot above.

Importnace can be reconfirmed with these plots:

qplot(Petal.Width, Petal.Length, data=iris, color = Species)

qplot(Sepal.Width, Sepal.Length, data=iris, color = Species)

Predicting(Testing)

predicted_table <- predict(iris_classifier, test_data[,-5])
table(observed = test_data[,5], predicted = predicted_table)

##             predicted
## observed     setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         16         0
##   virginica       0          4        13

You can check out the result of prediction.