: In this short article, I would like to walk readers through some basics of random forest.
source: R programming: Random forest(in Korean)
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.6.1
library(tree)
library(ggplot2)
library(GGally)
library(dplyr)
iris %>% head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
iris %>% tail()
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Just for your information, the photo’s of above flowers are as follows:
Setosa
Versicolor
Virginica
Radom forest is based on decision tree.
decision_tree <- tree(Species ~ ., data = iris) # Interpretation
# 1. use tree function
# 2. sort species
# 3. based on all(.) variables
# 4. data is iris dataset
decision_tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 150 329.600 setosa ( 0.33333 0.33333 0.33333 )
## 2) Petal.Length < 2.45 50 0.000 setosa ( 1.00000 0.00000 0.00000 ) *
## 3) Petal.Length > 2.45 100 138.600 versicolor ( 0.00000 0.50000 0.50000 )
## 6) Petal.Width < 1.75 54 33.320 versicolor ( 0.00000 0.90741 0.09259 )
## 12) Petal.Length < 4.95 48 9.721 versicolor ( 0.00000 0.97917 0.02083 )
## 24) Sepal.Length < 5.15 5 5.004 versicolor ( 0.00000 0.80000 0.20000 ) *
## 25) Sepal.Length > 5.15 43 0.000 versicolor ( 0.00000 1.00000 0.00000 ) *
## 13) Petal.Length > 4.95 6 7.638 virginica ( 0.00000 0.33333 0.66667 ) *
## 7) Petal.Width > 1.75 46 9.635 virginica ( 0.00000 0.02174 0.97826 )
## 14) Petal.Length < 4.95 6 5.407 virginica ( 0.00000 0.16667 0.83333 ) *
## 15) Petal.Length > 4.95 40 0.000 virginica ( 0.00000 0.00000 1.00000 ) *
summary(decision_tree)
##
## Classification tree:
## tree(formula = Species ~ ., data = iris)
## Variables actually used in tree construction:
## [1] "Petal.Length" "Petal.Width" "Sepal.Length"
## Number of terminal nodes: 6
## Residual mean deviance: 0.1253 = 18.05 / 144
## Misclassification error rate: 0.02667 = 4 / 150
The first table above describes the nodes information. (how the tree decides to diverge observations to sort species)
The second one is a summary.
This becomes clear when you visualise. It helps if you attempt to infer what each line means in both summary and the table hand-on-hand with the plot.
plot(decision_tree)
text(decision_tree)
Before we move onto random forest, there is a handy tool you can use to grasp a given data.
ggpairs automatically creates this beautiful plot showing the relationships between variables in one shot.
ggpairs(iris[,1:5])
You should be able to see that Petal information must be better predictors for species compared to Sepal information. The last column and the last row cleary demonstrates the point.
There is one more thing to keep in mind before we begin: the basic understanding of modelling process is that.
In every modelling process, you have to split a given data randomly into two sets; training set upon which you build a model, and test set with which you test the predicting power of your model. You may wonder why we ought to. The reason is actually very simple. There are endless variety of model types out there and picking the right model for a given case is what indicates a researcher’s ability. However, when you hold the whole dataset and build a model out of it, it’s like you already know all the answers. In other words, you already know which observation is which speicies so you can build a (seemingly but not actually) perfect model. But when you have only say 70% of the dataset and build a model out of that, by extrapolating the model to testset you can see whether your model actually works.
The convention is setting apart 30% of a given dataset as the test one.
index_row <- sample(2,
nrow(iris),
replace = T,
prob = c(0.7, 0.3)
) #assign values to the rows (1: Training, 2: Test)
train_data <- iris[index_row == 1,]
test_data <- iris[index_row == 2,]
Now that we divided the dataset, we are all ready to run random forest function.
iris_classifier <- randomForest(Species ~.,
data = train_data, #train data set
importance = T)
iris_classifier #Confusion matrix: prediction evaluation
##
## Call:
## randomForest(formula = Species ~ ., data = train_data, importance = T)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 2.97%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 34 0 0 0.00000000
## versicolor 0 32 2 0.05882353
## virginica 0 1 32 0.03030303
plot(iris_classifier)
importance(iris_classifier) #Petal features are more important
## setosa versicolor virginica MeanDecreaseAccuracy
## Sepal.Length 6.474330 0.6083961 10.696517 10.855986
## Sepal.Width 4.770164 5.3968525 6.612005 7.854257
## Petal.Length 23.575582 27.3513069 27.103185 32.412798
## Petal.Width 21.364612 28.0634599 32.477279 34.900055
## MeanDecreaseGini
## Sepal.Length 5.416192
## Sepal.Width 2.773880
## Petal.Length 28.496606
## Petal.Width 29.894320
varImpPlot(iris_classifier)
Interpretation
Confusion Matrix? → It’s about evaluation of the model. This shows how much the prediction based on the model is off from actual species.
Ex) Of the total of 39 versicolor, the model sorted 37 as versicolor and 2 as virginica.
Plot? → It’s about the number of trees required. This tells we don’t even need to create 100 trees to push down the error rates. (default is 500 trees)
Importance? → This tells the predicting power of each predictor. In consistent with the ggpair plot we’ve seen, Petal variables have greater predicting power. This is visualised in the plot above.
Importnace can be reconfirmed with these plots:
qplot(Petal.Width, Petal.Length, data=iris, color = Species)
qplot(Sepal.Width, Sepal.Length, data=iris, color = Species)
predicted_table <- predict(iris_classifier, test_data[,-5])
table(observed = test_data[,5], predicted = predicted_table)
## predicted
## observed setosa versicolor virginica
## setosa 16 0 0
## versicolor 0 16 0
## virginica 0 4 13
You can check out the result of prediction.