A Small Description About Random Forest

Random forest is a supervised learning algorithm which is used for both classification as well as regression. But however, it is mainly used for classification problems. As we know that a forest is made up of trees and more trees means more robust forest. Similarly, random forest algorithm creates decision trees on data samples and then gets the prediction from each of them and finally selects the best solution by means of voting. It is an ensemble method which is better than a single decision tree because it reduces the over-fitting by averaging the result.

Random Forest Classifier

Random Forest Classifier


Advantages Of Using Random Forest

There are 4 advantages that I have found while using Random forest as a technique of machine learning.

  1. It can be used for both classification and regression tasks.

  2. Over-fitting is one critical problem that may make the results worse, but for Random Forest algorithm, if there are enough trees in the forest, the classifier won’t overfit the model.

  3. The third advantage is the classifier of Random Forest can handle missing values.

  4. The last advantage is that the Random Forest classifier can be modeled for categorical values.


Detailed Coding For Random Forest

First, We will install and Call the required packages for Random Forest

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin

Load The Data Set

data <-read.csv(file.choose())
head(data)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

Let’s see the structure of the data set!

str(data)
## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

Now, we will change the data type of target variable.Here “Outcome” is our target variable. First, replace 1 with “Not Healthy” and 0 with “Healthy”,to have a clear idea about the data set. Now, change the data type of “Outcome” from “Integer” to “Factor”,to do the next analysis. Again check the data set what changes we have got.

Let’s do it!

data$Outcome <- ifelse(data$Outcome == 1,"Not Healthy","Healthy")
data$Outcome <- as.factor(data$Outcome)
head(data)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age     Outcome
## 1                    0.627  50 Not Healthy
## 2                    0.351  31     Healthy
## 3                    0.672  32 Not Healthy
## 4                    0.167  21     Healthy
## 5                    2.288  33 Not Healthy
## 6                    0.201  30     Healthy
str(data)
## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : Factor w/ 2 levels "Healthy","Not Healthy": 2 1 2 1 2 1 2 1 2 2 ...

Now,It’s time to scale the continuous variables of the data set. By scaling,it converts the continuous variables into a Standard Normal Form. As many things are available for a Standard Normal Distribution,it’ll help us to do more for our analysis.

data$Pregnancies<- scale(data$Pregnancies)
data$Glucose <- scale(data$Glucose)
data$BloodPressure <-scale(data$BloodPressure)
data$SkinThickness <- scale(data$SkinThickness)
data$Insulin <- scale(data$Insulin)
data$BMI <- scale(data$BMI)
data$DiabetesPedigreeFunction <- scale(data$DiabetesPedigreeFunction)
data$Age <- scale(data$Age)

Let’s check whether it changes to a Standard Normal Variable or not!

mean(data$Pregnancies)
## [1] -6.901102e-17
var(data$Pregnancies)
##      [,1]
## [1,]    1

Now,it’s time to build a model. By default, here it will choose 500 number of trees.

model1 <- randomForest(Outcome ~ .,data = data)
print(model1)
## 
## Call:
##  randomForest(formula = Outcome ~ ., data = data) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 23.44%
## Confusion matrix:
##             Healthy Not Healthy class.error
## Healthy         426          74   0.1480000
## Not Healthy     106         162   0.3955224

So, from the Model1 we get a error about 23%. Sounds like “NOT BAD AT ALL!”

Now,let’s create a data frame using the following code to show how the error rate of “OOB”,“Not Healthy”,“Healthy” of the Model1 fluctuating when we are increasing the number of trees.

oob.err.data1 <- data.frame(
  Trees = rep(1:nrow(model1$err.rate), 3), 
  Type = rep(c("OOB","Healthy","Not Healthy"), each = nrow(model1$err.rate)),
  Error = c(model1$err.rate[,"OOB"], model1$err.rate[,"Healthy"], model1$err.rate[,"Not Healthy"]))

It’s quite boring for us(future Data Scientists) to see this 1500 number of terms in the data frame.Let’s do some visual effects to have more clear idea about the the error is changing while adding more tress!

ggplot(data = oob.err.data1, aes(x = Trees, y= Error)) + geom_line(aes(color = Type))

If we increase the number of trees 1000, will there be any huge changes in accuracy of the model? Let’s see!

model2 <- randomForest(Outcome ~ ., data = data, ntree = 1000)
print(model2)
## 
## Call:
##  randomForest(formula = Outcome ~ ., data = data, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 22.79%
## Confusion matrix:
##             Healthy Not Healthy class.error
## Healthy         434          66   0.1320000
## Not Healthy     109         159   0.4067164

In Model2,we can see the error rate is low from Model1.We also perform the same process for the Model2 like what we have did for Model1,afterall “accuary” is not the only purpose to build a optimal model.

oob.err.data2 <- data.frame(
  Trees = rep(1:nrow(model2$err.rate), 3), 
  Type = rep(c("OOB","Healthy","Not Healthy"), each = nrow(model2$err.rate)),
  Error = c(model2$err.rate[,"OOB"], model2$err.rate[,"Healthy"], model2$err.rate[,"Not Healthy"]))
ggplot(data = oob.err.data2, aes(x = Trees, y= Error)) + geom_line(aes(color = Type))

From these two graphs we can see that number of tress above 500 doesn’t make significant change in the error rate. So,for saving time we will go for 500 trees.

Previously, we have varied the number of trees while fixed the number of variable used for modeling. Now we will check fixing the number of tress to 500,varying the number of variables.

The follwing code will add one by one variable are using 500 tress and will check the error rate of the model.

oob.values <- vector(length = 8)
for(i in 1:8){
  temp.model1 <- randomForest(Outcome ~ ., data = data, mtry = i, ntree = 500)
  oob.values[i] <- temp.model1$err.rate[nrow(temp.model1$err.rate),1]
}
oob.values
## [1] 0.2304688 0.2330729 0.2304688 0.2421875 0.2356771 0.2356771 0.2343750
## [8] 0.2421875

So,we can say that if we use just 2 variables ,it will give me the optimal model in terms of best accuracy.

So,let’s build the final model using 2 variables and 500 tress to have best model for the analysis.

model3 <- randomForest(Outcome ~ ., data = data, ntree = 500, mtry = 2)
print(model3)
## 
## Call:
##  randomForest(formula = Outcome ~ ., data = data, ntree = 500,      mtry = 2) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 23.31%
## Confusion matrix:
##             Healthy Not Healthy class.error
## Healthy         430          70   0.1400000
## Not Healthy     109         159   0.4067164

Here the error rate is low from Model1 and Model 2. Now, it’s time to check which variable is more important of this data set for our analysis.

importance(model3)
##                          MeanDecreaseGini
## Pregnancies                      28.89716
## Glucose                          89.29345
## BloodPressure                    30.49542
## SkinThickness                    24.52306
## Insulin                          25.92183
## BMI                              56.43743
## DiabetesPedigreeFunction         42.81106
## Age                              47.88851
varImpPlot(model3)

“Glucose”,“BMI” is the more important variables for this analysis.Because they have decreased the Gini impurity much more stringer way than others.But it doesn’t mean that rest of them are useless!

Colclusion

Though we have seen that using 2 variables we get the best result we cannot remove the other variable from the data set and again calculate the accuracy,because it will cost us data losing.So,We will try to do our job using the given data set.