Random forest is a supervised learning algorithm which is used for both classification as well as regression. But however, it is mainly used for classification problems. As we know that a forest is made up of trees and more trees means more robust forest. Similarly, random forest algorithm creates decision trees on data samples and then gets the prediction from each of them and finally selects the best solution by means of voting. It is an ensemble method which is better than a single decision tree because it reduces the over-fitting by averaging the result.
Random Forest Classifier
There are 4 advantages that I have found while using Random forest as a technique of machine learning.
It can be used for both classification and regression tasks.
Over-fitting is one critical problem that may make the results worse, but for Random Forest algorithm, if there are enough trees in the forest, the classifier won’t overfit the model.
The third advantage is the classifier of Random Forest can handle missing values.
The last advantage is that the Random Forest classifier can be modeled for categorical values.
First, We will install and Call the required packages for Random Forest
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
Load The Data Set
data <-read.csv(file.choose())
head(data)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
Let’s see the structure of the data set!
str(data)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
Now, we will change the data type of target variable.Here “Outcome” is our target variable. First, replace 1 with “Not Healthy” and 0 with “Healthy”,to have a clear idea about the data set. Now, change the data type of “Outcome” from “Integer” to “Factor”,to do the next analysis. Again check the data set what changes we have got.
Let’s do it!
data$Outcome <- ifelse(data$Outcome == 1,"Not Healthy","Healthy")
data$Outcome <- as.factor(data$Outcome)
head(data)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 Not Healthy
## 2 0.351 31 Healthy
## 3 0.672 32 Not Healthy
## 4 0.167 21 Healthy
## 5 2.288 33 Not Healthy
## 6 0.201 30 Healthy
str(data)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : Factor w/ 2 levels "Healthy","Not Healthy": 2 1 2 1 2 1 2 1 2 2 ...
Now,It’s time to scale the continuous variables of the data set. By scaling,it converts the continuous variables into a Standard Normal Form. As many things are available for a Standard Normal Distribution,it’ll help us to do more for our analysis.
data$Pregnancies<- scale(data$Pregnancies)
data$Glucose <- scale(data$Glucose)
data$BloodPressure <-scale(data$BloodPressure)
data$SkinThickness <- scale(data$SkinThickness)
data$Insulin <- scale(data$Insulin)
data$BMI <- scale(data$BMI)
data$DiabetesPedigreeFunction <- scale(data$DiabetesPedigreeFunction)
data$Age <- scale(data$Age)
Let’s check whether it changes to a Standard Normal Variable or not!
mean(data$Pregnancies)
## [1] -6.901102e-17
var(data$Pregnancies)
## [,1]
## [1,] 1
Now,it’s time to build a model. By default, here it will choose 500 number of trees.
model1 <- randomForest(Outcome ~ .,data = data)
print(model1)
##
## Call:
## randomForest(formula = Outcome ~ ., data = data)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 23.44%
## Confusion matrix:
## Healthy Not Healthy class.error
## Healthy 426 74 0.1480000
## Not Healthy 106 162 0.3955224
So, from the Model1 we get a error about 23%. Sounds like “NOT BAD AT ALL!”
Now,let’s create a data frame using the following code to show how the error rate of “OOB”,“Not Healthy”,“Healthy” of the Model1 fluctuating when we are increasing the number of trees.
oob.err.data1 <- data.frame(
Trees = rep(1:nrow(model1$err.rate), 3),
Type = rep(c("OOB","Healthy","Not Healthy"), each = nrow(model1$err.rate)),
Error = c(model1$err.rate[,"OOB"], model1$err.rate[,"Healthy"], model1$err.rate[,"Not Healthy"]))
It’s quite boring for us(future Data Scientists) to see this 1500 number of terms in the data frame.Let’s do some visual effects to have more clear idea about the the error is changing while adding more tress!
ggplot(data = oob.err.data1, aes(x = Trees, y= Error)) + geom_line(aes(color = Type))
If we increase the number of trees 1000, will there be any huge changes in accuracy of the model? Let’s see!
model2 <- randomForest(Outcome ~ ., data = data, ntree = 1000)
print(model2)
##
## Call:
## randomForest(formula = Outcome ~ ., data = data, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 22.79%
## Confusion matrix:
## Healthy Not Healthy class.error
## Healthy 434 66 0.1320000
## Not Healthy 109 159 0.4067164
In Model2,we can see the error rate is low from Model1.We also perform the same process for the Model2 like what we have did for Model1,afterall “accuary” is not the only purpose to build a optimal model.
oob.err.data2 <- data.frame(
Trees = rep(1:nrow(model2$err.rate), 3),
Type = rep(c("OOB","Healthy","Not Healthy"), each = nrow(model2$err.rate)),
Error = c(model2$err.rate[,"OOB"], model2$err.rate[,"Healthy"], model2$err.rate[,"Not Healthy"]))
ggplot(data = oob.err.data2, aes(x = Trees, y= Error)) + geom_line(aes(color = Type))
From these two graphs we can see that number of tress above 500 doesn’t make significant change in the error rate. So,for saving time we will go for 500 trees.
Previously, we have varied the number of trees while fixed the number of variable used for modeling. Now we will check fixing the number of tress to 500,varying the number of variables.
The follwing code will add one by one variable are using 500 tress and will check the error rate of the model.
oob.values <- vector(length = 8)
for(i in 1:8){
temp.model1 <- randomForest(Outcome ~ ., data = data, mtry = i, ntree = 500)
oob.values[i] <- temp.model1$err.rate[nrow(temp.model1$err.rate),1]
}
oob.values
## [1] 0.2304688 0.2330729 0.2304688 0.2421875 0.2356771 0.2356771 0.2343750
## [8] 0.2421875
So,we can say that if we use just 2 variables ,it will give me the optimal model in terms of best accuracy.
So,let’s build the final model using 2 variables and 500 tress to have best model for the analysis.
model3 <- randomForest(Outcome ~ ., data = data, ntree = 500, mtry = 2)
print(model3)
##
## Call:
## randomForest(formula = Outcome ~ ., data = data, ntree = 500, mtry = 2)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 23.31%
## Confusion matrix:
## Healthy Not Healthy class.error
## Healthy 430 70 0.1400000
## Not Healthy 109 159 0.4067164
Here the error rate is low from Model1 and Model 2. Now, it’s time to check which variable is more important of this data set for our analysis.
importance(model3)
## MeanDecreaseGini
## Pregnancies 28.89716
## Glucose 89.29345
## BloodPressure 30.49542
## SkinThickness 24.52306
## Insulin 25.92183
## BMI 56.43743
## DiabetesPedigreeFunction 42.81106
## Age 47.88851
varImpPlot(model3)
“Glucose”,“BMI” is the more important variables for this analysis.Because they have decreased the Gini impurity much more stringer way than others.But it doesn’t mean that rest of them are useless!
Though we have seen that using 2 variables we get the best result we cannot remove the other variable from the data set and again calculate the accuracy,because it will cost us data losing.So,We will try to do our job using the given data set.