INTRODUCTION

The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree.In the random forest approach, a large number of decision trees are created. Every observation is fed into every decision tree. The most common outcome for each observation is used as the final output. A new observation is fed into all the trees and taking a majority vote for each classification model.

DISCUSSION ON RANDOM FOREST

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual .Random forests are about having multiple trees, a forest of trees. Those trees can all be of the same type or algorithm or the forest can be made up of a mixture of tree types (algorithms). There are some very interesting further metaphorical thoughts that describe how the forest acts (decides).

The two most important parameters required here in Random Forest is that- ntree and mtry."ntree" actually says on how many trees are present there and "mtry" defines how many variables are required in each step. These two parameters basically results in variance explained and error rate percent.And, from observing these values we can say that decrease in the percent of error rate results in increase of the variance explained which actually fit the model better.

An error estimate is made for the cases which were not used while building the tree. That is called an OOB (Out-of-bag) error estimate which is mentioned as a percentage.

The R package "randomForest" is used to create random forests.

NEED OF RANDOM FOREST OVER DECISION TREE

Even though Decision trees are convenient and easily implemented, they lack accuracy. Decision trees work very effectively with the training data that was used to build them, but they're not flexible when it comes to classifying the new sample. Which means that the accuracy during testing phase is very low. This happens due to a process called Over-fitting. [Over-fitting occurs when a model studies the training data to such an extent that it negatively influences the performance of the model on new data.]

This means that the disturbance in the training data is recorded and learned as concepts by the model. But the problem here is that these concepts do not apply to the testing data and negatively impact the model's ability to classify the new data, hence reducing the accuracy on the testing data.

This is where Random Forest comes in. It is based on the idea of bagging, which is used to reduce the variation in the predictions by combining the result of multiple Decision trees on different samples of the data set.

...

...

MAJOR STEPS FOR RANDOM FOREST GENERATION

The working of Random Forest is as follows:

  1. Loading data
  2. Changing the data type of the target variable
  3. Scaling the continuous variables
  4. To list if any na columns present
  5. Creating the model
  6. Creating error rate dataframe for all the trees
  7. No. of tree vs error plot
  8. Model built with 1000 trees
  9. Creating error rate dataframe for all the trees
  10. No. of tree vs error plot
  11. Testing model accuracy with different values of random feature selection
  12. Print result
  13. Building final tree with most optimal customizations
  14. Checking important predictors

Here we'll take a very popular categorical dataset which is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.The objective of the dataset is to diagnostically predict whether or not a patient has diabetes,based on certain diagnostic measurements included in the data set.

The content of the dataset is actually of several predictor variables and one target variable,Outcome.Predictor variable includes the no. of pregnancies the patient has had,their BML,insulin level,age and so on.

Further, I have used Random Forest ML technique here to demonstrate Random Forest Generation based on the dataset given.

Loading the dataset

data <- read.csv("Diabetes.csv")
head(data)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0
str(data)
## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

Changing the data type of the target variable

data$Outcome <- ifelse(data$Outcome == 0, yes="Healthy", no= "Not Healthy")
data$Outcome <- as.factor(data$Outcome)

Scaling the continuous variables

data$Pregnancies <- scale(data$Pregnancies)
data$Glucose <- scale(data$Glucose)
data$BloodPressure <- scale(data$BloodPressure)
data$SkinThickness <- scale(data$SkinThickness)
data$Insulin <- scale(data$Insulin)
data$BMI <- scale(data$BMI)
data$DiabetesPedigreeFunction <- scale(data$DiabetesPedigreeFunction)
data$Age <- scale(data$Age)

To list if any missing values in each columns present

sapply(data,function(x)sum(is.na(x)))
##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0
set.seed(123)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.

Creating the model

model <- randomForest(Outcome ~ ., data = data)
print(model)
## 
## Call:
##  randomForest(formula = Outcome ~ ., data = data) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 23.31%
## Confusion matrix:
##             Healthy Not Healthy class.error
## Healthy         431          69   0.1380000
## Not Healthy     110         158   0.4104478

Creating error rate dataframe for all the trees

oob.err.data <- data.frame(
  Trees = rep(1:nrow(model$err.rate), 3), 
  Type = rep(c("OOB","Healthy","Unhealthy"), each = nrow(model$err.rate)),
  Error = c(model$err.rate[,"OOB"], model$err.rate[,"Healthy"], model$err.rate[,"Not Healthy"]))

No. of tree vs error plot

library(e1071)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
ggplot(data = oob.err.data, aes(x = Trees, y= Error)) + geom_line(aes(color = Type))

Model built with 1000 trees

model1 <- randomForest(Outcome ~ ., data = data, ntree = 1000)
print(model1)
## 
## Call:
##  randomForest(formula = Outcome ~ ., data = data, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 23.44%
## Confusion matrix:
##             Healthy Not Healthy class.error
## Healthy         428          72   0.1440000
## Not Healthy     108         160   0.4029851

Creating error rate dataframe for all the trees

oob.err.data1 <- data.frame(
  Trees = rep(1:nrow(model1$err.rate), 3), 
  Type = rep(c("OOB","Healthy","Unhealthy"), each = nrow(model1$err.rate)),
  Error = c(model1$err.rate[,"OOB"], model1$err.rate[,"Healthy"], model1$err.rate[,"Not Healthy"]))

No. of tree vs error plot

ggplot(data = oob.err.data1, aes(x = Trees, y= Error)) + geom_line(aes(color = Type))

Testing model accuracy with different values of random feature selection

oob.values <- vector(length = 10)
for(i in 1:8){
  temp.model <- randomForest(Outcome ~ ., data = data, mtry = i, ntree = 500)
  oob.values[i] <- temp.model$err.rate[nrow(temp.model$err.rate),1]
}

Building final tree with most optimal customizations

model2 <- randomForest(Outcome ~ ., data = data, ntree = 500, mtry = 8)
print(model2)
## 
## Call:
##  randomForest(formula = Outcome ~ ., data = data, ntree = 500,      mtry = 8) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 8
## 
##         OOB estimate of  error rate: 24.61%
## Confusion matrix:
##             Healthy Not Healthy class.error
## Healthy         421          79   0.1580000
## Not Healthy     110         158   0.4104478

Checking important predictors

importance(model2)
##                          MeanDecreaseGini
## Pregnancies                      23.44817
## Glucose                         113.77692
## BloodPressure                    30.11033
## SkinThickness                    17.19549
## Insulin                          18.18521
## BMI                              58.56169
## DiabetesPedigreeFunction         45.22922
## Age                              42.12460
varImpPlot(model2)

Hope I was able to share some helpful concepts with you.See you in the next article. My website link Archita.