The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree.In the random forest approach, a large number of decision trees are created. Every observation is fed into every decision tree. The most common outcome for each observation is used as the final output. A new observation is fed into all the trees and taking a majority vote for each classification model.
Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual .Random forests are about having multiple trees, a forest of trees. Those trees can all be of the same type or algorithm or the forest can be made up of a mixture of tree types (algorithms). There are some very interesting further metaphorical thoughts that describe how the forest acts (decides).
The two most important parameters required here in Random Forest is that- ntree and mtry."ntree" actually says on how many trees are present there and "mtry" defines how many variables are required in each step. These two parameters basically results in variance explained and error rate percent.And, from observing these values we can say that decrease in the percent of error rate results in increase of the variance explained which actually fit the model better.
An error estimate is made for the cases which were not used while building the tree. That is called an OOB (Out-of-bag) error estimate which is mentioned as a percentage.
The R package "randomForest" is used to create random forests.
Even though Decision trees are convenient and easily implemented, they lack accuracy. Decision trees work very effectively with the training data that was used to build them, but they're not flexible when it comes to classifying the new sample. Which means that the accuracy during testing phase is very low. This happens due to a process called Over-fitting. [Over-fitting occurs when a model studies the training data to such an extent that it negatively influences the performance of the model on new data.]
This means that the disturbance in the training data is recorded and learned as concepts by the model. But the problem here is that these concepts do not apply to the testing data and negatively impact the model's ability to classify the new data, hence reducing the accuracy on the testing data.
This is where Random Forest comes in. It is based on the idea of bagging, which is used to reduce the variation in the predictions by combining the result of multiple Decision trees on different samples of the data set.
...
The working of Random Forest is as follows:
Here we'll take a very popular categorical dataset which is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.The objective of the dataset is to diagnostically predict whether or not a patient has diabetes,based on certain diagnostic measurements included in the data set.
The content of the dataset is actually of several predictor variables and one target variable,Outcome.Predictor variable includes the no. of pregnancies the patient has had,their BML,insulin level,age and so on.
Further, I have used Random Forest ML technique here to demonstrate Random Forest Generation based on the dataset given.
data <- read.csv("Diabetes.csv")
head(data)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
str(data)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
data$Outcome <- ifelse(data$Outcome == 0, yes="Healthy", no= "Not Healthy")
data$Outcome <- as.factor(data$Outcome)
data$Pregnancies <- scale(data$Pregnancies)
data$Glucose <- scale(data$Glucose)
data$BloodPressure <- scale(data$BloodPressure)
data$SkinThickness <- scale(data$SkinThickness)
data$Insulin <- scale(data$Insulin)
data$BMI <- scale(data$BMI)
data$DiabetesPedigreeFunction <- scale(data$DiabetesPedigreeFunction)
data$Age <- scale(data$Age)
sapply(data,function(x)sum(is.na(x)))
## Pregnancies Glucose BloodPressure
## 0 0 0
## SkinThickness Insulin BMI
## 0 0 0
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
set.seed(123)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
model <- randomForest(Outcome ~ ., data = data)
print(model)
##
## Call:
## randomForest(formula = Outcome ~ ., data = data)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 23.31%
## Confusion matrix:
## Healthy Not Healthy class.error
## Healthy 431 69 0.1380000
## Not Healthy 110 158 0.4104478
oob.err.data <- data.frame(
Trees = rep(1:nrow(model$err.rate), 3),
Type = rep(c("OOB","Healthy","Unhealthy"), each = nrow(model$err.rate)),
Error = c(model$err.rate[,"OOB"], model$err.rate[,"Healthy"], model$err.rate[,"Not Healthy"]))
library(e1071)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
ggplot(data = oob.err.data, aes(x = Trees, y= Error)) + geom_line(aes(color = Type))
model1 <- randomForest(Outcome ~ ., data = data, ntree = 1000)
print(model1)
##
## Call:
## randomForest(formula = Outcome ~ ., data = data, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 23.44%
## Confusion matrix:
## Healthy Not Healthy class.error
## Healthy 428 72 0.1440000
## Not Healthy 108 160 0.4029851
oob.err.data1 <- data.frame(
Trees = rep(1:nrow(model1$err.rate), 3),
Type = rep(c("OOB","Healthy","Unhealthy"), each = nrow(model1$err.rate)),
Error = c(model1$err.rate[,"OOB"], model1$err.rate[,"Healthy"], model1$err.rate[,"Not Healthy"]))
ggplot(data = oob.err.data1, aes(x = Trees, y= Error)) + geom_line(aes(color = Type))
oob.values <- vector(length = 10)
for(i in 1:8){
temp.model <- randomForest(Outcome ~ ., data = data, mtry = i, ntree = 500)
oob.values[i] <- temp.model$err.rate[nrow(temp.model$err.rate),1]
}
oob.values
## [1] 0.2343750 0.2434896 0.2395833 0.2356771 0.2369792 0.2408854 0.2473958
## [8] 0.2330729 0.0000000 0.0000000
model2 <- randomForest(Outcome ~ ., data = data, ntree = 500, mtry = 8)
print(model2)
##
## Call:
## randomForest(formula = Outcome ~ ., data = data, ntree = 500, mtry = 8)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 24.61%
## Confusion matrix:
## Healthy Not Healthy class.error
## Healthy 421 79 0.1580000
## Not Healthy 110 158 0.4104478
importance(model2)
## MeanDecreaseGini
## Pregnancies 23.44817
## Glucose 113.77692
## BloodPressure 30.11033
## SkinThickness 17.19549
## Insulin 18.18521
## BMI 58.56169
## DiabetesPedigreeFunction 45.22922
## Age 42.12460
varImpPlot(model2)
Hope I was able to share some helpful concepts with you.See you in the next article. My website link Archita.