Introduction

Random forest is a popular supervised machine learning algorithm—used for both classification and regression problems. It is based on the concept of ensemble learning, which enables users to combine multiple classifiers to solve a complex problem and to also improve the performance of the model.Aggregate of the results of multiple predictors gives a better prediction than the best individual predictor. A group of predictors is called an ensemble. Thus, this technique is called Ensemble Learning.

The random forest algorithm relies on multiple decision trees and accepts the results of the predictions from each tree. Based on the majority votes of predictions, it determines the final result.This technique is called Random Forest.

we will proceed as follow to build a Random Forest:

Step 1)Import the Data

Step 2)Train the Model

Step 3)Set the Control parameter

Step 4)optimize the model

step 5)variable importance checking

Importing required library packages

   library(randomForest)
   library(ggplot2)
   library(caret)

Import the Data

Here we are using a dataset,which is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. Objective of this dataset is to diagonistically predict wheather or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.

To import the dataset Diabetes

    library(readr)
    diabetes <- read_csv("C:/Users/admin/Desktop/softanbees class-assignments/Diabetes.csv")
## Parsed with column specification:
## cols(
##   Pregnancies = col_double(),
##   Glucose = col_double(),
##   BloodPressure = col_double(),
##   SkinThickness = col_double(),
##   Insulin = col_double(),
##   BMI = col_double(),
##   DiabetesPedigreeFunction = col_double(),
##   Age = col_double(),
##   Outcome = col_double()
## )

The dataset consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

    head(diabetes)
## # A tibble: 6 x 9
##   Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI DiabetesPedigre~
##         <dbl>   <dbl>         <dbl>         <dbl>   <dbl> <dbl>            <dbl>
## 1           6     148            72            35       0  33.6            0.627
## 2           1      85            66            29       0  26.6            0.351
## 3           8     183            64             0       0  23.3            0.672
## 4           1      89            66            23      94  28.1            0.167
## 5           0     137            40            35     168  43.1            2.29 
## 6           5     116            74             0       0  25.6            0.201
## # ... with 2 more variables: Age <dbl>, Outcome <dbl>
    str(diabetes)
## tibble [768 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Pregnancies             : num [1:768] 6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : num [1:768] 148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : num [1:768] 72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : num [1:768] 35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : num [1:768] 0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num [1:768] 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num [1:768] 0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : num [1:768] 50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : num [1:768] 1 0 1 0 1 0 1 0 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Pregnancies = col_double(),
##   ..   Glucose = col_double(),
##   ..   BloodPressure = col_double(),
##   ..   SkinThickness = col_double(),
##   ..   Insulin = col_double(),
##   ..   BMI = col_double(),
##   ..   DiabetesPedigreeFunction = col_double(),
##   ..   Age = col_double(),
##   ..   Outcome = col_double()
##   .. )
    attach(diabetes)

Here we have to change the data-type of target variable from numerical to factor type as this is a categorical variable.Also we have to scale the continuous predicator variables for getting better result of model fitting. To change the data-type of predictors

   diabetes$Outcome <- ifelse(diabetes$Outcome == 1,"Not Healthy", "Healthy")
   diabetes$Outcome <- as.factor(diabetes$Outcome)

To scale the continuous variable

   diabetes$Pregnancies <- scale(diabetes$Pregnancies)
   diabetes$Glucose <- scale(diabetes$Glucose)
   diabetes$BloodPressure <- scale(diabetes$BloodPressure)
   diabetes$SkinThickness <- scale(diabetes$SkinThickness)
   diabetes$Insulin <- scale(diabetes$Insulin)
   diabetes$BMI <- scale(diabetes$BMI)
   diabetes$DiabetesPedigreeFunction <- scale(diabetes$DiabetesPedigreeFunction)
   diabetes$Age <- scale(diabetes$Age)

Train the Model

Now,We will do cross-validation with k(=10)fold and let’s understand why to do this for betterment;

The basic idea, behind cross-validation techniques, consists of dividing the data into two sets: 1.The training set, used to train (i.e. build) the model; 2.and the testing set (or validation set), used to test (i.e. validate) the model by estimating the prediction error. Cross-validation is also known as a re-sampling method because it involves fitting the same statistical method multiple times using different subsets of the data.

In usual cases we use “set.seed” and then split the main data-set into training and testing.In some cases,it is seen that training set may exclude some imp rows;that’s why accuracy reduces in great extent,and this changes with changing the value of “set.seed” and this is quiet unpredictable.So,we do cross validation.

For example,we have a dataset containing 1000 rows and prime object to split into train and test dataset with 5 fold.now question is what is 5 fold.this means total dataset would be splitted into 5 parts each containing 200 rows. at the very 1st time 1st part will be considered as testing dataset and next 4 parts would be training dataset and there will be measured accuracy and it is noted.in the very next step 2nd part will be considerd as testing dataset and the other four excluding 2nd part will be training dataset and accuracy will be measured.and in this way process will coninue. and in the last step, 5th part will be considerd as testing and rest of those would be training dataset. Now,lets discuss what will be the effectiveness of doing this instead of previous process: obviously previous(means doing set.seed and then split train and test data jst one time)process is one type of cross validation ,but in that case we may have a chance of miss some imp data in training dataset and in result it maynt be able to recognise testing dataset.And then accuracy reduces and main drawback is dataloss .In Cross Validation this problem will never occur as it repeatedly checks diffr parts as training and testing dataset.

For this reason it is good to do cross-validation to build the model. Cross validation with 10 fold

   set.seed(456)
   library(caret)
   train_control <- trainControl(method="cv", number=10)

Random forest chooses a random subset of features and builds many Decision Trees. The model averages out all the predictions of the Decisions trees.

Random forest has some parameters that can be changed to improve the generalization of the prediction. You will use the function RandomForest() to train the model.

Syntax for Randon Forest is:

Fit random-forest Model

   model <- train(Outcome~., data=diabetes, trControl=train_control, method="rf")
   print(model)
## Random Forest 
## 
## 768 samples
##   8 predictor
##   2 classes: 'Healthy', 'Not Healthy' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 691, 691, 692, 692, 691, 691, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.7551606  0.4443989
##   5     0.7564593  0.4546849
##   8     0.7617054  0.4638439
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 8.

Here, In the formula Outcome is target variable and ~ sign is used to denote that Outcome depends on all predictors.By the sign “.” it is meant that all predictors in the dataset is used. trControl: A list of values that define how this function acts. See trainControl and http://topepo.github.io/caret/using-your-own-model-in-train.html. (NOTE: If given, this argument must be named.) method: A string specifying which classification or regression model to use. Possible values are found using names(getModelInfo()). See http://topepo.github.io/caret/train-models-by-tag.html. A list of functions can also be passed for a custom model function. See http://topepo.github.io/caret/using-your-own-model-in-train.html for details.

Building tree

   library(randomForest)
   model1 <- randomForest(Outcome ~ ., data = diabetes)
   print(model1)
## 
## Call:
##  randomForest(formula = Outcome ~ ., data = diabetes) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 23.96%
## Confusion matrix:
##             Healthy Not Healthy class.error
## Healthy         427          73   0.1460000
## Not Healthy     111         157   0.4141791

Now let’s discuss about the term “Out of Bag(OOB)Estimate of Error rate” Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring the prediction error of random forests, boosted decision trees, and other machine learning models utilizing bootstrap aggregating (bagging) to sub-sample data samples used for training. OOB is the mean prediction error on each training sample xᵢ, using only the trees that did not have xᵢ in their bootstrap sample.[1] In simple words,when bootstrap sample is generated then some goes in training and some for testing.so after model fitting on training ,it predicts on testing dataset and gives an accuracy.the term 1-accuracy gives prediction error of first set.In this way, when we will do Cross validation with 10 fold,then mean prediction error on 10 training samples in their bootstrap sample. now let’s see error rate for OOB,Healthy and Not Healthy for trees:

   print(model1$err.rate)
##              OOB   Healthy Not Healthy
##   [1,] 0.3344948 0.2659574   0.4646465
##   [2,] 0.3242678 0.2345277   0.4853801
##   [3,] 0.3188153 0.2278820   0.4875622
##   [4,] 0.3013910 0.2132701   0.4666667
##   [5,] 0.3106936 0.2195122   0.4813278
##   [6,] 0.3059805 0.2038627   0.4940711
##   [7,] 0.3074830 0.2029289   0.5019455
##   [8,] 0.2929427 0.1905738   0.4828897
##   [9,] 0.2998679 0.2008114   0.4848485
##  [10,] 0.3083004 0.2064777   0.4981132
##  [11,] 0.2877792 0.1878788   0.4736842
##  [12,] 0.2844037 0.1854839   0.4681648
##  [13,] 0.2640523 0.1726908   0.4344569
##  [14,] 0.2715405 0.1663327   0.4681648
##  [15,] 0.2659713 0.1703407   0.4440299
##  [16,] 0.2698827 0.1683367   0.4589552
##  [17,] 0.2656250 0.1740000   0.4365672
##  [18,] 0.2591146 0.1600000   0.4440299
##  [19,] 0.2526042 0.1500000   0.4440299
##  [20,] 0.2434896 0.1500000   0.4179104
##  [21,] 0.2460938 0.1500000   0.4253731
##  [22,] 0.2434896 0.1500000   0.4179104
##  [23,] 0.2500000 0.1620000   0.4141791
##  [24,] 0.2500000 0.1600000   0.4179104
##  [25,] 0.2513021 0.1600000   0.4216418
##  [26,] 0.2473958 0.1560000   0.4179104
##  [27,] 0.2434896 0.1480000   0.4216418
##  [28,] 0.2500000 0.1540000   0.4291045
##  [29,] 0.2395833 0.1420000   0.4216418
##  [30,] 0.2382812 0.1520000   0.3992537
##  [31,] 0.2421875 0.1500000   0.4141791
##  [32,] 0.2408854 0.1480000   0.4141791
##  [33,] 0.2447917 0.1520000   0.4179104
##  [34,] 0.2500000 0.1600000   0.4179104
##  [35,] 0.2408854 0.1500000   0.4104478
##  [36,] 0.2447917 0.1560000   0.4104478
##  [37,] 0.2473958 0.1520000   0.4253731
##  [38,] 0.2500000 0.1600000   0.4179104
##  [39,] 0.2408854 0.1520000   0.4067164
##  [40,] 0.2500000 0.1540000   0.4291045
##  [41,] 0.2473958 0.1560000   0.4179104
##  [42,] 0.2500000 0.1600000   0.4179104
##  [43,] 0.2421875 0.1540000   0.4067164
##  [44,] 0.2539062 0.1620000   0.4253731
##  [45,] 0.2460938 0.1540000   0.4179104
##  [46,] 0.2552083 0.1660000   0.4216418
##  [47,] 0.2513021 0.1620000   0.4179104
##  [48,] 0.2447917 0.1540000   0.4141791
##  [49,] 0.2460938 0.1580000   0.4104478
##  [50,] 0.2500000 0.1620000   0.4141791
##  [51,] 0.2434896 0.1560000   0.4067164
##  [52,] 0.2395833 0.1520000   0.4029851
##  [53,] 0.2460938 0.1560000   0.4141791
##  [54,] 0.2408854 0.1500000   0.4104478
##  [55,] 0.2421875 0.1440000   0.4253731
##  [56,] 0.2421875 0.1460000   0.4216418
##  [57,] 0.2408854 0.1460000   0.4179104
##  [58,] 0.2317708 0.1380000   0.4067164
##  [59,] 0.2356771 0.1420000   0.4104478
##  [60,] 0.2421875 0.1540000   0.4067164
##  [61,] 0.2395833 0.1500000   0.4067164
##  [62,] 0.2369792 0.1440000   0.4104478
##  [63,] 0.2408854 0.1500000   0.4104478
##  [64,] 0.2395833 0.1500000   0.4067164
##  [65,] 0.2421875 0.1500000   0.4141791
##  [66,] 0.2408854 0.1480000   0.4141791
##  [67,] 0.2395833 0.1440000   0.4179104
##  [68,] 0.2421875 0.1480000   0.4179104
##  [69,] 0.2408854 0.1520000   0.4067164
##  [70,] 0.2408854 0.1500000   0.4104478
##  [71,] 0.2421875 0.1500000   0.4141791
##  [72,] 0.2408854 0.1500000   0.4104478
##  [73,] 0.2434896 0.1540000   0.4104478
##  [74,] 0.2447917 0.1540000   0.4141791
##  [75,] 0.2434896 0.1540000   0.4104478
##  [76,] 0.2421875 0.1520000   0.4104478
##  [77,] 0.2447917 0.1560000   0.4104478
##  [78,] 0.2408854 0.1520000   0.4067164
##  [79,] 0.2408854 0.1520000   0.4067164
##  [80,] 0.2421875 0.1540000   0.4067164
##  [81,] 0.2395833 0.1520000   0.4029851
##  [82,] 0.2408854 0.1540000   0.4029851
##  [83,] 0.2369792 0.1520000   0.3955224
##  [84,] 0.2408854 0.1540000   0.4029851
##  [85,] 0.2421875 0.1520000   0.4104478
##  [86,] 0.2421875 0.1540000   0.4067164
##  [87,] 0.2395833 0.1460000   0.4141791
##  [88,] 0.2434896 0.1520000   0.4141791
##  [89,] 0.2356771 0.1460000   0.4029851
##  [90,] 0.2395833 0.1500000   0.4067164
##  [91,] 0.2343750 0.1500000   0.3917910
##  [92,] 0.2369792 0.1480000   0.4029851
##  [93,] 0.2382812 0.1500000   0.4029851
##  [94,] 0.2382812 0.1460000   0.4104478
##  [95,] 0.2356771 0.1440000   0.4067164
##  [96,] 0.2356771 0.1440000   0.4067164
##  [97,] 0.2369792 0.1440000   0.4104478
##  [98,] 0.2369792 0.1440000   0.4104478
##  [99,] 0.2382812 0.1460000   0.4104478
## [100,] 0.2395833 0.1460000   0.4141791
## [101,] 0.2369792 0.1440000   0.4104478
## [102,] 0.2356771 0.1460000   0.4029851
## [103,] 0.2343750 0.1440000   0.4029851
## [104,] 0.2369792 0.1500000   0.3992537
## [105,] 0.2356771 0.1500000   0.3955224
## [106,] 0.2317708 0.1440000   0.3955224
## [107,] 0.2343750 0.1460000   0.3992537
## [108,] 0.2330729 0.1440000   0.3992537
## [109,] 0.2343750 0.1480000   0.3955224
## [110,] 0.2304688 0.1400000   0.3992537
## [111,] 0.2304688 0.1400000   0.3992537
## [112,] 0.2330729 0.1400000   0.4067164
## [113,] 0.2382812 0.1440000   0.4141791
## [114,] 0.2317708 0.1380000   0.4067164
## [115,] 0.2317708 0.1380000   0.4067164
## [116,] 0.2304688 0.1400000   0.3992537
## [117,] 0.2330729 0.1420000   0.4029851
## [118,] 0.2304688 0.1400000   0.3992537
## [119,] 0.2317708 0.1420000   0.3992537
## [120,] 0.2317708 0.1420000   0.3992537
## [121,] 0.2330729 0.1420000   0.4029851
## [122,] 0.2317708 0.1420000   0.3992537
## [123,] 0.2330729 0.1420000   0.4029851
## [124,] 0.2317708 0.1420000   0.3992537
## [125,] 0.2330729 0.1440000   0.3992537
## [126,] 0.2356771 0.1440000   0.4067164
## [127,] 0.2343750 0.1440000   0.4029851
## [128,] 0.2330729 0.1460000   0.3955224
## [129,] 0.2343750 0.1440000   0.4029851
## [130,] 0.2330729 0.1420000   0.4029851
## [131,] 0.2317708 0.1420000   0.3992537
## [132,] 0.2356771 0.1460000   0.4029851
## [133,] 0.2382812 0.1460000   0.4104478
## [134,] 0.2356771 0.1440000   0.4067164
## [135,] 0.2330729 0.1400000   0.4067164
## [136,] 0.2343750 0.1420000   0.4067164
## [137,] 0.2343750 0.1440000   0.4029851
## [138,] 0.2304688 0.1400000   0.3992537
## [139,] 0.2317708 0.1400000   0.4029851
## [140,] 0.2304688 0.1400000   0.3992537
## [141,] 0.2291667 0.1400000   0.3955224
## [142,] 0.2278646 0.1400000   0.3917910
## [143,] 0.2317708 0.1420000   0.3992537
## [144,] 0.2291667 0.1420000   0.3917910
## [145,] 0.2291667 0.1400000   0.3955224
## [146,] 0.2291667 0.1420000   0.3917910
## [147,] 0.2291667 0.1380000   0.3992537
## [148,] 0.2304688 0.1420000   0.3955224
## [149,] 0.2278646 0.1400000   0.3917910
## [150,] 0.2265625 0.1380000   0.3917910
## [151,] 0.2291667 0.1380000   0.3992537
## [152,] 0.2291667 0.1380000   0.3992537
## [153,] 0.2278646 0.1380000   0.3955224
## [154,] 0.2330729 0.1420000   0.4029851
## [155,] 0.2330729 0.1420000   0.4029851
## [156,] 0.2330729 0.1440000   0.3992537
## [157,] 0.2330729 0.1440000   0.3992537
## [158,] 0.2317708 0.1420000   0.3992537
## [159,] 0.2317708 0.1440000   0.3955224
## [160,] 0.2330729 0.1440000   0.3992537
## [161,] 0.2343750 0.1420000   0.4067164
## [162,] 0.2343750 0.1440000   0.4029851
## [163,] 0.2304688 0.1420000   0.3955224
## [164,] 0.2304688 0.1380000   0.4029851
## [165,] 0.2291667 0.1400000   0.3955224
## [166,] 0.2317708 0.1420000   0.3992537
## [167,] 0.2343750 0.1440000   0.4029851
## [168,] 0.2330729 0.1440000   0.3992537
## [169,] 0.2317708 0.1400000   0.4029851
## [170,] 0.2304688 0.1400000   0.3992537
## [171,] 0.2291667 0.1400000   0.3955224
## [172,] 0.2278646 0.1400000   0.3917910
## [173,] 0.2291667 0.1400000   0.3955224
## [174,] 0.2278646 0.1420000   0.3880597
## [175,] 0.2252604 0.1380000   0.3880597
## [176,] 0.2278646 0.1380000   0.3955224
## [177,] 0.2317708 0.1440000   0.3955224
## [178,] 0.2317708 0.1420000   0.3992537
## [179,] 0.2330729 0.1400000   0.4067164
## [180,] 0.2330729 0.1420000   0.4029851
## [181,] 0.2343750 0.1420000   0.4067164
## [182,] 0.2343750 0.1420000   0.4067164
## [183,] 0.2343750 0.1420000   0.4067164
## [184,] 0.2330729 0.1420000   0.4029851
## [185,] 0.2330729 0.1400000   0.4067164
## [186,] 0.2304688 0.1400000   0.3992537
## [187,] 0.2317708 0.1420000   0.3992537
## [188,] 0.2369792 0.1420000   0.4141791
## [189,] 0.2382812 0.1440000   0.4141791
## [190,] 0.2369792 0.1440000   0.4104478
## [191,] 0.2369792 0.1440000   0.4104478
## [192,] 0.2408854 0.1460000   0.4179104
## [193,] 0.2395833 0.1440000   0.4179104
## [194,] 0.2382812 0.1420000   0.4179104
## [195,] 0.2369792 0.1440000   0.4104478
## [196,] 0.2356771 0.1420000   0.4104478
## [197,] 0.2382812 0.1420000   0.4179104
## [198,] 0.2382812 0.1420000   0.4179104
## [199,] 0.2395833 0.1440000   0.4179104
## [200,] 0.2382812 0.1440000   0.4141791
## [201,] 0.2382812 0.1420000   0.4179104
## [202,] 0.2395833 0.1440000   0.4179104
## [203,] 0.2408854 0.1460000   0.4179104
## [204,] 0.2382812 0.1460000   0.4104478
## [205,] 0.2395833 0.1460000   0.4141791
## [206,] 0.2421875 0.1460000   0.4216418
## [207,] 0.2382812 0.1440000   0.4141791
## [208,] 0.2382812 0.1440000   0.4141791
## [209,] 0.2369792 0.1440000   0.4104478
## [210,] 0.2369792 0.1420000   0.4141791
## [211,] 0.2382812 0.1440000   0.4141791
## [212,] 0.2382812 0.1440000   0.4141791
## [213,] 0.2395833 0.1440000   0.4179104
## [214,] 0.2395833 0.1440000   0.4179104
## [215,] 0.2369792 0.1420000   0.4141791
## [216,] 0.2369792 0.1420000   0.4141791
## [217,] 0.2369792 0.1400000   0.4179104
## [218,] 0.2369792 0.1420000   0.4141791
## [219,] 0.2382812 0.1420000   0.4179104
## [220,] 0.2369792 0.1420000   0.4141791
## [221,] 0.2356771 0.1420000   0.4104478
## [222,] 0.2382812 0.1420000   0.4179104
## [223,] 0.2395833 0.1440000   0.4179104
## [224,] 0.2395833 0.1420000   0.4216418
## [225,] 0.2382812 0.1400000   0.4216418
## [226,] 0.2369792 0.1400000   0.4179104
## [227,] 0.2343750 0.1400000   0.4104478
## [228,] 0.2343750 0.1400000   0.4104478
## [229,] 0.2369792 0.1400000   0.4179104
## [230,] 0.2356771 0.1400000   0.4141791
## [231,] 0.2356771 0.1400000   0.4141791
## [232,] 0.2356771 0.1380000   0.4179104
## [233,] 0.2369792 0.1380000   0.4216418
## [234,] 0.2356771 0.1380000   0.4179104
## [235,] 0.2343750 0.1380000   0.4141791
## [236,] 0.2382812 0.1460000   0.4104478
## [237,] 0.2369792 0.1440000   0.4104478
## [238,] 0.2369792 0.1440000   0.4104478
## [239,] 0.2369792 0.1420000   0.4141791
## [240,] 0.2356771 0.1420000   0.4104478
## [241,] 0.2395833 0.1440000   0.4179104
## [242,] 0.2382812 0.1440000   0.4141791
## [243,] 0.2382812 0.1440000   0.4141791
## [244,] 0.2356771 0.1440000   0.4067164
## [245,] 0.2343750 0.1400000   0.4104478
## [246,] 0.2330729 0.1420000   0.4029851
## [247,] 0.2356771 0.1440000   0.4067164
## [248,] 0.2343750 0.1420000   0.4067164
## [249,] 0.2356771 0.1420000   0.4104478
## [250,] 0.2343750 0.1400000   0.4104478
## [251,] 0.2317708 0.1400000   0.4029851
## [252,] 0.2343750 0.1420000   0.4067164
## [253,] 0.2343750 0.1440000   0.4029851
## [254,] 0.2356771 0.1420000   0.4104478
## [255,] 0.2330729 0.1440000   0.3992537
## [256,] 0.2304688 0.1400000   0.3992537
## [257,] 0.2317708 0.1420000   0.3992537
## [258,] 0.2304688 0.1400000   0.3992537
## [259,] 0.2317708 0.1440000   0.3955224
## [260,] 0.2304688 0.1400000   0.3992537
## [261,] 0.2330729 0.1420000   0.4029851
## [262,] 0.2317708 0.1420000   0.3992537
## [263,] 0.2291667 0.1400000   0.3955224
## [264,] 0.2304688 0.1420000   0.3955224
## [265,] 0.2304688 0.1440000   0.3917910
## [266,] 0.2291667 0.1400000   0.3955224
## [267,] 0.2291667 0.1400000   0.3955224
## [268,] 0.2304688 0.1420000   0.3955224
## [269,] 0.2291667 0.1400000   0.3955224
## [270,] 0.2278646 0.1400000   0.3917910
## [271,] 0.2278646 0.1420000   0.3880597
## [272,] 0.2278646 0.1400000   0.3917910
## [273,] 0.2304688 0.1440000   0.3917910
## [274,] 0.2304688 0.1440000   0.3917910
## [275,] 0.2304688 0.1440000   0.3917910
## [276,] 0.2317708 0.1420000   0.3992537
## [277,] 0.2291667 0.1420000   0.3917910
## [278,] 0.2291667 0.1440000   0.3880597
## [279,] 0.2304688 0.1440000   0.3917910
## [280,] 0.2291667 0.1400000   0.3955224
## [281,] 0.2278646 0.1400000   0.3917910
## [282,] 0.2252604 0.1380000   0.3880597
## [283,] 0.2265625 0.1380000   0.3917910
## [284,] 0.2265625 0.1380000   0.3917910
## [285,] 0.2291667 0.1420000   0.3917910
## [286,] 0.2304688 0.1420000   0.3955224
## [287,] 0.2278646 0.1400000   0.3917910
## [288,] 0.2291667 0.1400000   0.3955224
## [289,] 0.2278646 0.1380000   0.3955224
## [290,] 0.2278646 0.1380000   0.3955224
## [291,] 0.2265625 0.1360000   0.3955224
## [292,] 0.2317708 0.1400000   0.4029851
## [293,] 0.2265625 0.1360000   0.3955224
## [294,] 0.2278646 0.1360000   0.3992537
## [295,] 0.2278646 0.1360000   0.3992537
## [296,] 0.2278646 0.1380000   0.3955224
## [297,] 0.2265625 0.1360000   0.3955224
## [298,] 0.2291667 0.1360000   0.4029851
## [299,] 0.2330729 0.1400000   0.4067164
## [300,] 0.2343750 0.1420000   0.4067164
## [301,] 0.2330729 0.1420000   0.4029851
## [302,] 0.2343750 0.1400000   0.4104478
## [303,] 0.2304688 0.1400000   0.3992537
## [304,] 0.2304688 0.1380000   0.4029851
## [305,] 0.2356771 0.1440000   0.4067164
## [306,] 0.2343750 0.1440000   0.4029851
## [307,] 0.2330729 0.1420000   0.4029851
## [308,] 0.2382812 0.1460000   0.4104478
## [309,] 0.2356771 0.1460000   0.4029851
## [310,] 0.2343750 0.1460000   0.3992537
## [311,] 0.2356771 0.1440000   0.4067164
## [312,] 0.2330729 0.1440000   0.3992537
## [313,] 0.2369792 0.1480000   0.4029851
## [314,] 0.2356771 0.1480000   0.3992537
## [315,] 0.2369792 0.1460000   0.4067164
## [316,] 0.2356771 0.1460000   0.4029851
## [317,] 0.2382812 0.1460000   0.4104478
## [318,] 0.2382812 0.1460000   0.4104478
## [319,] 0.2343750 0.1440000   0.4029851
## [320,] 0.2330729 0.1440000   0.3992537
## [321,] 0.2369792 0.1440000   0.4104478
## [322,] 0.2382812 0.1440000   0.4141791
## [323,] 0.2382812 0.1440000   0.4141791
## [324,] 0.2369792 0.1460000   0.4067164
## [325,] 0.2369792 0.1460000   0.4067164
## [326,] 0.2369792 0.1460000   0.4067164
## [327,] 0.2356771 0.1420000   0.4104478
## [328,] 0.2356771 0.1440000   0.4067164
## [329,] 0.2382812 0.1460000   0.4104478
## [330,] 0.2369792 0.1420000   0.4141791
## [331,] 0.2356771 0.1420000   0.4104478
## [332,] 0.2369792 0.1420000   0.4141791
## [333,] 0.2421875 0.1480000   0.4179104
## [334,] 0.2369792 0.1460000   0.4067164
## [335,] 0.2382812 0.1460000   0.4104478
## [336,] 0.2382812 0.1440000   0.4141791
## [337,] 0.2356771 0.1420000   0.4104478
## [338,] 0.2369792 0.1420000   0.4141791
## [339,] 0.2330729 0.1400000   0.4067164
## [340,] 0.2356771 0.1440000   0.4067164
## [341,] 0.2317708 0.1380000   0.4067164
## [342,] 0.2343750 0.1400000   0.4104478
## [343,] 0.2343750 0.1400000   0.4104478
## [344,] 0.2356771 0.1420000   0.4104478
## [345,] 0.2369792 0.1440000   0.4104478
## [346,] 0.2369792 0.1440000   0.4104478
## [347,] 0.2356771 0.1440000   0.4067164
## [348,] 0.2330729 0.1400000   0.4067164
## [349,] 0.2330729 0.1400000   0.4067164
## [350,] 0.2343750 0.1420000   0.4067164
## [351,] 0.2330729 0.1400000   0.4067164
## [352,] 0.2330729 0.1400000   0.4067164
## [353,] 0.2330729 0.1400000   0.4067164
## [354,] 0.2330729 0.1400000   0.4067164
## [355,] 0.2343750 0.1400000   0.4104478
## [356,] 0.2356771 0.1420000   0.4104478
## [357,] 0.2356771 0.1420000   0.4104478
## [358,] 0.2343750 0.1420000   0.4067164
## [359,] 0.2369792 0.1440000   0.4104478
## [360,] 0.2395833 0.1460000   0.4141791
## [361,] 0.2343750 0.1420000   0.4067164
## [362,] 0.2369792 0.1420000   0.4141791
## [363,] 0.2330729 0.1400000   0.4067164
## [364,] 0.2317708 0.1400000   0.4029851
## [365,] 0.2330729 0.1420000   0.4029851
## [366,] 0.2304688 0.1400000   0.3992537
## [367,] 0.2330729 0.1420000   0.4029851
## [368,] 0.2304688 0.1400000   0.3992537
## [369,] 0.2304688 0.1400000   0.3992537
## [370,] 0.2317708 0.1440000   0.3955224
## [371,] 0.2317708 0.1440000   0.3955224
## [372,] 0.2356771 0.1440000   0.4067164
## [373,] 0.2291667 0.1400000   0.3955224
## [374,] 0.2343750 0.1420000   0.4067164
## [375,] 0.2343750 0.1440000   0.4029851
## [376,] 0.2330729 0.1420000   0.4029851
## [377,] 0.2356771 0.1420000   0.4104478
## [378,] 0.2356771 0.1420000   0.4104478
## [379,] 0.2382812 0.1420000   0.4179104
## [380,] 0.2356771 0.1420000   0.4104478
## [381,] 0.2382812 0.1420000   0.4179104
## [382,] 0.2369792 0.1420000   0.4141791
## [383,] 0.2343750 0.1400000   0.4104478
## [384,] 0.2369792 0.1420000   0.4141791
## [385,] 0.2369792 0.1420000   0.4141791
## [386,] 0.2395833 0.1440000   0.4179104
## [387,] 0.2369792 0.1420000   0.4141791
## [388,] 0.2369792 0.1400000   0.4179104
## [389,] 0.2369792 0.1420000   0.4141791
## [390,] 0.2369792 0.1400000   0.4179104
## [391,] 0.2369792 0.1420000   0.4141791
## [392,] 0.2369792 0.1420000   0.4141791
## [393,] 0.2382812 0.1440000   0.4141791
## [394,] 0.2369792 0.1440000   0.4104478
## [395,] 0.2382812 0.1460000   0.4104478
## [396,] 0.2382812 0.1460000   0.4104478
## [397,] 0.2382812 0.1460000   0.4104478
## [398,] 0.2395833 0.1460000   0.4141791
## [399,] 0.2382812 0.1440000   0.4141791
## [400,] 0.2369792 0.1440000   0.4104478
## [401,] 0.2382812 0.1440000   0.4141791
## [402,] 0.2369792 0.1440000   0.4104478
## [403,] 0.2343750 0.1420000   0.4067164
## [404,] 0.2356771 0.1440000   0.4067164
## [405,] 0.2369792 0.1440000   0.4104478
## [406,] 0.2369792 0.1440000   0.4104478
## [407,] 0.2369792 0.1460000   0.4067164
## [408,] 0.2356771 0.1420000   0.4104478
## [409,] 0.2356771 0.1420000   0.4104478
## [410,] 0.2356771 0.1420000   0.4104478
## [411,] 0.2356771 0.1440000   0.4067164
## [412,] 0.2369792 0.1440000   0.4104478
## [413,] 0.2356771 0.1440000   0.4067164
## [414,] 0.2356771 0.1440000   0.4067164
## [415,] 0.2369792 0.1440000   0.4104478
## [416,] 0.2382812 0.1460000   0.4104478
## [417,] 0.2343750 0.1420000   0.4067164
## [418,] 0.2356771 0.1440000   0.4067164
## [419,] 0.2343750 0.1420000   0.4067164
## [420,] 0.2356771 0.1440000   0.4067164
## [421,] 0.2382812 0.1460000   0.4104478
## [422,] 0.2369792 0.1440000   0.4104478
## [423,] 0.2369792 0.1440000   0.4104478
## [424,] 0.2369792 0.1440000   0.4104478
## [425,] 0.2356771 0.1420000   0.4104478
## [426,] 0.2356771 0.1440000   0.4067164
## [427,] 0.2356771 0.1440000   0.4067164
## [428,] 0.2382812 0.1440000   0.4141791
## [429,] 0.2382812 0.1460000   0.4104478
## [430,] 0.2356771 0.1440000   0.4067164
## [431,] 0.2382812 0.1440000   0.4141791
## [432,] 0.2356771 0.1420000   0.4104478
## [433,] 0.2395833 0.1440000   0.4179104
## [434,] 0.2382812 0.1440000   0.4141791
## [435,] 0.2369792 0.1440000   0.4104478
## [436,] 0.2369792 0.1440000   0.4104478
## [437,] 0.2356771 0.1420000   0.4104478
## [438,] 0.2356771 0.1400000   0.4141791
## [439,] 0.2382812 0.1440000   0.4141791
## [440,] 0.2395833 0.1440000   0.4179104
## [441,] 0.2382812 0.1440000   0.4141791
## [442,] 0.2343750 0.1400000   0.4104478
## [443,] 0.2356771 0.1420000   0.4104478
## [444,] 0.2356771 0.1420000   0.4104478
## [445,] 0.2369792 0.1420000   0.4141791
## [446,] 0.2356771 0.1440000   0.4067164
## [447,] 0.2382812 0.1460000   0.4104478
## [448,] 0.2343750 0.1420000   0.4067164
## [449,] 0.2356771 0.1440000   0.4067164
## [450,] 0.2356771 0.1420000   0.4104478
## [451,] 0.2356771 0.1440000   0.4067164
## [452,] 0.2369792 0.1440000   0.4104478
## [453,] 0.2356771 0.1440000   0.4067164
## [454,] 0.2356771 0.1440000   0.4067164
## [455,] 0.2369792 0.1460000   0.4067164
## [456,] 0.2369792 0.1460000   0.4067164
## [457,] 0.2382812 0.1460000   0.4104478
## [458,] 0.2369792 0.1460000   0.4067164
## [459,] 0.2395833 0.1460000   0.4141791
## [460,] 0.2395833 0.1460000   0.4141791
## [461,] 0.2369792 0.1460000   0.4067164
## [462,] 0.2369792 0.1460000   0.4067164
## [463,] 0.2343750 0.1440000   0.4029851
## [464,] 0.2343750 0.1440000   0.4029851
## [465,] 0.2356771 0.1460000   0.4029851
## [466,] 0.2343750 0.1440000   0.4029851
## [467,] 0.2343750 0.1440000   0.4029851
## [468,] 0.2356771 0.1440000   0.4067164
## [469,] 0.2369792 0.1460000   0.4067164
## [470,] 0.2356771 0.1460000   0.4029851
## [471,] 0.2356771 0.1460000   0.4029851
## [472,] 0.2369792 0.1440000   0.4104478
## [473,] 0.2343750 0.1440000   0.4029851
## [474,] 0.2382812 0.1460000   0.4104478
## [475,] 0.2369792 0.1440000   0.4104478
## [476,] 0.2369792 0.1440000   0.4104478
## [477,] 0.2369792 0.1440000   0.4104478
## [478,] 0.2369792 0.1440000   0.4104478
## [479,] 0.2369792 0.1440000   0.4104478
## [480,] 0.2356771 0.1440000   0.4067164
## [481,] 0.2356771 0.1440000   0.4067164
## [482,] 0.2356771 0.1440000   0.4067164
## [483,] 0.2369792 0.1460000   0.4067164
## [484,] 0.2356771 0.1440000   0.4067164
## [485,] 0.2382812 0.1460000   0.4104478
## [486,] 0.2382812 0.1460000   0.4104478
## [487,] 0.2369792 0.1460000   0.4067164
## [488,] 0.2382812 0.1460000   0.4104478
## [489,] 0.2369792 0.1460000   0.4067164
## [490,] 0.2382812 0.1460000   0.4104478
## [491,] 0.2395833 0.1460000   0.4141791
## [492,] 0.2382812 0.1460000   0.4104478
## [493,] 0.2382812 0.1460000   0.4104478
## [494,] 0.2395833 0.1460000   0.4141791
## [495,] 0.2382812 0.1460000   0.4104478
## [496,] 0.2395833 0.1460000   0.4141791
## [497,] 0.2382812 0.1460000   0.4104478
## [498,] 0.2395833 0.1460000   0.4141791
## [499,] 0.2395833 0.1460000   0.4141791
## [500,] 0.2395833 0.1460000   0.4141791

Now,you can see there are 500 rows because i didn’t mentioned the ntree value in building randomforest model.In R default value of ntree is 500. ntree: Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times. now what denotes OOB,Healthy,Non Healthy Column? Focus on OOB column.First row gives value of OOB error rate when only 1 tree is generated;2nd row denotes calculated OOB error rate when 2 trees generated and taking two trees together.same way 3rd row denotes calculated OOB error rate when 3 trees taken together .In the same way for 500th row when all 500 trees taken together.

Now,2nd column i.e. Healthy error rate denotes rate of predicting Non-Healthy patient when the patient is actually Healthy.1st row denotes the calculated Healthy error rate when 1 tree is generated.same for 2nd row and goes on…

Now,3rd column i.e Not Healthy error rate denotes rate of predicting Healthy patient when the patient is actually Not Healthy.1st row denotes the calculated Not Healthy error rate when 1 tree is generated.same for 2nd row and goes on…

Set the control parameter

   a)Find the Best number of ntrees
   b)Find the Best number of mtry

a)to find Best number of ntrees
Now,we will build a dataframe oob.err.data Creating error rate dataframe for all the trees

   oob.err.data <- data.frame(
  Trees = rep(1:nrow(model1$err.rate), 3), 
  Type = rep(c("OOB","Healthy","Unhealthy"), each = nrow(model1$err.rate)),
  Error = c(model1$err.rate[,"OOB"], model1$err.rate[,"Healthy"], model1$err.rate[,"Not Healthy"]))
  #Note: Total no of obs is 1500 since there 3 types of error, healthy, unhealthy and oob, each with 500 obs. So 500x3=1500

From this Code,it is seen that there are 3 columns named Trees,Type,Error. Now,

  Trees = rep(1:nrow(model1$err.rate), 3) 
  

describes there will be 1 to 500 rows as nrow(model1$err.rate) is 500 ;will repeat 3times,i.e there will be total 1500 rows.

  Type = rep(c("OOB","Healthy","Unhealthy"), each = nrow(model1$err.rate))

describes from 1st 1to 500 rows there will be named OOB,in the 2nd 1 to 500 rows there will bw named Healthy and for the 3rd Non Healthy.

  Error = c(model1$err.rate[,"OOB"], model1$err.rate[,"Healthy"], model1$err.rate[,"Not Healthy"])

describes in the Error Column 1st 1 to 500 rows there will be OOB of model1 for 500 trees,2nd for Healthy and 3rd for Not Healthy.

Now,let’s do a ggplot of no of tree vs error rate No. of tree vs error plot

   library(ggplot2)
   ggplot(data = oob.err.data, aes(x = Trees, y= Error)) + geom_line(aes(color = Type))

Basically,From this Chart we can see 3 plots of Healthy,OOB and Unhealthy vs trees.Here Healthy,Non Healthy error rates are False Positive and False Negeative rates.from this chart, after 300 no of trees the error rate lines are somewhere flatline or randomly fluctuating.Randomly fluctuating means the process is under stable and no improvement is going ;this is for OOB and healthy .but for Unhealthy, the line is somewhat more biased to one direction either in upward or downward. You can check this from error rate of mode1;thats why unhealthy error rate is quiet high compared to Healthy and OOB.

Now,let’s try with 1000 ntrees to observe wheather or not any improvment of error rate

##model built with 1000 trees
library(randomForest)
model2 <- randomForest(Outcome ~ ., data = diabetes, ntree = 1000)
print(model2)
## 
## Call:
##  randomForest(formula = Outcome ~ ., data = diabetes, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 22.92%
## Confusion matrix:
##             Healthy Not Healthy class.error
## Healthy         430          70   0.1400000
## Not Healthy     106         162   0.3955224

Creating error rate dataframe for all the trees

  oob.err.data1 <- data.frame(
  Trees = rep(1:nrow(model2$err.rate), 3), 
  Type = rep(c("OOB","Healthy","Unhealthy"), each = nrow(model2$err.rate)),
  Error = c(model2$err.rate[,"OOB"], model2$err.rate[,"Healthy"], model2$err.rate[,"Not Healthy"]))

No. of tree vs error plot

    ggplot(data = oob.err.data1, aes(x = Trees, y= Error)) + geom_line(aes(color = Type))

Now,we can see there is no improvment of error rate as flat line is showing;so with increasing no of trees process is under stable. As,we observed that the process didnt improve after 300 ntrees.So we will set ntrees=300 to optimize the model.

b)to find best number of mtry testing model accuracy with different values of random feature selection

   set.seed(789)
   oob.values <- vector(length = 10)
   for(i in 1:10){
   temp.model <- randomForest(Outcome ~ ., data = diabetes, mtry = i)
   oob.values[i] <- temp.model$err.rate[nrow(temp.model$err.rate),1]
}

Here,We are running a for loop in which for diffr mtry from 1 to 10,“temp.model” has created.oob.values[i] has generated for i running from 1to 10;which gives oob error rate for diffr “i”.We have to choose that mtry value which has lowest oob.values.

print result

set.seed(123)
oob.values
##  [1] 0.2395833 0.2317708 0.2408854 0.2421875 0.2408854 0.2408854 0.2395833
##  [8] 0.2434896 0.2447917 0.2382812

So,to optimize our model we will take mtry=2.

Optimize the model

building final tree with most optimal customizations

set.seed(354)
model3 <- randomForest(Outcome ~ ., data = diabetes, ntree = 300, mtry = 2)
print(model3)
## 
## Call:
##  randomForest(formula = Outcome ~ ., data = diabetes, ntree = 300,      mtry = 2) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 23.44%
## Confusion matrix:
##             Healthy Not Healthy class.error
## Healthy         425          75    0.150000
## Not Healthy     105         163    0.391791

So.Finally Our Optimized Model’s OOB estimate of error rate is 24.09% i.e our model’s accuracy is 75.91%.

Now,Additionally we’ll check predictors are how much important to build the model or how much contributions they have.to measure this we have a term “MeanDecreaseGini”. Gini is defined as “inequity” when used in describing a society’s distribution of income, or a measure of “node impurity” in tree-based classification. A low Gini (i.e. higher descrease in Gini) means that a particular predictor variable plays a greater role in partitioning the data into the defined classes.

Variable importance checking

checking important predictors

importance(model3)
##                          MeanDecreaseGini
## Pregnancies                      30.06586
## Glucose                          88.54565
## BloodPressure                    29.88637
## SkinThickness                    24.62372
## Insulin                          24.87244
## BMI                              57.75925
## DiabetesPedigreeFunction         43.30932
## Age                              47.39903
varImpPlot(model3)

Now,you can see Glucose has highest impact on predicting wheather or not a patient has Diabetes and skin Thickness has lowest impact.Now,itsn’t a good job to simple delete the predictor which has lowest impact.in that case we will face dataloss and accuracy decrease(maybe).If,accuracy increases after deleting that predictor;then also it is advisable to check wheather or not you are facing a severe dataloss for that predictor.

Hope,I was able to give some useful contents with you.See you in the next article. My Website Link Saikat