AI4OPT

Fall 2022

Data Engineering and Mining II

December 5, 2022

The Random Forest Project

Directions:

You are given three datasets

  • the Iris dataset

  • the Pima Indians diabetes dataset, and

  • a dataset that you can name and place in Excel, then save as a .csv file (.csv excel dataset)

You are to run the .csv excel dataset plus one of the two remaining datasets through the random forest model.

The performance of each of the two models should be improved as well.

Remember:

Use a resampling method to split the dataset into subsets Implement, or call, the random forest model (use as many arguments as possible) Evaluate the performance of the model using metrics such as “Accuracy,” etc. If the model is not performing well, then go back and tune the rf parameters

First Dataset: Left-Right-Up-Down (Butros.csv)

Loading some libraries

library(randomForest)
library(mlbench)
library(RCurl)
library(caret)
library(rpart)

Read the dataset Butros.csv

Michael <- read.csv("Butros.csv")
str(Michael)
## 'data.frame':    10 obs. of  4 variables:
##  $ Left : int  1 0 1 0 0 0 0 1 1 0
##  $ Right: int  45 0 92 18 26 48 41 52 64 80
##  $ Up   : int  24 26 32 41 80 76 92 39 46 50
##  $ Down : int  100 69 46 24 0 32 86 71 65 48

Create a Random Forest for the dataset

set.seed(2022)
rf <- randomForest(Left~.,data=Michael)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
rf
## 
## Call:
##  randomForest(formula = Left ~ ., data = Michael) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 0.2774899
##                     % Var explained: -15.62

Change Left to a factor and rerun the model

Michael$Left <- as.factor(Michael$Left)
str(Michael)
## 'data.frame':    10 obs. of  4 variables:
##  $ Left : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 2 2 1
##  $ Right: int  45 0 92 18 26 48 41 52 64 80
##  $ Up   : int  24 26 32 41 80 76 92 39 46 50
##  $ Down : int  100 69 46 24 0 32 86 71 65 48
set.seed(2022)
rf <- randomForest(Left~.,data=Michael)
rf
## 
## Call:
##  randomForest(formula = Left ~ ., data = Michael) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 70%
## Confusion matrix:
##   0 1 class.error
## 0 3 3         0.5
## 1 4 0         1.0

Evaluate the Random Forest performance using cross validation

control <- trainControl(method = "cv", number = 3)
grid_rf <- expand.grid(mtry=3)
m_rf <- train(Left~., data=Michael, method = "rf", importance=TRUE, 
              trControl=control, tuneGrid = grid_rf)
m_rf
## Random Forest 
## 
## 10 samples
##  3 predictor
##  2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 7, 7, 6 
## Resampling results:
## 
##   Accuracy   Kappa
##   0.7777778  0.6  
## 
## Tuning parameter 'mtry' was held constant at a value of 3

Evaluate the Random Forest performance using repeated cross validation

fitControl <- trainControl(method="repeatedcv",number=5,repeats = 5)
grid_rf <- expand.grid(mtry=3)
m_rf <- train(Left~., data=Michael, method = "rf", importance=TRUE, 
              trControl=control, tuneGrid = grid_rf)
m_rf
## Random Forest 
## 
## 10 samples
##  3 predictor
##  2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 7, 6, 7 
## Resampling results:
## 
##   Accuracy   Kappa
##   0.6388889  0.3  
## 
## Tuning parameter 'mtry' was held constant at a value of 3

Making a prediction

pred <- predict(m_rf, Michael)
table(pred,Michael$Left)
##     
## pred 0 1
##    0 6 0
##    1 0 4

Conclusions

  • The accuracy for the dataset was not as high as we would like for running a random forest algorithm. This could be because some of the trees in the random forest are correlated.

  • We ran a regression algorithm on the dataset, and maybe should have ran a classification algorithm.

  • The algorithm performed better when evaluated using cross validations versus repeated cross validations.

  • Surprisingly, the prediction table yielded none zero entries only on the main diagonal, that is, there were no misclassifications.



Second Dataset: Iris

Read dataset and load libraries

library(randomForest)
library(caret)
data("iris")

Split into training and testing subsets

Index <- createDataPartition(iris$Species,p=0.80, list=FALSE)
training <- iris[Index, ]
testing <- iris[-Index, ]

Create Random Forest

model <- randomForest(Species~., data=training)
print(model)
## 
## Call:
##  randomForest(formula = Species ~ ., data = training) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 5%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         40          0         0       0.000
## versicolor      0         37         3       0.075
## virginica       0          3        37       0.075

Make a prediction

pred <- predict(model, testing)
table <- confusionMatrix(testing$Species, pred)
print(table)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         0
##   virginica       0          0        10
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8843, 1)
##     No Information Rate : 0.3333     
##     P-Value [Acc > NIR] : 4.857e-15  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           1.0000
## Specificity                 1.0000            1.0000           1.0000
## Pos Pred Value              1.0000            1.0000           1.0000
## Neg Pred Value              1.0000            1.0000           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3333
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            1.0000           1.0000

Tuning parameters on training subset

trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
tmodel = train(Species~., data=training, method="rf", trControl=trainControl)
print(tmodel)
## Random Forest 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9472222  0.9208333
##   3     0.9527778  0.9291667
##   4     0.9500000  0.9250000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.

Making new predictions using testing subset

tpred <- predict(tmodel, testing)
ttable <- confusionMatrix(testing$Species, tpred)
print(ttable)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         0
##   virginica       0          0        10
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8843, 1)
##     No Information Rate : 0.3333     
##     P-Value [Acc > NIR] : 4.857e-15  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           1.0000
## Specificity                 1.0000            1.0000           1.0000
## Pos Pred Value              1.0000            1.0000           1.0000
## Neg Pred Value              1.0000            1.0000           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3333
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            1.0000           1.0000

Conclusions

  • The accuracy of the model was high before tuning parameters and was very high after tuning the parameters of the random forest algorithm.

  • Parameters were tuned used repeated cross validations with 10 folds repeated 3 times.

  • Optimal configuration after tunuing parameter with

    • mtry = 3

    • accuracy = 0.9527778

    • kappa = 0.9291667

  • Predictions made after tuning the parameters yielded no misclassifications. That is, an accuracy of 100%.



Third Dataset: Pima Indians Diabetes

Load library and check dataset structure

library(mlbench)
data(PimaIndiansDiabetes)
str(PimaIndiansDiabetes)
## 'data.frame':    768 obs. of  9 variables:
##  $ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...
##  $ glucose : num  148 85 183 89 137 116 78 115 197 125 ...
##  $ pressure: num  72 66 64 66 40 74 50 0 70 96 ...
##  $ triceps : num  35 29 0 23 35 0 32 0 45 0 ...
##  $ insulin : num  0 0 0 94 168 0 88 0 543 0 ...
##  $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ age     : num  50 31 32 21 33 30 26 29 53 54 ...
##  $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
head(PimaIndiansDiabetes, n=5)
##   pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1        6     148       72      35       0 33.6    0.627  50      pos
## 2        1      85       66      29       0 26.6    0.351  31      neg
## 3        8     183       64       0       0 23.3    0.672  32      pos
## 4        1      89       66      23      94 28.1    0.167  21      neg
## 5        0     137       40      35     168 43.1    2.288  33      pos

Split dataset into training and testing subsets

trainIndex <- createDataPartition(PimaIndiansDiabetes$diabetes, p=0.80, list=FALSE)
PimaIndiansDiabetes$Outcome <- as.factor(PimaIndiansDiabetes$diabetes)
diabetes.training <- PimaIndiansDiabetes[trainIndex, ]
diabetes.testing <- PimaIndiansDiabetes[-trainIndex, ]
prop.table(table(diabetes.training$Outcome))
## 
##       neg       pos 
## 0.6504065 0.3495935

Test Data Proportion

prop.table(table(diabetes.testing$diabetes))
## 
##       neg       pos 
## 0.6535948 0.3464052

Create Random Forest model

model <- randomForest(diabetes~., data=diabetes.training)
print(model)
## 
## Call:
##  randomForest(formula = diabetes ~ ., data = diabetes.training) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0%
## Confusion matrix:
##     neg pos class.error
## neg 400   0           0
## pos   0 215           0

Parameter tuning

control <- trainControl(method ="repeatedcv", number = 10, repeats = 10)
grid <- expand.grid(mtry =c(3,4,5))
model.random.forest <- train(diabetes~., data=diabetes.training, method="rf",
                             tuneGrid = grid, trConrtol=control)
model.random.forest
## Random Forest 
## 
## 615 samples
##   9 predictor
##   2 classes: 'neg', 'pos' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 615, 615, 615, 615, 615, 615, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa
##   3     1         1    
##   4     1         1    
##   5     1         1    
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
plot(model.random.forest)

## Evaluate model performance

pred <- predict(model.random.forest,diabetes.testing)
confusionMatrix(pred,diabetes.testing$diabetes)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg 100   0
##        pos   0  53
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9762, 1)
##     No Information Rate : 0.6536     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6536     
##          Detection Rate : 0.6536     
##    Detection Prevalence : 0.6536     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : neg        
## 

Conclusions

  • Random forest algorithm yielded 0 error rate before parameter tuning.

  • Parameters were tuned using repeated cross validations with 10 folds and 10 repetitions. Also different values of mtry were used.

  • All values of mtry yielded 100% accuracy and 100% kappa values

  • Predictions on the tuned parameters yielded no misclassifications.