Random Forest Project - Iris Dataset

What is the Iris dataset and why is it so cool?

The iris dataset is widely known and was introduced by Ronald Fisher in 1936. It contains

  • Three plant species

    • Setosa

    • Virginica

    • Versicolor

  • Four Features in centimeters

    • Sepal Length

    • Sepal Width

    • Petal Length

    • Petal Width

All flowers have a sepal and petal and normally they are different color but that is not the case with the iris flower.

Below is a more common diagram of a flower showing the sepal and petal.

What is the goal of this project?

The goal of this project is to use random forest model to evaluate the performance of the model using metrics such as

  • Accuracy

  • Kappa

  • Specificity

  • Sensitivity

  • Error rate

In other words we want to train a percentage of the data to see patterns between the variables to predict the species of flower and then test the model using unseen data to evaluate how well the model correctly classified the flowers.

Step 1: Prepare the Dataset

Importing libraries

library(randomForest)
library(mlbench)
library(datasets)
library(kernlab)
library(caret)
library(klaR)
library(gmodels)

Load the iris dataset and summarize attributes

data(iris)
# Display part of the data
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
#Display statistical summary of data
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Analysis:

The iris dataset does not need to be normalized because all the values of the attributes range from 0.1 to 7.9 which is considered acceptable.

#Display name and data type of each column
sapply(iris, class)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##    "numeric"    "numeric"    "numeric"    "numeric"     "factor"

Visualize data

Comparing Sepal Length with Sepal Width

Analysis:

We can see that Setosa species has sepal widths and sepal lengths that are very different than the Versicolor and Virginica and thus I predict that our Random Forest Model will misclassify observations from Versicolor and Virginica but not Setosa.

Comparing Petal Length and Petal Width

Analysis:

There is a strong positive correlation between petal width and petal length.

Step 2: Train the model

Resampling method: Data Split

Data split means partitioning the data into training data and testing data. The training data is used to prepare the model and the testing data which is unseen data is used to evaluate the performance of the model in other words check how well the model predicted the target variable which in our case is species.

In our model we will train 80% and test 20% of the data.

set.seed(7)
trainIndex <- createDataPartition(iris$Species, p=0.80, list = FALSE)
dataTrain <- iris[trainIndex,]
dataTest <- iris[-trainIndex,]  #The argument -trainIndex is telling the algorithm to use all the data that is not in trainIndex. 
#Train a Naive Bayes model
fit <-NaiveBayes(Species~., data=dataTrain)
#Make Predictions
Predictions <- predict(fit, dataTest[,1:4])
#Summarize results 
confusionMatrix(Predictions$class,dataTest$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         1
##   virginica       0          0         9
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9667          
##                  95% CI : (0.8278, 0.9992)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.963e-13       
##                                           
##                   Kappa : 0.95            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.9000
## Specificity                 1.0000            0.9500           1.0000
## Pos Pred Value              1.0000            0.9091           1.0000
## Neg Pred Value              1.0000            1.0000           0.9524
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3000
## Detection Prevalence        0.3333            0.3667           0.3000
## Balanced Accuracy           1.0000            0.9750           0.9500

So what does this all mean? I got you!

CONFUSION MATRIX

A confusion matrix is a performance measurement for classification. It is useful for measuring performance metrics like Accuracy, Specificity, Sensitivity, Precision, etc.

On the diagonal of the confusion matrix, it shows the number of correct classifications.

Let’s take a look at our confusion matrix.

In our confusion matrix we can see that 1 Virginica was misclassified as Versicolor. This is nor surprising because earlier we observed the scatterplots and we could see that there are more similarities between versicolor and virginica than setosa.

ACCURACY

Our accuracy level is 0.9667 which is very high. The closer the value is to 1 the better. In our case 96.67% of our observations were predicted correctly.

\(accuracy=\frac{\text{number of correct predictions}}{\text{total number of predictions made}}\)

KAPPA

Kappa measures the possibility of the agreement between actual data (rater 1) and the predicted data (rater 2) occurring by chance.

  • If the raters are in perfect agreement then the value of Kappa is 1.

  • If the raters have no agreement between them then the value of Kappa is 0.

  • If the kappa value is over 0.7 then it is considered a very good agreement.

In our case the Kappa value was 0.95 which means high agreement between raters.

SENSITIVITY

Sensitivity also known as recall is the true positive rate.

In terms of a real example, sensitivity would be the probability of correctly diagnosing someone as having a disease.

\(sensitivity=\frac{\text{number of true positives}}{\text{number of predicted positives}}\)

In our example, we were able to correctly identify 100% of the setosa and versicolor species but only 90% of the virginica species.

SPECIFICITY

Specificity is a true negative rate.

For example, specificity would be the probability of correctly identifying people who do not have a disease.

\(specificity=\frac{\text{number of false positives}}{\text{number of predicted positives}}\)

ERROR RATE

The error rate is the percentage of wrong predictions and can be found in two ways:

  1. Dividing the number of misclassifications by total number of observations

  2. Subtracting Accuracy from 1

In our case the error rate is \(1-0.9667=0.0333\) and thus \(3\%\) of the data was misclassified.

ATTEMPT 2

In our model we will train 70% and test 30% of the data.

set.seed(7)
trainIndex <- createDataPartition(iris$Species, p=0.70, list = FALSE)
dataTrain2 <- iris[trainIndex,]
dataTest2 <- iris[-trainIndex,]  #The argument -trainIndex is telling the algorithm to use all the data that is not in trainIndex. 
#Train a Naive Bayes model
fit <-NaiveBayes(Species~., data=dataTrain2)
#Make Predictions
Predictions <- predict(fit, dataTest2[,1:4])
#Summarize results 
confusionMatrix(Predictions$class,dataTest2$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         15         1
##   virginica       0          0        14
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9778          
##                  95% CI : (0.8823, 0.9994)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9667          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.9333
## Specificity                 1.0000            0.9667           1.0000
## Pos Pred Value              1.0000            0.9375           1.0000
## Neg Pred Value              1.0000            1.0000           0.9677
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3111
## Detection Prevalence        0.3333            0.3556           0.3111
## Balanced Accuracy           1.0000            0.9833           0.9667

We can see that the accuracy increased when we trained 70% of the data and tested 30%. The Kappa value also increased.

Resampling method: k-fold Cross Validation

5-fold Cross-Validation

set.seed(7)
trainControl <- trainControl(method="cv", number =5)
model_forest <- train(Species~., data = iris, method="rf",
                       metric="Accuracy", trControl =trainControl)
#Summarize fit 
print(model_forest)
## Random Forest 
## 
## 150 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 120, 120, 120, 120, 120 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa
##   2     0.9533333  0.93 
##   3     0.9533333  0.93 
##   4     0.9533333  0.93 
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(model_forest)

Analysis:

The accuracy and the Kappa seem to be optimal when mtry is 2,3 and 4. Notice that the accuracy is not as high as the Naive Bayes Model.

Mtry is the number of variables available to be considered at every split.

10-fold Cross-Validation

set.seed(7)
trainControl <- trainControl(method="cv", number =10)
model_forest <- train(Species~., data = iris, method="rf",
                       metric="Accuracy", trControl =trainControl)
#Summarize fit 
print(model_forest)
## Random Forest 
## 
## 150 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa
##   2     0.96      0.94 
##   3     0.96      0.94 
##   4     0.96      0.94 
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(model_forest)

Random Forest Method

set.seed(7)
model_forest <- train(Species~., data = iris, method="rf",
                       metric="Accuracy", importance=TRUE , trControl =trainControl)
#Summarize fit 
print(model_forest)
## Random Forest 
## 
## 150 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa
##   2     0.9600000  0.94 
##   3     0.9600000  0.94 
##   4     0.9533333  0.93 
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(model_forest)

set.seed(7)
model_forest <- randomForest(Species~.,data=dataTrain2)
print(model_forest)
## 
## Call:
##  randomForest(formula = Species ~ ., data = dataTrain2) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 7.62%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         35          0         0   0.0000000
## versicolor      0         31         4   0.1142857
## virginica       0          4        31   0.1142857

3 variables at each split with 500 trees

model_forest_version1 <- randomForest(Species~.,data=dataTrain2,mtry=3,ntree=500)
model_forest_version1
## 
## Call:
##  randomForest(formula = Species ~ ., data = dataTrain2, mtry = 3,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 6.67%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         35          0         0  0.00000000
## versicolor      0         32         3  0.08571429
## virginica       0          4        31  0.11428571

3 variables at each split with 1000 trees

model_forest_version2 <- randomForest(Species~.,data=dataTrain2,mtry=3,ntree=1000)
model_forest_version2
## 
## Call:
##  randomForest(formula = Species ~ ., data = dataTrain2, mtry = 3,      ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 4.76%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         35          0         0  0.00000000
## versicolor      0         33         2  0.05714286
## virginica       0          3        32  0.08571429

Analysis:

We can see that the error rate is lowering as we increase the number of variables at it split and we increase the number of trees.

PREDICTION TIME!!

Let’s use our unseen data / test data.

# Make predictions on unused data using the final model. 
magicwand <- predict(model_forest_version2, dataTest2)
confusionMatrix(magicwand, dataTest2$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         15         1
##   virginica       0          0        14
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9778          
##                  95% CI : (0.8823, 0.9994)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9667          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.9333
## Specificity                 1.0000            0.9667           1.0000
## Pos Pred Value              1.0000            0.9375           1.0000
## Neg Pred Value              1.0000            1.0000           0.9677
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3111
## Detection Prevalence        0.3333            0.3556           0.3111
## Balanced Accuracy           1.0000            0.9833           0.9667

We can see that we get the same results when using the Naive Bayes Method with 70-30 split and Random Forest with 1000 trees and 3 variable split at each node.

Why do you wonder?

Stay tuned!