The iris dataset is widely known and was introduced by Ronald Fisher in 1936. It contains
Three plant species
Setosa
Virginica
Versicolor
Four Features in centimeters
Sepal Length
Sepal Width
Petal Length
Petal Width
All flowers have a sepal and petal and normally they are different color but that is not the case with the iris flower.
Below is a more common diagram of a flower showing the sepal and petal.
The goal of this project is to use random forest model to evaluate the performance of the model using metrics such as
Accuracy
Kappa
Specificity
Sensitivity
Error rate
In other words we want to train a percentage of the data to see patterns between the variables to predict the species of flower and then test the model using unseen data to evaluate how well the model correctly classified the flowers.
library(randomForest)
library(mlbench)
library(datasets)
library(kernlab)
library(caret)
library(klaR)
library(gmodels)
data(iris)
# Display part of the data
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
#Display statistical summary of data
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Analysis:
The iris dataset does not need to be normalized because all the values of the attributes range from 0.1 to 7.9 which is considered acceptable.
#Display name and data type of each column
sapply(iris, class)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## "numeric" "numeric" "numeric" "numeric" "factor"
Analysis:
We can see that Setosa species has sepal widths and sepal lengths that are very different than the Versicolor and Virginica and thus I predict that our Random Forest Model will misclassify observations from Versicolor and Virginica but not Setosa.
Analysis:
There is a strong positive correlation between petal width and petal length.
Data split means partitioning the data into training data and testing data. The training data is used to prepare the model and the testing data which is unseen data is used to evaluate the performance of the model in other words check how well the model predicted the target variable which in our case is species.
In our model we will train 80% and test 20% of the data.
set.seed(7)
trainIndex <- createDataPartition(iris$Species, p=0.80, list = FALSE)
dataTrain <- iris[trainIndex,]
dataTest <- iris[-trainIndex,] #The argument -trainIndex is telling the algorithm to use all the data that is not in trainIndex.
#Train a Naive Bayes model
fit <-NaiveBayes(Species~., data=dataTrain)
#Make Predictions
Predictions <- predict(fit, dataTest[,1:4])
#Summarize results
confusionMatrix(Predictions$class,dataTest$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 1
## virginica 0 0 9
##
## Overall Statistics
##
## Accuracy : 0.9667
## 95% CI : (0.8278, 0.9992)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 2.963e-13
##
## Kappa : 0.95
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 0.9000
## Specificity 1.0000 0.9500 1.0000
## Pos Pred Value 1.0000 0.9091 1.0000
## Neg Pred Value 1.0000 1.0000 0.9524
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.3000
## Detection Prevalence 0.3333 0.3667 0.3000
## Balanced Accuracy 1.0000 0.9750 0.9500
CONFUSION MATRIX
A confusion matrix is a performance measurement for classification. It is useful for measuring performance metrics like Accuracy, Specificity, Sensitivity, Precision, etc.
On the diagonal of the confusion matrix, it shows the number of correct classifications.
Let’s take a look at our confusion matrix.
In our confusion matrix we can see that 1 Virginica was misclassified as Versicolor. This is nor surprising because earlier we observed the scatterplots and we could see that there are more similarities between versicolor and virginica than setosa.
ACCURACY
Our accuracy level is 0.9667 which is very high. The closer the value is to 1 the better. In our case 96.67% of our observations were predicted correctly.
\(accuracy=\frac{\text{number of correct predictions}}{\text{total number of predictions made}}\)
KAPPA
Kappa measures the possibility of the agreement between actual data (rater 1) and the predicted data (rater 2) occurring by chance.
If the raters are in perfect agreement then the value of Kappa is 1.
If the raters have no agreement between them then the value of Kappa is 0.
If the kappa value is over 0.7 then it is considered a very good agreement.
In our case the Kappa value was 0.95 which means high agreement between raters.
SENSITIVITY
Sensitivity also known as recall is the true positive rate.
In terms of a real example, sensitivity would be the probability of correctly diagnosing someone as having a disease.
\(sensitivity=\frac{\text{number of true positives}}{\text{number of predicted positives}}\)
In our example, we were able to correctly identify 100% of the setosa and versicolor species but only 90% of the virginica species.
SPECIFICITY
Specificity is a true negative rate.
For example, specificity would be the probability of correctly identifying people who do not have a disease.
\(specificity=\frac{\text{number of false positives}}{\text{number of predicted positives}}\)
ERROR RATE
The error rate is the percentage of wrong predictions and can be found in two ways:
Dividing the number of misclassifications by total number of observations
Subtracting Accuracy from 1
In our case the error rate is \(1-0.9667=0.0333\) and thus \(3\%\) of the data was misclassified.
In our model we will train 70% and test 30% of the data.
set.seed(7)
trainIndex <- createDataPartition(iris$Species, p=0.70, list = FALSE)
dataTrain2 <- iris[trainIndex,]
dataTest2 <- iris[-trainIndex,] #The argument -trainIndex is telling the algorithm to use all the data that is not in trainIndex.
#Train a Naive Bayes model
fit <-NaiveBayes(Species~., data=dataTrain2)
#Make Predictions
Predictions <- predict(fit, dataTest2[,1:4])
#Summarize results
confusionMatrix(Predictions$class,dataTest2$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 15 1
## virginica 0 0 14
##
## Overall Statistics
##
## Accuracy : 0.9778
## 95% CI : (0.8823, 0.9994)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9667
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 0.9333
## Specificity 1.0000 0.9667 1.0000
## Pos Pred Value 1.0000 0.9375 1.0000
## Neg Pred Value 1.0000 1.0000 0.9677
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.3111
## Detection Prevalence 0.3333 0.3556 0.3111
## Balanced Accuracy 1.0000 0.9833 0.9667
We can see that the accuracy increased when we trained 70% of the data and tested 30%. The Kappa value also increased.
set.seed(7)
trainControl <- trainControl(method="cv", number =5)
model_forest <- train(Species~., data = iris, method="rf",
metric="Accuracy", trControl =trainControl)
#Summarize fit
print(model_forest)
## Random Forest
##
## 150 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 120, 120, 120, 120, 120
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9533333 0.93
## 3 0.9533333 0.93
## 4 0.9533333 0.93
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(model_forest)
Analysis:
The accuracy and the Kappa seem to be optimal when mtry is 2,3 and 4. Notice that the accuracy is not as high as the Naive Bayes Model.
Mtry is the number of variables available to be considered at every split.
set.seed(7)
trainControl <- trainControl(method="cv", number =10)
model_forest <- train(Species~., data = iris, method="rf",
metric="Accuracy", trControl =trainControl)
#Summarize fit
print(model_forest)
## Random Forest
##
## 150 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 135, 135, 135, 135, 135, 135, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.96 0.94
## 3 0.96 0.94
## 4 0.96 0.94
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(model_forest)
set.seed(7)
model_forest <- train(Species~., data = iris, method="rf",
metric="Accuracy", importance=TRUE , trControl =trainControl)
#Summarize fit
print(model_forest)
## Random Forest
##
## 150 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 135, 135, 135, 135, 135, 135, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9600000 0.94
## 3 0.9600000 0.94
## 4 0.9533333 0.93
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(model_forest)
set.seed(7)
model_forest <- randomForest(Species~.,data=dataTrain2)
print(model_forest)
##
## Call:
## randomForest(formula = Species ~ ., data = dataTrain2)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 7.62%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 35 0 0 0.0000000
## versicolor 0 31 4 0.1142857
## virginica 0 4 31 0.1142857
3 variables at each split with 500 trees
model_forest_version1 <- randomForest(Species~.,data=dataTrain2,mtry=3,ntree=500)
model_forest_version1
##
## Call:
## randomForest(formula = Species ~ ., data = dataTrain2, mtry = 3, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 6.67%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 35 0 0 0.00000000
## versicolor 0 32 3 0.08571429
## virginica 0 4 31 0.11428571
3 variables at each split with 1000 trees
model_forest_version2 <- randomForest(Species~.,data=dataTrain2,mtry=3,ntree=1000)
model_forest_version2
##
## Call:
## randomForest(formula = Species ~ ., data = dataTrain2, mtry = 3, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 4.76%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 35 0 0 0.00000000
## versicolor 0 33 2 0.05714286
## virginica 0 3 32 0.08571429
Analysis:
We can see that the error rate is lowering as we increase the number of variables at it split and we increase the number of trees.
PREDICTION TIME!!
Let’s use our unseen data / test data.
# Make predictions on unused data using the final model.
magicwand <- predict(model_forest_version2, dataTest2)
confusionMatrix(magicwand, dataTest2$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 15 1
## virginica 0 0 14
##
## Overall Statistics
##
## Accuracy : 0.9778
## 95% CI : (0.8823, 0.9994)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9667
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 0.9333
## Specificity 1.0000 0.9667 1.0000
## Pos Pred Value 1.0000 0.9375 1.0000
## Neg Pred Value 1.0000 1.0000 0.9677
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.3111
## Detection Prevalence 0.3333 0.3556 0.3111
## Balanced Accuracy 1.0000 0.9833 0.9667
We can see that we get the same results when using the Naive Bayes Method with 70-30 split and Random Forest with 1000 trees and 3 variable split at each node.
Why do you wonder?
Stay tuned!