Iris Flower Classification

Sambhav Shrestha

1. Introduction

In this project, I try to analayze the big data from Iris flowers dataset and classify them based on their features. I will be using different machine learning models and trying to find the best machine learning model that can accurately distinguish one flower from another. First we will start by importing the libraries.

2. Importing Required libraries

# Importing Required Libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(cowplot)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(ggplot2)
library(corrplot)
## corrplot 0.84 loaded
library(caTools)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(lda)
library(rpart)
library(rpart.plot)
library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice

3. Loading Data

The iris data is pre-available in R and can be loaded with following code.

#importing the iris datset
data(iris)

4. Data Summary

# summary of iris dataset
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
# first and last rows
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
tail(iris)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica
# check for any missing values
colSums(is.na(iris))
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##            0            0            0            0            0

From the above summary, we can see that this dataset has 5 columns, (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width and Species) with first 4 being the attributes and the species column as label (setosa, versicolor, virginica) distributed equally among 150 observations. Also, there are no missing values and the four attributes are on same scale (cm), so we don’t need to normalize them.

5. Split the Data

Let’s prepare our training and test set by splitting the data. I chose to split the data 8:2.

# split the data into training set and test set
set.seed(100)
split <- sample.split(iris$Species, SplitRatio = 0.8)
train <- subset(iris, split == TRUE)
test <- subset(iris, split == FALSE)
dim(train)
## [1] 120   5
dim(test)
## [1] 30  5
# check if we have equal number of each species in training and test data.
count(train, Species)
##      Species  n
## 1     setosa 40
## 2 versicolor 40
## 3  virginica 40
count(test, Species)
##      Species  n
## 1     setosa 10
## 2 versicolor 10
## 3  virginica 10

6. Training-Data Visualization

We will be looking at both the univariate and multivariate plots to understand each attribute and relationships between the elements in our training data

Bar Plot

qplot(x = train$Species, fill = train$Species, xlab = "Species", ylab ="count")

Box Plot

Box plots help in visualizing the InterQuartile Range and dispersion of dataset.

par(mfrow=c(1,4))
color <- c("red", "green", "orange", "yellow")
for (i in 1:4) {
  boxplot(train[, -c(5)][i], main=names(train)[i], col = color[i] )
}

We can see that the sepal length and sepal width of flowers have comparatively smaller ICR than that of petals. Also, petals are more negatively skewed.

Let’s Compare the boxplots for each species. In order to do so, let’s find the means of each attribute according to species.

# table of means
train %>% 
  group_by(Species) %>% summarise(avg_SL = mean(Sepal.Length), avg_SW = mean(Sepal.Width), avg_PL = mean(Petal.Length), avg_PW = mean(Petal.Width))
## # A tibble: 3 x 5
##   Species    avg_SL avg_SW avg_PL avg_PW
##   <fct>       <dbl>  <dbl>  <dbl>  <dbl>
## 1 setosa       5.03   3.46   1.47  0.238
## 2 versicolor   5.92   2.78   4.21  1.32 
## 3 virginica    6.66   2.99   5.64  2.05

we can already see some characteristics of each species. Setosa has smaller average petal length and width and bigger sepal width than other two. Virginica has the biggest sepal length, petal length and petal width in average. Let’s visualize our data in multivariate box plots.

featurePlot(x=train[, -c(5)], y=train[, 5], plot='box')

Scatter Plot

scatter plot helps us identify the separation of each species visually

featurePlot(x=train[, -c(5)], y=train[, 5], plot='ellipse')

From the above scatter plot, the separation between Setosa and other two species is pretty visible. Versicolor and Verginica overlap each other but Verginica has higher value of attributes.

Density Plots

Scatterplot helps in finding the relation between different columns. We can find if there is a linear relationship between length and width of petals of each species and also calculate the density plot for each species

# scatter plot between sepal length and width
scatsepal <- train %>% ggplot(aes(x = Sepal.Length, y = Sepal.Width, shape = Species, color = Species)) + 
  geom_point(size=2) + geom_smooth(method=lm, se = FALSE, formula = y ~ x)  + theme(legend.position = "none")

# density plots
xdensity1 <- train %>% ggplot(aes(x = Sepal.Length, fill = Species)) + geom_density(alpha = 0.5)
ydensity1 <- train %>% ggplot(aes(x = Sepal.Width, fill = Species)) + geom_density(alpha = 0.5) + theme(legend.position = "none")

# blank plot
blankPlot <- ggplot()+ theme_void()

# arranging all the plots
grid.arrange(xdensity1, blankPlot, scatsepal, ydensity1, ncol=2, nrow=2, widths=c(3, 2), heights=c(2, 3))

# scatter plot between sepal length and width
scatpetal <- train %>% ggplot(aes(x = Petal.Length, y = Petal.Width, shape = Species, color = Species)) + 
  geom_point(size=2) + geom_smooth(method=lm, se = FALSE, formula = y ~ x)  + theme(legend.position = "none")

# density plots
xdensity2 <- train %>% ggplot(aes(x = Petal.Length, fill = Species)) + geom_density(alpha = 0.5)
ydensity2 <- train %>% ggplot(aes(x = Petal.Width, fill = Species)) + geom_density(alpha = 0.5) + theme(legend.position = "none")

# blank plot
blankPlot <- ggplot()+ theme_void()

# arranging all the plots
grid.arrange(xdensity2, blankPlot, scatpetal, ydensity2, ncol=2, nrow=2, widths=c(3, 2), heights=c(2, 3))

From the two combined plots above, we can see that Petal plots have three separate clusters for each species while Sepal plots have overlapping clusters. Thus, it can be concluded that petals measurements have strong bearing on the model. Now, we shall build our model.

Coorelation Plot

correlation <-cor(train[,c(1:4)], method = 'pearson')
corrplot(correlation,  number.cex = 1, method = "color", type = "lower", tl.cex=0.8, tl.col="black")

The correlation plot shows the strong correlation between Petal.length and Sepal.Length and Petal.width and Petal.length.

7. Data Modeling

we will use 10-fold cross validation to estimate accuracy. This validation splits our data into 10 parts, training in 9 and testing on 1 part.

control <- trainControl(method='cv', number=10)
metric <- 'Accuracy'

Linear Discriminant Analysis

The first model that I will be using is LDA (Linear Discriminant Analysis) (similar to logistic regression but for more than two variables)

set.seed(123)
lda_fit <- train(Species~., data=train, method='lda', 
                  trControl=control, metric=metric)
lda_fit
## Linear Discriminant Analysis 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
## Resampling results:
## 
##   Accuracy   Kappa
##   0.9833333  0.975

The accuracy of LDA on training set is 98% with Kappa of 0.975. Let’s test it on test data

# predicting on test data
lda_predict <- predict(lda_fit, test)

# confusion matrix
lda_cm <- confusionMatrix(lda_predict, test$Species)
lda_cm
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          9         0
##   virginica       0          1        10
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9667          
##                  95% CI : (0.8278, 0.9992)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.963e-13       
##                                           
##                   Kappa : 0.95            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9000           1.0000
## Specificity                 1.0000            1.0000           0.9500
## Pos Pred Value              1.0000            1.0000           0.9091
## Neg Pred Value              1.0000            0.9524           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3000           0.3333
## Detection Prevalence        0.3333            0.3000           0.3667
## Balanced Accuracy           1.0000            0.9500           0.9750
# keep track of all model accuracies
Models_Accuracies <- tibble(model = "LDA", accuracy = lda_cm$overall['Accuracy'])

The accuracy of LDA is 96.67%

Classification and Regression Trees (CART)

set.seed(123)
cart_fit <- train(Species~., data=train, method='rpart', 
                  trControl=control, metric=metric)
cart_fit
## CART 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
## Resampling results across tuning parameters:
## 
##   cp      Accuracy   Kappa
##   0.0000  0.9500000  0.925
##   0.4625  0.7833333  0.675
##   0.5000  0.3333333  0.000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.

The best accuracy from CART Model obtained was 92.5 % on training data.

# predicting on test data
cart_predict <- predict(cart_fit, test)

# confusion matrix
cart_cm <- confusionMatrix(cart_predict, test$Species)
cart_cm
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          7         1
##   virginica       0          3         9
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8667          
##                  95% CI : (0.6928, 0.9624)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.296e-09       
##                                           
##                   Kappa : 0.8             
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.7000           0.9000
## Specificity                 1.0000            0.9500           0.8500
## Pos Pred Value              1.0000            0.8750           0.7500
## Neg Pred Value              1.0000            0.8636           0.9444
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2333           0.3000
## Detection Prevalence        0.3333            0.2667           0.4000
## Balanced Accuracy           1.0000            0.8250           0.8750
# keep track of all model accuracies
Models_Accuracies <- add_row(Models_Accuracies, model = "CART", accuracy = cart_cm$overall['Accuracy'])

Again, the accuracy on test set went down to 86 % with Kappa of 0.8

K-Nearest Neighbors (KNN)

set.seed(123)
knn_fit <- train(Species~., data=train, method='knn', trControl=control, metric=metric)
knn_fit
## k-Nearest Neighbors 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa 
##   5  0.9916667  0.9875
##   7  0.9916667  0.9875
##   9  0.9916667  0.9875
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

The accuracy from KNN on training data was 99.16 % highest till now.

Let’s see how it performs on test data

# test data
knn_predict <- predict(knn_fit, test)

# confusion matrix
knn_cm <- confusionMatrix(knn_predict, test$Species)
knn_cm
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          9         2
##   virginica       0          1         8
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.7347, 0.9789)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 1.665e-10       
##                                           
##                   Kappa : 0.85            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9000           0.8000
## Specificity                 1.0000            0.9000           0.9500
## Pos Pred Value              1.0000            0.8182           0.8889
## Neg Pred Value              1.0000            0.9474           0.9048
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3000           0.2667
## Detection Prevalence        0.3333            0.3667           0.3000
## Balanced Accuracy           1.0000            0.9000           0.8750
# keep track of all model accuracies
Models_Accuracies <- add_row(Models_Accuracies, model = "KNN", accuracy = knn_cm$overall['Accuracy'])

The accuracy on test set dropped to devastating 90%. It seems that the KNN overfitted the training set.

Support Vector Machines (SVM)

set.seed(123)
svm_fit <- train(Species~., data=train, method='svmRadial', trControl=control, metric=metric)
svm_fit
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa 
##   0.25  0.9583333  0.9375
##   0.50  0.9833333  0.9750
##   1.00  0.9833333  0.9750
## 
## Tuning parameter 'sigma' was held constant at a value of 0.6165725
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.6165725 and C = 0.5.

The accuracy from SVM Model is 97.5% on training data.

# test data
svm_predict <- predict(svm_fit, test)

# confusion matrix
svm_cm <- confusionMatrix(svm_predict, test$Species)
svm_cm
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa          9          0         0
##   versicolor      1          8         2
##   virginica       0          2         8
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.6528, 0.9436)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.444e-08       
##                                           
##                   Kappa : 0.75            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 0.9000            0.8000           0.8000
## Specificity                 1.0000            0.8500           0.9000
## Pos Pred Value              1.0000            0.7273           0.8000
## Neg Pred Value              0.9524            0.8947           0.9000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3000            0.2667           0.2667
## Detection Prevalence        0.3000            0.3667           0.3333
## Balanced Accuracy           0.9500            0.8250           0.8500
# keep track of all model accuracies
Models_Accuracies <- add_row(Models_Accuracies, model = "SVM", accuracy = svm_cm$overall['Accuracy'])

Same as KNN, SVM also performed poorly on test set with accuracy of only 83%.

Decision Trees

tree_fit <- rpart(Species ~ ., train, control = rpart.control(minsplit = 6, minbucket = 2), method = "class")
rpart.plot(tree_fit)

# test data
tree_predict <- predict(tree_fit, test, type="class")
tree_cm <- confusionMatrix(tree_predict, test$Species)
tree_cm
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          8         1
##   virginica       0          2         9
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.7347, 0.9789)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 1.665e-10       
##                                           
##                   Kappa : 0.85            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8000           0.9000
## Specificity                 1.0000            0.9500           0.9000
## Pos Pred Value              1.0000            0.8889           0.8182
## Neg Pred Value              1.0000            0.9048           0.9474
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2667           0.3000
## Detection Prevalence        0.3333            0.3000           0.3667
## Balanced Accuracy           1.0000            0.8750           0.9000
# keep track of all model accuracies
Models_Accuracies <- add_row(Models_Accuracies, model = "Decision Tree", accuracy = tree_cm$overall['Accuracy'])

The accuracy from decision tree was 90%.

Random Forest

set.seed(123)
rf_fit <- train(Species~., data=train, method='ranger', trControl=control, metric=metric)
rf_fit
## Random Forest 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa 
##   2     gini        0.9833333  0.9750
##   2     extratrees  0.9666667  0.9500
##   3     gini        0.9833333  0.9750
##   3     extratrees  0.9666667  0.9500
##   4     gini        0.9750000  0.9625
##   4     extratrees  0.9750000  0.9625
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2, splitrule = gini
##  and min.node.size = 1.

The random forest model got the training set accuracy of 97.5%.

# Test data
rf_predict <- predict(rf_fit, test)

# Confusion matrix
rf_cm <- confusionMatrix(rf_predict, test$Species)
rf_cm
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          8         2
##   virginica       0          2         8
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8667          
##                  95% CI : (0.6928, 0.9624)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.296e-09       
##                                           
##                   Kappa : 0.8             
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8000           0.8000
## Specificity                 1.0000            0.9000           0.9000
## Pos Pred Value              1.0000            0.8000           0.8000
## Neg Pred Value              1.0000            0.9000           0.9000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2667           0.2667
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            0.8500           0.8500
# Keep track of all model accuracies
Models_Accuracies <- add_row(Models_Accuracies, model = "Random Forest", accuracy = rf_cm$overall['Accuracy'])

The random forest model also didn’t perform so well on test data with accuracy of only 86.67%

XGBoost Model

set.seed(40)

#Convert class labels from factor to numeric
labels <- train$Species
y <- as.integer(labels) - 1

# xgb fit
xgb_fit <- xgboost(data = data.matrix(train[,-5]), 
 label = y,
 num_class = 3,
 eta = 0.3,
 gamma = 0.1,
 max_depth = 30, 
 nrounds = 20, 
 objective = "multi:softprob",
 colsample_bytree = 0.6,
 verbose = 0,
 nthread = 7,
 nfold = 10,
 prediction = TRUE,
)
## [14:34:56] WARNING: amalgamation/../src/learner.cc:541: 
## Parameters: { nfold, prediction } might not be used.
## 
##   This may not be accurate due to some parameters are only used in language bindings but
##   passed down to XGBoost core.  Or some parameters are not used but slip through this
##   verification. Please open an issue if you find above cases.
## 
## 
## [14:34:56] WARNING: amalgamation/../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
xgb_fit
## ##### xgb.Booster
## raw: 45.5 Kb 
## call:
##   xgb.train(params = params, data = dtrain, nrounds = nrounds, 
##     watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
##     early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
##     save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
##     callbacks = callbacks, num_class = 3, eta = 0.3, gamma = 0.1, 
##     max_depth = 30, objective = "multi:softprob", colsample_bytree = 0.6, 
##     nthread = 7, nfold = 10, prediction = TRUE)
## params (as set within xgb.train):
##   num_class = "3", eta = "0.3", gamma = "0.1", max_depth = "30", objective = "multi:softprob", colsample_bytree = "0.6", nthread = "7", nfold = "10", prediction = "TRUE", validate_parameters = "TRUE"
## xgb.attributes:
##   niter
## callbacks:
##   cb.evaluation.log()
## # of features: 4 
## niter: 20
## nfeatures : 4 
## evaluation_log:
##     iter train_mlogloss
##        1       0.741699
##        2       0.525989
## ---                    
##       19       0.025517
##       20       0.024078
# Test data
xgb_predict <- predict(xgb_fit, data.matrix(test[, -5]), reshape = T) %>% as.data.frame()
colnames(xgb_predict) = levels(labels)

# Use the predicted label with the highest probability
xgb_predict$prediction = apply(xgb_predict, 1 ,function(x) colnames(xgb_predict)[which.max(x)])

# Confusion matrix
xgb_cm <- confusionMatrix(factor(xgb_predict$prediction), factor(test$Species))
xgb_cm
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          8         1
##   virginica       0          2         9
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.7347, 0.9789)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 1.665e-10       
##                                           
##                   Kappa : 0.85            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8000           0.9000
## Specificity                 1.0000            0.9500           0.9000
## Pos Pred Value              1.0000            0.8889           0.8182
## Neg Pred Value              1.0000            0.9048           0.9474
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2667           0.3000
## Detection Prevalence        0.3333            0.3000           0.3667
## Balanced Accuracy           1.0000            0.8750           0.9000
# Keep track of all model accuracies
Models_Accuracies <- add_row(Models_Accuracies, model = "XGBoost", accuracy = xgb_cm$overall['Accuracy'])

XGBoose continuosly achieved 90% accuracy on test data on different hyperparameterization.

9. Result Comparison

Training Set Accuracy

Let’s compare how the model performed on training data.

# Compare the results of these algorithms
iris.results <- resamples(list(lda=lda_fit, cart=cart_fit, knn=knn_fit, svm=svm_fit, rf=rf_fit))

# Table Comparison
summary(iris.results)
## 
## Call:
## summary.resamples(object = iris.results)
## 
## Models: lda, cart, knn, svm, rf 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean 3rd Qu. Max. NA's
## lda  0.9166667 1.0000000 1.0000000 0.9833333       1    1    0
## cart 0.8333333 0.9166667 0.9583333 0.9500000       1    1    0
## knn  0.9166667 1.0000000 1.0000000 0.9916667       1    1    0
## svm  0.8333333 1.0000000 1.0000000 0.9833333       1    1    0
## rf   0.9166667 1.0000000 1.0000000 0.9833333       1    1    0
## 
## Kappa 
##       Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
## lda  0.875   1.000 1.0000 0.9750       1    1    0
## cart 0.750   0.875 0.9375 0.9250       1    1    0
## knn  0.875   1.000 1.0000 0.9875       1    1    0
## svm  0.750   1.000 1.0000 0.9750       1    1    0
## rf   0.875   1.000 1.0000 0.9750       1    1    0
dotplot(iris.results)

From the above data, we can see that KNN performed better than all the other model on training data with an average accuracy of 98.75%. The accuracy on training set may be affected by various reasons one being overfitting which performs extremely well on training data but breaks in test data.

Test Set Accuracy

Let’s compare the accuracies on our test set.

arrange(Models_Accuracies, desc(accuracy))
## # A tibble: 7 x 2
##   model         accuracy
##   <chr>            <dbl>
## 1 LDA              0.967
## 2 KNN              0.9  
## 3 Decision Tree    0.9  
## 4 XGBoost          0.9  
## 5 CART             0.867
## 6 Random Forest    0.867
## 7 SVM              0.833

10. Conclusion

As we can see, LDA performed extremely better than most other models, followed by KNN, XGBoost and Decision Tree. The LDA encompassed all 4 variables, and so was able to reflect variance better. Although, since the dataset is too small, this could also be the reason for other models not performing better than they are supposed to.