1. Introduction

In this, my first R Markdown document, I will attempt to train different types of models that classify Iris flowers based on their features. The Iris dataset is well-known in data science circles as being a great place to start with machine learning, and being a keen gardener myself, this is an apt starting point for me. The data has been sourced from the UCI Machine Learning Repository. The dataset is attributed to R.A. Fisher (often accredited as the godfather of modern statistics) and dates back to 1936: it contains 4 attributes (sepal length and width in centimetres, and petal length and width in centimentres) of 150 observations of Iris flowers split equally among 3 classes: the Iris Setosa, the Iris Versicolour & the Iris Virginica. Fisher’s write up of his attempt to create a taxonomic framework for the flowers is a fascinating read and I recommend it as further reading for those that want to delve deeper.

2. Installing packages

The following packages will be required for this project.

library(tidyverse)
library(cowplot)
library(ggpubr)
library(MASS)
library(caret)

3. Loading the data

The Iris dataset is available pre-loaded in R, so the following code loads and summarises the data.

data(iris)
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

We can see that there are no missing values in this dataset so there is no requirement for imputation, and also the 4 continuous variables are on the same scale (centimetres) so there is no normalization required. I will now proceed to splitting the data into training and test sets and I will ensure that the test set gets an equal number of each flower (with such a small dataset, there is a high chance that we’d only get negligible or nil levels of one species, and so the accuracy of our models would be compromised).

set.seed(49) # crucial step for replication purposes.
irissplit <- createDataPartition(iris$Species, p=.8, list=FALSE, times=1)
train <- iris[irissplit, ]
test <- iris[-irissplit, ]
# confirm we have 30 rows x 5 variables in the train set (20% of 150)...
dim(test)
## [1] 30  5
# ...and that we have a 33/33/33 split between species.
count(test, Species)
## # A tibble: 3 x 2
##   Species        n
##   <fct>      <int>
## 1 setosa        10
## 2 versicolor    10
## 3 virginica     10

4. Exploratory analysis

# table of means
train %>% 
  group_by(Species) %>% summarise(avg_SL = mean(Sepal.Length), avg_PL = mean(Petal.Length), avg_SW = mean(Sepal.Width), avg_PW = mean(Petal.Width)) 
## # A tibble: 3 x 5
##   Species    avg_SL avg_PL avg_SW avg_PW
##   <fct>       <dbl>  <dbl>  <dbl>  <dbl>
## 1 setosa       4.97   1.45   3.39  0.238
## 2 versicolor   5.94   4.27   2.77  1.34 
## 3 virginica    6.64   5.61   3.01  2.04

Immediately we can see that the average length and width of the Setosa’s petals are way smaller than the other two varieties. Let’s now do some visualisation.

# boxplots
box1 <- ggplot(train, aes(x=Species, y=Sepal.Length)) + geom_boxplot()
box2 <- ggplot(train, aes(x=Species, y=Sepal.Width)) + geom_boxplot()
box3 <- ggplot(train, aes(x=Species, y=Petal.Length)) + geom_boxplot()
box4 <- ggplot(train, aes(x=Species, y=Petal.Width)) + geom_boxplot()
plot_grid(box1, box2, box3, box4, labels = "AUTO", ncol = 2, nrow = 2)

The boxplots confirm that the range of the Setosa’s petal length and width are lower than the other two, so already these two features seem useful for predicting species. Next, lets see if there’s a linear relationship between length and width of both the petals and species of each species.

# scatterplots + density combined
scat1 <- ggscatter(train, x = "Sepal.Length", y="Sepal.Width", shape = "Species", color = "Species", palette = "lancet", size = 2, alpha = 0.6) + stat_smooth(method=lm, se=FALSE, colour="black")
xdens1 <- ggdensity(train, "Sepal.Length", colour = "Species", fill = "Species", palette = "lancet")
ydens1 <- ggdensity(train, "Sepal.Width", colour = "Species", fill = "Species", palette = "lancet") + rotate()
ggarrange(xdens1, NULL, scat1, ydens1, nrow=2, ncol=2, align = "hv", widths = c(2,1), heights = c(1,2), common.legend = TRUE)

scat2 <- ggscatter(train, x = "Petal.Length", y="Petal.Width", shape = "Species", color = "Species", palette = "lancet", size = 2, alpha = 0.6) + stat_smooth(method=lm, se=FALSE, colour="black")
xdens2 <- ggdensity(train, "Petal.Length", colour = "Species", fill = "Species", palette = "lancet")
ydens2 <- ggdensity(train, "Petal.Width", colour = "Species", fill = "Species", palette = "lancet") + rotate()
ggarrange(xdens2, NULL, scat2, ydens2, nrow = 2, ncol = 2, align = "hv", widths = c(2,1), heights = c(1,2), common.legend = TRUE)

These plots are revelatory. Using the sepal measurements alone will be unhelpful in distinguishing between species, particularly Versicolor and Virginica. However, the petal measurements seem positively associated and that we appear to have three clusters appearing. This leads me to deduce that petal length and width will have a stronger bearing on the model but that sepal measurements may help to separate Versicolors from Virginicas. On with the modelling.

5. Training a model

a. Discriminant Analysis

It seems fitting (excuse the pun) to start with the method that Fisher himself used, and then I will attempt to improve upon it using other machine learning algorithms. LDAs are useful for classification problems where there are more than 2 values for the predictor variable which is the case here, otherwise we could use logistic regression. Using a LDA is predicated on the normality assumption, and from the bell-curve shape of the density plots, its reasonable to proceed.

model_LDA <- lda(Species ~ ., train)
model_LDA
## Call:
## lda(Species ~ ., data = train)
## 
## Prior probabilities of groups:
##     setosa versicolor  virginica 
##  0.3333333  0.3333333  0.3333333 
## 
## Group means:
##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa           4.9725      3.3925       1.4500      0.2375
## versicolor       5.9375      2.7675       4.2675      1.3375
## virginica        6.6450      3.0100       5.6075      2.0450
## 
## Coefficients of linear discriminants:
##                     LD1         LD2
## Sepal.Length  0.8776422  0.08459848
## Sepal.Width   1.3427090  2.27738505
## Petal.Length -2.2499175 -0.91449073
## Petal.Width  -2.6969658  2.67386351
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9905 0.0095

From the model outputs, I can use the coefficients of each linear discriminant to show the decision rules (to 2 decimal places): \(LD1 = 0.88 * Sepal.Length + 1.34 * Sepal.Width - 2.25 * Petal.Length - 2.70 * Petal.Width\) \(LD2 = 0.08 * Sepal.Length + 2.28 * Sepal.Width - 0.91 * Petal.Length + 2.67 * Petal.Width\)

The proportion of trace tells us how well each discriminant distinguishes between the species, and given the very high size of the first discriminant (0.9905), we see that LD1 explains 99% of the variance and so LD2 is contributing very little to the distinction of species, as can be seen from the plot below.

dataforplot <- data.frame(species = train$Species, lda = predict(model_LDA)$x)
ggscatter(dataforplot, x="lda.LD1", y="lda.LD2", shape = "species", color = "species", palette = "lancet", size = 2, alpha = 0.6)

And now on to the predictions: is this model good at predicting Iris species based on 4 measurements?

predict_LDA <- predict(model_LDA, test)
confusion_LDA <- confusionMatrix(predict_LDA$class, test$Species)
confusion_LDA
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          9         0
##   virginica       0          1        10
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9667          
##                  95% CI : (0.8278, 0.9992)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.963e-13       
##                                           
##                   Kappa : 0.95            
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9000           1.0000
## Specificity                 1.0000            1.0000           0.9500
## Pos Pred Value              1.0000            1.0000           0.9091
## Neg Pred Value              1.0000            0.9524           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3000           0.3333
## Detection Prevalence        0.3333            0.3000           0.3667
## Balanced Accuracy           1.0000            0.9500           0.9750

96.7% accuracy has been achieved with only one mistaken prediction, so this is a high benchmark, but lets see if we can better it. I’ll store this result for later.

Models_Accuracies <- tibble(model = "LDA", accuracy = confusion_LDA$overall['Accuracy'])

b. Decision Tree

I have chosen this algorithm next as the outputs are generally easily interpretable (especially when the number of predictor variables is low) and being non-parametric, they are good at handling a mixed bag of variable types.

library(rpart)
library(rpart.plot)
iristree <- rpart(formula = Species ~ ., data = train, control = rpart.control(minsplit = 5, minbucket = 2), method = "class") # method = "class" directs the function to treat Species as a categorical variable.
print(iristree)
## n= 120 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 120 80 setosa (0.33333333 0.33333333 0.33333333)  
##    2) Petal.Length< 2.45 40  0 setosa (1.00000000 0.00000000 0.00000000) *
##    3) Petal.Length>=2.45 80 40 versicolor (0.00000000 0.50000000 0.50000000)  
##      6) Petal.Length< 5.05 44  4 versicolor (0.00000000 0.90909091 0.09090909)  
##       12) Petal.Width< 1.75 39  0 versicolor (0.00000000 1.00000000 0.00000000) *
##       13) Petal.Width>=1.75 5  1 virginica (0.00000000 0.20000000 0.80000000) *
##      7) Petal.Length>=5.05 36  0 virginica (0.00000000 0.00000000 1.00000000) *
rpart.plot(iristree)

The plot itself acts as a useful set of rules to apply to further datasets to predict species, and so the allocation rules would be:

  1. If Petal.Length < 2.5cm then species = Setosa

  2. If Petal.Length > 5.1cm OR If Petal.Length between 2.5cm and 5.1cm and Petal.Width > 1.8cm then species = Virginica

  3. If Petal.Length between 2.5cm and 5.1cm and Petal.Width < 1.8cm then species = Versicolor

Its interesting to note that this model does not accommodate the two sepal variables. I suspect that this will hamper the model’s accuracy.

Let’s look at how well the model performs:

predict_tree <- predict(iristree, test, type="class")
confusion_tree <- confusionMatrix(predict_tree, test$Species)
Models_Accuracies <- add_row(Models_Accuracies, model = "Classification Tree", accuracy = confusion_tree$overall['Accuracy'])
confusion_tree
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          9         2
##   virginica       0          1         8
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.7347, 0.9789)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 1.665e-10       
##                                           
##                   Kappa : 0.85            
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9000           0.8000
## Specificity                 1.0000            0.9000           0.9500
## Pos Pred Value              1.0000            0.8182           0.8889
## Neg Pred Value              1.0000            0.9474           0.9048
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3000           0.2667
## Detection Prevalence        0.3333            0.3667           0.3000
## Balanced Accuracy           1.0000            0.9000           0.8750

We have 3 flowers misclassified, and so we have an accuracy rate of 3 / 30 = 90%, therefore the model does not perform as well as the LDA. The next logical step then would be to combine multiple trees, so on to random forest.

c. Random Forests

Unlike regression models, we need to select our hyperparameters before the model is fit. But first I will run a simple baseline random forest from which we can then tune the hyperparameters to improve accuracy.

library(randomForest)
rf_baseline <- randomForest(Species ~ ., train, importance=TRUE, proximity=TRUE)
rf_baseline
## 
## Call:
##  randomForest(formula = Species ~ ., data = train, importance = TRUE,      proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 3.33%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         40          0         0        0.00
## versicolor      0         38         2        0.05
## virginica       0          2        38        0.05

The Out-of-bag error for this model is 3.33% which gives us an accuracy of 96.7%, and the error breaks down into 2 misclassifications of versicolor when it should be virginica, and 2 misclassifications of virginica when it should be versicolor.

varImpPlot(rf_baseline)

The importance information and plots show that the most important predictors in the model are petal length and petal width. No surprise given what we’ve seen already in our scatterplots of length v width, and also the decision tree earlier on that completely disregarded sepal length and width.

Now let’s see how the trained model performs on the test set.

predict_rf_baseline <- predict(rf_baseline, test)
confusion_RF <- confusionMatrix(predict_rf_baseline, test$Species)
Models_Accuracies <- add_row(Models_Accuracies, model = "Random Forest baseline", accuracy = confusion_RF$overall['Accuracy'])
confusion_RF
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          9         2
##   virginica       0          1         8
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.7347, 0.9789)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 1.665e-10       
##                                           
##                   Kappa : 0.85            
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9000           0.8000
## Specificity                 1.0000            0.9000           0.9500
## Pos Pred Value              1.0000            0.8182           0.8889
## Neg Pred Value              1.0000            0.9474           0.9048
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3000           0.2667
## Detection Prevalence        0.3333            0.3667           0.3000
## Balanced Accuracy           1.0000            0.9000           0.8750

The test accuracy here is not as strong as the out of bag accuracy and the class errors seem to be pretty similar (no difficulty in classifying the setosa flowers, the misclassification comes from the other two species).

I will now complexify the random forest a little by increasing the number of trees and variables to try at each split.

rf_2 <- randomForest(formula = Species ~ ., data = train, importance = TRUE, proximity = TRUE, ntree=1000, mtry=4) 
rf_2
## 
## Call:
##  randomForest(formula = Species ~ ., data = train, importance = TRUE,      proximity = TRUE, ntree = 1000, mtry = 4) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 2.5%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         40          0         0       0.000
## versicolor      0         38         2       0.050
## virginica       0          1        39       0.025

There is a small uptick in accuracy thanks to decreasing the number of misclassifications from 4 to 3.

varImpPlot(rf_2)

Petal length is now even more important in this model and the importance of sepal length and width has fallen even more.

predict_rf_2 <- predict(rf_2, test)
confusion_rf2 <- confusionMatrix(predict_rf_2, test$Species)
Models_Accuracies <- add_row(Models_Accuracies, model = "Random Forest complex", accuracy = confusion_rf2$overall['Accuracy'])
confusion_rf2
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          9         2
##   virginica       0          1         8
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.7347, 0.9789)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 1.665e-10       
##                                           
##                   Kappa : 0.85            
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9000           0.8000
## Specificity                 1.0000            0.9000           0.9500
## Pos Pred Value              1.0000            0.8182           0.8889
## Neg Pred Value              1.0000            0.9474           0.9048
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3000           0.2667
## Detection Prevalence        0.3333            0.3667           0.3000
## Balanced Accuracy           1.0000            0.9000           0.8750

Unfortunately, although the accuracy improved on the training data, it did not alter the accuracy on the test set: the model still misclassifies 3 out of 30 training set cases.

d. K-nearest Neighbour

Finally, I will use K-nearest neighbours as an example of an instance-based algorithm which is the only one used here that doesn’t learn a model, so it will be interesting to see how it performs. I will create 10 models, where k lies between 1 and 10 inclusive.

library(class)
model_knn <- list()
accuracy_knn <- numeric()
for (i in 1:10) {
 model_knn[[i]] <- knn(train[,-5], test[,-5], train$Species, k=i, prob=TRUE) 
 accuracy_knn[i] <- sum(model_knn[[i]]==test$Species)/length(test$Species)*100
}
accuracy_knn
##  [1] 90.00000 90.00000 90.00000 93.33333 93.33333 93.33333 93.33333
##  [8] 93.33333 93.33333 93.33333

The accuracy of the model improves a little when k>3 but the ceiling of the model seems to be around the 93.3% mark. I think taking k any higher than 10 for such a small dataset would lead to overfitting.

model_knn4 <- knn(train[,-5], test[,-5], train$Species, k=4, prob=TRUE)
confusion_knn <- table(test$Species, model_knn4)
confusion_knn
##             model_knn4
##              setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          9         1
##   virginica       0          1         9
Models_Accuracies <- add_row(Models_Accuracies, model = "KNN", accuracy = accuracy_knn[4]/100)

6. Summary

Looking at the summary of accuracies below:

arrange(Models_Accuracies, desc(accuracy))
## # A tibble: 5 x 2
##   model                  accuracy
##   <chr>                     <dbl>
## 1 LDA                       0.967
## 2 KNN                       0.933
## 3 Classification Tree       0.9  
## 4 Random Forest baseline    0.9  
## 5 Random Forest complex     0.9

We can see that the very algorithm that Fisher himself used to classify Iris flowers worked the best, followed by K-nearest neighbours. All of the tree-based methods (whether single or ensemble) performed the worst. On reflection, this makes sense; we don’t have an extensive dataset here to work with both width- and depth-wise. The LDA encompassed all 4 variables and so was able to reflect the variance within the data more accurately. By excluding or downplaying 2 of the 4 variables, the tree methods failed to live up to the accuracy of the LDA, or even the KNN model.