In this, my first R Markdown document, I will attempt to train different types of models that classify Iris flowers based on their features. The Iris dataset is well-known in data science circles as being a great place to start with machine learning, and being a keen gardener myself, this is an apt starting point for me. The data has been sourced from the UCI Machine Learning Repository. The dataset is attributed to R.A. Fisher (often accredited as the godfather of modern statistics) and dates back to 1936: it contains 4 attributes (sepal length and width in centimetres, and petal length and width in centimentres) of 150 observations of Iris flowers split equally among 3 classes: the Iris Setosa, the Iris Versicolour & the Iris Virginica. Fisher’s write up of his attempt to create a taxonomic framework for the flowers is a fascinating read and I recommend it as further reading for those that want to delve deeper.
The following packages will be required for this project.
library(tidyverse)
library(cowplot)
library(ggpubr)
library(MASS)
library(caret)
The Iris dataset is available pre-loaded in R, so the following code loads and summarises the data.
data(iris)
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
We can see that there are no missing values in this dataset so there is no requirement for imputation, and also the 4 continuous variables are on the same scale (centimetres) so there is no normalization required. I will now proceed to splitting the data into training and test sets and I will ensure that the test set gets an equal number of each flower (with such a small dataset, there is a high chance that we’d only get negligible or nil levels of one species, and so the accuracy of our models would be compromised).
set.seed(49) # crucial step for replication purposes.
irissplit <- createDataPartition(iris$Species, p=.8, list=FALSE, times=1)
train <- iris[irissplit, ]
test <- iris[-irissplit, ]
# confirm we have 30 rows x 5 variables in the train set (20% of 150)...
dim(test)
## [1] 30 5
# ...and that we have a 33/33/33 split between species.
count(test, Species)
## # A tibble: 3 x 2
## Species n
## <fct> <int>
## 1 setosa 10
## 2 versicolor 10
## 3 virginica 10
# table of means
train %>%
group_by(Species) %>% summarise(avg_SL = mean(Sepal.Length), avg_PL = mean(Petal.Length), avg_SW = mean(Sepal.Width), avg_PW = mean(Petal.Width))
## # A tibble: 3 x 5
## Species avg_SL avg_PL avg_SW avg_PW
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 4.97 1.45 3.39 0.238
## 2 versicolor 5.94 4.27 2.77 1.34
## 3 virginica 6.64 5.61 3.01 2.04
Immediately we can see that the average length and width of the Setosa’s petals are way smaller than the other two varieties. Let’s now do some visualisation.
# boxplots
box1 <- ggplot(train, aes(x=Species, y=Sepal.Length)) + geom_boxplot()
box2 <- ggplot(train, aes(x=Species, y=Sepal.Width)) + geom_boxplot()
box3 <- ggplot(train, aes(x=Species, y=Petal.Length)) + geom_boxplot()
box4 <- ggplot(train, aes(x=Species, y=Petal.Width)) + geom_boxplot()
plot_grid(box1, box2, box3, box4, labels = "AUTO", ncol = 2, nrow = 2)
The boxplots confirm that the range of the Setosa’s petal length and width are lower than the other two, so already these two features seem useful for predicting species. Next, lets see if there’s a linear relationship between length and width of both the petals and species of each species.
# scatterplots + density combined
scat1 <- ggscatter(train, x = "Sepal.Length", y="Sepal.Width", shape = "Species", color = "Species", palette = "lancet", size = 2, alpha = 0.6) + stat_smooth(method=lm, se=FALSE, colour="black")
xdens1 <- ggdensity(train, "Sepal.Length", colour = "Species", fill = "Species", palette = "lancet")
ydens1 <- ggdensity(train, "Sepal.Width", colour = "Species", fill = "Species", palette = "lancet") + rotate()
ggarrange(xdens1, NULL, scat1, ydens1, nrow=2, ncol=2, align = "hv", widths = c(2,1), heights = c(1,2), common.legend = TRUE)
scat2 <- ggscatter(train, x = "Petal.Length", y="Petal.Width", shape = "Species", color = "Species", palette = "lancet", size = 2, alpha = 0.6) + stat_smooth(method=lm, se=FALSE, colour="black")
xdens2 <- ggdensity(train, "Petal.Length", colour = "Species", fill = "Species", palette = "lancet")
ydens2 <- ggdensity(train, "Petal.Width", colour = "Species", fill = "Species", palette = "lancet") + rotate()
ggarrange(xdens2, NULL, scat2, ydens2, nrow = 2, ncol = 2, align = "hv", widths = c(2,1), heights = c(1,2), common.legend = TRUE)
These plots are revelatory. Using the sepal measurements alone will be unhelpful in distinguishing between species, particularly Versicolor and Virginica. However, the petal measurements seem positively associated and that we appear to have three clusters appearing. This leads me to deduce that petal length and width will have a stronger bearing on the model but that sepal measurements may help to separate Versicolors from Virginicas. On with the modelling.
It seems fitting (excuse the pun) to start with the method that Fisher himself used, and then I will attempt to improve upon it using other machine learning algorithms. LDAs are useful for classification problems where there are more than 2 values for the predictor variable which is the case here, otherwise we could use logistic regression. Using a LDA is predicated on the normality assumption, and from the bell-curve shape of the density plots, its reasonable to proceed.
model_LDA <- lda(Species ~ ., train)
model_LDA
## Call:
## lda(Species ~ ., data = train)
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.3333333 0.3333333 0.3333333
##
## Group means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 4.9725 3.3925 1.4500 0.2375
## versicolor 5.9375 2.7675 4.2675 1.3375
## virginica 6.6450 3.0100 5.6075 2.0450
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal.Length 0.8776422 0.08459848
## Sepal.Width 1.3427090 2.27738505
## Petal.Length -2.2499175 -0.91449073
## Petal.Width -2.6969658 2.67386351
##
## Proportion of trace:
## LD1 LD2
## 0.9905 0.0095
From the model outputs, I can use the coefficients of each linear discriminant to show the decision rules (to 2 decimal places): \(LD1 = 0.88 * Sepal.Length + 1.34 * Sepal.Width - 2.25 * Petal.Length - 2.70 * Petal.Width\) \(LD2 = 0.08 * Sepal.Length + 2.28 * Sepal.Width - 0.91 * Petal.Length + 2.67 * Petal.Width\)
The proportion of trace tells us how well each discriminant distinguishes between the species, and given the very high size of the first discriminant (0.9905), we see that LD1 explains 99% of the variance and so LD2 is contributing very little to the distinction of species, as can be seen from the plot below.
dataforplot <- data.frame(species = train$Species, lda = predict(model_LDA)$x)
ggscatter(dataforplot, x="lda.LD1", y="lda.LD2", shape = "species", color = "species", palette = "lancet", size = 2, alpha = 0.6)
And now on to the predictions: is this model good at predicting Iris species based on 4 measurements?
predict_LDA <- predict(model_LDA, test)
confusion_LDA <- confusionMatrix(predict_LDA$class, test$Species)
confusion_LDA
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 9 0
## virginica 0 1 10
##
## Overall Statistics
##
## Accuracy : 0.9667
## 95% CI : (0.8278, 0.9992)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 2.963e-13
##
## Kappa : 0.95
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9000 1.0000
## Specificity 1.0000 1.0000 0.9500
## Pos Pred Value 1.0000 1.0000 0.9091
## Neg Pred Value 1.0000 0.9524 1.0000
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3000 0.3333
## Detection Prevalence 0.3333 0.3000 0.3667
## Balanced Accuracy 1.0000 0.9500 0.9750
96.7% accuracy has been achieved with only one mistaken prediction, so this is a high benchmark, but lets see if we can better it. I’ll store this result for later.
Models_Accuracies <- tibble(model = "LDA", accuracy = confusion_LDA$overall['Accuracy'])
I have chosen this algorithm next as the outputs are generally easily interpretable (especially when the number of predictor variables is low) and being non-parametric, they are good at handling a mixed bag of variable types.
library(rpart)
library(rpart.plot)
iristree <- rpart(formula = Species ~ ., data = train, control = rpart.control(minsplit = 5, minbucket = 2), method = "class") # method = "class" directs the function to treat Species as a categorical variable.
print(iristree)
## n= 120
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 120 80 setosa (0.33333333 0.33333333 0.33333333)
## 2) Petal.Length< 2.45 40 0 setosa (1.00000000 0.00000000 0.00000000) *
## 3) Petal.Length>=2.45 80 40 versicolor (0.00000000 0.50000000 0.50000000)
## 6) Petal.Length< 5.05 44 4 versicolor (0.00000000 0.90909091 0.09090909)
## 12) Petal.Width< 1.75 39 0 versicolor (0.00000000 1.00000000 0.00000000) *
## 13) Petal.Width>=1.75 5 1 virginica (0.00000000 0.20000000 0.80000000) *
## 7) Petal.Length>=5.05 36 0 virginica (0.00000000 0.00000000 1.00000000) *
rpart.plot(iristree)
The plot itself acts as a useful set of rules to apply to further datasets to predict species, and so the allocation rules would be:
If Petal.Length < 2.5cm then species = Setosa
If Petal.Length > 5.1cm OR If Petal.Length between 2.5cm and 5.1cm and Petal.Width > 1.8cm then species = Virginica
If Petal.Length between 2.5cm and 5.1cm and Petal.Width < 1.8cm then species = Versicolor
Its interesting to note that this model does not accommodate the two sepal variables. I suspect that this will hamper the model’s accuracy.
Let’s look at how well the model performs:
predict_tree <- predict(iristree, test, type="class")
confusion_tree <- confusionMatrix(predict_tree, test$Species)
Models_Accuracies <- add_row(Models_Accuracies, model = "Classification Tree", accuracy = confusion_tree$overall['Accuracy'])
confusion_tree
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 9 2
## virginica 0 1 8
##
## Overall Statistics
##
## Accuracy : 0.9
## 95% CI : (0.7347, 0.9789)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 1.665e-10
##
## Kappa : 0.85
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9000 0.8000
## Specificity 1.0000 0.9000 0.9500
## Pos Pred Value 1.0000 0.8182 0.8889
## Neg Pred Value 1.0000 0.9474 0.9048
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3000 0.2667
## Detection Prevalence 0.3333 0.3667 0.3000
## Balanced Accuracy 1.0000 0.9000 0.8750
We have 3 flowers misclassified, and so we have an accuracy rate of 3 / 30 = 90%, therefore the model does not perform as well as the LDA. The next logical step then would be to combine multiple trees, so on to random forest.
Unlike regression models, we need to select our hyperparameters before the model is fit. But first I will run a simple baseline random forest from which we can then tune the hyperparameters to improve accuracy.
library(randomForest)
rf_baseline <- randomForest(Species ~ ., train, importance=TRUE, proximity=TRUE)
rf_baseline
##
## Call:
## randomForest(formula = Species ~ ., data = train, importance = TRUE, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 3.33%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 40 0 0 0.00
## versicolor 0 38 2 0.05
## virginica 0 2 38 0.05
The Out-of-bag error for this model is 3.33% which gives us an accuracy of 96.7%, and the error breaks down into 2 misclassifications of versicolor when it should be virginica, and 2 misclassifications of virginica when it should be versicolor.
varImpPlot(rf_baseline)
The importance information and plots show that the most important predictors in the model are petal length and petal width. No surprise given what we’ve seen already in our scatterplots of length v width, and also the decision tree earlier on that completely disregarded sepal length and width.
Now let’s see how the trained model performs on the test set.
predict_rf_baseline <- predict(rf_baseline, test)
confusion_RF <- confusionMatrix(predict_rf_baseline, test$Species)
Models_Accuracies <- add_row(Models_Accuracies, model = "Random Forest baseline", accuracy = confusion_RF$overall['Accuracy'])
confusion_RF
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 9 2
## virginica 0 1 8
##
## Overall Statistics
##
## Accuracy : 0.9
## 95% CI : (0.7347, 0.9789)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 1.665e-10
##
## Kappa : 0.85
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9000 0.8000
## Specificity 1.0000 0.9000 0.9500
## Pos Pred Value 1.0000 0.8182 0.8889
## Neg Pred Value 1.0000 0.9474 0.9048
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3000 0.2667
## Detection Prevalence 0.3333 0.3667 0.3000
## Balanced Accuracy 1.0000 0.9000 0.8750
The test accuracy here is not as strong as the out of bag accuracy and the class errors seem to be pretty similar (no difficulty in classifying the setosa flowers, the misclassification comes from the other two species).
I will now complexify the random forest a little by increasing the number of trees and variables to try at each split.
rf_2 <- randomForest(formula = Species ~ ., data = train, importance = TRUE, proximity = TRUE, ntree=1000, mtry=4)
rf_2
##
## Call:
## randomForest(formula = Species ~ ., data = train, importance = TRUE, proximity = TRUE, ntree = 1000, mtry = 4)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 2.5%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 40 0 0 0.000
## versicolor 0 38 2 0.050
## virginica 0 1 39 0.025
There is a small uptick in accuracy thanks to decreasing the number of misclassifications from 4 to 3.
varImpPlot(rf_2)
Petal length is now even more important in this model and the importance of sepal length and width has fallen even more.
predict_rf_2 <- predict(rf_2, test)
confusion_rf2 <- confusionMatrix(predict_rf_2, test$Species)
Models_Accuracies <- add_row(Models_Accuracies, model = "Random Forest complex", accuracy = confusion_rf2$overall['Accuracy'])
confusion_rf2
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 9 2
## virginica 0 1 8
##
## Overall Statistics
##
## Accuracy : 0.9
## 95% CI : (0.7347, 0.9789)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 1.665e-10
##
## Kappa : 0.85
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9000 0.8000
## Specificity 1.0000 0.9000 0.9500
## Pos Pred Value 1.0000 0.8182 0.8889
## Neg Pred Value 1.0000 0.9474 0.9048
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3000 0.2667
## Detection Prevalence 0.3333 0.3667 0.3000
## Balanced Accuracy 1.0000 0.9000 0.8750
Unfortunately, although the accuracy improved on the training data, it did not alter the accuracy on the test set: the model still misclassifies 3 out of 30 training set cases.
Finally, I will use K-nearest neighbours as an example of an instance-based algorithm which is the only one used here that doesn’t learn a model, so it will be interesting to see how it performs. I will create 10 models, where k lies between 1 and 10 inclusive.
library(class)
model_knn <- list()
accuracy_knn <- numeric()
for (i in 1:10) {
model_knn[[i]] <- knn(train[,-5], test[,-5], train$Species, k=i, prob=TRUE)
accuracy_knn[i] <- sum(model_knn[[i]]==test$Species)/length(test$Species)*100
}
accuracy_knn
## [1] 90.00000 90.00000 90.00000 93.33333 93.33333 93.33333 93.33333
## [8] 93.33333 93.33333 93.33333
The accuracy of the model improves a little when k>3 but the ceiling of the model seems to be around the 93.3% mark. I think taking k any higher than 10 for such a small dataset would lead to overfitting.
model_knn4 <- knn(train[,-5], test[,-5], train$Species, k=4, prob=TRUE)
confusion_knn <- table(test$Species, model_knn4)
confusion_knn
## model_knn4
## setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 9 1
## virginica 0 1 9
Models_Accuracies <- add_row(Models_Accuracies, model = "KNN", accuracy = accuracy_knn[4]/100)
Looking at the summary of accuracies below:
arrange(Models_Accuracies, desc(accuracy))
## # A tibble: 5 x 2
## model accuracy
## <chr> <dbl>
## 1 LDA 0.967
## 2 KNN 0.933
## 3 Classification Tree 0.9
## 4 Random Forest baseline 0.9
## 5 Random Forest complex 0.9
We can see that the very algorithm that Fisher himself used to classify Iris flowers worked the best, followed by K-nearest neighbours. All of the tree-based methods (whether single or ensemble) performed the worst. On reflection, this makes sense; we don’t have an extensive dataset here to work with both width- and depth-wise. The LDA encompassed all 4 variables and so was able to reflect the variance within the data more accurately. By excluding or downplaying 2 of the 4 variables, the tree methods failed to live up to the accuracy of the LDA, or even the KNN model.