Abstract
A machine learning approach to analyse three similar-looking Iris flower species, and predict a flower species by its morphological features.
Executive summary
Aim
- To analyse the Iris multivariate data set and determine the best prediction model(s) to predict a flower species based on its morphological features.
Methodology
Data
- Iris, a multivariate data set, with five (5) variables and 150 samples were used for this analysis and model selection.
Technique(s)
Supervised machine learning.
Three machine learning techniques, namely, quadratic discriminant analysis (QDA), decision tree and naive Bayes classifier techniques were used to analyse, classify, and predict flower species.
Data were divided into training and test data sets for model building and evaluation respectively. The prediction accuracy results are based on a model’s performance on test data set.
Results
Quadratic discriminant analysis
The quadratic discriminant analysis model classified and predicted flower species with an overall accuracy of 94.6%. In particular, the model predicted flower species Setosa with an accuracy of 100%, but Versicolor and Virginica flower species were predicted with an accuracy of 90% and 94%, respectively.
The first discriminant component alone explained 99% of the variation in the data.
Decision tree
The decision tree model predicted flower species with an overall accuracy of 95%. In particular, the model predicted flower species Setosa and Virginica with an accuracy of 100%, but Versicolor flower species were predicted with an accuracy of 85%.
Interpretation:
When the petal length is \(\le 1.9 cms\), the flower species is highly likely to be \(Setosa\).
When the petal length is \(> 1.9 cms\) and petal width is \(\le 1.7 cms\), the flower species is highly likely to be \(Versicolor\).
When the petal length is \(> 1.9 cms\) and petal width is \(> 1.7 cms\), the flower species is highly likely to be \(Virginica\).
Naive Bayes classifier
- The naive Bayes classifier predicted flower species with an overall accuracy of 96.7%. In particular, the model predicted flower species Setosa and Virginica with an accuracy of 100%, but Versicolor flower species were predicted with an accuracy of 89.4%..
Conclusion
- If the objective is higher prediction accuracy, then naive Bayes classifier is better than the decision tree model. Otherwise if the objective is model simplicity, then decision tree model is preferable than the naive Bayes classifier model.
Technical notes
Exploratory data analysis
Structure of the data set
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Class : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
The output above shows there are 5 (five) variables and 150 observations in the data set.
The four (4) variables, namely, Sepal.Length, Sepal.Width, Petal.Length and Petal.Width are of numeric type, and represent the morphological features of a flower in centimetres (cms).
The Class is a categorical variable, and contains the species or class of a flower. There are three different flower species, namely, Setosa, Versicolor and Virginica.
Therefore, we have a multivariate data set with 4 (four) independent variables, and 1 (one) dependent variable.
Descriptive statistics of the data set
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Class | |
|---|---|---|---|---|---|
| Min. :4.30 | Min. :2.00 | Min. :1.00 | Min. :0.1 | setosa :50 | |
| 1st Qu.:5.10 | 1st Qu.:2.80 | 1st Qu.:1.60 | 1st Qu.:0.3 | versicolor:50 | |
| Median :5.80 | Median :3.00 | Median :4.35 | Median :1.3 | virginica :50 | |
| Mean :5.84 | Mean :3.05 | Mean :3.76 | Mean :1.2 | NA | |
| 3rd Qu.:6.40 | 3rd Qu.:3.30 | 3rd Qu.:5.10 | 3rd Qu.:1.8 | NA | |
| Max. :7.90 | Max. :4.40 | Max. :6.90 | Max. :2.5 | NA |
The output above shows the descriptive statistics for the flowers’ morphological features with mean, median, and lower-, upper- and inter-quartile statistics. There are no missing data.
The range of Sepal Width and Petal Width appears to be narrower compared with the range of Sepal Length and Petal Length. The units of the morphological features were measured in centimetres (cms).
Visualisation of the data
Box-plots
The box-plots above show the distribution of each morphological feature against the flower species.
Range:
- The morphological feature range for each flower species appears to be different.
Distribution:
- The plots of Sepal Width and Petal Width show narrower distribution compared with the plots of Sepal Length and Petal Length.
- The distribution of morphological features for each species appear to be reasonably normal.
Skewness:
- The plot of Petal Length shows slight left-skewness for Versicolor species.
Outliers:
- The plot of Sepal Length shows a potential outlier for Virginica.
- The plot of Petal Length shows a potential outlier for Versicolor.
In summary, there are no major issues with the data set.
Pairs plot and correlation coefficients
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
|---|---|---|---|---|
| Sepal.Length | 1.00 | -0.11 | 0.87 | 0.82 |
| Sepal.Width | -0.11 | 1.00 | -0.42 | -0.36 |
| Petal.Length | 0.87 | -0.42 | 1.00 | 0.96 |
| Petal.Width | 0.82 | -0.36 | 0.96 | 1.00 |
The pairs plot above shows a matrix of scatter-plots representing relationship between the independent explanatory variables. There appears to be clustering of flower species represented in different colour codes.
Based on the Pearson correlation coefficient, there is positive correlation between
- Petal Length and Petal Width (0.96),
- Sepal Length and Petal Length (0.87), and
- Sepal Length and Petal Width (0.82).
and negative correlation between
- Petal Length and Sepal Width (-0.42), and
- Petal Width and Sepal Width (-0.36).
In summary, there are variables that are highly correlated with each other.
Model: Quadratic discriminant analysis (QDA)
Multivariate data assumptions check
Normality test
Based on the Mahalanobis distance test and histogram plot above, there appears to be some right-skewness in the data.
The Q-Q plot above shows observations deviating from the line at the top suggesting right-skewness in the data.
In summary, there are no major concerns with the normality test. The data appears to have somewhat Normal distribution.
Equal variance test
## Df Pillai approx F num Df den Df Pr(>F)
## Class 2 0.39499 8.9209 8 290 6.198e-11 ***
## Residuals 147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the equality of variance test summary output above, there is a strong evidence (P-value ~ 0) to suggest that the flower species are not the same, and that they are different. Therefore the equal variance assumption is violated, so we cannot fit a linear discriminant analysis.
Alternatively, quadratic discriminant analysis can be used for further analysis and classification.
Fit a QDA model
## Call:
## qda(Class ~ Sepal.Length + Sepal.Width + Petal.Width, data = iris.qda.df)
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.3333333 0.3333333 0.3333333
##
## Group means:
## Sepal.Length Sepal.Width Petal.Width
## setosa 5.006 3.418 0.244
## versicolor 5.936 2.770 1.326
## virginica 6.588 2.974 2.026
The QDA model summary output above shows the fitted model, prior probability and group means.
The QDA model was fitted with three regressor variables, Sepal length, Sepal width and Petal width. Petal length was dropped from the model due to multi-collinearity.
Perform leave-one-out cross-validation
## [1] "Classification by QD1:142"
## [1] "Classification by QD2:142"
- From the above output, the first quadratic discriminant function (QD1) does most of the action and that it has the ability to classify and predict the flower species. The second quadratic discriminant function (QD2) is not adding any additional value. Therefore, QD1 explains most of the variation in the data, and is able to classify and predict the flower species.
Classification accuracy scores
| setosa | versicolor | virginica |
|---|---|---|
| 1 | 0.00 | 0.00 |
| 0 | 0.90 | 0.10 |
| 0 | 0.06 | 0.94 |
- The classification table and discriminant score above suggests that the fitted QDA model has a classification accuracy of 94.6%.
Model: Decision tree
Fit a decision tree model
##
## Call:
## C5.0.formula(formula = Class ~ ., data = iris.train)
##
##
## C5.0 [Release 2.07 GPL Edition] Mon Dec 28 12:53:36 2020
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 90 cases (5 attributes) from undefined.data
##
## Decision tree:
##
## Petal.Length <= 1.9: setosa (27)
## Petal.Length > 1.9:
## :...Petal.Width <= 1.7: versicolor (34/2)
## Petal.Width > 1.7: virginica (29/1)
##
##
## Evaluation on training data (90 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 3 3( 3.3%) <<
##
##
## (a) (b) (c) <-classified as
## ---- ---- ----
## 27 (a): class setosa
## 32 1 (b): class versicolor
## 2 28 (c): class virginica
##
##
## Attribute usage:
##
## 100.00% Petal.Length
## 70.00% Petal.Width
##
##
## Time: 0.0 secs
The decision tree model summary output above shows that a flower species can be predicted using two features, petal length and petal width.
The training data classification summary output above shows that the decision tree prediction accuracy is about 96.7%.
Decision tree cross validation
| setosa | versicolor | virginica | |
|---|---|---|---|
| setosa | 23 | 0 | 0 |
| versicolor | 0 | 17 | 3 |
| virginica | 0 | 0 | 17 |
- The test data classification summary output above shows that the decision tree prediction accuracy is 95%.
Decision tree diagram
The decision tree diagram above suggests that a flower species can be predicted by two features, i.e., by its petal length and petal width.
Model interpretation:
When the petal length is \(\le 1.9 cms\), the flower species is highly likely to be \(Setosa\).
When the petal length is \(> 1.9 cms\) and petal width is \(\le 1.7 cms\), the flower species is highly likely to be \(Versicolor\).
When the petal length is \(> 1.9 cms\) and petal width is \(> 1.7 cms\), the flower species is highly likely to be \(Virginica\).
Model: Naive Bayes Classifier
Fit a naive Bayes classification model
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = iris.train[, 1:4], y = iris.train[, 5])
##
## A-priori probabilities:
## iris.train[, 5]
## setosa versicolor virginica
## 0.3000000 0.3666667 0.3333333
##
## Conditional probabilities:
## Sepal.Length
## iris.train[, 5] [,1] [,2]
## setosa 5.033333 0.3441824
## versicolor 5.984848 0.4944196
## virginica 6.673333 0.6596411
##
## Sepal.Width
## iris.train[, 5] [,1] [,2]
## setosa 3.466667 0.3594868
## versicolor 2.775758 0.3269464
## virginica 2.976667 0.3276703
##
## Petal.Length
## iris.train[, 5] [,1] [,2]
## setosa 1.485185 0.1586086
## versicolor 4.300000 0.4472136
## virginica 5.593333 0.6191949
##
## Petal.Width
## iris.train[, 5] [,1] [,2]
## setosa 0.2407407 0.09710921
## versicolor 1.3363636 0.20739181
## virginica 2.0166667 0.24925175
- The model summary output above shows the fitted model using training data, prior probabilities by flower species, and conditional probability based on mean and standard deviation for each morphological feature.
Naive Bayes classifier distribution plot
- The plot above shows the distribution of morphological features by flower species. The flower species, Setosa, appears to have distinct petal length and petal width than Versicolor and Virginica. Otherwise, rest of the morphological features overlap with all three flower species.
Naive Bayes classifier cross validation
| setosa | versicolor | virginica | |
|---|---|---|---|
| setosa | 23 | 0 | 0 |
| versicolor | 0 | 17 | 2 |
| virginica | 0 | 0 | 18 |
- The test data classification summary output above shows that the naive Bayes classifier prediction accuracy is 96.7%.
References
- Iris, Ronald Fisher, 1936, Data set, https://archive.ics.uci.edu/ml/datasets/iris.
- Iris, Several authors, Several years, Image, https://en.wikipedia.org/wiki/Iris_flower_data_set.
- GGPlot2, Hadley Wickham, 2016, R package, https://ggplot2.tidyverse.org.
- Decision tree, Max Kuhn, 2020, R package, https://www.rdocumentation.org/packages/C50.
- Naive Bayes classifier, Several authors, 2019, R package, https://cran.r-project.org/web/packages/e1071/index.html.
Keywords
- \(iris\), \(multivariate\), \(machine\,learning\), \(supervised\, learning\), \(classification\), \(prediction\), \(discriminant\,analysis\), \(QDA\), \(decision\,tree\), \(Naive\,Bayes\, classifier\).