Simple Analysis of the Anderson/Fisher Iris Data

Author

Dr Andrew Dalby

Analysing The Iris Data

The Iris Data has become a standard test set for statistical data-mining and machine learning. The purpose of the dataset was to classify the species/subspecies of blue irises that was first collected in 1935 by Edgar Anderson on the Gaspe Peninsula (Anderson 1935). That data was published in the Bulletin of the American Iris Society and if anyone has access to a copy I would like to see the original because it doesn’t seem to be available anywhere.

Anderson then followed this up with a paper called “The Species Problem in Iris” in 1936 (Anderson 1936). In this paper he detailed The distribution and properties of the main blue iris species in America. Geography plays a significant part in their identification as they have different but over-lapping ranges. There are also significant differences in seed size. They have distinct Karyotypes which makes them distinct species. From the outline drawings the three main species, Iris versicolor, Iris virginica and Iris setosa can also be distinguished by the shaves and sizes of their petals and sepals. Anderson created ideographs based on the different species based on these sepal and petal measurements. This paper contains summaries of the numerical data but not the numerical data itself.

The dataset gained significance when it was used by Fisher to create a numerical method for automatically creating a taxonomy based on measurements of multiple characteristics (Fisher 1936). The advantage of these methods is that they can be automated and they often use easy to collect data. This means that subsequent classification does not require an expert, and it also acquires a sense of objectivity and does not depend on the subjective opinion of the person carrying out the evaluation. Fisher published his study in The Annals of Eugenics and gave examples of the use of multiple measurements from craniometry and from mandibles as other related examples of using multiple measurements to create a classification.

The dataset contains 50 plants from each of the three species. Four measurements are made for each species. Sepal length, sepal width, petal length and petal width. If you looked at the original Anderson paper shape was a big determinant and I am far from convinced that these single measurements capture those shape differences.

The Iris dataset is available within R as one of the example datasets (R Core Team 2016). Note that in some older versions of the dataset used for machine learning there was a typographical error of one of the data-points (Bezdek et al. 1999).

library("tibble")
library("ggplot2")
iris_tib <- as_tibble(iris)

iris3_tib <- as_tibble(iris3)

This dataset has 4 measurements for each experimental unit that are used to classify the plants into three different classes. There are 50 data points for each of the classes. This allows the construction of a linear model containing 5 parameters without any concern of over-fitting. That is one parameter for each variable and an intercept. As there are more than 2 classes you cannot fit a simple logistic model to the data.

Exploratory Data Analysis

To become familiar with your data it is a good idea to carry out some basic exploratory data analysis. There are long (iris) and wide (iris3) versions of the dataset available. I am going to use the long version to summarise the data. To check for normality you always start with the histograms of your data which I plotted using ggplot2 (Wickham 2016).

#summary(filter(iris_tib, Species=="setosa"))
#summary(filter(iris_tib, Species=="versicolor"))
#summary(filter(iris_tib, Species=="virginica"))

ggplot(iris_tib,aes(x=Sepal.Length)) +
  geom_histogram(aes(y=after_stat(density)),binwidth=0.2, color="black", fill="skyblue") +
  geom_density(color="purple") +  
    ggtitle("The Histogram of Sepal Length (cm)") + 
    xlab("Sepal Length (cm)") + ylab("")

ggplot(iris_tib,aes(x=Petal.Length)) +
  geom_histogram(aes(y=after_stat(density)),binwidth=0.2, color="black", fill="skyblue") +
  geom_density(color="purple") +  
    ggtitle("The Histogram of Petal Length (cm)") + 
    xlab("Petal Length (cm)") + ylab("")

ggplot(iris_tib,aes(x=Sepal.Width)) +
  geom_histogram(aes(y=after_stat(density)),binwidth=0.2, color="black", fill="skyblue") +
  geom_density(color="purple") +  
    ggtitle("The Histogram of Sepal Width (cm)") + 
    xlab("Sepal Width (cm)") + ylab("")

ggplot(iris_tib,aes(x=Petal.Width)) +
  geom_histogram(aes(y=after_stat(density)),binwidth=0.2, color="black", fill="skyblue") +
  geom_density(color="purple") +  
    ggtitle("The Histogram of Petal Width (cm)") + 
    xlab("Petal Width (cm)") + ylab("")

These plots are clearly multi-modal because they contain an overlap between the different species. You can use facets in ggplot to create plots for each different species. It also suggests that the petal measurements are more discriminatory between species than sepal measurements.

ggplot(iris_tib,aes(x=Sepal.Length)) +
  geom_histogram(aes(y=after_stat(density)),binwidth=0.2, color="black", fill="skyblue") +
  geom_density(color="purple") +  
    ggtitle("The Histogram of Sepal Length (cm)") + 
    xlab("Sepal Length (cm)") + ylab("")+
    facet_wrap(~Species)

ggplot(iris_tib,aes(x=Petal.Length)) +
  geom_histogram(aes(y=after_stat(density)),binwidth=0.2, color="black", fill="skyblue") +
  geom_density(color="purple") +  
    ggtitle("The Histogram of Petal Length (cm)") + 
    xlab("Petal Length (cm)") + ylab("")+
    facet_wrap(~Species)

ggplot(iris_tib,aes(x=Petal.Width)) +
  geom_histogram(aes(y=after_stat(density)),binwidth=0.2, color="black", fill="skyblue") +
  geom_density(color="purple") +  
    ggtitle("The Histogram of Petal Width (cm)") + 
    xlab("Petal Width (cm)") + ylab("")+
    facet_wrap(~Species)

A good way to see if there is discrimination between the measurements for the different species is to plot the petal length and width as a scatterplot coloured by the species. You can also do this for the sepal length and width but the discrimination between species is poor.

ggplot(iris_tib,aes(x=Petal.Length,y=Petal.Width,color=Species))+
  geom_point()+
  ggtitle("The Scatterplot of Petal Length Against Width (cm)")

ggplot(iris_tib,aes(x=Sepal.Length,y=Sepal.Width,color=Species))+
  geom_point()+
  ggtitle("The Scatterplot of Sepal Length Against Width (cm)")

When Fisher completed his analysis he admitted that there was overlap between the distributions of I. virginica and I. versicolor and that assignment of these two unambiguously based solely on this data was not possible using just these four measures. But he felt that in cultured cases it should be sufficient where there is less variation than in the wild. Anderson’s Ideograms were much less ambiguous and the three species are very clearly distinguished, because he was more interested in the biology and Fisher in the mathematical perspective.

Decision Trees

My first thought about assignment to the different groups is a a decision tree. This should manage to capture most of the key features of the data. The decision trees in R and in the rpart library (Therneau and Atkinson 2023).

library("rpart")
library("rpart.plot")
dt1 <- rpart(Species ~ .,data=iris_tib)
printcp(dt1)


Classification tree:
rpart(formula = Species ~ ., data = iris_tib)

Variables actually used in tree construction:
[1] Petal.Length Petal.Width 

Root node error: 100/150 = 0.66667

n= 150 

    CP nsplit rel error xerror     xstd
1 0.50      0      1.00   1.14 0.052307
2 0.44      1      0.50   0.61 0.060161
3 0.01      2      0.06   0.11 0.031927

rpart.plot(dt1)

As expected the focus of the decision tree is on the petal lengths and widths and the sepal data doesn’t contribute. The first split divides I. setosa from the other two species and the second split divides I. versicolor from I. virginica. There is the slight issue of overlap that Fisher reported and there are a small number of incorrect classifications in the model.

This model fitting will give a biased estimate of the errors associated with a new set of data and so you should use some sort of cross-validation technique to divide the data into training and testing datasets.

Cross Validation

To carry out cross-validation I am going to use the caret library. The dataset is split into 20% for testing and 80% for training. There are then 10 cross-validations within the training data each repeated 3 times. The final decision tree is the same as before with an accuracy of 93%

library("caret")

Cargando paquete requerido: lattice

trainIndex <- createDataPartition(iris_tib$Species, p = .8, 
                                  list = FALSE, 
                                  times = 1)
irisTrain <- iris_tib[trainIndex,]
irisTest <- iris_tib[-trainIndex,]

trctrl <- trainControl(method="repeatedcv", number =10, repeats = 3)

cv_dt1 <- train(Species ~ .,
             method = "rpart", data = irisTrain,
             trControl=trctrl,
             tuneLength = 10)

cv_dt1

CART 

120 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
Resampling results across tuning parameters:

  cp          Accuracy   Kappa    
  0.00000000  0.8916667  0.8375000
  0.05555556  0.8972222  0.8458333
  0.11111111  0.8972222  0.8458333
  0.16666667  0.8972222  0.8458333
  0.22222222  0.8972222  0.8458333
  0.27777778  0.8972222  0.8458333
  0.33333333  0.8972222  0.8458333
  0.38888889  0.8972222  0.8458333
  0.44444444  0.6694444  0.5041667
  0.50000000  0.3333333  0.0000000

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.3888889.

rpart.plot(cv_dt1$finalModel)

test_pred_info <- predict(cv_dt1, newdata=irisTest)
confusionMatrix(test_pred_info,irisTest$Species)

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         0
  virginica       0          0        10

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.8843, 1)
    No Information Rate : 0.3333     
    P-Value [Acc > NIR] : 4.857e-15  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           1.0000
Specificity                 1.0000            1.0000           1.0000
Pos Pred Value              1.0000            1.0000           1.0000
Neg Pred Value              1.0000            1.0000           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3333           0.3333
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            1.0000           1.0000

Final Thoughts

This is a rather simple case. There are a few things that stand out.

Fisher’s original model used too many variables. Petal size is critical everything else largely irrelevant.
As he stated his prediction used these four measures but other might have been better (we know that probably they aren’t).

The key point for anyone working in machine learning and artificial intelligence is:

KNOW YOUR DATA

Fisher could have asked the experts on Iris classification how they classified irises into the different species and then collected that data. He did capture most of the important variables, because he was lucky BUT it could have been a more holistic classification which involved colour, location etc. Holistic gets a bad reputation but this is the way humans think rather than machines. Try writing down how you recognise a cat to be a cat and a lion to be a lion. You aren’t going to write anything about distances between eyes and muzzle, but the machine learning algorithm for identifying them is going to make that sort of measurement.

A good simple model for using decision trees is the game Guess Who. Each question you ask partitions the data. You start with the most common feature which is their appearance of gender.

In more advanced problems you don’t know what the significant characteristics are unless you ask the experts. I am a contributor to iNaturalist and the expert identifiers of species there are amazing. If you don’t identify the significant features you can end up trying to fit a model missing the essential variables which will be doomed to be a poor model or at best an overly complex one.

Blind use of Artificial Intelligence produces stupid answers. For example the Professor of AI who came to tell my biology students that an interesting pattern he had found at the end of DNA sequences was NNNNNN. This indicates a region of bad sequence reads where we know there is a nucleotide but we cannot identify it. The letter N is the wild card for sequences and not a pattern. My students were shocked that a Professor could make such a stupid mistake. Unfortunately this is all too common in the field, of all the professors of artificial intelligence I have met, artificial often seems a particularly appropriate word.

References

Anderson, Edgar. 1935. “The Irises of the Gaspe Peninsula.” Bull. Am. Iris Soc. 59: 2–5.

———. 1936. “The Species Problem in Iris.” Annals of the Missouri Botanical Garden 23 (3): 457. https://doi.org/10.2307/2394164.

Bezdek, J. C., J. M. Keller, R. Krishnapuram, L. I. Kuncheva, and N. R. Pal. 1999. “Will the Real Iris Data Please Stand Up?” IEEE Transactions on Fuzzy Systems 7 (3): 368–69. https://doi.org/10.1109/91.771092.

Fisher, R. A. 1936. “The Use Of Multiple Measurements In Taxonomic Problems.” Annals of Eugenics 7 (2): 179–88. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.

R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Therneau, Terry, and Beth Atkinson. 2023. “Rpart: Recursive Partitioning and Regression Trees.” https://CRAN.R-project.org/package=rpart.

Wickham, Hadley. 2016. “Ggplot2: Elegant Graphics for Data Analysis.” https://ggplot2.tidyverse.org.