Clustering the Anderson Iris Data

One of the simplest methods in machine learning is clustering. In this case we have labelled data and so we can try and produce an ellipse around each of the different species to see if they form easily distinguished clusters. For the Fisher/Anderson 1936 dataset there are three species and the setosa species is clearly distinct.

library("ggplot2")
library("ggforce")

ggplot(iris, aes(x=Petal.Length, y=Petal.Width, color=Species)) +
  geom_mark_ellipse(aes(fill = Species)) +
  geom_point()+
  ggtitle("The Scatterplot of Petal Length Against Width (cm)")

ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
  geom_mark_ellipse(aes(fill = Species)) +
  geom_point()+
  ggtitle("The Scatterplot of Sepal Length Against Width (cm)")

The petal widths for setosa create a clearly distinct group. For the sepal measurements there is too much overlap to be able to distinguish the three groups clearly along those two axes.

Virginica and versicolor show some overlap in the petal measurements and complete overlap in the sepal measurements.

Now we have an alternative set of data for virginica and versicolor where we can compare the degree of clustering.

library("dplyr")
df <- read.csv("Andersons_Irises_1928.csv")
iris_tib <- as_tibble(df)

ggplot(iris_tib, aes(x=Petal.Length, y=Petal.Width, color=Species)) +
  geom_mark_ellipse(aes(fill = Species)) +
  geom_point()+
  ggtitle("The Scatterplot of Petal Length Against Width (cm)")

ggplot(iris_tib, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
  geom_mark_ellipse(aes(fill = Species)) +
  geom_point()+
  ggtitle("The Scatterplot of Sepal Length Against Width (cm)")

The Anderson 1928 data has an even larger degree of overlap for both the petal and sepal data and suggests that the Fisher/Anderson 1936 dataset was a particularly well differentiated dataset and that the real prediction error of models trained on the Fisher data will be higher than originally predicted.