Anderson Iris Data 1928 - Graphical Display

The first step of any investigation using data is exploratory data analysis. In this case you want to see the relationships between the petal lengths and widths and the sepal lengths and widths. As the data was collected over a number of years you can check if there is any time dependent effect. It is also collected over multiple sites and so you can look at location dependent effects as well.

library("ggplot2")
library("plyr")
library("dplyr")
df <- read.csv("Andersons_Irises_1928.csv")

mu1 <- ddply(df, "Species", summarise, grp.mean=mean(Petal.Length))

ggplot(df, aes(x=Petal.Length, color=Species)) +
  geom_histogram(fill="white", position="dodge", binwidth=0.4)+
  geom_vline(data=mu1, aes(xintercept=grp.mean, color=Species),
             linetype="dashed")+
  theme(legend.position="top")

mu2 <- ddply(df, "Species", summarise, grp.mean=mean(Petal.Width))

ggplot(df, aes(x=Petal.Width, color=Species)) +
  geom_histogram(fill="white", position="dodge", binwidth=0.4)+
  geom_vline(data=mu2, aes(xintercept=grp.mean, color=Species),
             linetype="dashed")+
  theme(legend.position="top")

mu3 <- ddply(df, "Species", summarise, grp.mean=mean(Sepal.Length))

ggplot(df, aes(x=Sepal.Length, color=Species)) +
  geom_histogram(fill="white", position="dodge", binwidth=0.4)+
  geom_vline(data=mu3, aes(xintercept=grp.mean, color=Species),
             linetype="dashed")+
  theme(legend.position="top")

mu4 <- ddply(df, "Species", summarise, grp.mean=mean(Sepal.Width))

ggplot(df, aes(x=Sepal.Width, color=Species)) +
  geom_histogram(fill="white", position="dodge", binwidth=0.4)+
  geom_vline(data=mu4, aes(xintercept=grp.mean, color=Species),
             linetype="dashed")+
  theme(legend.position="top")

The first plots using the data frame show the differences in the mean values between the petal and sepal measurements between the two different species. The differences between the petal measurements of the two species are larger than the sepal measurements.

The dataframe is then transformed to a tibble, which makes creating plots using ggplot2 much easier.

Histograms summarise the distributions of the petal lengths and widths. A scatterplot shows the relationship between petal length and width and the points can be coloured by species to see if there are differences. This is a form of ANCOVA.

The dataset also has data about the states that the measurements were recorded in and the years when they were recorded. States are a proxy for species as different species occur in different locations. It is also possible that there are differences between years depending on environmental conditions during plant growth.

iris_tib <- as_tibble(df)
ggplot(iris_tib,aes(x=Petal.Length)) +
  geom_histogram(aes(y=after_stat(density)),binwidth=0.2, color="black", fill="skyblue") +
  geom_density(color="purple") +  
  ggtitle("The Histogram of Petal Length (cm)") + 
  xlab("Petal Length (cm)") + ylab("")

ggplot(iris_tib,aes(x=Petal.Width)) +
  geom_histogram(aes(y=after_stat(density)),binwidth=0.2, color="black", fill="skyblue") +
  geom_density(color="purple") +  
  ggtitle("The Histogram of Petal Width (cm)") + 
  xlab("Petal Width (cm)") + ylab("")

ggplot(iris_tib,aes(x=Petal.Length,y=Petal.Width,color=Species))+
  geom_point()+
  ggtitle("The Scatterplot of Petal Length Against Width (cm)")

ggplot(iris_tib,aes(x=Petal.Length,y=Petal.Width,color=State))+
  geom_point()+
  ggtitle("The Scatterplot of Petal Length Against Width (cm)")

ggplot(iris_tib,aes(x=Petal.Length,y=Petal.Width,color=Year))+
  geom_point()+
  ggtitle("The Scatterplot of Petal Length Against Width (cm)")

It is possible to create side by side comparisons of the histograms for the different species rather than the overlapping histograms from the first section.

You can do the same for the scatterplots for the two species using location as the colour. Alternatively you can use state as a facet and create a plot for each state which makes it clear that most states only have a single species.

Finally you can look at the effect of year on the scatterplots of the two species to see how the distributions alter in time

ggplot(iris_tib,aes(x=Petal.Length)) +
  geom_histogram(aes(y=after_stat(density)),binwidth=0.2, color="black", fill="skyblue") +
  geom_density(color="purple") +  
  ggtitle("The Histogram of Petal Length (cm)") + 
  xlab("Petal Length (cm)") + ylab("")+
  facet_wrap(~Species)

ggplot(iris_tib,aes(x=Petal.Width)) +
  geom_histogram(aes(y=after_stat(density)),binwidth=0.2, color="black", fill="skyblue") +
  geom_density(color="purple") +  
  ggtitle("The Histogram of Petal Width (cm)") + 
  xlab("Petal Width (cm)") + ylab("")+
  facet_wrap(~Species)

ggplot(iris_tib,aes(x=Petal.Length,y=Petal.Width,color=State))+
  geom_point()+
  ggtitle("The Scatterplot of Petal Length Against Width (cm)")+
  facet_wrap(~Species)

ggplot(iris_tib,aes(x=Petal.Length,y=Petal.Width,color=Species))+
  geom_point()+
  ggtitle("The Scatterplot of Petal Length Against Width (cm)")+
  facet_wrap(~State)

ggplot(iris_tib,aes(x=Petal.Length,y=Petal.Width,color=Species))+
  geom_point()+
  ggtitle("The Scatterplot of Petal Length Against Width (cm)")+
  facet_wrap(~Year)

Once you have the complete set of plots for the petal data you can repeat the process with the sepal data.

ggplot(iris_tib,aes(x=Sepal.Length)) +
  geom_histogram(aes(y=after_stat(density)),binwidth=0.2, color="black", fill="skyblue") +
  geom_density(color="purple") +  
  ggtitle("The Histogram of Sepal Length (cm)") + 
  xlab("Speal Length (cm)") + ylab("")

ggplot(iris_tib,aes(x=Sepal.Width)) +
  geom_histogram(aes(y=after_stat(density)),binwidth=0.2, color="black", fill="skyblue") +
  geom_density(color="purple") +  
  ggtitle("The Histogram of Sepal Width (cm)") + 
  xlab("Sepal Width (cm)") + ylab("")

ggplot(iris_tib,aes(x=Sepal.Length,y=Sepal.Width,color=Species))+
  geom_point()+
  ggtitle("The Scatterplot of Sepal Length Against Width (cm)")

ggplot(iris_tib,aes(x=Sepal.Length,y=Sepal.Width,color=State))+
  geom_point()+
  ggtitle("The Scatterplot of Sepal Length Against Width (cm)")

ggplot(iris_tib,aes(x=Sepal.Length,y=Sepal.Width,color=Year))+
  geom_point()+
  ggtitle("The Scatterplot of Sepal Length Against Width (cm)")

ggplot(iris_tib,aes(x=Sepal.Length)) +
  geom_histogram(aes(y=after_stat(density)),binwidth=0.2, color="black", fill="skyblue") +
  geom_density(color="purple") +  
  ggtitle("The Histogram of Sepal Length (cm)") + 
  xlab("Sepal Length (cm)") + ylab("")+
  facet_wrap(~Species)

ggplot(iris_tib,aes(x=Sepal.Width)) +
  geom_histogram(aes(y=after_stat(density)),binwidth=0.2, color="black", fill="skyblue") +
  geom_density(color="purple") +  
  ggtitle("The Histogram of Sepal Width (cm)") + 
  xlab("Sepal Width (cm)") + ylab("")+
  facet_wrap(~Species)

ggplot(iris_tib,aes(x=Sepal.Length,y=Sepal.Width,color=State))+
  geom_point()+
  ggtitle("The Scatterplot of Sepal Length Against Width (cm)")+
  facet_wrap(~Species)

ggplot(iris_tib,aes(x=Sepal.Length,y=Sepal.Width,color=Species))+
  geom_point()+
  ggtitle("The Scatterplot of Sepal Length Against Width (cm)")+
  facet_wrap(~State)

ggplot(iris_tib,aes(x=Sepal.Length,y=Sepal.Width,color=Species))+
  geom_point()+
  ggtitle("The Scatterplot of Sepal Length Against Width (cm)")+
  facet_wrap(~Year)

There seems to be some affect dependent on year with the 1926 versicolor data being shifted to larger sizes for the sepals and to a lesser extent the petals. It would be worth looking at climate data for the 1926 data collection, or it might have been that the fieldwork was carried out later in the year.