1 Introduction

The aim of this project is to do data exploration on the famous Fisher’s or Anderson’s Iris data set and use the result for future Clustering classification which will be my next project. Iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

1.1 Load libraries and helper function

The main package used for data exploration is tidyverse, developed by Hadley Wickham.

library(tidyverse)
library(stringr)
library(DT)
library(grid)
library(gridExtra)
library(corrplot)

We’ll use multiplot function from R cookbooks

# Define multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

1.2 Load data set

data1 <- iris

dim_iris <- dim(data1)

The data set Iris comes in R by default. The data set has 150 observation and 5 variables.

1.3 File structure and content

Variables	Description
Sepal.Length	Length of Sepal in cm
Sepal.Width	Width of Sepal in cm
Petal.Length	Length of Petal in cm
Petal.width	Width of Petal in cm
Species	Iris species - Iris setosa, versicolor, virginica

Since Species variable is categorical I have formatted to factor and rest of the variables are integer.

2 Missing values

First this I do with any data set is to check if there are any missing values.

missing_count <- sum(is.na(data1))

str_c("There are ", missing_count, " missing values")

## [1] "There are 0 missing values"

Great! There is no missing value.

3 Data exploration

3.1 Descriptive statistics

h1 <- data1 %>% 
  ggplot(aes(Sepal.Length))+
  geom_histogram(aes(fill = Species), binwidth =0.2, col = "black")+
  geom_vline(aes(xintercept = mean(Sepal.Length)), linetype = "dashed", color = "black")+
  labs(x = "Sepal Length (cm)", y = "Frequency")+
  theme(legend.position = "none")

h2 <- data1 %>% 
  ggplot(aes(Sepal.Width))+
  geom_histogram(aes(fill = Species), binwidth =0.2, col = "black")+
  geom_vline(aes(xintercept = mean(Sepal.Width)), linetype = "dashed", color = "black")+
  labs(x = "Sepal.Width (cm)", y = "Frequency")+
  theme(legend.position = "none")

h3 <- data1 %>% 
  ggplot(aes(Petal.Length))+
  geom_histogram(aes(fill = Species), binwidth =0.2, col = "black")+
  geom_vline(aes(xintercept = mean(Petal.Length)), linetype = "dashed", color = "black")+
  labs(x = "Petal.Length (cm)", y = "Frequency")+
  theme(legend.position = "none")

h4 <- data1 %>% 
  ggplot(aes(Petal.Width))+
  geom_histogram(aes(fill = Species), binwidth =0.2, col = "black")+
  geom_vline(aes(xintercept = mean(Petal.Width)), linetype = "dashed", color = "black")+
  labs(x = "Petal.Width (cm)", y = "Frequency")+
  theme(legend.position = "right")

grid.arrange(h1,h2,h3,h4, nrow=2, top = textGrob("Iris Histogram"))

Due to Petal length anf width are skewed to the left, I would be caution using its mean. The Setosa Petal length and width are concentrated on the far left from the rest of the Species which is very interesting!

v1 <- data1 %>% 
  ggplot(aes(Species, Sepal.Length))+
  geom_violin(aes(fill = Species))+
  geom_boxplot(width = 0.1)+
  scale_y_continuous("Sepal Length", breaks = seq(0, 10, by = .5))+
  theme(legend.position = "none")

v2 <- data1 %>% 
  ggplot(aes(Species, Sepal.Width))+
  geom_violin(aes(fill = Species))+
  geom_boxplot(width = 0.1)+
  scale_y_continuous("Sepal Width", breaks = seq(0, 10, by = .5))+
  theme(legend.position = "none")

v3 <- data1 %>% 
  ggplot(aes(Species, Petal.Length))+
  geom_violin(aes(fill = Species))+
  geom_boxplot(width = 0.1)+
  scale_y_continuous("Petal Length", breaks = seq(0, 10, by = .5))+
  theme(legend.position = "none")

v4 <- data1 %>% 
  ggplot(aes(Species, Petal.Width))+
  geom_violin(aes(fill = Species))+
  geom_boxplot(width = 0.1)+
  scale_y_continuous("Petal Width", breaks = seq(0, 10, by = .5))+
  theme(legend.position = "right")

grid.arrange(v1,v2,v3,v4, nrow = 2, top = textGrob("Box plot of Iris species"))

Although the above plot is clean, I prefer below boxplot since they put all the box plot into one single measurement.

gather(data1, Var, value, -Species) %>% 
  ggplot(aes(Var, value))+
  geom_violin(aes(fill = Species))+
  facet_grid(~Species)+
  theme(axis.text.x = element_text(angle = 90, vjust = .5))+
  labs(x = "Measurements", y = "Length in cm", title = "Violin Boxplot of Species")+
  geom_boxplot(width=0.1)

The above Violin Boxplot of Species shows that Iris Virginica has highest median value in petal length, petal width and sepal length when compared against Versicolor and Setosa. However, Iris Setosa has the highest sepal width median value. We can also see significant difference between Setosa’s sepal lenght and width against its petal length and width. That differene is smaller in Versicolor and Virginica. The violin plot also indicates that the weight of the Virginica sepal width and petal width are highly concentrated around the median.

Avg_Iris <- data1 %>% 
  group_by(Species) %>% 
  summarise(Avg_sepal_length = mean(Sepal.Length), Avg_sepal_width = mean(Sepal.Width), Avg_petal_length = mean(Petal.Length),
            Avg_petal_width = mean(Petal.Width))

Sd_Iris <- data1 %>% 
  group_by(Species) %>% 
  summarise(Sd_sepal_length = sd(Sepal.Length), Sd_sepal_width = sd(Sepal.Width), Sd_petal_length = sd(Petal.Length),
            Sd_petal_width = sd(Petal.Width))

datatable(Avg_Iris, caption = "Average measurement of all Species") %>% 
  formatRound(2:5,digits = 2)

datatable(Sd_Iris, caption = "Standard Deviation of all Species") %>% 
  formatRound(2:5,digits = 2)

3.2 Scatter plot

s1 <- data1 %>% 
  ggplot(aes(Sepal.Length, Sepal.Width))+
  geom_point(aes(col = Species))+
  theme(legend.position = "none")

s2 <- data1 %>% 
  ggplot(aes(Petal.Length, Petal.Width))+
  geom_point(aes(col = Species))+
  theme(legend.position = "none")

s3 <- data1 %>% 
  ggplot(aes(Species))+
  geom_bar(aes(fill = Species))+
  theme(legend.position = "top")

layout <- matrix(c(1,2, 3, 3 ), 2, 2, byrow = T)
multiplot(s1, s2, s3, layout = layout)

We can see from the above plot that Petal feature shows clustering divison.

3.3 Correlation matrix

cor_iris <- data1 %>% 
  select(-Species) %>% 
  cor()

p.mat <- data1 %>% 
  select(-Species) %>% 
  cor.mtest() %>% 
  .$p

corrplot(cor_iris, type = "upper", method = "number", diag = F)

The correlation plot shows that Petal length and Petal width are highly correlated.

=================================================== Junk commands

# data1 %>% 
#   gather(Var, value, -Species) %>% 
#   ggplot(aes(value))+
#   geom_density(aes(fill = Species))+
#   facet_grid(Species~Var)
# 
# 
# 
# 
#

Famous Iris data set exploration