The aim of this project is to do data exploration on the famous Fisher’s or Anderson’s Iris data set and use the result for future Clustering classification which will be my next project. Iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
The main package used for data exploration is tidyverse, developed by Hadley Wickham.
library(tidyverse)
library(stringr)
library(DT)
library(grid)
library(gridExtra)
library(corrplot)
We’ll use multiplot function from R cookbooks
# Define multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols: Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
data1 <- iris
dim_iris <- dim(data1)
The data set Iris comes in R by default. The data set has 150 observation and 5 variables.
Variables | Description |
---|---|
Sepal.Length | Length of Sepal in cm |
Sepal.Width | Width of Sepal in cm |
Petal.Length | Length of Petal in cm |
Petal.width | Width of Petal in cm |
Species | Iris species - Iris setosa, versicolor, virginica |
Since Species variable is categorical I have formatted to factor and rest of the variables are integer.
First this I do with any data set is to check if there are any missing values.
missing_count <- sum(is.na(data1))
str_c("There are ", missing_count, " missing values")
## [1] "There are 0 missing values"
Great! There is no missing value.
h1 <- data1 %>%
ggplot(aes(Sepal.Length))+
geom_histogram(aes(fill = Species), binwidth =0.2, col = "black")+
geom_vline(aes(xintercept = mean(Sepal.Length)), linetype = "dashed", color = "black")+
labs(x = "Sepal Length (cm)", y = "Frequency")+
theme(legend.position = "none")
h2 <- data1 %>%
ggplot(aes(Sepal.Width))+
geom_histogram(aes(fill = Species), binwidth =0.2, col = "black")+
geom_vline(aes(xintercept = mean(Sepal.Width)), linetype = "dashed", color = "black")+
labs(x = "Sepal.Width (cm)", y = "Frequency")+
theme(legend.position = "none")
h3 <- data1 %>%
ggplot(aes(Petal.Length))+
geom_histogram(aes(fill = Species), binwidth =0.2, col = "black")+
geom_vline(aes(xintercept = mean(Petal.Length)), linetype = "dashed", color = "black")+
labs(x = "Petal.Length (cm)", y = "Frequency")+
theme(legend.position = "none")
h4 <- data1 %>%
ggplot(aes(Petal.Width))+
geom_histogram(aes(fill = Species), binwidth =0.2, col = "black")+
geom_vline(aes(xintercept = mean(Petal.Width)), linetype = "dashed", color = "black")+
labs(x = "Petal.Width (cm)", y = "Frequency")+
theme(legend.position = "right")
grid.arrange(h1,h2,h3,h4, nrow=2, top = textGrob("Iris Histogram"))
Due to Petal length anf width are skewed to the left, I would be caution using its mean. The Setosa Petal length and width are concentrated on the far left from the rest of the Species which is very interesting!
v1 <- data1 %>%
ggplot(aes(Species, Sepal.Length))+
geom_violin(aes(fill = Species))+
geom_boxplot(width = 0.1)+
scale_y_continuous("Sepal Length", breaks = seq(0, 10, by = .5))+
theme(legend.position = "none")
v2 <- data1 %>%
ggplot(aes(Species, Sepal.Width))+
geom_violin(aes(fill = Species))+
geom_boxplot(width = 0.1)+
scale_y_continuous("Sepal Width", breaks = seq(0, 10, by = .5))+
theme(legend.position = "none")
v3 <- data1 %>%
ggplot(aes(Species, Petal.Length))+
geom_violin(aes(fill = Species))+
geom_boxplot(width = 0.1)+
scale_y_continuous("Petal Length", breaks = seq(0, 10, by = .5))+
theme(legend.position = "none")
v4 <- data1 %>%
ggplot(aes(Species, Petal.Width))+
geom_violin(aes(fill = Species))+
geom_boxplot(width = 0.1)+
scale_y_continuous("Petal Width", breaks = seq(0, 10, by = .5))+
theme(legend.position = "right")
grid.arrange(v1,v2,v3,v4, nrow = 2, top = textGrob("Box plot of Iris species"))
Although the above plot is clean, I prefer below boxplot since they put all the box plot into one single measurement.
gather(data1, Var, value, -Species) %>%
ggplot(aes(Var, value))+
geom_violin(aes(fill = Species))+
facet_grid(~Species)+
theme(axis.text.x = element_text(angle = 90, vjust = .5))+
labs(x = "Measurements", y = "Length in cm", title = "Violin Boxplot of Species")+
geom_boxplot(width=0.1)
The above Violin Boxplot of Species shows that Iris Virginica has highest median value in petal length, petal width and sepal length when compared against Versicolor and Setosa. However, Iris Setosa has the highest sepal width median value. We can also see significant difference between Setosa’s sepal lenght and width against its petal length and width. That differene is smaller in Versicolor and Virginica. The violin plot also indicates that the weight of the Virginica sepal width and petal width are highly concentrated around the median.
Avg_Iris <- data1 %>%
group_by(Species) %>%
summarise(Avg_sepal_length = mean(Sepal.Length), Avg_sepal_width = mean(Sepal.Width), Avg_petal_length = mean(Petal.Length),
Avg_petal_width = mean(Petal.Width))
Sd_Iris <- data1 %>%
group_by(Species) %>%
summarise(Sd_sepal_length = sd(Sepal.Length), Sd_sepal_width = sd(Sepal.Width), Sd_petal_length = sd(Petal.Length),
Sd_petal_width = sd(Petal.Width))
datatable(Avg_Iris, caption = "Average measurement of all Species") %>%
formatRound(2:5,digits = 2)
datatable(Sd_Iris, caption = "Standard Deviation of all Species") %>%
formatRound(2:5,digits = 2)
s1 <- data1 %>%
ggplot(aes(Sepal.Length, Sepal.Width))+
geom_point(aes(col = Species))+
theme(legend.position = "none")
s2 <- data1 %>%
ggplot(aes(Petal.Length, Petal.Width))+
geom_point(aes(col = Species))+
theme(legend.position = "none")
s3 <- data1 %>%
ggplot(aes(Species))+
geom_bar(aes(fill = Species))+
theme(legend.position = "top")
layout <- matrix(c(1,2, 3, 3 ), 2, 2, byrow = T)
multiplot(s1, s2, s3, layout = layout)
We can see from the above plot that Petal feature shows clustering divison.
cor_iris <- data1 %>%
select(-Species) %>%
cor()
p.mat <- data1 %>%
select(-Species) %>%
cor.mtest() %>%
.$p
corrplot(cor_iris, type = "upper", method = "number", diag = F)
The correlation plot shows that Petal length and Petal width are highly correlated.
=================================================== Junk commands
# data1 %>%
# gather(Var, value, -Species) %>%
# ggplot(aes(value))+
# geom_density(aes(fill = Species))+
# facet_grid(Species~Var)
#
#
#
#
#