About Iris Dataset

The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by Sir Ronald Fisher in his paper in 1936 as an example of linear discriminant analysis. It is also called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula from the same pasture, picked on the same day and measured at the same time by the same person with the same apparatus.

Image credits: https://pixabay.com/en/iris-early-flower-garden-blossom-2392750/

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

Summary Statistics

Iris dataset contains 150 observations and 5 variables. Variables Sepal length, Sepal width, Petal length, Petal width are quantitative variables describing the length and widths of of parts of flowers in cm. Variable Species is categorical consisiting of three different species namely, setosa, versicolor and virginica.

library(class)
library(ggplot2)
library(GGally)
library(ggcorrplot)

data(iris)
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Distribution of Variables

Below are the histogram and density plots showing the distribution of the quantitative variables Sepal length, Sepal width, Petal length and Petal width.

iris[,1:4] <- scale(iris[,1:4])
par(mfrow=c(2,2))
plot(density(iris$Sepal.Length), col=iris$Species)
plot(density(iris$Sepal.Width))
plot(density(iris$Petal.Length))
plot(density(iris$Petal.Width))

We can see that there is a possibility of 2 categories of observations for Petal Length and Petal Width as well.

par(mfrow=c(2,2))
hist(iris$Sepal.Length, col="blue", breaks=20)
hist(iris$Sepal.Width, col="blue", breaks=20)
hist(iris$Petal.Length, col="blue", breaks=20)
hist(iris$Petal.Width, col="blue", breaks=20)

If we observe the distribution of Petal Length and Petal width of the Iris dataset as a whole, we see that they do not follow a normal distribution.

Relationship between Variables

We are interested in seeing if there is a relationship between width and length of petal and sepal respectively. We use scatter plots for this analysis.

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) + geom_point() + geom_smooth(method="lm") + facet_grid(.~iris$Species)

ggplot(data = iris, aes(x = Petal.Length, y = Petal.Width, col = Species)) + geom_point() + geom_smooth(method="lm") + facet_grid(.~iris$Species)

We observe that length and width of petal and sepal appear to have a positive relationship. We use the correlation matrix to check the strength of correlation.

ggpairs(iris)