The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by Sir Ronald Fisher in his paper in 1936 as an example of linear discriminant analysis. It is also called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula from the same pasture, picked on the same day and measured at the same time by the same person with the same apparatus.
Image credits: https://pixabay.com/en/iris-early-flower-garden-blossom-2392750/
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.
Iris dataset contains 150 observations and 5 variables. Variables Sepal length, Sepal width, Petal length, Petal width are quantitative variables describing the length and widths of of parts of flowers in cm. Variable Species is categorical consisiting of three different species namely, setosa, versicolor and virginica.
library(class)
library(ggplot2)
library(GGally)
library(ggcorrplot)
data(iris)
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Below are the histogram and density plots showing the distribution of the quantitative variables Sepal length, Sepal width, Petal length and Petal width.
iris[,1:4] <- scale(iris[,1:4])
par(mfrow=c(2,2))
plot(density(iris$Sepal.Length), col=iris$Species)
plot(density(iris$Sepal.Width))
plot(density(iris$Petal.Length))
plot(density(iris$Petal.Width))
We can see that there is a possibility of 2 categories of observations for Petal Length and Petal Width as well.
par(mfrow=c(2,2))
hist(iris$Sepal.Length, col="blue", breaks=20)
hist(iris$Sepal.Width, col="blue", breaks=20)
hist(iris$Petal.Length, col="blue", breaks=20)
hist(iris$Petal.Width, col="blue", breaks=20)
If we observe the distribution of Petal Length and Petal width of the Iris dataset as a whole, we see that they do not follow a normal distribution.
We are interested in seeing if there is a relationship between width and length of petal and sepal respectively. We use scatter plots for this analysis.
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) + geom_point() + geom_smooth(method="lm") + facet_grid(.~iris$Species)
ggplot(data = iris, aes(x = Petal.Length, y = Petal.Width, col = Species)) + geom_point() + geom_smooth(method="lm") + facet_grid(.~iris$Species)
We observe that length and width of petal and sepal appear to have a positive relationship. We use the correlation matrix to check the strength of correlation.
ggpairs(iris)