The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by Sir Ronald Fisher in his paper in 1936 as an example of linear discriminant analysis. It is also called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula from the same pasture, picked on the same day and measured at the same time by the same person with the same apparatus.
Image credits: https://pixabay.com/en/iris-early-flower-garden-blossom-2392750/
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.
Iris dataset contains 150 observations and 5 variables. Variables Sepal length, Sepal width, Petal length, Petal width are quantitative variables describing the length and widths of of parts of flowers in cm. Variable Species is categorical consisiting of three different species namely, setosa, versicolor and virginica.
library(class)
library(ggplot2)
library(GGally)
library(ggcorrplot)
data(iris)
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Below are the histogram and density plots showing the distribution of the quantitative variables Sepal length, Sepal width, Petal length and Petal width.
iris[,1:4] <- scale(iris[,1:4])
par(mfrow=c(2,2))
plot(density(iris$Sepal.Length), col=iris$Species)
plot(density(iris$Sepal.Width))
plot(density(iris$Petal.Length))
plot(density(iris$Petal.Width))
We can see that there is a possibility of 2 categories of observations for Petal Length and Petal Width as well.
par(mfrow=c(2,2))
hist(iris$Sepal.Length, col="blue", breaks=20)
hist(iris$Sepal.Width, col="blue", breaks=20)
hist(iris$Petal.Length, col="blue", breaks=20)
hist(iris$Petal.Width, col="blue", breaks=20)
If we observe the distribution of Petal Length and Petal width of the Iris dataset as a whole, we see that they do not follow a normal distribution.
We are interested in seeing if there is a relationship between width and length of petal and sepal respectively. We use scatter plots for this analysis.
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) + geom_point() + geom_smooth(method="lm") + facet_grid(.~iris$Species)
ggplot(data = iris, aes(x = Petal.Length, y = Petal.Width, col = Species)) + geom_point() + geom_smooth(method="lm") + facet_grid(.~iris$Species)
We observe that length and width of petal and sepal appear to have a positive relationship. We use the correlation matrix to check the strength of correlation.
ggpairs(iris)
From the correlation plot in figure, we observe that there is a strong positive correlation between Petal length and Petal width with correlation coefficient of 0.963. Petal Length and Sepal Length are also strongly correlated. Petal length and Sepal Width are negatively correlated, while Sepal Length and Sepal have a very weak correlation stating they do not affect each other.
From the boxplots in the species column, we can observe outliers for distinct species of the flower. It is also observed that the range of the sepal length, sepal width, petal length and petal width is different for different species of the flower. The range for setosa is distinctly different while that of versicolor and virginica are close to each other. This shows that the dimensions of parts of the flower are different for different species.
We want to see if we are correctly able to predict the species class a flower belongs to based on the Petal Width, Petal Length, Sepal Width and Sepal Length.
We first divide the Iris dataset into training and test dataset to apply KNN classification. 60% of the data is used for training while the KNN classification is tested on the remaining 40% of the data.
set.seed(12366894)
setosa<- rbind(iris[iris$Species=="setosa",])
versicolor<- rbind(iris[iris$Species=="versicolor",])
virginica<- rbind(iris[iris$Species=="virginica",])
ind <- sample(1:nrow(setosa), nrow(setosa)*0.6)
iris.train<- rbind(setosa[ind,], versicolor[ind,], virginica[ind,])
iris.test<- rbind(setosa[-ind,], versicolor[-ind,], virginica[-ind,])
We run the KNN classification algorithm for different values of K and see the value of K from K=1 to K=15 which gives the lowest error.
error <- c()
for (i in 1:15)
{
knn.fit <- knn(train = iris.train[,1:4], test = iris.test[,1:4], cl = iris.train$Species, k = i)
error[i] = 1- mean(knn.fit == iris.test$Species)
}
ggplot(data = data.frame(error), aes(x = 1:15, y = error)) +
geom_line(color = "Blue")
We can see that K=5 gives the lowest test error.
We run KNN classification on the data set using the value of K as K=5. We then check for the misclassification rate in the predictions made.
set.seed(12366894)
iris_pred <- knn(train = iris.train[,1:4], test = iris.test[,1:4], cl = iris.train$Species, k=5)
table(iris.test$Species,iris_pred)
## iris_pred
## setosa versicolor virginica
## setosa 20 0 0
## versicolor 0 19 1
## virginica 0 1 19
We see that the misclassification rate is 3.33% (2 out of 60 observations) which is low. Hence, we can say that the performance of the classifier is good on the test data set.