KNN Classification on Iris Data

About Iris Dataset

The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by Sir Ronald Fisher in his paper in 1936 as an example of linear discriminant analysis. It is also called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula from the same pasture, picked on the same day and measured at the same time by the same person with the same apparatus.

Image credits: https://pixabay.com/en/iris-early-flower-garden-blossom-2392750/

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

Summary Statistics

Iris dataset contains 150 observations and 5 variables. Variables Sepal length, Sepal width, Petal length, Petal width are quantitative variables describing the length and widths of of parts of flowers in cm. Variable Species is categorical consisiting of three different species namely, setosa, versicolor and virginica.

library(class)
library(ggplot2)
library(GGally)
library(ggcorrplot)

data(iris)
summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Distribution of Variables

Below are the histogram and density plots showing the distribution of the quantitative variables Sepal length, Sepal width, Petal length and Petal width.

iris[,1:4] <- scale(iris[,1:4])
par(mfrow=c(2,2))
plot(density(iris$Sepal.Length), col=iris$Species)
plot(density(iris$Sepal.Width))
plot(density(iris$Petal.Length))
plot(density(iris$Petal.Width))

We can see that there is a possibility of 2 categories of observations for Petal Length and Petal Width as well.

par(mfrow=c(2,2))
hist(iris$Sepal.Length, col="blue", breaks=20)
hist(iris$Sepal.Width, col="blue", breaks=20)
hist(iris$Petal.Length, col="blue", breaks=20)
hist(iris$Petal.Width, col="blue", breaks=20)

If we observe the distribution of Petal Length and Petal width of the Iris dataset as a whole, we see that they do not follow a normal distribution.

Relationship between Variables

We are interested in seeing if there is a relationship between width and length of petal and sepal respectively. We use scatter plots for this analysis.

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) + geom_point() + geom_smooth(method="lm") + facet_grid(.~iris$Species)

ggplot(data = iris, aes(x = Petal.Length, y = Petal.Width, col = Species)) + geom_point() + geom_smooth(method="lm") + facet_grid(.~iris$Species)

We observe that length and width of petal and sepal appear to have a positive relationship. We use the correlation matrix to check the strength of correlation.

ggpairs(iris)

From the correlation plot in figure, we observe that there is a strong positive correlation between Petal length and Petal width with correlation coefficient of 0.963. Petal Length and Sepal Length are also strongly correlated. Petal length and Sepal Width are negatively correlated, while Sepal Length and Sepal have a very weak correlation stating they do not affect each other.

From the boxplots in the species column, we can observe outliers for distinct species of the flower. It is also observed that the range of the sepal length, sepal width, petal length and petal width is different for different species of the flower. The range for setosa is distinctly different while that of versicolor and virginica are close to each other. This shows that the dimensions of parts of the flower are different for different species.

Classification into Different Species using the K-Nearest Neighbor Method

We want to see if we are correctly able to predict the species class a flower belongs to based on the Petal Width, Petal Length, Sepal Width and Sepal Length.

We first divide the Iris dataset into training and test dataset to apply KNN classification. 60% of the data is used for training while the KNN classification is tested on the remaining 40% of the data.

set.seed(12366894)
setosa<- rbind(iris[iris$Species=="setosa",])
versicolor<- rbind(iris[iris$Species=="versicolor",])
virginica<- rbind(iris[iris$Species=="virginica",])


ind <- sample(1:nrow(setosa), nrow(setosa)*0.6)

iris.train<- rbind(setosa[ind,], versicolor[ind,], virginica[ind,])
iris.test<- rbind(setosa[-ind,], versicolor[-ind,], virginica[-ind,])

Choosing the Value of K

We run the KNN classification algorithm for different values of K and see the value of K from K=1 to K=15 which gives the lowest error.

error <- c()
for (i in 1:15)
{
  knn.fit <- knn(train = iris.train[,1:4], test = iris.test[,1:4], cl = iris.train$Species, k = i)
  error[i] = 1- mean(knn.fit == iris.test$Species)
}

ggplot(data = data.frame(error), aes(x = 1:15, y = error)) +
  geom_line(color = "Blue")

We can see that K=5 gives the lowest test error.

Prediction Accuracy using the chosen K value

We run KNN classification on the data set using the value of K as K=5. We then check for the misclassification rate in the predictions made.

set.seed(12366894)
iris_pred <- knn(train = iris.train[,1:4], test = iris.test[,1:4], cl = iris.train$Species, k=5)
table(iris.test$Species,iris_pred)

##             iris_pred
##              setosa versicolor virginica
##   setosa         20          0         0
##   versicolor      0         19         1
##   virginica       0          1        19

We see that the misclassification rate is 3.33% (2 out of 60 observations) which is low. Hence, we can say that the performance of the classifier is good on the test data set.