Introduction

Iris dataset contains 150 observations and 5 variables. Variables Sepal length, Sepal width, Petal length, Petal width are quantitative variables describing the length and widths of parts of flowers in cm. Variable Species is categorical consisiting of three different species namely, setosa, versicolor and virginica.

Goal of the study is to perform exploratory analysis on the data and build a KNN Model with different values of k and compare the out-of-sample prediction accuracy.

Summary Statistics

The table below contain the statistics like range, mean, median etc. for the Iris dataset.

library(class)
library(ggplot2)
library(GGally)

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50
apply(iris[,1:4], 2, sd)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.8280661    0.4358663    1.7652982    0.7622377

Histogram Plots

Below are the histogram and density plots showing the distribution of the quantitative variables Sepal length, Sepal width, Petal length and Petal width. If we observe the distribution of Petal Length and Petal width of the Iris dataset as a whole, we see that they do not follow a normal distribution.

Scatter Plots

The below plots show the relationship between the Petal length and Petal width, and between Sepal lengths and Sepal widths of all the three species of Iris. We observe that the lengths and widths of the flower parts are positively correlated, but the relationship is not very strong.

Correlation matrix shows us the relationship between the variables and value of correlation coefficient.

From the correlation plot, we observe that there is a strong positive correlation between Petal length and Petal width with a correlation coefficient of 0.963. From the boxplots in the species column, we can observe outliers for distinct species of the flower. It is also observed that the range of the sepal length, sepal width, petal length and petal width is different for different species of the flower. The range for setosa is distinctly different while that of versicolor and virginica are close to each other. This shows that the dimensions of parts of the flower are different for different species.

ggpairs(iris)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Finding the Optimum K for K nearest neighbor

We now divide the Iris dataset into training and test dataset to apply KNN classification. 60% of the data is used for training while the KNN classification is tested on the remaining 40% of the data.

set.seed(12383117)
iris[,1:4] <- scale(iris[,1:4])
setosa<- rbind(iris[iris$Species=="setosa",])
versicolor<- rbind(iris[iris$Species=="versicolor",])
virginica<- rbind(iris[iris$Species=="virginica",])


ind <- sample(1:nrow(setosa), nrow(setosa)*0.6)
iris.train<- rbind(setosa[ind,], versicolor[ind,], virginica[ind,])
iris.test<- rbind(setosa[-ind,], versicolor[-ind,], virginica[-ind,])
iris[,1:4] <- scale(iris[,1:4])

The below plot shows the classification error for different values of k. We observe that as the flexibility increases (i.e. 1/k increases), the classification error initially decreases and then again starts increasing.

error <- c()
for (i in 1:15)
{
  knn.fit <- knn(train = iris.train[,1:4], test = iris.test[,1:4], cl = iris.train$Species, k = i)
  error[i] = 1- mean(knn.fit == iris.test$Species)
}

ggplot(data = data.frame(error), aes(x = 1:15, y = error)) +
  geom_line(color = "Blue")

The minimum error is observed at k=2 and k=8. As we would prefer a less complex model for similar prediction errors, we decided to choose K = 2 for the final model.

iris_pred <- knn(train = iris.train[,1:4], test = iris.test[,1:4], cl = iris.train$Species, k=2)

table(iris.test$Species,iris_pred)
##             iris_pred
##              setosa versicolor virginica
##   setosa         20          0         0
##   versicolor      0         19         1
##   virginica       0          2        18

We observe that the out of sample prediction accuracy at k=2 is 93.33%