Introduction

Iris dataset contains 150 observations and 5 variables. We have 50 flowers of each specie.

We do an exploratory analysis on the dataset and build a classification model using K-nearest neighbours method

Packages Required

We need the following packages for our analysis:

library(class)
library(ggplot2)
library(GGally)

Summary Statistics

The below table contains the summary stats for the dataset. Also wecan check the variance for each variable.

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
apply(iris[,1:4], 2, sd)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.8280661    0.4358663    1.7652982    0.7622377

Data Visualization

Histogram Plots

Below are the histogram showing the distribution of the quantitative variables Sepal length, Sepal width, Petal length and Petal width.

par(mfrow=c(2,2))
hist(iris$Sepal.Length, col="blue", breaks=20)
hist(iris$Sepal.Width, col="blue", breaks=20)
hist(iris$Petal.Length, col="blue", breaks=20)
hist(iris$Petal.Width, col="blue", breaks=20)

Scatter Plots

Checking the distribution of Sepal width vs Sepal Length and Petal width vs Petal length
Virginica has the maximum value for Petal width and Petal length

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
  geom_point()

ggplot(data = iris, aes(x = Petal.Length, y = Petal.Width, col = Species)) +
  geom_point()

Correlation Matrix

From the correlation plot, we observe that:

  • There is a strong positive correlation between Petal length and Petal width with a correlation coefficient of 0.963
  • There is a strong positive correlation between Petal width and Sepal length with a correlation coefficient of 0.818
  • There is a strong positive correlation between Petal length and Sepal length with a correlation coefficient of 0.872
ggpairs(iris)

Classification using KNN

Splitting the dataset

We now divide the Iris dataset into training and test dataset to apply KNN classification. 80% of the data is used for training while the KNN classification is tested on the remaining 20% of the data.

set.seed(12420352)
iris[,1:4] <- scale(iris[,1:4])
setosa<- rbind(iris[iris$Species=="setosa",])
versicolor<- rbind(iris[iris$Species=="versicolor",])
virginica<- rbind(iris[iris$Species=="virginica",])


ind <- sample(1:nrow(setosa), nrow(setosa)*0.8)
iris.train<- rbind(setosa[ind,], versicolor[ind,], virginica[ind,])
iris.test<- rbind(setosa[-ind,], versicolor[-ind,], virginica[-ind,])
iris[,1:4] <- scale(iris[,1:4])

finding optimum value of K

The below plot shows the classification error for different values of k. We see that the error decreases initially but then starts increasing due to overfitting. It takes a constant value afterwards.

error <- c()
for (i in 1:15)
{
  knn.fit <- knn(train = iris.train[,1:4], test = iris.test[,1:4], cl = iris.train$Species, k = i)
  error[i] = 1- mean(knn.fit == iris.test$Species)
}

From the below plot we see that minimum error is when value of k is eqal to 5 or 7. We chose the less complex model and go with k=5.

ggplot(data = data.frame(error), aes(x = 1:15, y = error)) +
  geom_line(color = "Blue")

Confusion Matrix

The minimum error is observed at k=5. Getting the confusion matrix and accuracy for the model:

iris_pred <- knn(train = iris.train[,1:4], test = iris.test[,1:4], cl = iris.train$Species, k=5)

table(iris.test$Species,iris_pred)
##             iris_pred
##              setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         0
##   virginica       0          1         9

We observe that the out of sample prediction accuracy at k=5 is 96.67%

Conclusion

We are able to create a KNN classifier which gives a prediction accuracy of 96.67% on test dataset.