Classification of Iris data set using KNN

Introduction

Fisher’s or Anderson’s iris data set contains 50 flowers from each of 3 species of iris and gives measurements (in cm) of sepal length and width and petal length and width. The species are Iris setosa, versicolor, and virignica. The data set has 150 rows and 5 columns.

Here, k-nearest neighbors (KNN) is used to classify the flowers into different species.

Approach * Exploratory Data Analysis * Spliting into training data and testing data * Fitting KNN model *

Packages required

Packages required Loading required packages:

library(datasets)
library(ggplot2)
library(dplyr)
library(class)
library(GGally)

Exploratory data analysis

data(iris)

#dimensions of the data set
dim(iris)
## [1] 150   5
#a look at first few rows of the data set
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
#structure of data
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#summary statistics
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
pairs(iris[1:4])

ggpairs(iris[1:4])

Data Visualization

iris %>%
  ggplot(aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
  geom_point() +
  ggtitle("Variation of Sepal length by Sepal width in different species") +
  xlab("Sepal length") +
  ylab("Sepal Width")

iris %>%
  ggplot(aes(x = Petal.Length, y = Petal.Width, col = Species)) +
  geom_point() +
  ggtitle("Variation of Petal length by Petal width in different species") +
  xlab("Petal length") +
  ylab("Petal Width")

Splitting data into training and testing data sets

#Scaling the data
iris.std <- iris 
iris.std[,-5] <- scale(iris[, -5])

#Splitting data
set.seed(123)
ind <- sample(nrow(iris.std), nrow(iris.std) * 0.6)
iris.train <- iris.std[ind, ]
iris.test <- iris.std[-ind, ]

KNN model

#finding best value of k
pred_accuracy <- c()
for (i in 1:20) {
  knn.iris <- knn(train = iris.train[,-5], test = iris.test[,-5], cl = iris.train[,5], k = i)
  pred_accuracy[i] <- mean(knn.iris == iris.test$Species)
}

ggplot(data = data.frame(pred_accuracy), aes(x = 1:20, y = pred_accuracy)) +
  geom_line() +
  xlab("Value of k") +
  ylab("Prediction Accuracy")

table(iris.test[,5], knn.iris, dnn = c("True", "Predicted"))
##             Predicted
## True         setosa versicolor virginica
##   setosa         17          0         0
##   versicolor      0         21         4
##   virginica       0          2        16

Higher the prediction accuracy, better is the model. From the plot, it can be seen that for k = 5 results in highest prediction accuracy of 95%.