Iris dataset contains 150 observations and 5 variables. We have 50 flowers of each specie.
We do an exploratory analysis on the dataset and build a classification model using K-nearest neighbours method
We need the following packages for our analysis:
library(class)
library(ggplot2)
library(GGally)
The below table contains the summary stats for the dataset. Also wecan check the variance for each variable.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
apply(iris[,1:4], 2, sd)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.8280661 0.4358663 1.7652982 0.7622377
Below are the histogram showing the distribution of the quantitative variables Sepal length, Sepal width, Petal length and Petal width.
par(mfrow=c(2,2))
hist(iris$Sepal.Length, col="blue", breaks=20)
hist(iris$Sepal.Width, col="blue", breaks=20)
hist(iris$Petal.Length, col="blue", breaks=20)
hist(iris$Petal.Width, col="blue", breaks=20)
Checking the distribution of Sepal width vs Sepal Length and Petal width vs Petal length
Virginica has the maximum value for Petal width and Petal length
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
geom_point()
ggplot(data = iris, aes(x = Petal.Length, y = Petal.Width, col = Species)) +
geom_point()
From the correlation plot, we observe that:
ggpairs(iris)
We now divide the Iris dataset into training and test dataset to apply KNN classification. 80% of the data is used for training while the KNN classification is tested on the remaining 20% of the data.
set.seed(12420352)
iris[,1:4] <- scale(iris[,1:4])
setosa<- rbind(iris[iris$Species=="setosa",])
versicolor<- rbind(iris[iris$Species=="versicolor",])
virginica<- rbind(iris[iris$Species=="virginica",])
ind <- sample(1:nrow(setosa), nrow(setosa)*0.8)
iris.train<- rbind(setosa[ind,], versicolor[ind,], virginica[ind,])
iris.test<- rbind(setosa[-ind,], versicolor[-ind,], virginica[-ind,])
iris[,1:4] <- scale(iris[,1:4])
The below plot shows the classification error for different values of k. We see that the error decreases initially but then starts increasing due to overfitting. It takes a constant value afterwards.
error <- c()
for (i in 1:15)
{
knn.fit <- knn(train = iris.train[,1:4], test = iris.test[,1:4], cl = iris.train$Species, k = i)
error[i] = 1- mean(knn.fit == iris.test$Species)
}
From the below plot we see that minimum error is when value of k is eqal to 5 or 7. We chose the less complex model and go with k=5.
ggplot(data = data.frame(error), aes(x = 1:15, y = error)) +
geom_line(color = "Blue")
The minimum error is observed at k=5. Getting the confusion matrix and accuracy for the model:
iris_pred <- knn(train = iris.train[,1:4], test = iris.test[,1:4], cl = iris.train$Species, k=5)
table(iris.test$Species,iris_pred)
## iris_pred
## setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 0
## virginica 0 1 9
We observe that the out of sample prediction accuracy at k=5 is 96.67%
We are able to create a KNN classifier which gives a prediction accuracy of 96.67% on test dataset.