Introduction Iris dataset contains 150 observations and 5 variables. We have 50 flowers of each specie.
Variables Sepal length, Sepal width, Petal length, Petal width are quantitative variables describing the length and widths of parts of flowers in cm. Variable Species is categorical consisiting of three different species namely, setosa, versicolor and virginica.
We do an exploratory analysis on the dataset and build a classification model using K-nearest neighbours method
library(class)
## Warning: package 'class' was built under R version 3.5.2
library(ggplot2)
library(GGally)
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
apply(iris[,1:4], 2, sd)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.8280661 0.4358663 1.7652982 0.7622377
Data Visualization Histogram Plots Below are the histogram showing the distribution of the quantitative variables Sepal length, Sepal width, Petal length and Petal width.
par(mfrow=c(2,2))
hist(iris$Sepal.Length, col="blue", breaks=20)
hist(iris$Sepal.Width, col="blue", breaks=20)
hist(iris$Petal.Length, col="blue", breaks=20)
hist(iris$Petal.Width, col="blue", breaks=20)
Scatter Plots Checking the distribution of Sepal width vs Sepal Length and Petal width vs Petal length Virginica has the maximum value for Petal width and Petal length
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
geom_point()
ggplot(data = iris, aes(x = Petal.Length, y = Petal.Width, col = Species)) +
geom_point()
Correlation Matrix From the correlation plot, we observe that:
There is a strong positive correlation between Petal length and Petal width with a correlation coefficient of 0.963 There is a strong positive correlation between Petal width and Sepal length with a correlation coefficient of 0.818 There is a strong positive correlation between Petal length and Sepal length with a correlation coefficient of 0.872
ggpairs(iris)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Classification using KNN Splitting the dataset We now divide the Iris dataset into training and test dataset to apply KNN classification. 80% of the data is used for training while the KNN classification is tested on the remaining 20% of the data.
set.seed(12420352)
iris[,1:4] <- scale(iris[,1:4])
setosa<- rbind(iris[iris$Species=="setosa",])
versicolor<- rbind(iris[iris$Species=="versicolor",])
virginica<- rbind(iris[iris$Species=="virginica",])
ind <- sample(1:nrow(setosa), nrow(setosa)*0.8)
iris.train<- rbind(setosa[ind,], versicolor[ind,], virginica[ind,])
iris.test<- rbind(setosa[-ind,], versicolor[-ind,], virginica[-ind,])
iris[,1:4] <- scale(iris[,1:4])
finding optimum value of K The below plot shows the classification error for different values of k. We see that the error decreases initially but then starts increasing due to overfitting. It takes a constant value afterwards.
error <- c()
for (i in 1:15)
{
knn.fit <- knn(train = iris.train[,1:4], test = iris.test[,1:4], cl = iris.train$Species, k = i)
error[i] = 1- mean(knn.fit == iris.test$Species)
}
From the below plot we see that minimum error is when value of k is eqal to 5 or 7. We chose the less complex model and go with k=5.
ggplot(data = data.frame(error), aes(x = 1:15, y = error)) +
geom_line(color = "Blue")
Confusion Matrix The minimum error is observed at k=5. Getting the confusion matrix and accuracy for the model:
iris_pred <- knn(train = iris.train[,1:4], test = iris.test[,1:4], cl = iris.train$Species, k=5)
table(iris.test$Species,iris_pred)
## iris_pred
## setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 0
## virginica 0 1 9
Confusion Matrix The minimum error is observed at k=5. Getting the confusion matrix and accuracy for the model: