Fisher’s or Anderson’s iris data set contains 50 flowers from each of 3 species of iris and gives measurements (in cm) of sepal length and width and petal length and width. The species are Iris setosa, versicolor, and virignica. The data set has 150 rows and 5 columns.
Here, k-nearest neighbors (KNN) is used to classify the flowers into different species.
Approach * Exploratory Data Analysis * Spliting into training data and testing data * Fitting KNN model *
Packages required Loading required packages:
library(datasets)
library(ggplot2)
library(dplyr)
library(class)
library(GGally)
data(iris)
#dimensions of the data set
dim(iris)
## [1] 150 5
#a look at first few rows of the data set
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
#structure of data
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#summary statistics
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
pairs(iris[1:4])
ggpairs(iris[1:4])
iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
geom_point() +
ggtitle("Variation of Sepal length by Sepal width in different species") +
xlab("Sepal length") +
ylab("Sepal Width")
iris %>%
ggplot(aes(x = Petal.Length, y = Petal.Width, col = Species)) +
geom_point() +
ggtitle("Variation of Petal length by Petal width in different species") +
xlab("Petal length") +
ylab("Petal Width")
#Scaling the data
iris.std <- iris
iris.std[,-5] <- scale(iris[, -5])
#Splitting data
set.seed(123)
ind <- sample(nrow(iris.std), nrow(iris.std) * 0.6)
iris.train <- iris.std[ind, ]
iris.test <- iris.std[-ind, ]
#finding best value of k
pred_accuracy <- c()
for (i in 1:20) {
knn.iris <- knn(train = iris.train[,-5], test = iris.test[,-5], cl = iris.train[,5], k = i)
pred_accuracy[i] <- mean(knn.iris == iris.test$Species)
}
ggplot(data = data.frame(pred_accuracy), aes(x = 1:20, y = pred_accuracy)) +
geom_line() +
xlab("Value of k") +
ylab("Prediction Accuracy")
table(iris.test[,5], knn.iris, dnn = c("True", "Predicted"))
## Predicted
## True setosa versicolor virginica
## setosa 17 0 0
## versicolor 0 21 4
## virginica 0 2 16
Higher the prediction accuracy, better is the model. From the plot, it can be seen that for k = 5 results in highest prediction accuracy of 95%.