In this small project I will try to use a KNN (K-nearest-neighbors) method for Iris Species identification. The iris dataset is a built-in dataset in R that contains measurements on 4 different attributes (in centimeters) for 50 flowers from 3 different species (“setosa”, “versicolor” and “virginica”). There are total 150 data row and they will be split into training and testing dataset. The training dataset will be used to train my model while the testing dataset will be used as a accuracy verification for our model. The goal of this project is to find out which k-value (No. of Nearest Neighbor) is the best in species identification.
library(tidyverse)
library(ggplot2)
library(class)
library(caret)
data(iris)
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
As each variable consist of different scales, it is better to normalize thd data column first.
# define a simple normalization function
normalize=function(x){
return( (x-min(x))/(max(x)-min(x)) )
}
# Loop over the first 4 numeric column
for (i in 1:4){
iris[i]=normalize(iris[i]) }
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.2222 1st Qu.:0.3333 1st Qu.:0.1017 1st Qu.:0.08333
## Median :0.4167 Median :0.4167 Median :0.5678 Median :0.50000
## Mean :0.4287 Mean :0.4406 Mean :0.4675 Mean :0.45806
## 3rd Qu.:0.5833 3rd Qu.:0.5417 3rd Qu.:0.6949 3rd Qu.:0.70833
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Now they are normalized into same scale (0-1) for further work.
set.seed(111)
iris_train_no=sample(1:nrow(iris),nrow(iris)*0.7,replace=FALSE)
iris_train_data=iris[iris_train_no,]
iris_test_data=iris[-iris_train_no,]
# Here I randomly select 3 as nearest neighbor number for a testing.
set.seed(111)
knn3=knn(iris_train_data[1:4],iris_test_data[1:4],cl=as.factor(iris_train_data$Species),k=3)
# Them make a confusion matrix
knn3_cm=confusionMatrix(knn3,iris_test_data$Species)
knn3_cm$overall[1]
## Accuracy
## 0.9333333
When we set the nearest neighbor (k) =3, it has an accuracy of 93.3%. What if we set the k for other values (e.g. 1-50)? Will the result be improved or changed if we use a higher k value?
# Make a loop for k=1 to k=50, then add accuracy rate to a vector
knn_loop_accuracy=c()
for (i in 1:50){
set.seed(111)
knn_loop=knn(iris_train_data[1:4],iris_test_data[1:4],cl=as.factor(iris_train_data$Species),k=i)
knn_loop_cm=confusionMatrix(knn_loop,iris_test_data$Species)
knn_loop_accuracy[i]=knn_loop_cm$overall[1]
}
# Make a dataframe and present the result by visualization
knn_number_accuracy=data.frame(knn_number=c(1:50),accuracy=knn_loop_accuracy)
ggplot(knn_number_accuracy,aes(x=knn_number,y=accuracy))+geom_line(color="red")+geom_point(color="skyblue")+scale_y_continuous(limits = c(0.8,1))+scale_x_continuous(n.breaks=10)+labs(title="Accuracy Rate of using different K-values")+theme_classic()
When K number is between 0-25, they all have accuracy rate which >90%. However as the k number keep increase, the accuracy rate keep dropping. This is because the chance of noises from other species is also higher according.
Therefore, I select the k=6 with highest accuracy rate 95.6%. This should be the best in making a iris species prediction.