KNN project

In this project, we will be working with the UCI adult dataset. We will be attempting to predict the class of the iris plant.

Data Set Information: This is perhaps the best known database to be found in the pattern recognition literature. Fisher’s paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Predicted attribute: class of iris plant.

Attribute Information:

  1. sepal length in cm
  2. sepal width in cm
  3. petal length in cm
  4. petal width in cm
  5. class: – Iris Setosa – Iris Versicolour – Iris Virginica

#Importing necessary library

library(ISLR)

#head of iris dataset

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

str()on iris dataset

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Standardize Data

 standard.iris <- scale(iris[, -5])

checking for the standarized data value

 var(standard.iris[,1])
## [1] 1
 var(standard.iris[,2])
## [1] 1
final.data <- cbind(standard.iris,iris[5])
#haad of final.data
head(final.data)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1   -0.8976739  1.01560199    -1.335752   -1.311052  setosa
## 2   -1.1392005 -0.13153881    -1.335752   -1.311052  setosa
## 3   -1.3807271  0.32731751    -1.392399   -1.311052  setosa
## 4   -1.5014904  0.09788935    -1.279104   -1.311052  setosa
## 5   -1.0184372  1.24503015    -1.335752   -1.311052  setosa
## 6   -0.5353840  1.93331463    -1.165809   -1.048667  setosa

Train and test

library(caTools)

#training and testing

set.seed(101)
sample <- sample.split(final.data$Species,SplitRatio=0.70)
train <- subset(final.data,sample==T)
test <- subset(final.data,sample==F)

#Build a KNN model

library(class)

Model deployed in action

predicted.species <- knn(train[1:4] , test[1:4] , train$Species , k=1)

#predicted Spieces

(predicted.species)
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] setosa     setosa     setosa     versicolor versicolor versicolor
## [19] versicolor versicolor virginica  versicolor versicolor versicolor
## [25] versicolor versicolor virginica  versicolor versicolor versicolor
## [31] virginica  virginica  virginica  virginica  virginica  virginica 
## [37] virginica  virginica  virginica  virginica  virginica  virginica 
## [43] virginica  virginica  virginica 
## Levels: setosa versicolor virginica

#Missclassification error

mean(predicted.species != test$Species)
## [1] 0.04444444

#choosing a k value

#initilazing the predicted spieces && error.rate==NULL
predicted.species <- NULL
error.rate <- NULL

for(i in 1:10)
  {
  set.seed(101)
  predicted.species <- knn(train[1:4],test[1:4],train$Species,k=i)
  error.rate[i] <- mean(predicted.species!=test$Species)
  }

#Creating a dataframe

library(ggplot2)
k.values <- 1:10
error.df <- data.frame(error.rate,k.values)

#Plotting in graph

ggplot(error.df,aes(k.values,error.rate))+geom_point()+geom_line(lty='dotted',color='red')

You should have noticed that the error drops to its lowest for k values between 2-6. Then it begins to jump back up again, this is due to how small the data set it. At k=10 you begin to approach setting k=10% of the data, which is quite large.