KNN project
In this project, we will be working with the UCI adult dataset. We will be attempting to predict the class of the iris plant.
Data Set Information: This is perhaps the best known database to be found in the pattern recognition literature. Fisher’s paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.
Predicted attribute: class of iris plant.
Attribute Information:
#Importing necessary library
library(ISLR)
#head of iris dataset
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
str()on iris dataset
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Standardize Data
standard.iris <- scale(iris[, -5])
checking for the standarized data value
var(standard.iris[,1])
## [1] 1
var(standard.iris[,2])
## [1] 1
final.data <- cbind(standard.iris,iris[5])
#haad of final.data
head(final.data)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 -0.8976739 1.01560199 -1.335752 -1.311052 setosa
## 2 -1.1392005 -0.13153881 -1.335752 -1.311052 setosa
## 3 -1.3807271 0.32731751 -1.392399 -1.311052 setosa
## 4 -1.5014904 0.09788935 -1.279104 -1.311052 setosa
## 5 -1.0184372 1.24503015 -1.335752 -1.311052 setosa
## 6 -0.5353840 1.93331463 -1.165809 -1.048667 setosa
Train and test
library(caTools)
#training and testing
set.seed(101)
sample <- sample.split(final.data$Species,SplitRatio=0.70)
train <- subset(final.data,sample==T)
test <- subset(final.data,sample==F)
#Build a KNN model
library(class)
Model deployed in action
predicted.species <- knn(train[1:4] , test[1:4] , train$Species , k=1)
#predicted Spieces
(predicted.species)
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa versicolor versicolor versicolor
## [19] versicolor versicolor virginica versicolor versicolor versicolor
## [25] versicolor versicolor virginica versicolor versicolor versicolor
## [31] virginica virginica virginica virginica virginica virginica
## [37] virginica virginica virginica virginica virginica virginica
## [43] virginica virginica virginica
## Levels: setosa versicolor virginica
#Missclassification error
mean(predicted.species != test$Species)
## [1] 0.04444444
#choosing a k value
#initilazing the predicted spieces && error.rate==NULL
predicted.species <- NULL
error.rate <- NULL
for(i in 1:10)
{
set.seed(101)
predicted.species <- knn(train[1:4],test[1:4],train$Species,k=i)
error.rate[i] <- mean(predicted.species!=test$Species)
}
#Creating a dataframe
library(ggplot2)
k.values <- 1:10
error.df <- data.frame(error.rate,k.values)
#Plotting in graph
ggplot(error.df,aes(k.values,error.rate))+geom_point()+geom_line(lty='dotted',color='red')
You should have noticed that the error drops to its lowest for k values between 2-6. Then it begins to jump back up again, this is due to how small the data set it. At k=10 you begin to approach setting k=10% of the data, which is quite large.