The Nearest Neighbours is a very Simple and Effective approach to classification which are well suited for classification tasks
K-NN utilizes information about an example’s k nearest neighbours to classify unlabelled examples
Choosing K- Common practice is to begin with k equal to the square root of the number of training sample
K generally is an odd number in case of a tie breaker
K-NN algorithm uses Euclidean distance, which is the distance one would measure if it were possible to use a ruler to connect two points
< http://archive.ics.uci.edu/ml/datasets/Glass+Identification
#Lets read the dataset
balanceCSV <- read.csv("balance-scale.csv")
head(balanceCSV)
## C1ass LWght LDist RWght RDist
## 1 2 1 1 1 1
## 2 3 1 1 1 2
## 3 3 1 1 1 3
## 4 3 1 1 1 4
## 5 3 1 1 1 5
## 6 3 1 1 2 1
Get the data from Balance Scale Data Set
*Attribute Information
•Class Name: 3 (L, B, R)
•Left-Weight: 5 (1, 2, 3, 4, 5)
•Left-Distance: 5 (1, 2, 3, 4, 5)
•Right-Weight: 5 (1, 2, 3, 4, 5)
•Right-Distance: 5 (1, 2, 3, 4, 5)
##The Structure of the dataset:
str(balanceCSV)
## 'data.frame': 625 obs. of 5 variables:
## $ C1ass: int 2 3 3 3 3 3 3 3 3 3 ...
## $ LWght: int 1 1 1 1 1 1 1 1 1 1 ...
## $ LDist: int 1 1 1 1 1 1 1 1 1 1 ...
## $ RWght: int 1 1 1 1 1 2 2 2 2 2 ...
## $ RDist: int 1 2 3 4 5 1 2 3 4 5 ...
##The Summary for the complete dataset
summary(balanceCSV)
## C1ass LWght LDist RWght RDist
## Min. :1 Min. :1 Min. :1 Min. :1 Min. :1
## 1st Qu.:1 1st Qu.:2 1st Qu.:2 1st Qu.:2 1st Qu.:2
## Median :2 Median :3 Median :3 Median :3 Median :3
## Mean :2 Mean :3 Mean :3 Mean :3 Mean :3
## 3rd Qu.:3 3rd Qu.:4 3rd Qu.:4 3rd Qu.:4 3rd Qu.:4
## Max. :3 Max. :5 Max. :5 Max. :5 Max. :5
If the range values in last summary is same then we don’t need to normalize else do it.
Let’s Do it here only for learning purpose
normalize <- function(y){return((y - min(y)) / (max(y) - min(y)))}
normalize <- function(y) {return ((y - min(y)) / (max(y) - min(y)))}
balanceCSV_normalized <- as.data.frame(lapply(balanceCSV[2:5], normalize))
summary(balanceCSV_normalized)
## LWght LDist RWght RDist
## Min. :0.00 Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.25 1st Qu.:0.25 1st Qu.:0.25 1st Qu.:0.25
## Median :0.50 Median :0.50 Median :0.50 Median :0.50
## Mean :0.50 Mean :0.50 Mean :0.50 Mean :0.50
## 3rd Qu.:0.75 3rd Qu.:0.75 3rd Qu.:0.75 3rd Qu.:0.75
## Max. :1.00 Max. :1.00 Max. :1.00 Max. :1.00
balanceDF <- balanceCSV[order(runif(nrow(balanceCSV))),]
tail(balanceDF,25)
## C1ass LWght LDist RWght RDist
## 75 3 1 3 5 5
## 420 3 4 2 4 5
## 318 3 3 3 4 3
## 266 3 3 1 4 1
## 140 3 2 1 3 5
## 573 2 5 3 5 3
## 69 3 1 3 4 4
## 419 3 4 2 4 4
## 438 1 4 3 3 3
## 399 3 4 1 5 4
## 562 1 5 3 3 2
## 548 3 5 2 5 3
## 583 1 5 4 2 3
## 237 1 2 5 3 2
## 74 3 1 3 5 4
## 456 1 4 4 2 1
## 571 1 5 3 5 1
## 510 3 5 1 2 5
## 38 3 1 2 3 3
## 342 1 3 4 4 2
## 528 1 5 2 1 3
## 409 2 4 2 2 4
## 304 1 3 3 1 4
## 494 1 4 5 4 4
## 213 3 2 4 3 3
balanceDF_train <- balanceDF[1:600,]
balanceDF_test <- balanceDF[601:625,]
nrow(balanceDF_train)
## [1] 600
nrow(balanceDF_test)
## [1] 25
require(class)
## Loading required package: class
sqrt(nrow(balanceDF))
## [1] 25
balanceDF_train_labels <- balanceDF_train[,1]
balanceDF_test_labels <- balanceDF_test[,1]
balanceDF_test_labels
## [1] 3 3 3 3 3 2 3 3 1 3 1 3 1 1 3 1 1 3 3 1 1 2 1 1 3
knn_model<- knn(train = balanceDF_train, test = balanceDF_test, cl = balanceDF_train_labels, k = 25)
summary(knn_model)
## 1 2 3
## 11 0 14
table(balanceDF_test_labels,knn_model)
## knn_model
## balanceDF_test_labels 1 2 3
## 1 10 0 0
## 2 1 0 1
## 3 0 0 13
Running our Model on the test dataset we validated ou model.