K-Nearest Neighbours - Begineer Level

K Nearest Neighbour

The Nearest Neighbours is a very Simple and Effective approach to classification which are well suited for classification tasks
K-NN utilizes information about an example’s k nearest neighbours to classify unlabelled examples
Choosing K- Common practice is to begin with k equal to the square root of the number of training sample
K generally is an odd number in case of a tie breaker
K-NN algorithm uses Euclidean distance, which is the distance one would measure if it were possible to use a ruler to connect two points

Download the dataset from

< http://archive.ics.uci.edu/ml/datasets/Glass+Identification

#Lets read the dataset
balanceCSV <- read.csv("balance-scale.csv")
head(balanceCSV)

##   C1ass LWght LDist RWght RDist
## 1     2     1     1     1     1
## 2     3     1     1     1     2
## 3     3     1     1     1     3
## 4     3     1     1     1     4
## 5     3     1     1     1     5
## 6     3     1     1     2     1

Attribute information for the above dataset.

Get the data from Balance Scale Data Set
*Attribute Information
•Class Name: 3 (L, B, R)
•Left-Weight: 5 (1, 2, 3, 4, 5)
•Left-Distance: 5 (1, 2, 3, 4, 5)
•Right-Weight: 5 (1, 2, 3, 4, 5)
•Right-Distance: 5 (1, 2, 3, 4, 5)

##The Structure of the dataset:
str(balanceCSV)

## 'data.frame':    625 obs. of  5 variables:
##  $ C1ass: int  2 3 3 3 3 3 3 3 3 3 ...
##  $ LWght: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ LDist: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ RWght: int  1 1 1 1 1 2 2 2 2 2 ...
##  $ RDist: int  1 2 3 4 5 1 2 3 4 5 ...

##The Summary for the complete dataset
summary(balanceCSV)

##      C1ass       LWght       LDist       RWght       RDist  
##  Min.   :1   Min.   :1   Min.   :1   Min.   :1   Min.   :1  
##  1st Qu.:1   1st Qu.:2   1st Qu.:2   1st Qu.:2   1st Qu.:2  
##  Median :2   Median :3   Median :3   Median :3   Median :3  
##  Mean   :2   Mean   :3   Mean   :3   Mean   :3   Mean   :3  
##  3rd Qu.:3   3rd Qu.:4   3rd Qu.:4   3rd Qu.:4   3rd Qu.:4  
##  Max.   :3   Max.   :5   Max.   :5   Max.   :5   Max.   :5

Use Normalize function if the range of values is really high.

If the range values in last summary is same then we don’t need to normalize else do it.
Let’s Do it here only for learning purpose
normalize <- function(y){return((y - min(y)) / (max(y) - min(y)))}

normalize <- function(y) {return ((y - min(y)) / (max(y) - min(y)))} 
balanceCSV_normalized <- as.data.frame(lapply(balanceCSV[2:5], normalize))
summary(balanceCSV_normalized)

##      LWght          LDist          RWght          RDist     
##  Min.   :0.00   Min.   :0.00   Min.   :0.00   Min.   :0.00  
##  1st Qu.:0.25   1st Qu.:0.25   1st Qu.:0.25   1st Qu.:0.25  
##  Median :0.50   Median :0.50   Median :0.50   Median :0.50  
##  Mean   :0.50   Mean   :0.50   Mean   :0.50   Mean   :0.50  
##  3rd Qu.:0.75   3rd Qu.:0.75   3rd Qu.:0.75   3rd Qu.:0.75  
##  Max.   :1.00   Max.   :1.00   Max.   :1.00   Max.   :1.00

balanceDF <- balanceCSV[order(runif(nrow(balanceCSV))),]
tail(balanceDF,25)

##     C1ass LWght LDist RWght RDist
## 75      3     1     3     5     5
## 420     3     4     2     4     5
## 318     3     3     3     4     3
## 266     3     3     1     4     1
## 140     3     2     1     3     5
## 573     2     5     3     5     3
## 69      3     1     3     4     4
## 419     3     4     2     4     4
## 438     1     4     3     3     3
## 399     3     4     1     5     4
## 562     1     5     3     3     2
## 548     3     5     2     5     3
## 583     1     5     4     2     3
## 237     1     2     5     3     2
## 74      3     1     3     5     4
## 456     1     4     4     2     1
## 571     1     5     3     5     1
## 510     3     5     1     2     5
## 38      3     1     2     3     3
## 342     1     3     4     4     2
## 528     1     5     2     1     3
## 409     2     4     2     2     4
## 304     1     3     3     1     4
## 494     1     4     5     4     4
## 213     3     2     4     3     3

Splitting the data into train and test

balanceDF_train <- balanceDF[1:600,]
balanceDF_test <- balanceDF[601:625,]

nrow(balanceDF_train)

## [1] 600

nrow(balanceDF_test)

## [1] 25

Let’s find the value of K

require(class)

## Loading required package: class

sqrt(nrow(balanceDF))

## [1] 25

Taking labels off the dataset

balanceDF_train_labels <- balanceDF_train[,1]
balanceDF_test_labels <- balanceDF_test[,1]
balanceDF_test_labels

##  [1] 3 3 3 3 3 2 3 3 1 3 1 3 1 1 3 1 1 3 3 1 1 2 1 1 3

knn(train= , test= , cl = , k = )

knn_model<- knn(train = balanceDF_train, test = balanceDF_test, cl = balanceDF_train_labels, k = 25)

Summary of our KNN Model

summary(knn_model)

##  1  2  3 
## 11  0 14

Testing the model on test Dataset

table(balanceDF_test_labels,knn_model)

##                      knn_model
## balanceDF_test_labels  1  2  3
##                     1 10  0  0
##                     2  1  0  1
##                     3  0  0 13

Running our Model on the test dataset we validated ou model.