Example 1 of K Nearest Neighbour Algorithm

First example is about kNN algorithm applied on Smarket dataset available with ISLR library. Actually, this is the same example from “A introduction to statistical Learning” book.

Load ISLR library and attach Smarket data set in the workspace

library(ISLR)
head(Smarket)

##   Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today Direction
## 1 2001  0.381 -0.192 -2.624 -1.055  5.010  1.191  0.959        Up
## 2 2001  0.959  0.381 -0.192 -2.624 -1.055  1.296  1.032        Up
## 3 2001  1.032  0.959  0.381 -0.192 -2.624  1.411 -0.623      Down
## 4 2001 -0.623  1.032  0.959  0.381 -0.192  1.276  0.614        Up
## 5 2001  0.614 -0.623  1.032  0.959  0.381  1.206  0.213        Up
## 6 2001  0.213  0.614 -0.623  1.032  0.959  1.349  1.392        Up

attach(Smarket)

kNN algorithm is going to use Lag1 and Lag2 features to predict direction of the market. In order to evaluate the performance.

train <- Year < 2005
test <- !train

#kNN requires standardized columns because data comes with different types and measurement

stan_data <- scale(Smarket[,c(2,3)]) ##Here 2,3 in the vector represent Lag1 and Lag2 features

#Standardized variables with mean zero and standard deviation 1. Checking variance of standardized variables
var(Smarket[,2])

## [1] 1.291

var(Smarket[,2])

## [1] 1.291

var(stan_data[,1])

## [1] 1

var(stan_data[,2])

## [1] 1

kNN likes only standardized features. scale() function has been used to standardized variables. Standardized variable’s mean would be 0 and sd would be 1. var() function is expected to give value 1 for standardized feature. To compare before and after standardization, var() function called multiple times.

Now spliting training and testing data set for kNN

train.X <- stan_data[train,]
test.X <- stan_data[test,]
train.Direction <- Direction[train]
test.Direction <- Direction[test]
#Seed must set in order to get reproducible result
set.seed(1)

To run kNN algorithm, knn() function needs to be invoked. knn function is available in “class” library

library(class)
knn.pred <- knn(train=train.X,test = test.X,cl = train.Direction,k = 3)

Next step is evaluating performance of the model. Simple table() function can be used. I am going to use CrossTable function comes with gmodels library.

library(gmodels)
CrossTable(x=test.Direction,y=knn.pred,prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  252 
## 
##  
##                | knn.pred 
## test.Direction |      Down |        Up | Row Total | 
## ---------------|-----------|-----------|-----------|
##           Down |        48 |        63 |       111 | 
##                |     0.432 |     0.568 |     0.440 | 
##                |     0.466 |     0.423 |           | 
##                |     0.190 |     0.250 |           | 
## ---------------|-----------|-----------|-----------|
##             Up |        55 |        86 |       141 | 
##                |     0.390 |     0.610 |     0.560 | 
##                |     0.534 |     0.577 |           | 
##                |     0.218 |     0.341 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |       103 |       149 |       252 | 
##                |     0.409 |     0.591 |           | 
## ---------------|-----------|-----------|-----------|
## 
##

#Finding how many % of them are correctly predicted
mean(knn.pred == test.Direction)

## [1] 0.5317

With k=3, % of correct prediction is 53%. Let us try to increase k value.

set.seed(2)
knn.pred <- knn(train=train.X,test = test.X,cl = train.Direction,k = 10)

Next step is evaluating performance of the model. Simple table() function can be used. I am going to use CrossTable function comes with gmodels library.

CrossTable(x=test.Direction,y=knn.pred,prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  252 
## 
##  
##                | knn.pred 
## test.Direction |      Down |        Up | Row Total | 
## ---------------|-----------|-----------|-----------|
##           Down |        46 |        65 |       111 | 
##                |     0.414 |     0.586 |     0.440 | 
##                |     0.418 |     0.458 |           | 
##                |     0.183 |     0.258 |           | 
## ---------------|-----------|-----------|-----------|
##             Up |        64 |        77 |       141 | 
##                |     0.454 |     0.546 |     0.560 | 
##                |     0.582 |     0.542 |           | 
##                |     0.254 |     0.306 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |       110 |       142 |       252 | 
##                |     0.437 |     0.563 |           | 
## ---------------|-----------|-----------|-----------|
## 
##

#Finding how many % of them are correctly predicted
mean(knn.pred == test.Direction)

## [1] 0.4881

It did not improve the correct prediction rate with k=10

Next, let us try to add more features to see the algorithm improvment. Year, today and Direction is not included in the list

stand_data2 = scale(Smarket[,2:7])
var(stand_data2)

##             Lag1      Lag2     Lag3      Lag4      Lag5   Volume
## Lag1    1.000000 -0.026294 -0.01080 -0.002986 -0.005675  0.04091
## Lag2   -0.026294  1.000000 -0.02590 -0.010854 -0.003558 -0.04338
## Lag3   -0.010803 -0.025897  1.00000 -0.024051 -0.018808 -0.04182
## Lag4   -0.002986 -0.010854 -0.02405  1.000000 -0.027084 -0.04841
## Lag5   -0.005675 -0.003558 -0.01881 -0.027084  1.000000 -0.02200
## Volume  0.040910 -0.043383 -0.04182 -0.048414 -0.022002  1.00000

train.X2 <- stand_data2[train,]
test.X2 <- stand_data2[test,]
train.Direction2 <- Direction[train]
test.Direction2 <- Direction[test]

#Calling knn function
knn.pred <- knn(train = train.X2,test = test.X2,cl = train.Direction2,k = 3)

#Evaluating the performance of the model
CrossTable(x = knn.pred, y = test.Direction2, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  252 
## 
##  
##              | test.Direction2 
##     knn.pred |      Down |        Up | Row Total | 
## -------------|-----------|-----------|-----------|
##         Down |        44 |        55 |        99 | 
##              |     0.444 |     0.556 |     0.393 | 
##              |     0.396 |     0.390 |           | 
##              |     0.175 |     0.218 |           | 
## -------------|-----------|-----------|-----------|
##           Up |        67 |        86 |       153 | 
##              |     0.438 |     0.562 |     0.607 | 
##              |     0.604 |     0.610 |           | 
##              |     0.266 |     0.341 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       111 |       141 |       252 | 
##              |     0.440 |     0.560 |           | 
## -------------|-----------|-----------|-----------|
## 
##

##Finding how many % of them are correctly predicted
mean(knn.pred == test.Direction2)

## [1] 0.5159

#Calling Knn function with k=10

#Calling knn function
knn.pred <- knn(train = train.X2,test = test.X2,cl = train.Direction2,k = 10)

#Evaluating the performance of the model
CrossTable(x = knn.pred, y = test.Direction2, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  252 
## 
##  
##              | test.Direction2 
##     knn.pred |      Down |        Up | Row Total | 
## -------------|-----------|-----------|-----------|
##         Down |        36 |        55 |        91 | 
##              |     0.396 |     0.604 |     0.361 | 
##              |     0.324 |     0.390 |           | 
##              |     0.143 |     0.218 |           | 
## -------------|-----------|-----------|-----------|
##           Up |        75 |        86 |       161 | 
##              |     0.466 |     0.534 |     0.639 | 
##              |     0.676 |     0.610 |           | 
##              |     0.298 |     0.341 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       111 |       141 |       252 | 
##              |     0.440 |     0.560 |           | 
## -------------|-----------|-----------|-----------|
## 
##

##Finding how many % of them are correctly predicted
mean(knn.pred == test.Direction2)

## [1] 0.4841

There is no much improvment after adding more features as well.

Example 1 of K Nearest Neighbour Algorithm

Vijayakumar Jawaharlal

April 22, 2014