kNN example 1 - Normalizing features and applying resampling methods

This is continuation of example 1 improved. In this script, I want to try various resampling method to use optimal K value. Following two resamples methods are going to be used in this script

Cross Validation
Boot Strap
Using fixed sample

Package e1071 is going to be used for resampling methods. This package has bunch tools to tune various parameters to find out best performance from the model. It has very good support functions for svm, random forest, decision tree, knn etc., Refer help pages for e1071 package.

tune.knn() function is going to be used for resampling. I may need to use tune.control(), summary and plot to depict the results from tune.knn()

normalize <- function(x) {
    norm <- ((x - min(x))/(max(x) - min(x)))
    return (norm)
}


marketSample <- function(smarkt, test_pct = 20) {
    train_pct = 100 - test_pct
    #Seeding for reproducibility
    set.seed(test_pct)
    test_indx <- sample(nrow(smarkt), nrow(smarkt) * test_pct/100, replace=FALSE)
    test.all <- smarkt[test_indx,]
    
    test.data <- test.all[,-9]
    test.direction <- test.all$Direction
    
    #Prep. training data  which is noting but remaing test data
    train.all <- smarkt[-test_indx,]
    train.data <- train.all[,-9]
    train.direction <- train.all$Direction
    
    rtrn_list <- list(testdata = test.data, testdir = test.direction, traindata = train.data, traindir = train.direction)
    
    return(rtrn_list)
    
}

Data prep using normalize function

library(ISLR)
input.norml <- as.data.frame(lapply(Smarket[,-9],normalize))
input.norml$Direction <- Smarket$Direction

full.data <- input.norml[,-9]
full.dir <- input.norml[,9]

temp_list <- marketSample(input.norml, 20)
train.data <- temp_list$traindata
train.dir <- temp_list$traindir

test.data <- temp_list$testdata
test.dir <- temp_list$testdir

rm(temp_list)

Let us apply various resampling method to find out optimal K.

Let us go with k-fold cross validation to find out the optimal k value for knn()

library(e1071)
#Full Data set can be used for cross validation
knn.cross <- tune.knn(x = full.data, y = full.dir, k = 1:20,tunecontrol=tune.control(sampling = "cross"), cross=10)
#Summarize the resampling results set
summary(knn.cross)

## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##   k
##  19
## 
## - best performance: 0.1248 
## 
## - Detailed performance results:
##     k  error dispersion
## 1   1 0.1856    0.03180
## 2   2 0.2184    0.03524
## 3   3 0.1687    0.02999
## 4   4 0.1703    0.03122
## 5   5 0.1583    0.03067
## 6   6 0.1599    0.03427
## 7   7 0.1511    0.03273
## 8   8 0.1368    0.02053
## 9   9 0.1432    0.03268
## 10 10 0.1544    0.03372
## 11 11 0.1424    0.03966
## 12 12 0.1471    0.03280
## 13 13 0.1375    0.03457
## 14 14 0.1416    0.03347
## 15 15 0.1384    0.03056
## 16 16 0.1400    0.03759
## 17 17 0.1279    0.03740
## 18 18 0.1272    0.02638
## 19 19 0.1248    0.03225
## 20 20 0.1280    0.02410

#Plot the error rate 
plot(knn.cross)

plot of chunk unnamed-chunk-3

Resampling using bootstraping

#Full Data set can be used for cross validation
knn.boot <- tune.knn(x = full.data, y = full.dir, k = 1:20,tunecontrol=tune.control(sampling = "boot") )
#Summarize the resampling results set
summary(knn.boot)

## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: bootstrapping 
## 
## - best parameters:
##   k
##  20
## 
## - best performance: 0.1599 
## 
## - Detailed performance results:
##     k  error dispersion
## 1   1 0.2027   0.011899
## 2   2 0.2177   0.016510
## 3   3 0.2180   0.008659
## 4   4 0.2092   0.012496
## 5   5 0.1956   0.017534
## 6   6 0.1906   0.019819
## 7   7 0.1912   0.021306
## 8   8 0.1856   0.019959
## 9   9 0.1818   0.023124
## 10 10 0.1796   0.017639
## 11 11 0.1777   0.018028
## 12 12 0.1748   0.013825
## 13 13 0.1737   0.017660
## 14 14 0.1773   0.017987
## 15 15 0.1716   0.017996
## 16 16 0.1682   0.019212
## 17 17 0.1659   0.023116
## 18 18 0.1658   0.024365
## 19 19 0.1605   0.020610
## 20 20 0.1599   0.022703

#Plot the error rate 
plot(knn.boot)

plot of chunk unnamed-chunk-4

Resampling fixed set

#Full Data set can be used for cross validation
knn.fix <- tune.knn(x = full.data, y = full.dir, k = 1:20,tunecontrol=tune.control(sampling = "fix") , fix=10)
#Summarize the resampling results set
summary(knn.fix)

## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: fixed training/validation set 
## 
## - best parameters:
##  k
##  7
## 
## - best performance: 0.1487 
## 
## - Detailed performance results:
##     k  error dispersion
## 1   1 0.1871         NA
## 2   2 0.2134         NA
## 3   3 0.1703         NA
## 4   4 0.1751         NA
## 5   5 0.1583         NA
## 6   6 0.1655         NA
## 7   7 0.1487         NA
## 8   8 0.1751         NA
## 9   9 0.1631         NA
## 10 10 0.1631         NA
## 11 11 0.1535         NA
## 12 12 0.1751         NA
## 13 13 0.1535         NA
## 14 14 0.1607         NA
## 15 15 0.1727         NA
## 16 16 0.1679         NA
## 17 17 0.1775         NA
## 18 18 0.1583         NA
## 19 19 0.1679         NA
## 20 20 0.1775         NA

#Plot the error rate 
plot(knn.fix)

plot of chunk unnamed-chunk-5

Based on 10 fold cross validation, k=20 is better and k=19 from bootstrap. I want to check both K=19 and 20

library(class)
knn.pred <- knn(train = train.data,test = test.data,cl = train.dir,k = 19)
table(knn.pred,test.dir)

##         test.dir
## knn.pred Down  Up
##     Down  103  11
##     Up     19 117

mean(knn.pred == test.dir)

## [1] 0.88

knn.pred1 <- knn(train = train.data,test = test.data,cl = train.dir,k = 20)
table(knn.pred1,test.dir)

##          test.dir
## knn.pred1 Down  Up
##      Down  107  10
##      Up     15 118

mean(knn.pred1 == test.dir)

## [1] 0.9

Prediction accuracy is now about 88 to 90%. This is much much improved result than previous script “knn_example1_imprvd”

kNN example 1 - Normalizing features and applying resampling methods

Vijayakumar Jawaharlal

April 23, 2014