This is continuation of example 1 improved. In this script, I want to try various resampling method to use optimal K value. Following two resamples methods are going to be used in this script
Package e1071 is going to be used for resampling methods. This package has bunch tools to tune various parameters to find out best performance from the model. It has very good support functions for svm, random forest, decision tree, knn etc., Refer help pages for e1071 package.
tune.knn() function is going to be used for resampling. I may need to use tune.control(), summary and plot to depict the results from tune.knn()
normalize <- function(x) {
norm <- ((x - min(x))/(max(x) - min(x)))
return (norm)
}
marketSample <- function(smarkt, test_pct = 20) {
train_pct = 100 - test_pct
#Seeding for reproducibility
set.seed(test_pct)
test_indx <- sample(nrow(smarkt), nrow(smarkt) * test_pct/100, replace=FALSE)
test.all <- smarkt[test_indx,]
test.data <- test.all[,-9]
test.direction <- test.all$Direction
#Prep. training data which is noting but remaing test data
train.all <- smarkt[-test_indx,]
train.data <- train.all[,-9]
train.direction <- train.all$Direction
rtrn_list <- list(testdata = test.data, testdir = test.direction, traindata = train.data, traindir = train.direction)
return(rtrn_list)
}
Data prep using normalize function
library(ISLR)
input.norml <- as.data.frame(lapply(Smarket[,-9],normalize))
input.norml$Direction <- Smarket$Direction
full.data <- input.norml[,-9]
full.dir <- input.norml[,9]
temp_list <- marketSample(input.norml, 20)
train.data <- temp_list$traindata
train.dir <- temp_list$traindir
test.data <- temp_list$testdata
test.dir <- temp_list$testdir
rm(temp_list)
Let us apply various resampling method to find out optimal K.
Let us go with k-fold cross validation to find out the optimal k value for knn()
library(e1071)
#Full Data set can be used for cross validation
knn.cross <- tune.knn(x = full.data, y = full.dir, k = 1:20,tunecontrol=tune.control(sampling = "cross"), cross=10)
#Summarize the resampling results set
summary(knn.cross)
##
## Parameter tuning of 'knn.wrapper':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## k
## 19
##
## - best performance: 0.1248
##
## - Detailed performance results:
## k error dispersion
## 1 1 0.1856 0.03180
## 2 2 0.2184 0.03524
## 3 3 0.1687 0.02999
## 4 4 0.1703 0.03122
## 5 5 0.1583 0.03067
## 6 6 0.1599 0.03427
## 7 7 0.1511 0.03273
## 8 8 0.1368 0.02053
## 9 9 0.1432 0.03268
## 10 10 0.1544 0.03372
## 11 11 0.1424 0.03966
## 12 12 0.1471 0.03280
## 13 13 0.1375 0.03457
## 14 14 0.1416 0.03347
## 15 15 0.1384 0.03056
## 16 16 0.1400 0.03759
## 17 17 0.1279 0.03740
## 18 18 0.1272 0.02638
## 19 19 0.1248 0.03225
## 20 20 0.1280 0.02410
#Plot the error rate
plot(knn.cross)
Resampling using bootstraping
#Full Data set can be used for cross validation
knn.boot <- tune.knn(x = full.data, y = full.dir, k = 1:20,tunecontrol=tune.control(sampling = "boot") )
#Summarize the resampling results set
summary(knn.boot)
##
## Parameter tuning of 'knn.wrapper':
##
## - sampling method: bootstrapping
##
## - best parameters:
## k
## 20
##
## - best performance: 0.1599
##
## - Detailed performance results:
## k error dispersion
## 1 1 0.2027 0.011899
## 2 2 0.2177 0.016510
## 3 3 0.2180 0.008659
## 4 4 0.2092 0.012496
## 5 5 0.1956 0.017534
## 6 6 0.1906 0.019819
## 7 7 0.1912 0.021306
## 8 8 0.1856 0.019959
## 9 9 0.1818 0.023124
## 10 10 0.1796 0.017639
## 11 11 0.1777 0.018028
## 12 12 0.1748 0.013825
## 13 13 0.1737 0.017660
## 14 14 0.1773 0.017987
## 15 15 0.1716 0.017996
## 16 16 0.1682 0.019212
## 17 17 0.1659 0.023116
## 18 18 0.1658 0.024365
## 19 19 0.1605 0.020610
## 20 20 0.1599 0.022703
#Plot the error rate
plot(knn.boot)
Resampling fixed set
#Full Data set can be used for cross validation
knn.fix <- tune.knn(x = full.data, y = full.dir, k = 1:20,tunecontrol=tune.control(sampling = "fix") , fix=10)
#Summarize the resampling results set
summary(knn.fix)
##
## Parameter tuning of 'knn.wrapper':
##
## - sampling method: fixed training/validation set
##
## - best parameters:
## k
## 7
##
## - best performance: 0.1487
##
## - Detailed performance results:
## k error dispersion
## 1 1 0.1871 NA
## 2 2 0.2134 NA
## 3 3 0.1703 NA
## 4 4 0.1751 NA
## 5 5 0.1583 NA
## 6 6 0.1655 NA
## 7 7 0.1487 NA
## 8 8 0.1751 NA
## 9 9 0.1631 NA
## 10 10 0.1631 NA
## 11 11 0.1535 NA
## 12 12 0.1751 NA
## 13 13 0.1535 NA
## 14 14 0.1607 NA
## 15 15 0.1727 NA
## 16 16 0.1679 NA
## 17 17 0.1775 NA
## 18 18 0.1583 NA
## 19 19 0.1679 NA
## 20 20 0.1775 NA
#Plot the error rate
plot(knn.fix)
Based on 10 fold cross validation, k=20 is better and k=19 from bootstrap. I want to check both K=19 and 20
library(class)
knn.pred <- knn(train = train.data,test = test.data,cl = train.dir,k = 19)
table(knn.pred,test.dir)
## test.dir
## knn.pred Down Up
## Down 103 11
## Up 19 117
mean(knn.pred == test.dir)
## [1] 0.88
knn.pred1 <- knn(train = train.data,test = test.data,cl = train.dir,k = 20)
table(knn.pred1,test.dir)
## test.dir
## knn.pred1 Down Up
## Down 107 10
## Up 15 118
mean(knn.pred1 == test.dir)
## [1] 0.9
Prediction accuracy is now about 88 to 90%. This is much much improved result than previous script “knn_example1_imprvd”