Recently I’ve got familiar with caret package. Caret is a great R package which provides general interface to nearly 150 ML algorithms. It also provides great functions to sample the data (for training and testing), preprocessing, evaluating the model etc.,

To get familiar with caret package, please check following URLs

https://www.youtube.com/watch?v=7Jbb2ItbTC4 http://http://caret.r-forge.r-project.org http://cran.r-project.org/web/packages/caret/vignettes/caret.pdf

I am going to use same dataset from previous examples. Intention of this excercise is to get familiar with caret package

Sampling

library(ISLR)
library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

set.seed(300)
#Spliting data as training and test set. Using createDataPartition() function from caret
indxTrain <- createDataPartition(y = Smarket$Direction,p = 0.75,list = FALSE)
training <- Smarket[indxTrain,]
testing <- Smarket[-indxTrain,]

#Checking distibution in origanl data and partitioned data
prop.table(table(training$Direction)) * 100

## 
##  Down    Up 
## 48.19 51.81

prop.table(table(testing$Direction)) * 100

## 
##  Down    Up 
## 48.08 51.92

prop.table(table(Smarket$Direction)) * 100

## 
##  Down    Up 
## 48.16 51.84

creteDataParition function creates sample very effortlessly. We don’t need to write complex function like previous example

Preprocessing

kNN requires variables to be normalized or scaled. caret provides facility to preprocess data. I am going to choose centring and scaling

trainX <- training[,names(training) != "Direction"]
preProcValues <- preProcess(x = trainX,method = c("center", "scale"))
preProcValues

## 
## Call:
## preProcess.default(x = trainX, method = c("center", "scale"))
## 
## Created from 938 samples and 8 variables
## Pre-processing: centered, scaled

Training and train control

set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 3) #,classProbs=TRUE,summaryFunction = twoClassSummary)
knnFit <- train(Direction ~ ., data = training, method = "knn", trControl = ctrl, preProcess = c("center","scale"), tuneLength = 20)

#Output of kNN fit
knnFit

## k-Nearest Neighbors 
## 
## 938 samples
##   8 predictors
##   2 classes: 'Down', 'Up' 
## 
## Pre-processing: centered, scaled 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## 
## Summary of sample sizes: 844, 844, 844, 845, 844, 845, ... 
## 
## Resampling results across tuning parameters:
## 
##   k   Accuracy  Kappa  Accuracy SD  Kappa SD
##   5   0.9       0.7    0.04         0.07    
##   7   0.9       0.8    0.04         0.08    
##   9   0.9       0.8    0.04         0.08    
##   10  0.9       0.8    0.03         0.07    
##   10  0.9       0.8    0.03         0.07    
##   20  0.9       0.8    0.03         0.06    
##   20  0.9       0.8    0.03         0.07    
##   20  0.9       0.8    0.04         0.08    
##   20  0.9       0.8    0.03         0.07    
##   20  0.9       0.8    0.03         0.07    
##   20  0.9       0.8    0.03         0.06    
##   30  0.9       0.8    0.03         0.06    
##   30  0.9       0.8    0.03         0.07    
##   30  0.9       0.8    0.03         0.07    
##   30  0.9       0.8    0.03         0.07    
##   40  0.9       0.8    0.03         0.07    
##   40  0.9       0.8    0.03         0.06    
##   40  0.9       0.8    0.03         0.06    
##   40  0.9       0.8    0.03         0.06    
##   40  0.9       0.8    0.03         0.05    
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 23.

#Plotting yields Number of Neighbours Vs accuracy (based on repeated cross validation)
plot(knnFit)

plot of chunk unnamed-chunk-3

knnPredict <- predict(knnFit,newdata = testing )
#Get the confusion matrix to see accuracy value and other parameter values
confusionMatrix(knnPredict, testing$Direction )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down  123   8
##       Up     27 154
##                                         
##                Accuracy : 0.888         
##                  95% CI : (0.847, 0.921)
##     No Information Rate : 0.519         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.774         
##  Mcnemar's Test P-Value : 0.00235       
##                                         
##             Sensitivity : 0.820         
##             Specificity : 0.951         
##          Pos Pred Value : 0.939         
##          Neg Pred Value : 0.851         
##              Prevalence : 0.481         
##          Detection Rate : 0.394         
##    Detection Prevalence : 0.420         
##       Balanced Accuracy : 0.885         
##                                         
##        'Positive' Class : Down          
##

mean(knnPredict == testing$Direction)

## [1] 0.8878

#Now verifying 2 class summary function

ctrl <- trainControl(method="repeatedcv",repeats = 3,classProbs=TRUE,summaryFunction = twoClassSummary)
knnFit <- train(Direction ~ ., data = training, method = "knn", trControl = ctrl, preProcess = c("center","scale"), tuneLength = 20)

## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

## Warning: The metric "Accuracy" was not in the result set. ROC will be used
## instead.

#Output of kNN fit
knnFit

## k-Nearest Neighbors 
## 
## 938 samples
##   8 predictors
##   2 classes: 'Down', 'Up' 
## 
## Pre-processing: centered, scaled 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## 
## Summary of sample sizes: 844, 844, 845, 843, 844, 844, ... 
## 
## Resampling results across tuning parameters:
## 
##   k   ROC  Sens  Spec  ROC SD  Sens SD  Spec SD
##   5   0.9  0.8   0.9   0.03    0.07     0.05   
##   7   1    0.8   0.9   0.02    0.06     0.05   
##   9   1    0.8   0.9   0.02    0.07     0.05   
##   10  1    0.8   0.9   0.02    0.07     0.05   
##   10  1    0.8   0.9   0.02    0.07     0.05   
##   20  1    0.9   0.9   0.02    0.06     0.04   
##   20  1    0.9   0.9   0.02    0.06     0.04   
##   20  1    0.9   0.9   0.02    0.07     0.03   
##   20  1    0.9   0.9   0.01    0.06     0.03   
##   20  1    0.9   0.9   0.01    0.06     0.03   
##   20  1    0.9   0.9   0.01    0.07     0.03   
##   30  1    0.9   0.9   0.01    0.06     0.03   
##   30  1    0.8   0.9   0.01    0.07     0.03   
##   30  1    0.8   0.9   0.01    0.06     0.03   
##   30  1    0.8   0.9   0.01    0.07     0.03   
##   40  1    0.8   0.9   0.01    0.06     0.03   
##   40  1    0.8   0.9   0.01    0.06     0.03   
##   40  1    0.8   0.9   0.01    0.06     0.02   
##   40  1    0.8   0.9   0.01    0.06     0.02   
##   40  1    0.8   0.9   0.01    0.06     0.03   
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was k = 43.

#Plotting yields Number of Neighbours Vs accuracy (based on repeated cross validation)
plot(knnFit, print.thres = 0.5, type="S")

plot of chunk unnamed-chunk-3

knnPredict <- predict(knnFit,newdata = testing )
#Get the confusion matrix to see accuracy value and other parameter values
confusionMatrix(knnPredict, testing$Direction )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down  123   9
##       Up     27 153
##                                         
##                Accuracy : 0.885         
##                  95% CI : (0.844, 0.918)
##     No Information Rate : 0.519         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.768         
##  Mcnemar's Test P-Value : 0.00461       
##                                         
##             Sensitivity : 0.820         
##             Specificity : 0.944         
##          Pos Pred Value : 0.932         
##          Neg Pred Value : 0.850         
##              Prevalence : 0.481         
##          Detection Rate : 0.394         
##    Detection Prevalence : 0.423         
##       Balanced Accuracy : 0.882         
##                                         
##        'Positive' Class : Down          
##

mean(knnPredict == testing$Direction)

## [1] 0.8846

Trying to plot ROC curve to check specificity and sensitivity

library(pROC)
knnPredict <- predict(knnFit,newdata = testing , type="prob")
knnROC <- roc(testing$Direction,knnPredict[,"Down"], levels = rev(testing$Direction))
knnROC

## 
## Call:
## roc.default(response = testing$Direction, predictor = knnPredict[,     "Down"], levels = rev(testing$Direction))
## 
## Data: knnPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction 2).
## Area under the curve: 0.5

plot(knnROC, type="S", print.thres= 0.5)

plot of chunk unnamed-chunk-4

## 
## Call:
## roc.default(response = testing$Direction, predictor = knnPredict[,     "Down"], levels = rev(testing$Direction))
## 
## Data: knnPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction 2).
## Area under the curve: 0.5

Applying Random Forest to see the performance improvement

set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 3) #,classProbs=TRUE,summaryFunction = twoClassSummary)

# Random forrest
rfFit <- train(Direction ~ ., data = training, method = "rf", trControl = ctrl, preProcess = c("center","scale"), tuneLength = 20)

## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.

## note: only 7 unique complexity parameters in default grid. Truncating the grid to 7 .

rfFit

## Random Forest 
## 
## 938 samples
##   8 predictors
##   2 classes: 'Down', 'Up' 
## 
## Pre-processing: centered, scaled 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## 
## Summary of sample sizes: 844, 844, 844, 845, 844, 845, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
##   2     1         1      0.004        0.009   
##   3     1         1      0.004        0.009   
##   4     1         1      0.004        0.009   
##   5     1         1      0.004        0.009   
##   6     1         1      0.004        0.009   
##   7     1         1      0.004        0.009   
##   8     1         1      0.004        0.009   
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

plot(rfFit)

plot of chunk unnamed-chunk-5

rfPredict <- predict(rfFit,newdata = testing )
confusionMatrix(rfPredict, testing$Direction )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down  150   0
##       Up      0 162
##                                     
##                Accuracy : 1         
##                  95% CI : (0.988, 1)
##     No Information Rate : 0.519     
##     P-Value [Acc > NIR] : <2e-16    
##                                     
##                   Kappa : 1         
##  Mcnemar's Test P-Value : NA        
##                                     
##             Sensitivity : 1.000     
##             Specificity : 1.000     
##          Pos Pred Value : 1.000     
##          Neg Pred Value : 1.000     
##              Prevalence : 0.481     
##          Detection Rate : 0.481     
##    Detection Prevalence : 0.481     
##       Balanced Accuracy : 1.000     
##                                     
##        'Positive' Class : Down      
##

mean(rfPredict == testing$Direction)

## [1] 1

#With twoclasssummary
ctrl <- trainControl(method="repeatedcv",repeats = 3,classProbs=TRUE,summaryFunction = twoClassSummary)
# Random forrest
rfFit <- train(Direction ~ ., data = training, method = "rf", trControl = ctrl, preProcess = c("center","scale"), tuneLength = 20)

## note: only 7 unique complexity parameters in default grid. Truncating the grid to 7 .

## Warning: The metric "Accuracy" was not in the result set. ROC will be used
## instead.

rfFit

## Random Forest 
## 
## 938 samples
##   8 predictors
##   2 classes: 'Down', 'Up' 
## 
## Pre-processing: centered, scaled 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## 
## Summary of sample sizes: 844, 844, 845, 845, 843, 845, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  ROC  Sens  Spec  ROC SD  Sens SD  Spec SD
##   2     1    1     1     9e-05   0.007    0.005  
##   3     1    1     1     8e-05   0.007    0.005  
##   4     1    1     1     0       0.007    0.005  
##   5     1    1     1     0       0.007    0.005  
##   6     1    1     1     8e-05   0.007    0.005  
##   7     1    1     1     0       0.007    0.005  
##   8     1    1     1     0       0.007    0.005  
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 4.

plot(rfFit)

plot of chunk unnamed-chunk-5

#Trying plot with some more parameters
plot(rfFit, print.thres = 0.5, type="S")

plot of chunk unnamed-chunk-5

rfPredict <- predict(rfFit,newdata = testing )
confusionMatrix(rfPredict, testing$Direction )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down  150   0
##       Up      0 162
##                                     
##                Accuracy : 1         
##                  95% CI : (0.988, 1)
##     No Information Rate : 0.519     
##     P-Value [Acc > NIR] : <2e-16    
##                                     
##                   Kappa : 1         
##  Mcnemar's Test P-Value : NA        
##                                     
##             Sensitivity : 1.000     
##             Specificity : 1.000     
##          Pos Pred Value : 1.000     
##          Neg Pred Value : 1.000     
##              Prevalence : 0.481     
##          Detection Rate : 0.481     
##    Detection Prevalence : 0.481     
##       Balanced Accuracy : 1.000     
##                                     
##        'Positive' Class : Down      
##

mean(rfPredict == testing$Direction)

## [1] 1

Ploting ROC curve

library(pROC)
rfPredict <- predict(rfFit,newdata = testing , type="prob")
rfROC <- roc(testing$Direction,rfPredict[,"Down"], levels = rev(testing$Direction))
rfROC

## 
## Call:
## roc.default(response = testing$Direction, predictor = rfPredict[,     "Down"], levels = rev(testing$Direction))
## 
## Data: rfPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction 2).
## Area under the curve: 0.5

plot(rfROC, type="S", print.thres= 0.5)

plot of chunk unnamed-chunk-6

## 
## Call:
## roc.default(response = testing$Direction, predictor = rfPredict[,     "Down"], levels = rev(testing$Direction))
## 
## Data: rfPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction 2).
## Area under the curve: 0.5

Yet to learn about interpreting ROC curve :)

kNN Using caret R package

Vijayakumar Jawaharlal

April 29, 2014

Sampling

Preprocessing

Training and train control

Applying Random Forest to see the performance improvement

Yet to learn about interpreting ROC curve :)