Recently I’ve got familiar with caret package. Caret is a great R package which provides general interface to nearly 150 ML algorithms. It also provides great functions to sample the data (for training and testing), preprocessing, evaluating the model etc.,
To get familiar with caret package, please check following URLs
https://www.youtube.com/watch?v=7Jbb2ItbTC4 http://http://caret.r-forge.r-project.org http://cran.r-project.org/web/packages/caret/vignettes/caret.pdf
I am going to use same dataset from previous examples. Intention of this excercise is to get familiar with caret package
library(ISLR)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(300)
#Spliting data as training and test set. Using createDataPartition() function from caret
indxTrain <- createDataPartition(y = Smarket$Direction,p = 0.75,list = FALSE)
training <- Smarket[indxTrain,]
testing <- Smarket[-indxTrain,]
#Checking distibution in origanl data and partitioned data
prop.table(table(training$Direction)) * 100
##
## Down Up
## 48.19 51.81
prop.table(table(testing$Direction)) * 100
##
## Down Up
## 48.08 51.92
prop.table(table(Smarket$Direction)) * 100
##
## Down Up
## 48.16 51.84
creteDataParition function creates sample very effortlessly. We don’t need to write complex function like previous example
kNN requires variables to be normalized or scaled. caret provides facility to preprocess data. I am going to choose centring and scaling
trainX <- training[,names(training) != "Direction"]
preProcValues <- preProcess(x = trainX,method = c("center", "scale"))
preProcValues
##
## Call:
## preProcess.default(x = trainX, method = c("center", "scale"))
##
## Created from 938 samples and 8 variables
## Pre-processing: centered, scaled
set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 3) #,classProbs=TRUE,summaryFunction = twoClassSummary)
knnFit <- train(Direction ~ ., data = training, method = "knn", trControl = ctrl, preProcess = c("center","scale"), tuneLength = 20)
#Output of kNN fit
knnFit
## k-Nearest Neighbors
##
## 938 samples
## 8 predictors
## 2 classes: 'Down', 'Up'
##
## Pre-processing: centered, scaled
## Resampling: Cross-Validated (10 fold, repeated 3 times)
##
## Summary of sample sizes: 844, 844, 844, 845, 844, 845, ...
##
## Resampling results across tuning parameters:
##
## k Accuracy Kappa Accuracy SD Kappa SD
## 5 0.9 0.7 0.04 0.07
## 7 0.9 0.8 0.04 0.08
## 9 0.9 0.8 0.04 0.08
## 10 0.9 0.8 0.03 0.07
## 10 0.9 0.8 0.03 0.07
## 20 0.9 0.8 0.03 0.06
## 20 0.9 0.8 0.03 0.07
## 20 0.9 0.8 0.04 0.08
## 20 0.9 0.8 0.03 0.07
## 20 0.9 0.8 0.03 0.07
## 20 0.9 0.8 0.03 0.06
## 30 0.9 0.8 0.03 0.06
## 30 0.9 0.8 0.03 0.07
## 30 0.9 0.8 0.03 0.07
## 30 0.9 0.8 0.03 0.07
## 40 0.9 0.8 0.03 0.07
## 40 0.9 0.8 0.03 0.06
## 40 0.9 0.8 0.03 0.06
## 40 0.9 0.8 0.03 0.06
## 40 0.9 0.8 0.03 0.05
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 23.
#Plotting yields Number of Neighbours Vs accuracy (based on repeated cross validation)
plot(knnFit)
knnPredict <- predict(knnFit,newdata = testing )
#Get the confusion matrix to see accuracy value and other parameter values
confusionMatrix(knnPredict, testing$Direction )
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 123 8
## Up 27 154
##
## Accuracy : 0.888
## 95% CI : (0.847, 0.921)
## No Information Rate : 0.519
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.774
## Mcnemar's Test P-Value : 0.00235
##
## Sensitivity : 0.820
## Specificity : 0.951
## Pos Pred Value : 0.939
## Neg Pred Value : 0.851
## Prevalence : 0.481
## Detection Rate : 0.394
## Detection Prevalence : 0.420
## Balanced Accuracy : 0.885
##
## 'Positive' Class : Down
##
mean(knnPredict == testing$Direction)
## [1] 0.8878
#Now verifying 2 class summary function
ctrl <- trainControl(method="repeatedcv",repeats = 3,classProbs=TRUE,summaryFunction = twoClassSummary)
knnFit <- train(Direction ~ ., data = training, method = "knn", trControl = ctrl, preProcess = c("center","scale"), tuneLength = 20)
## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Warning: The metric "Accuracy" was not in the result set. ROC will be used
## instead.
#Output of kNN fit
knnFit
## k-Nearest Neighbors
##
## 938 samples
## 8 predictors
## 2 classes: 'Down', 'Up'
##
## Pre-processing: centered, scaled
## Resampling: Cross-Validated (10 fold, repeated 3 times)
##
## Summary of sample sizes: 844, 844, 845, 843, 844, 844, ...
##
## Resampling results across tuning parameters:
##
## k ROC Sens Spec ROC SD Sens SD Spec SD
## 5 0.9 0.8 0.9 0.03 0.07 0.05
## 7 1 0.8 0.9 0.02 0.06 0.05
## 9 1 0.8 0.9 0.02 0.07 0.05
## 10 1 0.8 0.9 0.02 0.07 0.05
## 10 1 0.8 0.9 0.02 0.07 0.05
## 20 1 0.9 0.9 0.02 0.06 0.04
## 20 1 0.9 0.9 0.02 0.06 0.04
## 20 1 0.9 0.9 0.02 0.07 0.03
## 20 1 0.9 0.9 0.01 0.06 0.03
## 20 1 0.9 0.9 0.01 0.06 0.03
## 20 1 0.9 0.9 0.01 0.07 0.03
## 30 1 0.9 0.9 0.01 0.06 0.03
## 30 1 0.8 0.9 0.01 0.07 0.03
## 30 1 0.8 0.9 0.01 0.06 0.03
## 30 1 0.8 0.9 0.01 0.07 0.03
## 40 1 0.8 0.9 0.01 0.06 0.03
## 40 1 0.8 0.9 0.01 0.06 0.03
## 40 1 0.8 0.9 0.01 0.06 0.02
## 40 1 0.8 0.9 0.01 0.06 0.02
## 40 1 0.8 0.9 0.01 0.06 0.03
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 43.
#Plotting yields Number of Neighbours Vs accuracy (based on repeated cross validation)
plot(knnFit, print.thres = 0.5, type="S")
knnPredict <- predict(knnFit,newdata = testing )
#Get the confusion matrix to see accuracy value and other parameter values
confusionMatrix(knnPredict, testing$Direction )
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 123 9
## Up 27 153
##
## Accuracy : 0.885
## 95% CI : (0.844, 0.918)
## No Information Rate : 0.519
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.768
## Mcnemar's Test P-Value : 0.00461
##
## Sensitivity : 0.820
## Specificity : 0.944
## Pos Pred Value : 0.932
## Neg Pred Value : 0.850
## Prevalence : 0.481
## Detection Rate : 0.394
## Detection Prevalence : 0.423
## Balanced Accuracy : 0.882
##
## 'Positive' Class : Down
##
mean(knnPredict == testing$Direction)
## [1] 0.8846
Trying to plot ROC curve to check specificity and sensitivity
library(pROC)
knnPredict <- predict(knnFit,newdata = testing , type="prob")
knnROC <- roc(testing$Direction,knnPredict[,"Down"], levels = rev(testing$Direction))
knnROC
##
## Call:
## roc.default(response = testing$Direction, predictor = knnPredict[, "Down"], levels = rev(testing$Direction))
##
## Data: knnPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction 2).
## Area under the curve: 0.5
plot(knnROC, type="S", print.thres= 0.5)
##
## Call:
## roc.default(response = testing$Direction, predictor = knnPredict[, "Down"], levels = rev(testing$Direction))
##
## Data: knnPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction 2).
## Area under the curve: 0.5
set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 3) #,classProbs=TRUE,summaryFunction = twoClassSummary)
# Random forrest
rfFit <- train(Direction ~ ., data = training, method = "rf", trControl = ctrl, preProcess = c("center","scale"), tuneLength = 20)
## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
## note: only 7 unique complexity parameters in default grid. Truncating the grid to 7 .
rfFit
## Random Forest
##
## 938 samples
## 8 predictors
## 2 classes: 'Down', 'Up'
##
## Pre-processing: centered, scaled
## Resampling: Cross-Validated (10 fold, repeated 3 times)
##
## Summary of sample sizes: 844, 844, 844, 845, 844, 845, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 1 1 0.004 0.009
## 3 1 1 0.004 0.009
## 4 1 1 0.004 0.009
## 5 1 1 0.004 0.009
## 6 1 1 0.004 0.009
## 7 1 1 0.004 0.009
## 8 1 1 0.004 0.009
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(rfFit)
rfPredict <- predict(rfFit,newdata = testing )
confusionMatrix(rfPredict, testing$Direction )
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 150 0
## Up 0 162
##
## Accuracy : 1
## 95% CI : (0.988, 1)
## No Information Rate : 0.519
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.000
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 1.000
## Prevalence : 0.481
## Detection Rate : 0.481
## Detection Prevalence : 0.481
## Balanced Accuracy : 1.000
##
## 'Positive' Class : Down
##
mean(rfPredict == testing$Direction)
## [1] 1
#With twoclasssummary
ctrl <- trainControl(method="repeatedcv",repeats = 3,classProbs=TRUE,summaryFunction = twoClassSummary)
# Random forrest
rfFit <- train(Direction ~ ., data = training, method = "rf", trControl = ctrl, preProcess = c("center","scale"), tuneLength = 20)
## note: only 7 unique complexity parameters in default grid. Truncating the grid to 7 .
## Warning: The metric "Accuracy" was not in the result set. ROC will be used
## instead.
rfFit
## Random Forest
##
## 938 samples
## 8 predictors
## 2 classes: 'Down', 'Up'
##
## Pre-processing: centered, scaled
## Resampling: Cross-Validated (10 fold, repeated 3 times)
##
## Summary of sample sizes: 844, 844, 845, 845, 843, 845, ...
##
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec ROC SD Sens SD Spec SD
## 2 1 1 1 9e-05 0.007 0.005
## 3 1 1 1 8e-05 0.007 0.005
## 4 1 1 1 0 0.007 0.005
## 5 1 1 1 0 0.007 0.005
## 6 1 1 1 8e-05 0.007 0.005
## 7 1 1 1 0 0.007 0.005
## 8 1 1 1 0 0.007 0.005
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.
plot(rfFit)
#Trying plot with some more parameters
plot(rfFit, print.thres = 0.5, type="S")
rfPredict <- predict(rfFit,newdata = testing )
confusionMatrix(rfPredict, testing$Direction )
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 150 0
## Up 0 162
##
## Accuracy : 1
## 95% CI : (0.988, 1)
## No Information Rate : 0.519
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.000
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 1.000
## Prevalence : 0.481
## Detection Rate : 0.481
## Detection Prevalence : 0.481
## Balanced Accuracy : 1.000
##
## 'Positive' Class : Down
##
mean(rfPredict == testing$Direction)
## [1] 1
Ploting ROC curve
library(pROC)
rfPredict <- predict(rfFit,newdata = testing , type="prob")
rfROC <- roc(testing$Direction,rfPredict[,"Down"], levels = rev(testing$Direction))
rfROC
##
## Call:
## roc.default(response = testing$Direction, predictor = rfPredict[, "Down"], levels = rev(testing$Direction))
##
## Data: rfPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction 2).
## Area under the curve: 0.5
plot(rfROC, type="S", print.thres= 0.5)
##
## Call:
## roc.default(response = testing$Direction, predictor = rfPredict[, "Down"], levels = rev(testing$Direction))
##
## Data: rfPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction 2).
## Area under the curve: 0.5