#Loading all required libraries
library(caret)
#In order to perform training in parallel on 4 cores
library(doParallel)
registerDoParallel(cores=4)
First of all, in order to perform any kind of analysis, our training data should be cleaned.
#Lets examine dimension of original data
dim(pml)
## [1] 19622 160
Currently, data set contains 160 variables (including response variable “classe”) and 19622 cases.
#Number of variables that have more than 90% NAs
sum(!apply(pml, MARGIN = 2, FUN = function (column) {sum(is.na(column))}) < 0.1 * nrow(pml))
## [1] 67
#Removing variables with a lot of NAs
pml <- pml[, apply(pml, MARGIN = 2, FUN = function (column) {sum(is.na(column))}) < 0.1 * nrow(pml)]
#Dimension of data set without NAs
dim(pml)
## [1] 19622 93
So, 67 attributes of original data have >= 90% NAs. To use these attributes to build model we should decrease size of out data more than 10 times. Better way - just to ignore these variables.
#Number of "near zero variance" variables
nsv <- nearZeroVar(pml, saveMetrics = TRUE)
sum(nsv$nzv)
## [1] 34
#Removing "near zero variance" variables from data set
pml <- pml[,!nsv$nzv]
#Dimension of data set without "near zero variance" variables
dim(pml)
## [1] 19622 59
34 variables will not increase accuracy of prediction because their variation is pretty small (entropy is almost zero). They also shouldn’t be used for training.
#Deleting timestamps and order variables
pml <- pml[, -c(1, 3, 4, 5)]
#Dimension of data
dim(pml)
## [1] 19622 55
There are no useful information in variables: “X”, “raw_timestamp_part_1”, “raw_timestamp_part_2”, “cvtd_timestamp”.
#Number of NAs in final data
sum(apply(pml, MARGIN = 2, FUN = function (column) {sum(is.na(column))}))
## [1] 0
So, there are no NAs in data. We have 54 predictors and one response variable.
#Splitting data
set.seed(0)
inTrain <- createDataPartition(y=pml$classe, p=0.8, list=FALSE)
training <- pml[inTrain,]
testing <- pml[-inTrain,]
dim(training); dim(testing)
## [1] 15699 55
## [1] 3923 55
There are 15699 cases for training and 3923 cases for testing.
#Cross-validation options
fitControl = trainControl(method = "repeatedcv", number = 10, repeats = 5, verboseIter = TRUE)
In order to tune model and find out of sample error cross-validation has been used.
Type of cross-validation - k-folds with 10 folds and 5 repeats.
#Tuning options
c50Grid <- expand.grid(.trials = c(1:100),
.model = c("tree"),
.winnow = c(TRUE, FALSE))
c50 boosting model is being used. Boosting is the process of adding weak learners in such a way that newer learners pick up the slack of older learners. So, this approach end up with set of trees (not rules, because of tuning option .model = c(“tree”)). Every next tree will try to improve accuracy of prediction for the cases, which previous trees predict not so well. Splitting while building trees is performed based on information gain. Another options, which set for tuning - trials - number of possible trees to be used (in my case this number will range in 1:100); winnow - possibly to use and no use such approach to deal with overfitting. Winnowing means trying to remove predictors to improve model accuracy.
#Training model
C50_model <- train(classe ~.,
method = "C5.0",
data = training,
tuneGrid = c50Grid,
trControl = fitControl)
training data is used for training. Also, as could be observed, all variables are being used to build model to predict “classe” variable. eval = FALSE have been used in this chunk to save time (model evaluating and tuning takes almost 1 hour on 4 cores). Next code will load already tuned model from my working directory:
#Loading tuned model
load(file = "C50_model.rda")
#Tuning process
plot(C50_model)
This plot shows tuning process. There are two lines. First one shows cross-validation accuracy for different number of trials with no winnowing. Second one - with winnowing. As could be observed, there are almost no difference of using or no using winnowing, accuracy almost the same. Moreover, starting from approximately 20 trials also lead almost to the same accuracy. So, there are almost no difference of using 20 or up to 100 trees.
Best tuning parameters:
#Tuning parameters
C50_model$bestTune
## trials model winnow
## 178 78 tree TRUE
#Density of accuracy and Kappa for different k-folds cross-validation iterations
resampleHist(C50_model, type = "density", layout = c(2, 1), adjust = 1.5)
As could be observed, model is highly accurate (Any k-fold iteration has more than 99% accuracy).
#Important variables
important_variables <- varImp(C50_model, scale = TRUE)
plot(important_variables, top = 58)
This plot shows how different variables are important for built model. As could be observed, there are 8 variables which aren’t important for prediction at all.
#No important variables
row.names(tail(important_variables$importance, 8))
## [1] "accel_belt_y" "magnet_belt_y" "roll_arm"
## [4] "gyros_arm_y" "accel_arm_y" "magnet_arm_y"
## [7] "pitch_dumbbell" "magnet_forearm_x"
#In sample error
confusionMatrix(training$classe, predict(C50_model, newdata = training))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4464 0 0 0 0
## B 0 3038 0 0 0
## C 0 0 2738 0 0
## D 0 0 0 2573 0
## E 0 0 0 0 2886
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9998, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
As could be observed, model reports best possible accuracy -> 1 on training data. No cases were classified incorrectly.
#Out of sample error
confusionMatrix(testing$classe, predict(C50_model, newdata = testing))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1116 0 0 0 0
## B 1 758 0 0 0
## C 0 1 683 0 0
## D 0 0 0 643 0
## E 0 1 0 2 718
##
## Overall Statistics
##
## Accuracy : 0.9987
## 95% CI : (0.997, 0.9996)
## No Information Rate : 0.2847
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9984
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9991 0.9974 1.0000 0.9969 1.0000
## Specificity 1.0000 0.9997 0.9997 1.0000 0.9991
## Pos Pred Value 1.0000 0.9987 0.9985 1.0000 0.9958
## Neg Pred Value 0.9996 0.9994 1.0000 0.9994 1.0000
## Prevalence 0.2847 0.1937 0.1741 0.1644 0.1830
## Detection Rate 0.2845 0.1932 0.1741 0.1639 0.1830
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9996 0.9985 0.9998 0.9984 0.9995
As could be observed, model reports pretty high accuracy -> 0.9987 on testing data. Only 5 cases were classified incorrectly. All 20 cases form pml.testing data also were classified correctly by this model.