#Loading all required libraries
library(caret)
#In order to perform training in parallel on 4 cores
library(doParallel)
registerDoParallel(cores=4)

Cleaning data set

First of all, in order to perform any kind of analysis, our training data should be cleaned.

#Lets examine dimension of original data
dim(pml)
## [1] 19622   160

Currently, data set contains 160 variables (including response variable “classe”) and 19622 cases.

  1. Deleting all attributes that have more than 90% NAs:
#Number of variables that have more than 90% NAs
sum(!apply(pml, MARGIN = 2, FUN = function (column) {sum(is.na(column))}) < 0.1 * nrow(pml))
## [1] 67
#Removing variables with a lot of NAs
pml <- pml[, apply(pml, MARGIN = 2, FUN = function (column) {sum(is.na(column))}) < 0.1 * nrow(pml)]

#Dimension of data set without NAs
dim(pml)
## [1] 19622    93

So, 67 attributes of original data have >= 90% NAs. To use these attributes to build model we should decrease size of out data more than 10 times. Better way - just to ignore these variables.

  1. Deleting variables with small variance:
#Number of "near zero variance" variables
nsv <- nearZeroVar(pml, saveMetrics = TRUE)
sum(nsv$nzv)
## [1] 34
#Removing "near zero variance" variables from data set
pml <- pml[,!nsv$nzv]

#Dimension of data set without "near zero variance" variables
dim(pml)
## [1] 19622    59

34 variables will not increase accuracy of prediction because their variation is pretty small (entropy is almost zero). They also shouldn’t be used for training.

  1. Deleting “no-sense”" variables:
#Deleting timestamps and order variables
pml <- pml[, -c(1, 3, 4, 5)]

#Dimension of data
dim(pml)
## [1] 19622    55

There are no useful information in variables: “X”, “raw_timestamp_part_1”, “raw_timestamp_part_2”, “cvtd_timestamp”.

  1. Verifying if there are no NAs left:
#Number of NAs in final data
sum(apply(pml, MARGIN = 2, FUN = function (column) {sum(is.na(column))}))
## [1] 0

So, there are no NAs in data. We have 54 predictors and one response variable.


Training model

  1. Splitting data set on training and testing:
#Splitting data
set.seed(0)
inTrain <- createDataPartition(y=pml$classe, p=0.8, list=FALSE)
training <- pml[inTrain,]
testing <- pml[-inTrain,]
dim(training); dim(testing)
## [1] 15699    55
## [1] 3923   55

There are 15699 cases for training and 3923 cases for testing.

  1. Cross-validation:
#Cross-validation options
fitControl = trainControl(method = "repeatedcv", number = 10, repeats = 5, verboseIter = TRUE)

In order to tune model and find out of sample error cross-validation has been used.
Type of cross-validation - k-folds with 10 folds and 5 repeats.

  1. Model and tuning space:
#Tuning options
c50Grid <- expand.grid(.trials = c(1:100),
                       .model = c("tree"),
                       .winnow = c(TRUE, FALSE))

c50 boosting model is being used. Boosting is the process of adding weak learners in such a way that newer learners pick up the slack of older learners. So, this approach end up with set of trees (not rules, because of tuning option .model = c(“tree”)). Every next tree will try to improve accuracy of prediction for the cases, which previous trees predict not so well. Splitting while building trees is performed based on information gain. Another options, which set for tuning - trials - number of possible trees to be used (in my case this number will range in 1:100); winnow - possibly to use and no use such approach to deal with overfitting. Winnowing means trying to remove predictors to improve model accuracy.

  1. Training model:
#Training model
C50_model <- train(classe ~.,
                   method = "C5.0",
                   data = training,
                   tuneGrid = c50Grid,
                   trControl = fitControl)

training data is used for training. Also, as could be observed, all variables are being used to build model to predict “classe” variable. eval = FALSE have been used in this chunk to save time (model evaluating and tuning takes almost 1 hour on 4 cores). Next code will load already tuned model from my working directory:

#Loading tuned model
load(file = "C50_model.rda")

Examining and evaluating model

  1. Best tuning parameters:
#Tuning process
plot(C50_model)

This plot shows tuning process. There are two lines. First one shows cross-validation accuracy for different number of trials with no winnowing. Second one - with winnowing. As could be observed, there are almost no difference of using or no using winnowing, accuracy almost the same. Moreover, starting from approximately 20 trials also lead almost to the same accuracy. So, there are almost no difference of using 20 or up to 100 trees.
Best tuning parameters:

#Tuning parameters
C50_model$bestTune
##     trials model winnow
## 178     78  tree   TRUE
  1. Density of accuracy and Kappa coefficients for k-folds cross-validation iterations:
#Density of accuracy and Kappa for different k-folds cross-validation iterations
resampleHist(C50_model, type = "density", layout = c(2, 1), adjust = 1.5)

As could be observed, model is highly accurate (Any k-fold iteration has more than 99% accuracy).

  1. Important variables for prediction:
#Important variables
important_variables <- varImp(C50_model, scale = TRUE)
plot(important_variables, top = 58)

This plot shows how different variables are important for built model. As could be observed, there are 8 variables which aren’t important for prediction at all.

#No important variables
row.names(tail(important_variables$importance, 8))
## [1] "accel_belt_y"     "magnet_belt_y"    "roll_arm"        
## [4] "gyros_arm_y"      "accel_arm_y"      "magnet_arm_y"    
## [7] "pitch_dumbbell"   "magnet_forearm_x"
  1. Accuracy on training data (in sample error)
#In sample error
confusionMatrix(training$classe, predict(C50_model, newdata = training))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4464    0    0    0    0
##          B    0 3038    0    0    0
##          C    0    0 2738    0    0
##          D    0    0    0 2573    0
##          E    0    0    0    0 2886
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9998, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

As could be observed, model reports best possible accuracy -> 1 on training data. No cases were classified incorrectly.

  1. Accuracy on testing data (out of sample error)
#Out of sample error
confusionMatrix(testing$classe, predict(C50_model, newdata = testing))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1116    0    0    0    0
##          B    1  758    0    0    0
##          C    0    1  683    0    0
##          D    0    0    0  643    0
##          E    0    1    0    2  718
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9987         
##                  95% CI : (0.997, 0.9996)
##     No Information Rate : 0.2847         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9984         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9991   0.9974   1.0000   0.9969   1.0000
## Specificity            1.0000   0.9997   0.9997   1.0000   0.9991
## Pos Pred Value         1.0000   0.9987   0.9985   1.0000   0.9958
## Neg Pred Value         0.9996   0.9994   1.0000   0.9994   1.0000
## Prevalence             0.2847   0.1937   0.1741   0.1644   0.1830
## Detection Rate         0.2845   0.1932   0.1741   0.1639   0.1830
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9996   0.9985   0.9998   0.9984   0.9995

As could be observed, model reports pretty high accuracy -> 0.9987 on testing data. Only 5 cases were classified incorrectly. All 20 cases form pml.testing data also were classified correctly by this model.