Practical Machine Learning. Course Project. Prediction Assignment Writeup

Overview

In this project, the goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict the manner in which they did the exercise.

The original data was preprocessed to remove NA and empty columns using dplyr package and imputed some of the missing values in the remaining ones. Then 6 models (lda, lda2, rf, gbm, knn and kknn) were fitted with cross-validation using caret library. Comparison of the models on the validation subset showed that Random forest model has the highest accuracy. The selected model then was applied to the test data set.

Loading and Preprocessing the Data

The following code loads csv files with training and testing data to the current working directory and then reads them in.

trainUrl<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(trainUrl,destfile=paste0(getwd(),"/pml-training.csv"), method = "curl")
download.file(testUrl,destfile=paste0(getwd(),"/pml-testing.csv"), method = "curl")
pmlTrain<-read.csv("pml-training.csv", stringsAsFactors = FALSE, na.strings=c("NA", "", "#DIV/0!"))
pmlTest<-read.csv("pml-testing.csv", stringsAsFactors = FALSE, na.strings=c("NA", "", "#DIV/0!"))

The training data set contains of 19622 observations on 160 variables. There are a lot of columns containing mostly NAs or “” in the training data set, so I’ve removed them both from training and testing data sets. I’ve also excluded variables 1:7 that are not relevant for predicting the way the excercise was done.

library(dplyr)
# Removing columns with NA values
csum<-colSums(is.na(pmlTrain))
nanames<-names(csum[csum>19000])
pmlTrain2<-select(pmlTrain, -c(nanames))
pmlTesting<-select(pmlTest, -c(nanames))
# Removing irrelevant columns
pmlTrain2<-pmlTrain2[,-(1:7)]
pmlTesting<-pmlTesting[,-(1:7)]

Next step is to check the remaining columns for zero/near-zero values and remove those if there are any.

library(caret)
nzval<-nearZeroVar(pmlTrain2, saveMetrics = TRUE)
nzval

As we can see - all are false, so there is nothing to remove.

                     freqRatio percentUnique zeroVar   nzv
roll_belt             1.101904     6.7781062   FALSE FALSE
pitch_belt            1.036082     9.3772296   FALSE FALSE
yaw_belt              1.058480     9.9734991   FALSE FALSE
total_accel_belt      1.063160     0.1477933   FALSE FALSE
gyros_belt_x          1.058651     0.7134849   FALSE FALSE
gyros_belt_y          1.144000     0.3516461   FALSE FALSE
gyros_belt_z          1.066214     0.8612782   FALSE FALSE
accel_belt_x          1.055412     0.8357966   FALSE FALSE
accel_belt_y          1.113725     0.7287738   FALSE FALSE
accel_belt_z          1.078767     1.5237998   FALSE FALSE
magnet_belt_x         1.090141     1.6664968   FALSE FALSE
magnet_belt_y         1.099688     1.5187035   FALSE FALSE
magnet_belt_z         1.006369     2.3290184   FALSE FALSE
roll_arm             52.338462    13.5256345   FALSE FALSE
pitch_arm            87.256410    15.7323412   FALSE FALSE
yaw_arm              33.029126    14.6570176   FALSE FALSE
total_accel_arm       1.024526     0.3363572   FALSE FALSE
gyros_arm_x           1.015504     3.2769341   FALSE FALSE
gyros_arm_y           1.454369     1.9162165   FALSE FALSE
gyros_arm_z           1.110687     1.2638875   FALSE FALSE
accel_arm_x           1.017341     3.9598410   FALSE FALSE
accel_arm_y           1.140187     2.7367241   FALSE FALSE
accel_arm_z           1.128000     4.0362858   FALSE FALSE
magnet_arm_x          1.000000     6.8239731   FALSE FALSE
magnet_arm_y          1.056818     4.4439914   FALSE FALSE
magnet_arm_z          1.036364     6.4468454   FALSE FALSE
roll_dumbbell         1.022388    84.2065029   FALSE FALSE
pitch_dumbbell        2.277372    81.7449801   FALSE FALSE
yaw_dumbbell          1.132231    83.4828254   FALSE FALSE
total_accel_dumbbell  1.072634     0.2191418   FALSE FALSE
gyros_dumbbell_x      1.003268     1.2282132   FALSE FALSE
gyros_dumbbell_y      1.264957     1.4167771   FALSE FALSE
gyros_dumbbell_z      1.060100     1.0498420   FALSE FALSE
accel_dumbbell_x      1.018018     2.1659362   FALSE FALSE
accel_dumbbell_y      1.053061     2.3748853   FALSE FALSE
accel_dumbbell_z      1.133333     2.0894914   FALSE FALSE
magnet_dumbbell_x     1.098266     5.7486495   FALSE FALSE
magnet_dumbbell_y     1.197740     4.3012945   FALSE FALSE
magnet_dumbbell_z     1.020833     3.4451126   FALSE FALSE
roll_forearm         11.589286    11.0895933   FALSE FALSE
pitch_forearm        65.983051    14.8557741   FALSE FALSE
yaw_forearm          15.322835    10.1467740   FALSE FALSE
total_accel_forearm   1.128928     0.3567424   FALSE FALSE
gyros_forearm_x       1.059273     1.5187035   FALSE FALSE
gyros_forearm_y       1.036554     3.7763735   FALSE FALSE
gyros_forearm_z       1.122917     1.5645704   FALSE FALSE
accel_forearm_x       1.126437     4.0464784   FALSE FALSE
accel_forearm_y       1.059406     5.1116094   FALSE FALSE
accel_forearm_z       1.006250     2.9558659   FALSE FALSE
magnet_forearm_x      1.012346     7.7667924   FALSE FALSE
magnet_forearm_y      1.246914     9.5403119   FALSE FALSE
magnet_forearm_z      1.000000     8.5771073   FALSE FALSE
classe                1.469581     0.0254816   FALSE FALSE

After the procedure the number of variables decreased to 53. Since there are a lot of observations in training set we can split it in two subsets: training (75%) and validation (25%) in order to find out what model is better before applying it to the test set.

pmlTrain2$classe<-as.factor(pmlTrain2$classe)

library(caret)
set.seed(123456)
inTrain<-createDataPartition(y=pmlTrain2$classe, p=0.75, list = FALSE)
pmlValidation<-pmlTrain2[-inTrain,]
pmlTraining<-pmlTrain2[inTrain,]

Fitting models

Since many models utilize random numbers during the phase where parameters are estimated and to ensure that the same resamples are used between calls to train we’ll use set.seed prior to every call to train function. We will fit 6 models:

Linear Discriminant Analysis - lda2
Random Forest - rf
Generalized Boosted Regression Model - gbm
Support Vector Machines - svm
Two types of K-Nearest Neighbours (regular and weighted) - knn, kknn

For every model the 3-fold Cross-Validation is used by applying trControl=trainControl(method=“cv”, number=3). Then we’ll predict classe variable for Validation data set and build confusion matrices.

Linear Discriminant Analysis

set.seed(123456)
fitlda2<-train(classe~., data=pmlTraining, method="lda2", preProcess="knnImpute",
              trControl=trainControl(method="cv", number=3))
fitlda2

Linear Discriminant Analysis 

14718 samples
   52 predictor
    5 classes: 'A', 'B', 'C', 'D', 'E' 

Pre-processing: nearest neighbor imputation (52), centered (52),
 scaled (52) 
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 9812, 9811, 9813 
Resampling results across tuning parameters:

  dimen  Accuracy   Kappa    
  1      0.4697651  0.3166359
  2      0.5950536  0.4862335
  3      0.6449928  0.5497851
  4      0.7008431  0.6213326

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was dimen = 4.

vPredict<-predict(fitlda2, pmlValidation)
cmlda2<-confusionMatrix(pmlValidation$classe,vPredict); cmlda2

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1153   34  108   97    3
         B  152  604  113   35   45
         C   76   74  560  114   31
         D   46   37  100  602   19
         E   43  157   80  108  513

Overall Statistics
                                          
               Accuracy : 0.6998          
                 95% CI : (0.6868, 0.7126)
    No Information Rate : 0.2998          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.62            
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.7844   0.6667   0.5827   0.6297   0.8396
Specificity            0.9295   0.9137   0.9252   0.9488   0.9096
Pos Pred Value         0.8265   0.6365   0.6550   0.7488   0.5694
Neg Pred Value         0.9097   0.9236   0.9010   0.9137   0.9755
Prevalence             0.2998   0.1847   0.1960   0.1949   0.1246
Detection Rate         0.2351   0.1232   0.1142   0.1228   0.1046
Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
Balanced Accuracy      0.8569   0.7902   0.7540   0.7893   0.8746

Random Forests

set.seed(123456)
fitrf<-train(classe~., data=pmlTraining, method="rf", preProcess="knnImpute",
             trControl=trainControl(method="cv", number=3))
fitrf

Random Forest 

14718 samples
   52 predictor
    5 classes: 'A', 'B', 'C', 'D', 'E' 

Pre-processing: nearest neighbor imputation (52), centered (52),
 scaled (52) 
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 9812, 9811, 9813 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.9883135  0.9852154
  27    0.9896042  0.9868490
  52    0.9803634  0.9751556

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 27.

vPredict<-predict(fitrf, pmlValidation)
cmrf<-confusionMatrix(pmlValidation$classe,vPredict); cmrf

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1395    0    0    0    0
         B    4  943    2    0    0
         C    0    3  849    3    0
         D    0    0    7  797    0
         E    0    1    4    3  893

Overall Statistics
                                         
               Accuracy : 0.9945         
                 95% CI : (0.992, 0.9964)
    No Information Rate : 0.2853         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.993          
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9971   0.9958   0.9849   0.9925   1.0000
Specificity            1.0000   0.9985   0.9985   0.9983   0.9980
Pos Pred Value         1.0000   0.9937   0.9930   0.9913   0.9911
Neg Pred Value         0.9989   0.9990   0.9968   0.9985   1.0000
Prevalence             0.2853   0.1931   0.1758   0.1637   0.1821
Detection Rate         0.2845   0.1923   0.1731   0.1625   0.1821
Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
Balanced Accuracy      0.9986   0.9971   0.9917   0.9954   0.9990

Generalized Boosted Regression Model

set.seed(123456)
fitgbm<-train(classe~., data=pmlTraining, method="gbm", preProcess="knnImpute",
              trControl=trainControl(method="cv", number=3), verbose=FALSE)
fitgbm

Stochastic Gradient Boosting 

14718 samples
   52 predictor
    5 classes: 'A', 'B', 'C', 'D', 'E' 

Pre-processing: nearest neighbor imputation (52), centered (52),
 scaled (52) 
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 9812, 9811, 9813 
Resampling results across tuning parameters:

  interaction.depth  n.trees  Accuracy   Kappa    
  1                   50      0.7506455  0.6839001
  1                  100      0.8204911  0.7727524
  1                  150      0.8526968  0.8135909
  2                   50      0.8576571  0.8196083
  2                  100      0.9087509  0.8844991
  2                  150      0.9334823  0.9158221
  3                   50      0.8938040  0.8655349
  3                  100      0.9427908  0.9275977
  3                  150      0.9590973  0.9482429

Tuning parameter 'shrinkage' was held constant at a value of 0.1

Tuning parameter 'n.minobsinnode' was held constant at a value of 10
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 150,
 interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

vPredict<-predict(fitgbm, pmlValidation)
cmgbm<-confusionMatrix(pmlValidation$classe,vPredict); cmgbm

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1366   15    9    4    1
         B   26  897   25    0    1
         C    0   26  818    8    3
         D    1    2   30  767    4
         E    1    4   19   15  862

Overall Statistics
                                          
               Accuracy : 0.9604          
                 95% CI : (0.9546, 0.9657)
    No Information Rate : 0.2843          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.95            
 Mcnemar's Test P-Value : 5.442e-07       

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9799   0.9502   0.9079   0.9660   0.9897
Specificity            0.9917   0.9869   0.9908   0.9910   0.9903
Pos Pred Value         0.9792   0.9452   0.9567   0.9540   0.9567
Neg Pred Value         0.9920   0.9881   0.9795   0.9934   0.9978
Prevalence             0.2843   0.1925   0.1837   0.1619   0.1776
Detection Rate         0.2785   0.1829   0.1668   0.1564   0.1758
Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
Balanced Accuracy      0.9858   0.9685   0.9493   0.9785   0.9900

Support Vector Machines

SVM is one of the most widely-used and robust classifiers. Not only can it efficiently classify linear decision boundaries, but it can also classify non-linear boundaries and solve linearly inseparable problems. As we can see it’s little less accurate than rf and gbm, but much more precise than lda.

library(e1071)
set.seed(123456)
fitsvm<-svm(classe~., data=pmlTraining)
fitsvm


Call:
svm(formula = classe ~ ., data = pmlTraining)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 
      gamma:  0.01923077 

Number of Support Vectors:  6760

vPredict<-predict(fitsvm, pmlValidation)
cmsvm<-confusionMatrix(pmlValidation$classe,vPredict); cmsvm

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1382    4    9    0    0
         B   59  862   27    0    1
         C    2   31  812    9    1
         D    2    0   82  719    1
         E    1    7   28   21  844

Overall Statistics
                                         
               Accuracy : 0.9419         
                 95% CI : (0.935, 0.9483)
    No Information Rate : 0.2949         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9264         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9557   0.9535   0.8476   0.9599   0.9965
Specificity            0.9962   0.9782   0.9891   0.9795   0.9860
Pos Pred Value         0.9907   0.9083   0.9497   0.8943   0.9367
Neg Pred Value         0.9818   0.9894   0.9639   0.9927   0.9993
Prevalence             0.2949   0.1843   0.1954   0.1527   0.1727
Detection Rate         0.2818   0.1758   0.1656   0.1466   0.1721
Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
Balanced Accuracy      0.9760   0.9659   0.9184   0.9697   0.9912

K-Nearest Neighbor Classifier

set.seed(123456)
fitknn<-train(classe~., data=pmlTraining, method="knn", preProcess="knnImpute",
              trControl=trainControl(method="cv", number=3))
fitknn

k-Nearest Neighbors 

14718 samples
   52 predictor
    5 classes: 'A', 'B', 'C', 'D', 'E' 

Pre-processing: nearest neighbor imputation (52), centered (52),
 scaled (52) 
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 9812, 9811, 9813 
Resampling results across tuning parameters:

  k  Accuracy   Kappa    
  5  0.9474790  0.9335486
  7  0.9313758  0.9131477
  9  0.9173792  0.8954277

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 5.

vPredict<-predict(fitknn, pmlValidation)
cmknn<-confusionMatrix(pmlValidation$classe,vPredict); cmknn

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1382    5    7    0    1
         B   22  912   15    0    0
         C    4   16  825   10    0
         D    3    0   32  767    2
         E    0   12   16    6  867

Overall Statistics
                                         
               Accuracy : 0.9692         
                 95% CI : (0.964, 0.9739)
    No Information Rate : 0.2877         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.961          
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9794   0.9651   0.9218   0.9796   0.9966
Specificity            0.9963   0.9907   0.9925   0.9910   0.9916
Pos Pred Value         0.9907   0.9610   0.9649   0.9540   0.9623
Neg Pred Value         0.9917   0.9917   0.9827   0.9961   0.9993
Prevalence             0.2877   0.1927   0.1825   0.1597   0.1774
Detection Rate         0.2818   0.1860   0.1682   0.1564   0.1768
Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
Balanced Accuracy      0.9879   0.9779   0.9572   0.9853   0.9941

Weighted k-Nearest Neighbor Classifier

Performs k-nearest neighbor classification: for each row of the test set, the k nearest training set vectors (according to Minkowski distance) are found, and the classification is done via the maximum of summed kernel densities.

library(kknn)
set.seed(123456)
fitkknn<-train(classe~., data=pmlTraining, method="kknn", preProcess="knnImpute",
              trControl=trainControl(method="cv", number=3))
fitkknn

k-Nearest Neighbors 

14718 samples
   52 predictor
    5 classes: 'A', 'B', 'C', 'D', 'E' 

Pre-processing: nearest neighbor imputation (52), centered (52),
 scaled (52) 
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 9812, 9811, 9813 
Resampling results across tuning parameters:

  kmax  Accuracy   Kappa    
  5     0.9824025  0.9777427
  7     0.9824025  0.9777427
  9     0.9824025  0.9777427

Tuning parameter 'distance' was held constant at a value of 2

Tuning parameter 'kernel' was held constant at a value of optimal
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were kmax = 9, distance = 2 and
 kernel = optimal.

vPredict<-predict(fitkknn, pmlValidation)
cmkknn<-confusionMatrix(pmlValidation$classe,vPredict); cmkknn

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1391    1    1    1    1
         B    5  939    4    1    0
         C    3    8  840    4    0
         D    1    1   12  788    2
         E    0    3    1    3  894

Overall Statistics
                                          
               Accuracy : 0.9894          
                 95% CI : (0.9861, 0.9921)
    No Information Rate : 0.2855          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9866          
 Mcnemar's Test P-Value : 0.1641          

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9936   0.9863   0.9790   0.9887   0.9967
Specificity            0.9989   0.9975   0.9963   0.9961   0.9983
Pos Pred Value         0.9971   0.9895   0.9825   0.9801   0.9922
Neg Pred Value         0.9974   0.9967   0.9956   0.9978   0.9993
Prevalence             0.2855   0.1941   0.1750   0.1625   0.1829
Detection Rate         0.2836   0.1915   0.1713   0.1607   0.1823
Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
Balanced Accuracy      0.9962   0.9919   0.9877   0.9924   0.9975

Plots of final models

Meaningful plots can be built for 2 models only. For random forest we can see that error decreases with increaing of the number of trees built. Plot for kknn model gives us information about the quality of the classification based on the number of neighbors.

par(mfrow=c(1,2),mar=c(5,4,2,2))
plot(fitrf$finalModel, main="RF")
plot(fitkknn$finalModel, main="KKNN")

Plots for tuning parameters

The following plots show how the accuracy changes while the parameters of the models are tuned. The model parameters are selected based on the accuracy value.

p1<-ggplot(fitlda2) + labs(title="lda2") + theme_bw()
p2<-ggplot(fitrf) + labs(title="rf") + theme_bw()
p3<-ggplot(fitgbm) + labs(title="gbm") + theme_bw()
p4<-ggplot(fitknn) + labs(title="knn") + theme_bw()
p5<-ggplot(fitkknn) + labs(title="kknn") + theme_bw()
multiplot(p1, p4, p2, p5, p3, cols=3)

Accuracy comparison

Random forest and KKNN model have the best accuracy, but RF is even more precise, so I’ll choose it for predicting for the test data set.

accuracyDF<-data.frame(Model=c("lda2", "rf", "gbm", "svm","knn", "kknn"),
                       Accuracy=c(cmlda2$overall[1], cmrf$overall[1], cmgbm$overall[1],
                                  cmsvm$overall[1], cmknn$overall[1], cmkknn$overall[1]))
accuracyDF

  Model  Accuracy
1  lda2 0.6998369
2    rf 0.9944943
3   gbm 0.9604405
4   svm 0.9418842
5   knn 0.9692088
6  kknn 0.9893964

Predicting for Test Data

Let’s predict the style of doing exercises for the test data set using trained Random Forest model with Cross Validation.

testPredict<-predict(fitrf, pmlTesting)
testPredict

 [1] B A B A A E D B A A B C B A E E A B B B
Levels: A B C D E