Synopsis

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit, it is now possible to collect a large amount of data about personal activity relatively inexpensively. These devices are part of the “quantified self movement” - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this project, I shall investigate activity tracker data from six male participants aged between 20 and 28 years, with limited weightlifting experience. These participants were asked to perform one set of ten repetitions of the Unilateral Dumbbell Biceps Curl using a 1.25 kg dumbbell in five different fashions, comprising the correct technique and four common technique errors as follows:

  1. exactly according to the specification (Class A)
  2. throwing the elbows to the front (Class B)
  3. lifting the dumbbell only halfway (Class C)
  4. lowering the dumbbell only halfway (Class D)
  5. throwing the hips to the front (Class E).

This data was sourced from Velloso, E., Bulling, A., Gellersen, H., Ugulino, W., and Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises, Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13), Stuttgart, Germany: ACM SIGCHI, 2013. (http://groupware.les.inf.puc-rio.br/har#wle_paper_section)

I shall undertake three types of data analysis to investigate the degree of accuracy with which predictor variables can predict whether a Unilateral Dumbbell Biceps Curl is being performed correctly, and if not, which common error is being made. These are:

  1. Random Forest
  2. Decision Trees
  3. Generalised Boosting Model

The most accurate of these prediction models shall be used to predict whether a Unilateral Dumbbell Biceps Curl is being performed correctly, and if not, which common error is being made, using the test dataset.

Load software and data

The following block of R code shall:

  1. load a series of R packages
  2. download the activity tracker data in the form of two .csv files (training and testing) from a web address and read the data into R
  3. partition the training dataset into a training dataset (70%) and a validation dataset (30%)
  4. remove variables that have near zero variability, are mostly blank or NA, and have no predictive value (e.g. identification numbers, timestamps etc.)
# Load required software packages.
        library(rpart)
        library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
        library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
        library(gbm)
## Loaded gbm 2.1.5
# Load required data
        training <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv")
        testing <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv")

# Make copy of testing dataset - keep original testing dataset unchanged
        testcopy <- testing

# Partition training dataset into a 70:30 training:validation dataset
        inTrain  <- createDataPartition(training$classe, p = 0.7, list = FALSE)
        trainset <- training[inTrain, ]
        validset <- training[-inTrain, ]

# Remove variables with near zero variability i.e. poor predictors
        nzv <- nearZeroVar(trainset)
        trainset <- trainset[, -nzv]
        validset <- validset[, -nzv]
     
# Remove mostly blank or NA variables
        mostNA <- sapply(trainset, function(x) mean(is.na(x))) > 0.90
        trainset <- trainset[, mostNA == FALSE]
        validset <- validset[, mostNA == FALSE]
        
# Remove first six columns - identification, timestamp etc.
        trainset <- trainset[, -(1:6)]
        validset <- validset[, -(1:6)]
        
# Check dataset dimensions
        dim(trainset)
## [1] 13737    53
        dim(validset)       
## [1] 5885   53

Prediction models

This section shall compare the Random Forest, Decision Tree, and Generalised Boosting prediction models using the same training and validation datasets.

Random Forest

# Random Forest model
        rfmodel <- randomForest(classe ~ ., data = trainset, ntree = 500)
        rfmodel
## 
## Call:
##  randomForest(formula = classe ~ ., data = trainset, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.53%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3901    5    0    0    0 0.001280082
## B   11 2640    7    0    0 0.006772009
## C    0   13 2381    2    0 0.006260434
## D    0    0   24 2226    2 0.011545293
## E    0    0    2    7 2516 0.003564356
# Predict against validation dataset
        rfpredict <- predict(rfmodel, validset, type = "class")
        rfcm <- confusionMatrix(rfpredict, validset$classe)
        rfcm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    8    0    0    0
##          B    1 1129    5    0    0
##          C    0    2 1021   15    0
##          D    0    0    0  949    4
##          E    0    0    0    0 1078
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9941          
##                  95% CI : (0.9917, 0.9959)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9925          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9912   0.9951   0.9844   0.9963
## Specificity            0.9981   0.9987   0.9965   0.9992   1.0000
## Pos Pred Value         0.9952   0.9947   0.9836   0.9958   1.0000
## Neg Pred Value         0.9998   0.9979   0.9990   0.9970   0.9992
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1918   0.1735   0.1613   0.1832
## Detection Prevalence   0.2856   0.1929   0.1764   0.1619   0.1832
## Balanced Accuracy      0.9988   0.9950   0.9958   0.9918   0.9982
# Plot confusion matrix results
        plot(rfcm$table, main = paste("Random Forest Accuracy =", (100*round(rfcm$overall['Accuracy'], 4)),"%"))

Decision Tree

# Decision Tree model
        dtmodel <- rpart(classe ~ ., data = trainset, method = "class")
        dtmodel
## n= 13737 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##     1) root 13737 9831 A (0.28 0.19 0.17 0.16 0.18)  
##       2) roll_belt< 130.5 12590 8694 A (0.31 0.21 0.19 0.18 0.11)  
##         4) pitch_forearm< -33.95 1103    6 A (0.99 0.0054 0 0 0) *
##         5) pitch_forearm>=-33.95 11487 8688 A (0.24 0.23 0.21 0.2 0.12)  
##          10) magnet_dumbbell_y< 436.5 9674 6936 A (0.28 0.18 0.24 0.19 0.11)  
##            20) roll_forearm< 123.5 6024 3565 A (0.41 0.18 0.18 0.17 0.06)  
##              40) magnet_dumbbell_z< -24.5 2155  733 A (0.66 0.21 0.02 0.079 0.031)  
##                80) roll_forearm>=-136.5 1803  413 A (0.77 0.17 0.023 0.027 0.0067) *
##                81) roll_forearm< -136.5 352  211 B (0.091 0.4 0.0085 0.34 0.16) *
##              41) magnet_dumbbell_z>=-24.5 3869 2801 C (0.27 0.17 0.28 0.21 0.077)  
##                82) yaw_belt>=168.5 509   81 A (0.84 0.083 0 0.073 0.0039) *
##                83) yaw_belt< 168.5 3360 2292 C (0.18 0.18 0.32 0.23 0.087)  
##                 166) accel_dumbbell_y>=-40.5 2907 2133 D (0.21 0.2 0.23 0.27 0.095)  
##                   332) pitch_belt< -42.85 353   66 B (0.014 0.81 0.11 0.034 0.025) *
##                   333) pitch_belt>=-42.85 2554 1792 D (0.23 0.11 0.25 0.3 0.1)  
##                     666) roll_belt>=125.5 604  243 C (0.37 0.02 0.6 0.0099 0.0033)  
##                      1332) magnet_belt_z< -324.5 193    4 A (0.98 0 0.01 0 0.01) *
##                      1333) magnet_belt_z>=-324.5 411   52 C (0.083 0.029 0.87 0.015 0) *
##                     667) roll_belt< 125.5 1950 1194 D (0.19 0.14 0.14 0.39 0.14)  
##                      1334) pitch_belt>=0.895 1249  965 A (0.23 0.21 0.15 0.22 0.19)  
##                        2668) accel_dumbbell_z< 25.5 775  510 A (0.34 0.13 0.23 0.27 0.032)  
##                          5336) yaw_forearm>=-65.05 539  274 A (0.49 0.17 0.25 0.078 0.011)  
##                           10672) magnet_forearm_z>=-119.5 332   72 A (0.78 0.11 0.006 0.099 0.006) *
##                           10673) magnet_forearm_z< -119.5 207   72 C (0.024 0.26 0.65 0.043 0.019) *
##                          5337) yaw_forearm< -65.05 236   71 D (0 0.047 0.17 0.7 0.081) *
##                        2669) accel_dumbbell_z>=25.5 474  262 E (0.04 0.35 0.015 0.15 0.45)  
##                          5338) roll_dumbbell< 39.4826 172   42 B (0.041 0.76 0.041 0.047 0.12) *
##                          5339) roll_dumbbell>=39.4826 302  110 E (0.04 0.12 0 0.21 0.64) *
##                      1335) pitch_belt< 0.895 701  222 D (0.13 0.021 0.12 0.68 0.041) *
##                 167) accel_dumbbell_y< -40.5 453   56 C (0.0088 0.049 0.88 0.029 0.038) *
##            21) roll_forearm>=123.5 3650 2448 C (0.076 0.18 0.33 0.23 0.18)  
##              42) magnet_dumbbell_y< 291.5 2158 1126 C (0.09 0.14 0.48 0.15 0.15)  
##                84) magnet_forearm_z< -245.5 171   39 A (0.77 0.064 0 0.058 0.11) *
##                85) magnet_forearm_z>=-245.5 1987  955 C (0.032 0.14 0.52 0.16 0.15)  
##                 170) pitch_belt>=26.15 127   20 B (0.11 0.84 0.0079 0 0.039) *
##                 171) pitch_belt< 26.15 1860  829 C (0.026 0.096 0.55 0.17 0.16) *
##              43) magnet_dumbbell_y>=291.5 1492  959 D (0.056 0.24 0.11 0.36 0.24)  
##                86) accel_forearm_x>=-102.5 961  643 E (0.048 0.31 0.16 0.15 0.33)  
##                 172) magnet_arm_y>=186 392  174 B (0.0077 0.56 0.22 0.1 0.11) *
##                 173) magnet_arm_y< 186 569  295 E (0.076 0.14 0.11 0.19 0.48) *
##                87) accel_forearm_x< -102.5 531  146 D (0.072 0.1 0.038 0.73 0.064) *
##          11) magnet_dumbbell_y>=436.5 1813  906 B (0.034 0.5 0.045 0.22 0.2)  
##            22) total_accel_dumbbell>=5.5 1303  468 B (0.047 0.64 0.061 0.02 0.23)  
##              44) roll_belt>=-0.565 1096  261 B (0.056 0.76 0.073 0.024 0.086) *
##              45) roll_belt< -0.565 207    0 E (0 0 0 0 1) *
##            23) total_accel_dumbbell< 5.5 510  132 D (0 0.14 0.0039 0.74 0.11) *
##       3) roll_belt>=130.5 1147   10 E (0.0087 0 0 0 0.99) *
# Predict against validation dataset
        dtpredict <- predict(dtmodel, validset, type = "class")
        dtcm <- confusionMatrix(dtpredict, validset$classe)
        dtcm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1472  166   23   45   14
##          B   69  732   93   88  102
##          C   52   97  843  151  122
##          D   53   90   43  606   61
##          E   28   54   24   74  783
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7538          
##                  95% CI : (0.7426, 0.7647)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6883          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8793   0.6427   0.8216   0.6286   0.7237
## Specificity            0.9411   0.9258   0.9132   0.9498   0.9625
## Pos Pred Value         0.8558   0.6753   0.6664   0.7104   0.8131
## Neg Pred Value         0.9515   0.9152   0.9604   0.9289   0.9393
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2501   0.1244   0.1432   0.1030   0.1331
## Detection Prevalence   0.2923   0.1842   0.2150   0.1449   0.1636
## Balanced Accuracy      0.9102   0.7843   0.8674   0.7892   0.8431
# Plot confusion matrix results
        plot(dtcm$table, main = paste("Decision Tree Accuracy =", (100*round(dtcm$overall['Accuracy'], 4)),"%"))

Generalised Boosting Model

# Generalised Boosting Model
        gbcontrol <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
        gbmodel <- train(classe ~ ., data = trainset, method = "gbm", trControl = gbcontrol, verbose = FALSE)
        gbmodel
## Stochastic Gradient Boosting 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 10989, 10989, 10989, 10991, 10990 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.7527119  0.6865706
##   1                  100      0.8226687  0.7756390
##   1                  150      0.8517869  0.8124982
##   2                   50      0.8565191  0.8182452
##   2                  100      0.9067493  0.8820254
##   2                  150      0.9314267  0.9132484
##   3                   50      0.8969203  0.8695164
##   3                  100      0.9409630  0.9252936
##   3                  150      0.9598900  0.9492506
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
# Predict against validation dataset
        gbpredict <- predict(gbmodel, validset, type = "raw")
        gbcm <- confusionMatrix(gbpredict, validset$classe)
        gbcm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1644   40    0    2    1
##          B   16 1058   34    5   12
##          C   10   39  980   42    8
##          D    3    2   10  907   15
##          E    1    0    2    8 1046
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9575         
##                  95% CI : (0.952, 0.9625)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9463         
##                                          
##  Mcnemar's Test P-Value : 4.435e-09      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9821   0.9289   0.9552   0.9409   0.9667
## Specificity            0.9898   0.9859   0.9796   0.9939   0.9977
## Pos Pred Value         0.9745   0.9404   0.9082   0.9680   0.9896
## Neg Pred Value         0.9929   0.9830   0.9904   0.9885   0.9925
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2794   0.1798   0.1665   0.1541   0.1777
## Detection Prevalence   0.2867   0.1912   0.1833   0.1592   0.1796
## Balanced Accuracy      0.9859   0.9574   0.9674   0.9674   0.9822
# Plot confusion matrix results
        plot(gbcm$table, main = paste("Generalised Boosted Model Accuracy =", (100*round(gbcm$overall['Accuracy'], 4)),"%"))

Findings

The Confusion Matrix cross-validates what the prediction model (developed from the training dataset) predicted in the validation dataset (Prediction), against the values that were actually in the validation dataset (Reference). For example, if the prediction model predicted 1,677 samples in the validation dataset in Class A (correct weightlifting technique), and the validation dataset actually had 1,672 samples in Class A, then the corss-validation indicates that the prediction model had a high rate of predictive accuracy.

Out of sample error rates are measured by calculating 1 minus the Accuracy figure that was calculated when validating the prediction model developed from the training dataset against the validation dataset.

The Random Forest prediction model has an accuracy rate of 99.41%, so the out of sample error rate is 0.59%.

The Decision Tree prediction model has an accuracy rate of 75.38%, so the out of sample error rate is 24.62%.

The Generalised Boosted Model prediction model has an accuracy rate of 95.75%, so the out of sample error rate is 4.25%.

On this basis, the Random Forest prediction model shall be selected to be applied to the twenty observations in the training dataset.

Conclusion

This section shall apply the Random Forest prediction model to a copy of the original testing dataset to predict a new testcopy$classe_rf variable for each of the twenty observations in the testing dataset.

As the accuracy of the Generalised Boosting prediction model was close to that of the Random Forest prediction model, it is worth applying the Generalised Boosting prediction model to another copy of the original testing dataset to predict a new testcopy2$classe_gbm variable for each of the twenty observations in the testing dataset. The values for this variable for at least nineteen of the twenty observations in the testing dataset should be the same as those predicted by the Random Forest prediction model.

# Run Random Forest prediction against test dataset
        testcopy$classe_rf <- predict(rfmodel, testcopy, type = "class")

# Make another copy of the original testing dataset
        testcopy2 <- testing

# Run Generalised Boosting Model prediction against copy of test dataset
        testcopy2$classe_gbm <- predict(gbmodel, testcopy2, type = "raw")
# Create data frame to compare RF test against GBM test
        comparison <- data.frame("Random Forest" = testcopy$classe_rf, "Generalised Boosting Model" = testcopy2$classe_gbm)
        comparison

Software versions used

# Print relevant software and versions
        sessionInfo()
## R version 3.5.3 (2019-03-11)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] gbm_2.1.5           randomForest_4.6-14 caret_6.0-83       
## [4] ggplot2_3.1.1       lattice_0.20-38     rpart_4.1-15       
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5   xfun_0.6           purrr_0.3.2       
##  [4] reshape2_1.4.3     splines_3.5.3      colorspace_1.4-1  
##  [7] generics_0.0.2     htmltools_0.3.6    stats4_3.5.3      
## [10] yaml_2.2.0         survival_2.43-3    prodlim_2018.04.18
## [13] rlang_0.3.4        e1071_1.7-1        ModelMetrics_1.2.2
## [16] pillar_1.3.1       glue_1.3.1         withr_2.1.2       
## [19] foreach_1.4.4      plyr_1.8.4         lava_1.6.5        
## [22] stringr_1.4.0      timeDate_3043.102  munsell_0.5.0     
## [25] gtable_0.3.0       recipes_0.1.5      codetools_0.2-16  
## [28] evaluate_0.13      knitr_1.22         class_7.3-15      
## [31] Rcpp_1.0.1         scales_1.0.0       ipred_0.9-8       
## [34] jsonlite_1.6       gridExtra_2.3      digest_0.6.18     
## [37] stringi_1.4.3      dplyr_0.8.0.1      grid_3.5.3        
## [40] tools_3.5.3        magrittr_1.5       lazyeval_0.2.2    
## [43] tibble_2.1.1       crayon_1.3.4       pkgconfig_2.0.2   
## [46] MASS_7.3-51.1      Matrix_1.2-15      data.table_1.12.2 
## [49] lubridate_1.7.4    gower_0.2.0        assertthat_0.2.1  
## [52] rmarkdown_1.12     iterators_1.0.10   R6_2.4.0          
## [55] nnet_7.3-12        nlme_3.1-137       compiler_3.5.3