Using devices such as Jawbone Up, Nike FuelBand, and Fitbit, it is now possible to collect a large amount of data about personal activity relatively inexpensively. These devices are part of the “quantified self movement” - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, I shall investigate activity tracker data from six male participants aged between 20 and 28 years, with limited weightlifting experience. These participants were asked to perform one set of ten repetitions of the Unilateral Dumbbell Biceps Curl using a 1.25 kg dumbbell in five different fashions, comprising the correct technique and four common technique errors as follows:
This data was sourced from Velloso, E., Bulling, A., Gellersen, H., Ugulino, W., and Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises, Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13), Stuttgart, Germany: ACM SIGCHI, 2013. (http://groupware.les.inf.puc-rio.br/har#wle_paper_section)
I shall undertake three types of data analysis to investigate the degree of accuracy with which predictor variables can predict whether a Unilateral Dumbbell Biceps Curl is being performed correctly, and if not, which common error is being made. These are:
The most accurate of these prediction models shall be used to predict whether a Unilateral Dumbbell Biceps Curl is being performed correctly, and if not, which common error is being made, using the test dataset.
The following block of R code shall:
# Load required software packages.
library(rpart)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(gbm)
## Loaded gbm 2.1.5
# Load required data
training <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv")
testing <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv")
# Make copy of testing dataset - keep original testing dataset unchanged
testcopy <- testing
# Partition training dataset into a 70:30 training:validation dataset
inTrain <- createDataPartition(training$classe, p = 0.7, list = FALSE)
trainset <- training[inTrain, ]
validset <- training[-inTrain, ]
# Remove variables with near zero variability i.e. poor predictors
nzv <- nearZeroVar(trainset)
trainset <- trainset[, -nzv]
validset <- validset[, -nzv]
# Remove mostly blank or NA variables
mostNA <- sapply(trainset, function(x) mean(is.na(x))) > 0.90
trainset <- trainset[, mostNA == FALSE]
validset <- validset[, mostNA == FALSE]
# Remove first six columns - identification, timestamp etc.
trainset <- trainset[, -(1:6)]
validset <- validset[, -(1:6)]
# Check dataset dimensions
dim(trainset)
## [1] 13737 53
dim(validset)
## [1] 5885 53
This section shall compare the Random Forest, Decision Tree, and Generalised Boosting prediction models using the same training and validation datasets.
# Random Forest model
rfmodel <- randomForest(classe ~ ., data = trainset, ntree = 500)
rfmodel
##
## Call:
## randomForest(formula = classe ~ ., data = trainset, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.53%
## Confusion matrix:
## A B C D E class.error
## A 3901 5 0 0 0 0.001280082
## B 11 2640 7 0 0 0.006772009
## C 0 13 2381 2 0 0.006260434
## D 0 0 24 2226 2 0.011545293
## E 0 0 2 7 2516 0.003564356
# Predict against validation dataset
rfpredict <- predict(rfmodel, validset, type = "class")
rfcm <- confusionMatrix(rfpredict, validset$classe)
rfcm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 8 0 0 0
## B 1 1129 5 0 0
## C 0 2 1021 15 0
## D 0 0 0 949 4
## E 0 0 0 0 1078
##
## Overall Statistics
##
## Accuracy : 0.9941
## 95% CI : (0.9917, 0.9959)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9925
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9912 0.9951 0.9844 0.9963
## Specificity 0.9981 0.9987 0.9965 0.9992 1.0000
## Pos Pred Value 0.9952 0.9947 0.9836 0.9958 1.0000
## Neg Pred Value 0.9998 0.9979 0.9990 0.9970 0.9992
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1918 0.1735 0.1613 0.1832
## Detection Prevalence 0.2856 0.1929 0.1764 0.1619 0.1832
## Balanced Accuracy 0.9988 0.9950 0.9958 0.9918 0.9982
# Plot confusion matrix results
plot(rfcm$table, main = paste("Random Forest Accuracy =", (100*round(rfcm$overall['Accuracy'], 4)),"%"))
# Decision Tree model
dtmodel <- rpart(classe ~ ., data = trainset, method = "class")
dtmodel
## n= 13737
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 13737 9831 A (0.28 0.19 0.17 0.16 0.18)
## 2) roll_belt< 130.5 12590 8694 A (0.31 0.21 0.19 0.18 0.11)
## 4) pitch_forearm< -33.95 1103 6 A (0.99 0.0054 0 0 0) *
## 5) pitch_forearm>=-33.95 11487 8688 A (0.24 0.23 0.21 0.2 0.12)
## 10) magnet_dumbbell_y< 436.5 9674 6936 A (0.28 0.18 0.24 0.19 0.11)
## 20) roll_forearm< 123.5 6024 3565 A (0.41 0.18 0.18 0.17 0.06)
## 40) magnet_dumbbell_z< -24.5 2155 733 A (0.66 0.21 0.02 0.079 0.031)
## 80) roll_forearm>=-136.5 1803 413 A (0.77 0.17 0.023 0.027 0.0067) *
## 81) roll_forearm< -136.5 352 211 B (0.091 0.4 0.0085 0.34 0.16) *
## 41) magnet_dumbbell_z>=-24.5 3869 2801 C (0.27 0.17 0.28 0.21 0.077)
## 82) yaw_belt>=168.5 509 81 A (0.84 0.083 0 0.073 0.0039) *
## 83) yaw_belt< 168.5 3360 2292 C (0.18 0.18 0.32 0.23 0.087)
## 166) accel_dumbbell_y>=-40.5 2907 2133 D (0.21 0.2 0.23 0.27 0.095)
## 332) pitch_belt< -42.85 353 66 B (0.014 0.81 0.11 0.034 0.025) *
## 333) pitch_belt>=-42.85 2554 1792 D (0.23 0.11 0.25 0.3 0.1)
## 666) roll_belt>=125.5 604 243 C (0.37 0.02 0.6 0.0099 0.0033)
## 1332) magnet_belt_z< -324.5 193 4 A (0.98 0 0.01 0 0.01) *
## 1333) magnet_belt_z>=-324.5 411 52 C (0.083 0.029 0.87 0.015 0) *
## 667) roll_belt< 125.5 1950 1194 D (0.19 0.14 0.14 0.39 0.14)
## 1334) pitch_belt>=0.895 1249 965 A (0.23 0.21 0.15 0.22 0.19)
## 2668) accel_dumbbell_z< 25.5 775 510 A (0.34 0.13 0.23 0.27 0.032)
## 5336) yaw_forearm>=-65.05 539 274 A (0.49 0.17 0.25 0.078 0.011)
## 10672) magnet_forearm_z>=-119.5 332 72 A (0.78 0.11 0.006 0.099 0.006) *
## 10673) magnet_forearm_z< -119.5 207 72 C (0.024 0.26 0.65 0.043 0.019) *
## 5337) yaw_forearm< -65.05 236 71 D (0 0.047 0.17 0.7 0.081) *
## 2669) accel_dumbbell_z>=25.5 474 262 E (0.04 0.35 0.015 0.15 0.45)
## 5338) roll_dumbbell< 39.4826 172 42 B (0.041 0.76 0.041 0.047 0.12) *
## 5339) roll_dumbbell>=39.4826 302 110 E (0.04 0.12 0 0.21 0.64) *
## 1335) pitch_belt< 0.895 701 222 D (0.13 0.021 0.12 0.68 0.041) *
## 167) accel_dumbbell_y< -40.5 453 56 C (0.0088 0.049 0.88 0.029 0.038) *
## 21) roll_forearm>=123.5 3650 2448 C (0.076 0.18 0.33 0.23 0.18)
## 42) magnet_dumbbell_y< 291.5 2158 1126 C (0.09 0.14 0.48 0.15 0.15)
## 84) magnet_forearm_z< -245.5 171 39 A (0.77 0.064 0 0.058 0.11) *
## 85) magnet_forearm_z>=-245.5 1987 955 C (0.032 0.14 0.52 0.16 0.15)
## 170) pitch_belt>=26.15 127 20 B (0.11 0.84 0.0079 0 0.039) *
## 171) pitch_belt< 26.15 1860 829 C (0.026 0.096 0.55 0.17 0.16) *
## 43) magnet_dumbbell_y>=291.5 1492 959 D (0.056 0.24 0.11 0.36 0.24)
## 86) accel_forearm_x>=-102.5 961 643 E (0.048 0.31 0.16 0.15 0.33)
## 172) magnet_arm_y>=186 392 174 B (0.0077 0.56 0.22 0.1 0.11) *
## 173) magnet_arm_y< 186 569 295 E (0.076 0.14 0.11 0.19 0.48) *
## 87) accel_forearm_x< -102.5 531 146 D (0.072 0.1 0.038 0.73 0.064) *
## 11) magnet_dumbbell_y>=436.5 1813 906 B (0.034 0.5 0.045 0.22 0.2)
## 22) total_accel_dumbbell>=5.5 1303 468 B (0.047 0.64 0.061 0.02 0.23)
## 44) roll_belt>=-0.565 1096 261 B (0.056 0.76 0.073 0.024 0.086) *
## 45) roll_belt< -0.565 207 0 E (0 0 0 0 1) *
## 23) total_accel_dumbbell< 5.5 510 132 D (0 0.14 0.0039 0.74 0.11) *
## 3) roll_belt>=130.5 1147 10 E (0.0087 0 0 0 0.99) *
# Predict against validation dataset
dtpredict <- predict(dtmodel, validset, type = "class")
dtcm <- confusionMatrix(dtpredict, validset$classe)
dtcm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1472 166 23 45 14
## B 69 732 93 88 102
## C 52 97 843 151 122
## D 53 90 43 606 61
## E 28 54 24 74 783
##
## Overall Statistics
##
## Accuracy : 0.7538
## 95% CI : (0.7426, 0.7647)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6883
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8793 0.6427 0.8216 0.6286 0.7237
## Specificity 0.9411 0.9258 0.9132 0.9498 0.9625
## Pos Pred Value 0.8558 0.6753 0.6664 0.7104 0.8131
## Neg Pred Value 0.9515 0.9152 0.9604 0.9289 0.9393
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2501 0.1244 0.1432 0.1030 0.1331
## Detection Prevalence 0.2923 0.1842 0.2150 0.1449 0.1636
## Balanced Accuracy 0.9102 0.7843 0.8674 0.7892 0.8431
# Plot confusion matrix results
plot(dtcm$table, main = paste("Decision Tree Accuracy =", (100*round(dtcm$overall['Accuracy'], 4)),"%"))
# Generalised Boosting Model
gbcontrol <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
gbmodel <- train(classe ~ ., data = trainset, method = "gbm", trControl = gbcontrol, verbose = FALSE)
gbmodel
## Stochastic Gradient Boosting
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 10989, 10989, 10989, 10991, 10990
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.7527119 0.6865706
## 1 100 0.8226687 0.7756390
## 1 150 0.8517869 0.8124982
## 2 50 0.8565191 0.8182452
## 2 100 0.9067493 0.8820254
## 2 150 0.9314267 0.9132484
## 3 50 0.8969203 0.8695164
## 3 100 0.9409630 0.9252936
## 3 150 0.9598900 0.9492506
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
# Predict against validation dataset
gbpredict <- predict(gbmodel, validset, type = "raw")
gbcm <- confusionMatrix(gbpredict, validset$classe)
gbcm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1644 40 0 2 1
## B 16 1058 34 5 12
## C 10 39 980 42 8
## D 3 2 10 907 15
## E 1 0 2 8 1046
##
## Overall Statistics
##
## Accuracy : 0.9575
## 95% CI : (0.952, 0.9625)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9463
##
## Mcnemar's Test P-Value : 4.435e-09
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9821 0.9289 0.9552 0.9409 0.9667
## Specificity 0.9898 0.9859 0.9796 0.9939 0.9977
## Pos Pred Value 0.9745 0.9404 0.9082 0.9680 0.9896
## Neg Pred Value 0.9929 0.9830 0.9904 0.9885 0.9925
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2794 0.1798 0.1665 0.1541 0.1777
## Detection Prevalence 0.2867 0.1912 0.1833 0.1592 0.1796
## Balanced Accuracy 0.9859 0.9574 0.9674 0.9674 0.9822
# Plot confusion matrix results
plot(gbcm$table, main = paste("Generalised Boosted Model Accuracy =", (100*round(gbcm$overall['Accuracy'], 4)),"%"))
The Confusion Matrix cross-validates what the prediction model (developed from the training dataset) predicted in the validation dataset (Prediction), against the values that were actually in the validation dataset (Reference). For example, if the prediction model predicted 1,677 samples in the validation dataset in Class A (correct weightlifting technique), and the validation dataset actually had 1,672 samples in Class A, then the corss-validation indicates that the prediction model had a high rate of predictive accuracy.
Out of sample error rates are measured by calculating 1 minus the Accuracy figure that was calculated when validating the prediction model developed from the training dataset against the validation dataset.
The Random Forest prediction model has an accuracy rate of 99.41%, so the out of sample error rate is 0.59%.
The Decision Tree prediction model has an accuracy rate of 75.38%, so the out of sample error rate is 24.62%.
The Generalised Boosted Model prediction model has an accuracy rate of 95.75%, so the out of sample error rate is 4.25%.
On this basis, the Random Forest prediction model shall be selected to be applied to the twenty observations in the training dataset.
This section shall apply the Random Forest prediction model to a copy of the original testing dataset to predict a new testcopy$classe_rf variable for each of the twenty observations in the testing dataset.
As the accuracy of the Generalised Boosting prediction model was close to that of the Random Forest prediction model, it is worth applying the Generalised Boosting prediction model to another copy of the original testing dataset to predict a new testcopy2$classe_gbm variable for each of the twenty observations in the testing dataset. The values for this variable for at least nineteen of the twenty observations in the testing dataset should be the same as those predicted by the Random Forest prediction model.
# Run Random Forest prediction against test dataset
testcopy$classe_rf <- predict(rfmodel, testcopy, type = "class")
# Make another copy of the original testing dataset
testcopy2 <- testing
# Run Generalised Boosting Model prediction against copy of test dataset
testcopy2$classe_gbm <- predict(gbmodel, testcopy2, type = "raw")
# Create data frame to compare RF test against GBM test
comparison <- data.frame("Random Forest" = testcopy$classe_rf, "Generalised Boosting Model" = testcopy2$classe_gbm)
comparison
# Print relevant software and versions
sessionInfo()
## R version 3.5.3 (2019-03-11)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gbm_2.1.5 randomForest_4.6-14 caret_6.0-83
## [4] ggplot2_3.1.1 lattice_0.20-38 rpart_4.1-15
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.6 purrr_0.3.2
## [4] reshape2_1.4.3 splines_3.5.3 colorspace_1.4-1
## [7] generics_0.0.2 htmltools_0.3.6 stats4_3.5.3
## [10] yaml_2.2.0 survival_2.43-3 prodlim_2018.04.18
## [13] rlang_0.3.4 e1071_1.7-1 ModelMetrics_1.2.2
## [16] pillar_1.3.1 glue_1.3.1 withr_2.1.2
## [19] foreach_1.4.4 plyr_1.8.4 lava_1.6.5
## [22] stringr_1.4.0 timeDate_3043.102 munsell_0.5.0
## [25] gtable_0.3.0 recipes_0.1.5 codetools_0.2-16
## [28] evaluate_0.13 knitr_1.22 class_7.3-15
## [31] Rcpp_1.0.1 scales_1.0.0 ipred_0.9-8
## [34] jsonlite_1.6 gridExtra_2.3 digest_0.6.18
## [37] stringi_1.4.3 dplyr_0.8.0.1 grid_3.5.3
## [40] tools_3.5.3 magrittr_1.5 lazyeval_0.2.2
## [43] tibble_2.1.1 crayon_1.3.4 pkgconfig_2.0.2
## [46] MASS_7.3-51.1 Matrix_1.2-15 data.table_1.12.2
## [49] lubridate_1.7.4 gower_0.2.0 assertthat_0.2.1
## [52] rmarkdown_1.12 iterators_1.0.10 R6_2.4.0
## [55] nnet_7.3-12 nlme_3.1-137 compiler_3.5.3