Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:
* Class A: exactly according to the specification
* Class B: throwing the elbows to the front
* Class C: lifting the dumbbell only halfway
* Class D: lowering the dumbbell only halfway
* Class E: throwing the hips to the front

Read more: http://groupware.les.inf.puc-rio.br/har#ixzz6oBdIzdE6

The Data

Provided for this project is a data set with 19622 observation total and 160 variables. Some of the variables are raw data collected from sensors attached at the bicep (labeled arm), wrist (labeled forearm), waist (labeled belt), and on the dumbbell. There was no codebook given for further analysis into the variables.

The data appears to be sets of time series data that takes one rep in a single ‘window’, and at the end of each bicep curl rep the raw data was used to create features to be further analyzed by the original researchers.

The way we have been instructed to use the data is to treat each observation as independent since they have all been labeled a class variable. The validity of this process to classify the ‘goodness’ or ‘badness’ of a real world bicep curls on a single time instance of data is unknown. It has been shown to be statistically successful with the separated testing data, and all three models achieved a 100% accuracy on the 20 samples held out of the provided training data set.

Summary of Results:

My first model used a random forest technique with a 25 bootstrapping resampling method. This model reported a 95% confidence interval on the testing data of 98.69% to 99.22%. This method was nearly perfect on the training data and I am certain there is overfitting. To combat the over fitting I created the second model.

My second model also used a random forest technique but included a 5-fold cross-validation resampling method. This reduced the accuracy slightly and resulted in a 95% confidence interval for accuracy of 98.75% to 99.27% for the out of sample error approximation.

The third and final method used a gradient boosting model with a 25 bootstrapping resampling method. This model was less accurate on the testing set. The 95% confidence interval puts the out of sample accuracy within 91.86% and 93.22%.

All three models scored 100% on the 20 sample unlabeled test data.

Cross Validation

I chose to create a training set and a test set. 70% of the observations provided were randomly selected using the R function from the caret package ’createDataPartition()’s to be in the training set and the other 30% were used for estimating out of sample error. Further cross validation was employed within the models.

Tidying the Data

Since the 20 samples set asside to test our models were not time series sets, the variables that indicate time do not have any reasonable reason to be part of the model. The window variables also refer to the beginning and end of a time series, these won’t be necessary either. There are factor variables that were computed as averages or composites of all the window data, and these variables are only non-NA or non-empty at the end of a window, so these variables are removed from consideration. Since each person performs 10 reps of each outcome, there is no reason to keep identifying variables as predictors. After noticing total columns for the acceleration of the arms, forearms, dumbbell, and belts, I selected only those variables and plotted the values vs the class. There was no clear correlation with the total-variables between classes so those were removed.

After removing all previously mentioned variables there were still a total of 49 possible predictors. At this point I used my knowledge of form and weight lifting (decisions had to be made some how) and I decided to look at the rotation variables (row, pitch, and yaw) for each of the sensor positions. Row and pitch are rotations about the two horizontal axes and the yaw is the rotation about the vertical axis. With this final reduction I had 12 predictors to work. Below are a list of the chosen predictor variables to work with.

## Warning: package 'ggpubr' was built under R version 4.0.4

##  [1] "classe"         "roll_belt"      "pitch_belt"     "yaw_belt"      
##  [5] "roll_arm"       "pitch_arm"      "yaw_arm"        "roll_dumbbell" 
##  [9] "pitch_dumbbell" "yaw_dumbbell"   "roll_forearm"   "pitch_forearm" 
## [13] "yaw_forearm"

Analysis

## Warning: package 'doParallel' was built under R version 4.0.4

Model 1: Random Forest With 25 Bootstrapped Resamples

The first models I chose to run was a random forest with all the default settings except for allowing for parallel processing. This model did very good on the training data, too good. It made no mistakes and this implies over fitting. In a real world situation my test data would probably not do so well, especially since the test data was randomly chosen from data that was initially not independent because it was part of a time series data set. Below you can see the results of the model fitting and the confusion matrix of the test set that was initially set aside.

##    user  system elapsed 
##    8.14    0.16  208.76
## Random Forest 
## 
## 13737 samples
##    12 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9813280  0.9763828
##    7    0.9811595  0.9761705
##   12    0.9740713  0.9672108
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
##           Reference
## Prediction    A    B    C    D    E
##          A 3906    0    0    0    0
##          B    0 2658    0    0    0
##          C    0    0 2396    0    0
##          D    0    0    0 2252    0
##          E    0    0    0    0 2525
## [1] "Results of Testing Set"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1671   12    0    0    0
##          B    2 1111    9    1    1
##          C    1   14 1013    9    2
##          D    0    2    4  952    0
##          E    0    0    0    2 1079
## 
## Overall Statistics
##                                           
##                Accuracy : 0.99            
##                  95% CI : (0.9871, 0.9924)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9873          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9754   0.9873   0.9876   0.9972
## Specificity            0.9972   0.9973   0.9946   0.9988   0.9996
## Pos Pred Value         0.9929   0.9884   0.9750   0.9937   0.9981
## Neg Pred Value         0.9993   0.9941   0.9973   0.9976   0.9994
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2839   0.1888   0.1721   0.1618   0.1833
## Detection Prevalence   0.2860   0.1910   0.1766   0.1628   0.1837
## Balanced Accuracy      0.9977   0.9863   0.9910   0.9932   0.9984

Model 2: Random Forest With 5-Fold Cross-Validation.

The second model used 5-fold cross validation and did slightly less memorizing and is still overfitting. Below are the results of the model and the confusion matrix. It still scores 100% accuracy on the training data and estimates the out of sample accuracy with a 95% confidence interval of 98.75% to 99.27% which is still unrealistic.

##    user  system elapsed 
##   10.50    0.11   52.14
## Random Forest 
## 
## 13737 samples
##    12 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10990, 10990, 10990, 10989, 10989 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9826016  0.9779940
##    7    0.9837663  0.9794690
##   12    0.9791072  0.9735798
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 7.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3906    0    0    0    0
##          B    0 2658    0    0    0
##          C    0    0 2396    0    0
##          D    0    0    0 2252    0
##          E    0    0    0    0 2525
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000
## [1] "Results of Testing Set"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1668   11    0    0    0
##          B    6 1113   10    1    0
##          C    0   15 1013    4    2
##          D    0    0    3  955    1
##          E    0    0    0    4 1079
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9903          
##                  95% CI : (0.9875, 0.9927)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9877          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9772   0.9873   0.9907   0.9972
## Specificity            0.9974   0.9964   0.9957   0.9992   0.9992
## Pos Pred Value         0.9934   0.9850   0.9797   0.9958   0.9963
## Neg Pred Value         0.9986   0.9945   0.9973   0.9982   0.9994
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2834   0.1891   0.1721   0.1623   0.1833
## Detection Prevalence   0.2853   0.1920   0.1757   0.1630   0.1840
## Balanced Accuracy      0.9969   0.9868   0.9915   0.9949   0.9982

Model 3: Gradient Boost Machine

The third model used Gradient boost to train the model, but used the defaults which used the 25 bootstrapping resampling. This model did not score a perfect score (only 94%) like the others on the training data, and the testing set had an out of sample accuracy estimate less than the other two (92.5%). While lower accuracy it was still a quite respectable 95% confidence interval of 91.8% to 93.2%.

##    user  system elapsed 
##    7.67    0.12  154.08
## Stochastic Gradient Boosting 
## 
## 13737 samples
##    12 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.6813039  0.5958739
##   1                  100      0.7404196  0.6722965
##   1                  150      0.7657862  0.7043980
##   2                   50      0.7926340  0.7383526
##   2                  100      0.8556348  0.8178005
##   2                  150      0.8837896  0.8532863
##   3                   50      0.8532821  0.8147779
##   3                  100      0.9031235  0.8776398
##   3                  150      0.9251727  0.9054541
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3759   72    4    7    5
##          B   77 2429  118   21   38
##          C   35  120 2216   49   35
##          D   18   32   52 2167   49
##          E   17    5    6    8 2398
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9441          
##                  95% CI : (0.9401, 0.9479)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9293          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9624   0.9138   0.9249   0.9623   0.9497
## Specificity            0.9910   0.9771   0.9789   0.9869   0.9968
## Pos Pred Value         0.9771   0.9053   0.9026   0.9349   0.9852
## Neg Pred Value         0.9851   0.9793   0.9840   0.9926   0.9888
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2736   0.1768   0.1613   0.1577   0.1746
## Detection Prevalence   0.2800   0.1953   0.1787   0.1687   0.1772
## Balanced Accuracy      0.9767   0.9455   0.9519   0.9746   0.9732

All three models scored 100% on the 20 samples that came with out labels. Below is the visualization of the three models on the labeled testing data.

## [1] "Results on Testing Set:"
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1587   42    1    5    7
##          B   56 1005   70    9   12
##          C   16   68  931   26   24
##          D   13   21   21  917   20
##          E    2    3    3    7 1019
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9276          
##                  95% CI : (0.9207, 0.9341)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9085          
##                                           
##  Mcnemar's Test P-Value : 3.266e-08       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9480   0.8824   0.9074   0.9512   0.9418
## Specificity            0.9869   0.9690   0.9724   0.9848   0.9969
## Pos Pred Value         0.9665   0.8724   0.8742   0.9244   0.9855
## Neg Pred Value         0.9795   0.9717   0.9803   0.9904   0.9870
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2697   0.1708   0.1582   0.1558   0.1732
## Detection Prevalence   0.2790   0.1958   0.1810   0.1686   0.1757
## Balanced Accuracy      0.9675   0.9257   0.9399   0.9680   0.9693
## [1]  20 160
##    Sample_Number Random_Forest Random_Forest.cv Gradient_Boost
## 1              1             B                B              B
## 2              2             A                A              A
## 3              3             B                B              B
## 4              4             A                A              A
## 5              5             A                A              A
## 6              6             E                E              E
## 7              7             D                D              D
## 8              8             B                B              B
## 9              9             A                A              A
## 10            10             A                A              A
## 11            11             B                B              B
## 12            12             C                C              C
## 13            13             B                B              B
## 14            14             A                A              A
## 15            15             E                E              E
## 16            16             E                E              E
## 17            17             A                A              A
## 18            18             B                B              B
## 19            19             B                B              B
## 20            20             B                B              B

Appendix

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] doParallel_1.0.16 iterators_1.0.13  foreach_1.5.1     ggpubr_0.4.0     
## [5] dplyr_1.0.3       caret_6.0-86      ggplot2_3.3.3     lattice_0.20-41  
## 
## loaded via a namespace (and not attached):
##  [1] tidyr_1.1.2          splines_4.0.3        carData_3.0-4       
##  [4] prodlim_2019.11.13   assertthat_0.2.1     highr_0.8           
##  [7] stats4_4.0.3         cellranger_1.1.0     yaml_2.2.1          
## [10] ipred_0.9-9          pillar_1.4.7         backports_1.2.0     
## [13] glue_1.4.2           pROC_1.17.0.1        digest_0.6.27       
## [16] ggsignif_0.6.1       randomForest_4.6-14  gbm_2.1.8           
## [19] colorspace_2.0-0     recipes_0.1.15       cowplot_1.1.1       
## [22] htmltools_0.5.1.1    Matrix_1.2-18        plyr_1.8.6          
## [25] timeDate_3043.102    pkgconfig_2.0.3      broom_0.7.4         
## [28] haven_2.3.1          purrr_0.3.4          scales_1.1.1        
## [31] openxlsx_4.2.3       gower_0.2.2          lava_1.6.8.1        
## [34] rio_0.5.16           tibble_3.0.6         generics_0.1.0      
## [37] farver_2.0.3         car_3.0-10           ellipsis_0.3.1      
## [40] withr_2.4.1          nnet_7.3-15          cli_2.3.0           
## [43] survival_3.2-7       magrittr_2.0.1       crayon_1.4.1        
## [46] readxl_1.3.1         evaluate_0.14        nlme_3.1-151        
## [49] MASS_7.3-53          rstatix_0.7.0        forcats_0.5.1       
## [52] foreign_0.8-81       class_7.3-18         tools_4.0.3         
## [55] data.table_1.13.6    hms_1.0.0            lifecycle_0.2.0     
## [58] stringr_1.4.0        munsell_0.5.0        zip_2.1.1           
## [61] e1071_1.7-4          compiler_4.0.3       rlang_0.4.10        
## [64] grid_4.0.3           labeling_0.4.2       rmarkdown_2.6       
## [67] gtable_0.3.0         ModelMetrics_1.2.2.2 codetools_0.2-18    
## [70] abind_1.4-5          DBI_1.1.1            curl_4.3            
## [73] reshape2_1.4.4       R6_2.5.0             gridExtra_2.3       
## [76] lubridate_1.7.9.2    knitr_1.31           stringi_1.5.3       
## [79] Rcpp_1.0.6           vctrs_0.3.6          rpart_4.1-15        
## [82] tidyselect_1.1.0     xfun_0.20