Executive Summary

In this project we attempted to construct a predictive model to quantify how well a bumbell barbell lifting movement is executed using data from accelerometers on the participant and dumbell. The predictive model should predict the correct class of movement from a set of five different classes using data at any single point in time. Two models are constructed using linear discriminant analysis (LDA) and XGBOOST respectively. The XGBOOST model achieved almost 100% accuracy on the cross validation set, and 100% accuracy on the test set.

Introduction

The objective of this project is to construct a predictive model to quantify how well a particular set of movement (in this case, dumbell lifting) is done, using data from accelerometers on the belt, forearm, arm, and dumbell of participants at any point in time.

This project is based on the study of the same title conducted by Velloso, E. et. al. (2013). However, the scope of this project is to construct a model based on the same dataset for the prediction of quality of weight lifting exercise execution. The full dataset used in this analysis and the relevant documentation can be found here.

The training dataset can be downloaded here. The testing dataset can be downloaded here.

Data Exploration, Cleaning & Processing

Load data and dependencies.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
raw_training <- read.csv("pml training.csv")
raw_testing <- read.csv("pml testing.csv")
set.seed(1234)

Split the raw_training dataset into a training dataset (80%) and a cross-validation dataset (20%) based on the classe variable.

inTrain <- createDataPartition(raw_training$classe, p=0.8, list = FALSE)
training <- raw_training[inTrain,]
cval <- raw_training[-inTrain,]

As the dataset consist of many variables with high proportions of NAs and near zero variance variables, we will remove these variables from our dataset.

newdat <- training[, colMeans(is.na(training)) <= .50] 
zerovar_list <- nearZeroVar(newdat)
newdat1 <- newdat[,-zerovar_list]

Due to the nature of the objectives of this project and the test set provided, we will not be analysing the dataset as a time series (details in the limitation section below). We also removed the test subject’s identity to reduce design bias.

ctraining <- newdat1[,-(1:6)]
ncol(ctraining)
## [1] 53
names(ctraining)
##  [1] "roll_belt"            "pitch_belt"           "yaw_belt"            
##  [4] "total_accel_belt"     "gyros_belt_x"         "gyros_belt_y"        
##  [7] "gyros_belt_z"         "accel_belt_x"         "accel_belt_y"        
## [10] "accel_belt_z"         "magnet_belt_x"        "magnet_belt_y"       
## [13] "magnet_belt_z"        "roll_arm"             "pitch_arm"           
## [16] "yaw_arm"              "total_accel_arm"      "gyros_arm_x"         
## [19] "gyros_arm_y"          "gyros_arm_z"          "accel_arm_x"         
## [22] "accel_arm_y"          "accel_arm_z"          "magnet_arm_x"        
## [25] "magnet_arm_y"         "magnet_arm_z"         "roll_dumbbell"       
## [28] "pitch_dumbbell"       "yaw_dumbbell"         "total_accel_dumbbell"
## [31] "gyros_dumbbell_x"     "gyros_dumbbell_y"     "gyros_dumbbell_z"    
## [34] "accel_dumbbell_x"     "accel_dumbbell_y"     "accel_dumbbell_z"    
## [37] "magnet_dumbbell_x"    "magnet_dumbbell_y"    "magnet_dumbbell_z"   
## [40] "roll_forearm"         "pitch_forearm"        "yaw_forearm"         
## [43] "total_accel_forearm"  "gyros_forearm_x"      "gyros_forearm_y"     
## [46] "gyros_forearm_z"      "accel_forearm_x"      "accel_forearm_y"     
## [49] "accel_forearm_z"      "magnet_forearm_x"     "magnet_forearm_y"    
## [52] "magnet_forearm_z"     "classe"

Now we will scale and center the ctraining dataset. We will then apply the same transformation to the cross validation dataset and the testing dataset.

sc_trans <- preProcess(ctraining)
sc_training <- predict(sc_trans,ctraining)
sc_cval <- predict(sc_trans,cval)
sc_testing <- predict(sc_trans,raw_testing)

Model Construction

We will be constructing two different prediction models from the sc_training dataset. The first model will constructed using linear discriminant analysis and the second model will be constructed using XGBOOST.

Model A: Linear Discriminant Analysis

fit_lda <- train(classe~., data=sc_training, method="lda")
fit_lda$finalModel
## Call:
## lda(x, grouping = y)
## 
## Prior probabilities of groups:
##         A         B         C         D         E 
## 0.2843493 0.1935155 0.1744060 0.1638958 0.1838334 
## 
## Group means:
##      roll_belt   pitch_belt     yaw_belt total_accel_belt gyros_belt_x
## A -0.077443603 -0.011785709  0.002361025     -0.080095400  0.006851356
## B  0.016403641 -0.021812716 -0.013590133     -0.019257667  0.008992629
## C  0.003027305 -0.041098313  0.025280998     -0.018768107 -0.043120823
## D -0.058043077  0.079122770 -0.077641536     -0.009970404 -0.048992944
## E  0.151396417  0.009640585  0.055890337      0.170856405  0.064525154
##   gyros_belt_y gyros_belt_z accel_belt_x accel_belt_y accel_belt_z
## A  0.015264137  0.043784663 -0.009763041  -0.04384357   0.09551994
## B  0.046557283 -0.011936887  0.031607291   0.06416663  -0.01380817
## C -0.003627635 -0.013659246  0.034848303   0.03572628   0.01661086
## D -0.054927248 -0.040142986 -0.103088537   0.01077577   0.03572394
## E -0.020207852 -0.006411489  0.040676167  -0.04323117  -0.18082121
##   magnet_belt_x magnet_belt_y magnet_belt_z    roll_arm   pitch_arm
## A   0.048624551    0.23666465    0.12130770 -0.24915922  0.26438740
## B  -0.088855849    0.15681204    0.13271328  0.19133764 -0.05034410
## C   0.007935449    0.17074299    0.11977763  0.09465541  0.08171536
## D  -0.119911220    0.02323308    0.08201516  0.06127816 -0.19318834
## E   0.117702143   -0.71383887   -0.51409448  0.03954532 -0.26124153
##       yaw_arm total_accel_arm  gyros_arm_x  gyros_arm_y   gyros_arm_z
## A -0.15764381      0.18118161 -0.008141219  0.044281354 -0.0088396654
## B  0.11565641      0.09366581 -0.021575367 -0.023750942 -0.0009383429
## C  0.06568048     -0.12047337  0.037834371 -0.015562841  0.0154537075
## D  0.09143959     -0.19607911 -0.002033118  0.002813387 -0.0170270380
## E -0.02174269     -0.08973798  0.001222825 -0.031235061  0.0151799272
##   accel_arm_x accel_arm_y accel_arm_z magnet_arm_x magnet_arm_y
## A -0.39371970  0.12918648 -0.03590900  -0.47254585    0.3887828
## B  0.08949264 -0.06690850 -0.17862222   0.08888067   -0.1235333
## C -0.08794711  0.07541363  0.13038511  -0.07033764    0.1549478
## D  0.41088657 -0.06392360  0.18218511   0.45686294   -0.2847947
## E  0.23190372 -0.14394577 -0.04255185   0.29677801   -0.3644154
##   magnet_arm_z roll_dumbbell pitch_dumbbell yaw_dumbbell
## A   0.31239669   -0.02378267     -0.2172060 -0.011676960
## B  -0.33409183    0.16567431      0.3688967  0.155860817
## C   0.16673647   -0.55516609     -0.3896282 -0.199239472
## D  -0.01557554    0.38278022      0.2384820 -0.001553199
## E  -0.27581997    0.04781654      0.1046733  0.044398767
##   total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y gyros_dumbbell_z
## A           0.08930021     -0.022435194      -0.01554755      0.018092036
## B           0.07292203      0.005745444      -0.06165504     -0.007735324
## C          -0.08800551      0.019954976       0.01367445     -0.010868581
## D          -0.23524210      0.026163817      -0.05237685     -0.001944579
## E           0.07833116     -0.013603667       0.12267404     -0.007796728
##   accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z magnet_dumbbell_x
## A       -0.3203466      0.003222691      -0.17253386        -0.1773147
## B        0.4056592      0.207224209       0.20000399         0.2256251
## C       -0.1734469     -0.290865938      -0.11765471        -0.1182183
## D        0.0918842      0.005066460       0.04734681         0.0374826
## E        0.1511138      0.048309667       0.12574300         0.1154965
##   magnet_dumbbell_y magnet_dumbbell_z roll_forearm pitch_forearm
## A      -0.004743084       -0.25292902  -0.07346000   -0.62298407
## B       0.138653663        0.03155998  -0.02437864    0.14016192
## C      -0.193502111        0.11429841   0.24653516    0.05787679
## D      -0.009709318        0.06855867  -0.17898517    0.61433111
## E       0.053615439        0.18844263   0.06496997    0.21346098
##   yaw_forearm total_accel_forearm gyros_forearm_x gyros_forearm_y
## A  0.05352226         -0.25368057      0.03017840     0.016409488
## B -0.06511763          0.06284656     -0.02187857     0.006886273
## C  0.19402192          0.01313372      0.05591282    -0.005529746
## D -0.14986876          0.13640120     -0.05253348    -0.021324866
## E -0.06469707          0.19216279     -0.02985791    -0.008372533
##   gyros_forearm_z accel_forearm_x accel_forearm_y accel_forearm_z
## A     0.009747001      0.32935462      0.03369202     -0.02788343
## B     0.011490608     -0.09452260     -0.14551513      0.07784583
## C    -0.009978561      0.08358785      0.24065547     -0.04354579
## D    -0.021907132     -0.50403516     -0.06459929      0.03773621
## E     0.001825805     -0.03986848     -0.06965590     -0.03114722
##   magnet_forearm_x magnet_forearm_y magnet_forearm_z
## A       0.32962215        0.1808764       0.03422768
## B      -0.03779198       -0.2065221      -0.05586125
## C      -0.06207973        0.2437320       0.18738863
## D      -0.40613944       -0.1250257      -0.08867847
## E      -0.04908183       -0.1821433      -0.09285732
## 
## Coefficients of linear discriminants:
##                                LD1          LD2           LD3          LD4
## roll_belt             3.5815026250  5.834741796 -0.3504022024  4.156171948
## pitch_belt            0.8241749775  0.001780319 -1.5197072946  0.290882478
## yaw_belt             -0.8860805909  0.047428478 -1.2063398532 -0.303051641
## total_accel_belt     -0.0811132589 -0.443969429 -2.1686253265 -1.176747640
## gyros_belt_x          0.1164589994  0.064892393  0.2009520965  0.087233471
## gyros_belt_y         -0.1245918155 -0.168802605 -0.0459597849  0.081602431
## gyros_belt_z          0.1393996924  0.139508645  0.1014173963 -0.145970982
## accel_belt_x         -0.0201024275  0.045137587  0.7179080197  0.142939884
## accel_belt_y         -0.6769253950 -0.777780785  1.8239259201  0.145035152
## accel_belt_z          0.5441525884  2.628935546 -1.5316823751  1.508407190
## magnet_belt_x        -0.6972667422  0.020209468 -1.5557787542 -0.109157992
## magnet_belt_y        -0.7800857155 -0.297582119  0.0763329231 -0.086222966
## magnet_belt_z         0.4887261909  0.032807405  0.7755037590  0.155407703
## roll_arm              0.0534117487  0.044396735  0.1992871755  0.017431104
## pitch_arm            -0.1011275139  0.208616508  0.1520816012  0.040396650
## yaw_arm               0.1000811147 -0.071320063  0.1346983727 -0.110873051
## total_accel_arm       0.0523356137 -0.283417034 -0.2186822157 -0.184099610
## gyros_arm_x           0.2514772010 -0.012016068 -0.1283627027  0.133483979
## gyros_arm_y           0.0640718959 -0.090019548 -0.0802758794  0.203618505
## gyros_arm_z          -0.0872396855 -0.055570745  0.0166823321  0.091180253
## accel_arm_x          -0.5283285712 -1.014379568 -1.4613009370 -0.315722655
## accel_arm_y          -0.3138907420  1.586741535 -0.2660440016  0.453340569
## accel_arm_z           1.3396417477 -0.238827018  0.3004775653 -0.944608538
## magnet_arm_x         -0.0006205584 -0.071373847  0.9984184639  0.457405344
## magnet_arm_y         -0.3055572200 -0.916269358  1.1794461322 -0.003511072
## magnet_arm_z         -1.1938669330 -0.921269078 -1.7742803950  0.704308904
## roll_dumbbell         0.1694535144 -0.285798432 -0.2614508785 -0.534968207
## pitch_dumbbell       -0.2068831767 -0.148120750 -0.1702540183 -0.171879960
## yaw_dumbbell         -0.6222248704  0.555179819 -0.3518655345 -0.259404734
## total_accel_dumbbell  0.7436605678  0.668534100  0.0001358791  0.031412947
## gyros_dumbbell_x      0.4755021327 -0.815404903  0.4924935554  0.094273340
## gyros_dumbbell_y      0.1457649150 -0.175146633  0.0427837576  0.138077548
## gyros_dumbbell_z      0.2169267700 -0.784470375  0.5191011497  0.121423806
## accel_dumbbell_x      0.8831664101  0.577624388  0.1292659546  0.402575235
## accel_dumbbell_y      0.1872891979  0.213259099  0.4001860038 -0.228610339
## accel_dumbbell_z      0.2801734874  0.223038411  0.2586173623  0.121743927
## magnet_dumbbell_x    -1.5115675698  0.134208146  0.7655690779 -0.785670622
## magnet_dumbbell_y    -0.3188061469  0.721497764 -0.4777244464 -0.624034284
## magnet_dumbbell_z     1.8628139544 -1.416367834  0.2214539092  1.298226114
## roll_forearm          0.1678744897  0.154883882  0.0188932533  0.138007420
## pitch_forearm         0.4395478553 -0.334647457  0.1543297605 -0.010601837
## yaw_forearm          -0.0253957857  0.095870696  0.0497703642  0.131055939
## total_accel_forearm   0.3275059478  0.055321590 -0.0786619316  0.031168521
## gyros_forearm_x      -0.0138251577 -0.039107770  0.1252496990  0.058945605
## gyros_forearm_y      -0.0361423724 -0.131825001  0.0456430449  0.012544694
## gyros_forearm_z       0.1684594127  0.241293247 -0.0659780230 -0.079180582
## accel_forearm_x       0.6642863373  1.904853276 -0.1379577099  0.669983350
## accel_forearm_y       0.1647498404 -0.232145108 -0.1421117475 -0.391081959
## accel_forearm_z      -0.9965675188  0.449103204  0.4513312959 -0.646823258
## magnet_forearm_x     -0.6525891876 -1.200260745  0.0604218863 -0.380651574
## magnet_forearm_y     -0.5000183270 -0.714243996  0.2608265190  0.161454202
## magnet_forearm_z     -0.0495717176 -0.536038091 -0.0102278158  0.447921692
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4 
## 0.4911 0.2402 0.1532 0.1155

Let’s test the model on our cross validation dataset.

pred_lda <- predict(fit_lda,sc_cval)
confusionMatrix(pred_lda,sc_cval$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 925 110  65  39  31
##          B  26 496  74  26 120
##          C  88  82 430  84  63
##          D  75  35  96 471  50
##          E   2  36  19  23 457
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7084          
##                  95% CI : (0.6939, 0.7226)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6308          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8289   0.6535   0.6287   0.7325   0.6338
## Specificity            0.9127   0.9223   0.9021   0.9220   0.9750
## Pos Pred Value         0.7906   0.6685   0.5756   0.6479   0.8510
## Neg Pred Value         0.9306   0.9173   0.9200   0.9462   0.9220
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2358   0.1264   0.1096   0.1201   0.1165
## Detection Prevalence   0.2982   0.1891   0.1904   0.1853   0.1369
## Balanced Accuracy      0.8708   0.7879   0.7654   0.8272   0.8044

According to the results of Model A on the cross validation dataset, the model only have an accuracy of 0.7083864.

Model B: XGBOOST

control <- trainControl(method="cv", number=3, allowParallel = TRUE)
fit_xgb <- train(classe~., method ="xgbTree", data=sc_training, trControl = control)
fit_xgb$finalModel
## ##### xgb.Booster
## raw: 472.2 Kb 
## call:
##   xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth, 
##     gamma = param$gamma, colsample_bytree = param$colsample_bytree, 
##     min_child_weight = param$min_child_weight, subsample = param$subsample), 
##     data = x, nrounds = param$nrounds, num_class = length(lev), 
##     objective = "multi:softprob")
## params (as set within xgb.train):
##   eta = "0.4", max_depth = "3", gamma = "0", colsample_bytree = "0.6", min_child_weight = "1", subsample = "0.5", num_class = "5", objective = "multi:softprob", silent = "1"
## xgb.attributes:
##   niter
## callbacks:
##   cb.print.evaluation(period = print_every_n)
## niter: 150
## xNames: roll_beltpitch_beltyaw_belttotal_accel_beltgyros_belt_xgyros_belt_ygyros_belt_zaccel_belt_xaccel_belt_yaccel_belt_zmagnet_belt_xmagnet_belt_ymagnet_belt_zroll_armpitch_armyaw_armtotal_accel_armgyros_arm_xgyros_arm_ygyros_arm_zaccel_arm_xaccel_arm_yaccel_arm_zmagnet_arm_xmagnet_arm_ymagnet_arm_zroll_dumbbellpitch_dumbbellyaw_dumbbelltotal_accel_dumbbellgyros_dumbbell_xgyros_dumbbell_ygyros_dumbbell_zaccel_dumbbell_xaccel_dumbbell_yaccel_dumbbell_zmagnet_dumbbell_xmagnet_dumbbell_ymagnet_dumbbell_zroll_forearmpitch_forearmyaw_forearmtotal_accel_forearmgyros_forearm_xgyros_forearm_ygyros_forearm_zaccel_forearm_xaccel_forearm_yaccel_forearm_zmagnet_forearm_xmagnet_forearm_ymagnet_forearm_z
## problemType: Classification
## tuneValue:
##     nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 93     150         3 0.4     0              0.6                1       0.5
## obsLevels: ABCDE
## param:
##  list()

Let’s test the model on our cross validation dataset.

pred_xgb <- predict(fit_xgb,sc_cval)
confusionMatrix(pred_xgb,sc_cval$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1116    2    0    0    0
##          B    0  756    5    0    0
##          C    0    1  679    1    0
##          D    0    0    0  642    0
##          E    0    0    0    0  721
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9977         
##                  95% CI : (0.9956, 0.999)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9971         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9960   0.9927   0.9984   1.0000
## Specificity            0.9993   0.9984   0.9994   1.0000   1.0000
## Pos Pred Value         0.9982   0.9934   0.9971   1.0000   1.0000
## Neg Pred Value         1.0000   0.9991   0.9985   0.9997   1.0000
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1927   0.1731   0.1637   0.1838
## Detection Prevalence   0.2850   0.1940   0.1736   0.1637   0.1838
## Balanced Accuracy      0.9996   0.9972   0.9960   0.9992   1.0000

According to the results of Model B on the cross validation dataset, the model have a high accuracy of 0.9977058.

Model Selection & Prediction

As Model B using XGBOOST have almost 100% accuracy on the cross validation set, no other models will be constructed or ensembled. We will be using Model B directly for the prediction of the testing dataset.

pred_testing <- predict(fit_xgb,sc_testing)
pred_testing
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Conclusion

Our model performed a 100% accuracy prediction on the testing dataset.

Limitations

Although the accuracy of the model prediction is good, it might not generalize well when applied to a wider range of subjects, as only six participants are involved in this study, and only a few specific movements (five different barbell lifting movements) are recorded. And due to the data in itself is a time-series dataset, it is should be analysed using time series analysis (in this case, a sliding window analysis as a complete movement includes a combination of sequential entry records), and the testing dataset should be split by time chunks or by windows.

References

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.