Machine Learning - Predicting Dumbbells

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

This paper discusses building a model that predicts the manner in which the health participants did the exercise, and discusses the use of cross validation and the expected out-of-sample error.

Exploring the Data

Inspecting an initial summary of the training data showed a number of variables containing mostly NA values, and others containing very similar values, as shown by the very narrow range and skewness of these boxplots with some of the variables associated with the belt sensors.

plot of chunk unnamed-chunk-2 In addition, out of the 19622 rows in the training set, many of the variables, such as var_accel_arm, min_roll_belt, and min_pitch_dumbbell had merely 406 values, with the rest NA. This was a small enough number that imputing the remaining 19216 values would likely not add enough variability to be meaningful.

I chose to begin selecting variables for my model by cleaning out the similar and mostly NA variables. I reduced the training set from 160 variables to 100 variables by removing those with near zero variance. I reduced the training set further down to 59 variables by removing those with NA values.

Starting Cross Validation

Even with the reduced set I still had many variables. I began to build a cross validation set. To do this, I used the K-fold method to create 10 training sets and 10 test sets out of the training set.

This gave me 10 training sets with length of around 59 and 10 testing sets (also taken from the training set) with length of around 160. I also needed to remove the classe variable from these test sets so that prediction could happen.

Beginning to Build a Model

I began to build a model on the new training set. To do this, I began by examining the correlated predictors. The sensors were focused on the belt, arm, forearm, and dumbbell of each participant, with several sensors measuring different values, so I suspected that there might be strong correlation between many of these variables. Using the cor() function, I identified 24 variables in the remaining set with greater than .80 correlation, and with classe created a small training set. Out of curiosity, I also created a smaller training set with greater than .90 correlation. (Note that by “small” and “smaller” I mean fewer predictors for the model.) I began trying these two versions of the training set with the different algorithms.

The variables for the small training set are: user_name, cvtd_timestamp, roll_belt, pitch_belt, yaw_belt, total_accel_belt, accel_belt_x, accel_belt_y, accel_belt_z, magnet_belt_x, gyros_arm_x, gyros_arm_y, accel_arm_x, magnet_arm_x, magnet_arm_y, magnet_arm_z, pitch_dumbbell, yaw_dumbbell, gyros_dumbbell_x, gyros_dumbbell_z, accel_dumbbell_x, accel_dumbbell_z, gyros_forearm_y, gyros_forearm_z, classe.

The variables for the smaller training set are: roll_belt, pitch_belt, total_accel_belt, accel_belt_x, accel_belt_y, accel_belt_z, gyros_arm_x, gyros_arm_y, gyros_dumbbell_x, gyros_dumbbell_z, gyros_forearm_z, classe.

Since the response variable is a factor rather than continuous, linear models would be difficult to fit for this data set. To that end, I decided to use the first of my 10 training sets with different tree functions, and with boost, to compare the models’ accuracy levels. I chose to compare accuracy of the rpart, tree, randomForest, and boost functions.

Testing Models with Cross-Validation Training Set 1

Here are some of the results from my experiments with fitting models using different functions. Not all are shown due to space limitations.

plot of chunk unnamed-chunk-6

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##    1) root 17661 60000 A ( 0.284 0.193 0.174 0.164 0.184 )  
##      2) roll_belt < 129.5 16055 50000 A ( 0.309 0.213 0.192 0.180 0.106 )  
##        4) cvtd_timestamp < 19.5 15076 50000 A ( 0.329 0.227 0.204 0.161 0.080 )  
##          8) pitch_dumbbell < -2.01532 8670 20000 A ( 0.407 0.189 0.282 0.098 0.024 )  
##           16) cvtd_timestamp < 2.5 1213  1000 A ( 0.807 0.193 0.000 0.000 0.000 ) *
##           17) cvtd_timestamp > 2.5 7457 20000 A ( 0.341 0.189 0.328 0.114 0.028 )  
##             34) pitch_belt < 0.825 1610  4000 C ( 0.000 0.188 0.402 0.394 0.016 )  
##               68) cvtd_timestamp < 3.5 1099  2000 C ( 0.000 0.269 0.578 0.153 0.000 )  
##                136) pitch_belt < -43.25 303   200 B ( 0.000 0.914 0.086 0.000 0.000 ) *
##                137) pitch_belt > -43.25 796  1000 C ( 0.000 0.024 0.765 0.211 0.000 ) *
##               69) cvtd_timestamp > 3.5 511   400 D ( 0.000 0.014 0.023 0.912 0.051 ) *
##             35) pitch_belt > 0.825 5847 10000 A ( 0.435 0.189 0.308 0.037 0.031 )  
##               70) magnet_arm_x < -360.5 1674  2000 A ( 0.803 0.055 0.131 0.004 0.007 ) *
##               71) magnet_arm_x > -360.5 4173 10000 C ( 0.288 0.243 0.379 0.050 0.041 )  
##                142) cvtd_timestamp < 6.5 544   700 A ( 0.688 0.312 0.000 0.000 0.000 ) *
##                143) cvtd_timestamp > 6.5 3629 10000 C ( 0.228 0.232 0.436 0.057 0.047 )  
##                  286) user_name < 3.5 1441  4000 C ( 0.086 0.120 0.568 0.117 0.108 ) *
##                  287) user_name > 3.5 2188  5000 C ( 0.321 0.306 0.349 0.017 0.006 ) *
##          9) pitch_dumbbell > -2.01532 6406 20000 B ( 0.224 0.277 0.097 0.246 0.156 )  
##           18) cvtd_timestamp < 16.5 5775 20000 B ( 0.248 0.307 0.108 0.246 0.091 )  
##             36) roll_belt < -0.01 366   500 E ( 0.000 0.000 0.000 0.320 0.680 ) *
##             37) roll_belt > -0.01 5409 20000 B ( 0.265 0.328 0.115 0.241 0.051 )  
##               74) accel_dumbbell_x < 39.5 4638 10000 A ( 0.300 0.242 0.129 0.280 0.049 )  
##                148) cvtd_timestamp < 6.5 570   900 B ( 0.221 0.725 0.047 0.000 0.007 ) *
##                149) cvtd_timestamp > 6.5 4068 10000 D ( 0.311 0.174 0.141 0.319 0.055 )  
##                  298) cvtd_timestamp < 7.5 518   300 D ( 0.000 0.025 0.000 0.931 0.044 ) *
##                  299) cvtd_timestamp > 7.5 3550 10000 A ( 0.357 0.196 0.161 0.230 0.056 )  
##                    598) cvtd_timestamp < 10.5 773  1000 A ( 0.568 0.395 0.026 0.000 0.012 )  
##                     1196) yaw_belt < -93.15 402   100 A ( 0.963 0.037 0.000 0.000 0.000 ) *
##                     1197) yaw_belt > -93.15 371   500 B ( 0.140 0.782 0.054 0.000 0.024 ) *
##                    599) cvtd_timestamp > 10.5 2777  8000 A ( 0.298 0.140 0.199 0.293 0.069 )  
##                     1198) user_name < 3 545   900 D ( 0.000 0.000 0.099 0.640 0.261 ) *
##                     1199) user_name > 3 2232  6000 A ( 0.371 0.175 0.224 0.209 0.022 )  
##                       2398) cvtd_timestamp < 13.5 1181  2000 A ( 0.483 0.208 0.309 0.000 0.000 ) *
##                       2399) cvtd_timestamp > 13.5 1051  3000 D ( 0.245 0.137 0.127 0.443 0.047 )  
##                         4798) accel_belt_z < 13 515   400 D ( 0.000 0.000 0.012 0.893 0.095 ) *
##                         4799) accel_belt_z > 13 536  1000 A ( 0.481 0.269 0.239 0.011 0.000 )  
##                           9598) cvtd_timestamp < 15.5 252     0 A ( 1.000 0.000 0.000 0.000 0.000 ) *
##                           9599) cvtd_timestamp > 15.5 284   500 B ( 0.021 0.507 0.451 0.021 0.000 ) *
##               75) accel_dumbbell_x > 39.5 771   900 B ( 0.053 0.848 0.031 0.006 0.061 ) *
##           19) cvtd_timestamp > 16.5 631   700 E ( 0.000 0.000 0.000 0.246 0.754 ) *
##        5) cvtd_timestamp > 19.5 979  1000 E ( 0.000 0.000 0.008 0.483 0.509 )  
##         10) pitch_belt < 3.14 316     0 E ( 0.000 0.000 0.000 0.000 1.000 ) *
##         11) pitch_belt > 3.14 663   900 D ( 0.000 0.000 0.012 0.713 0.275 ) *
##      3) roll_belt > 129.5 1606   500 E ( 0.039 0.000 0.000 0.000 0.961 ) *

plot of chunk unnamed-chunk-6

Accuracy of Tests

accuracy.all

##        methods  small smaller
## 1        rpart 0.5166 0.42282
## 2         tree 0.1742 0.03158
## 3 randomForest 0.9995 0.99440
## 4        boost 0.9643 0.99440

Upon examining the accuracy of the different functions, and compared with the different sizes of training sets, the randomForest function with the small training set had the highest accuracy of the eight attempts, though boost was close, and both randomForest and boost, using the smaller training set, weren’t far behind. It’s interesting to see that having more predictors improved the accuracy rate for some, though not all, of the functions. The randomForest function, with the small data set, appears to be the best choice.

Cross Validation with Chosen Model

Now that I had built a model on the training set, and tested it on the test set, I needed to repeat the evaluation using the chosen model (randomForest) with all of the cross validation training and test sets, and average the estimated errors. For each of the 10 training/test sets, I measured the accuracy of the prediction by counting the number of correct predictions, divided by the number of observations in the test set.

After completing the cross validation tests to validate the model, the accuracy values for the 10 predictions were:

##  [1] 1.000 0.999 1.000 0.999 0.999 0.999 0.999 0.998 0.999 1.000

The mean was 0.999. With the accuracy that high, I chose to fit the model with the entire training and testing sets, and use this to predict the classe variable from the testing set.

Conclusion

The final model is

## 
## Call:
##  randomForest(formula = classe ~ ., data = training.final, importance = TRUE,      proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 0.62%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 5561   10    9    0    0    0.003405
## B   31 3751   15    0    0    0.012115
## C    4   17 3386   15    0    0.010520
## D    1    0    9 3204    2    0.003731
## E    0    0    2    7 3598    0.002495

It’s interesting to note in the final model that the OOB estimate of the error rate is significantly lower than the cross validation accuracy results, which were 0.99. Still, the accuracy using the training and testing set with my randomForest model fit was 18/20, or 0.9, slightly below the estimated out of sample error that was expected to have success be closer to 0.99.

References

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.