Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
This paper discusses building a model that predicts the manner in which the health participants did the exercise, and discusses the use of cross validation and the expected out-of-sample error.
Inspecting an initial summary of the training data showed a number of variables containing mostly NA values, and others containing very similar values, as shown by the very narrow range and skewness of these boxplots with some of the variables associated with the belt sensors.
In addition, out of the 19622 rows in the training set, many of the variables, such as var_accel_arm, min_roll_belt, and min_pitch_dumbbell had merely 406 values, with the rest NA. This was a small enough number that imputing the remaining 19216 values would likely not add enough variability to be meaningful.
I chose to begin selecting variables for my model by cleaning out the similar and mostly NA variables. I reduced the training set from 160 variables to 100 variables by removing those with near zero variance. I reduced the training set further down to 59 variables by removing those with NA values.
Even with the reduced set I still had many variables. I began to build a cross validation set. To do this, I used the K-fold method to create 10 training sets and 10 test sets out of the training set.
This gave me 10 training sets with length of around 59 and 10 testing sets (also taken from the training set) with length of around 160. I also needed to remove the classe variable from these test sets so that prediction could happen.
I began to build a model on the new training set. To do this, I began by examining the correlated predictors. The sensors were focused on the belt, arm, forearm, and dumbbell of each participant, with several sensors measuring different values, so I suspected that there might be strong correlation between many of these variables. Using the cor() function, I identified 24 variables in the remaining set with greater than .80 correlation, and with classe created a small training set. Out of curiosity, I also created a smaller training set with greater than .90 correlation. (Note that by “small” and “smaller” I mean fewer predictors for the model.) I began trying these two versions of the training set with the different algorithms.
The variables for the small training set are: user_name, cvtd_timestamp, roll_belt, pitch_belt, yaw_belt, total_accel_belt, accel_belt_x, accel_belt_y, accel_belt_z, magnet_belt_x, gyros_arm_x, gyros_arm_y, accel_arm_x, magnet_arm_x, magnet_arm_y, magnet_arm_z, pitch_dumbbell, yaw_dumbbell, gyros_dumbbell_x, gyros_dumbbell_z, accel_dumbbell_x, accel_dumbbell_z, gyros_forearm_y, gyros_forearm_z, classe.
The variables for the smaller training set are: roll_belt, pitch_belt, total_accel_belt, accel_belt_x, accel_belt_y, accel_belt_z, gyros_arm_x, gyros_arm_y, gyros_dumbbell_x, gyros_dumbbell_z, gyros_forearm_z, classe.
Since the response variable is a factor rather than continuous, linear models would be difficult to fit for this data set. To that end, I decided to use the first of my 10 training sets with different tree functions, and with boost, to compare the models’ accuracy levels. I chose to compare accuracy of the rpart, tree, randomForest, and boost functions.
Here are some of the results from my experiments with fitting models using different functions. Not all are shown due to space limitations.
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 17661 60000 A ( 0.284 0.193 0.174 0.164 0.184 )
## 2) roll_belt < 129.5 16055 50000 A ( 0.309 0.213 0.192 0.180 0.106 )
## 4) cvtd_timestamp < 19.5 15076 50000 A ( 0.329 0.227 0.204 0.161 0.080 )
## 8) pitch_dumbbell < -2.01532 8670 20000 A ( 0.407 0.189 0.282 0.098 0.024 )
## 16) cvtd_timestamp < 2.5 1213 1000 A ( 0.807 0.193 0.000 0.000 0.000 ) *
## 17) cvtd_timestamp > 2.5 7457 20000 A ( 0.341 0.189 0.328 0.114 0.028 )
## 34) pitch_belt < 0.825 1610 4000 C ( 0.000 0.188 0.402 0.394 0.016 )
## 68) cvtd_timestamp < 3.5 1099 2000 C ( 0.000 0.269 0.578 0.153 0.000 )
## 136) pitch_belt < -43.25 303 200 B ( 0.000 0.914 0.086 0.000 0.000 ) *
## 137) pitch_belt > -43.25 796 1000 C ( 0.000 0.024 0.765 0.211 0.000 ) *
## 69) cvtd_timestamp > 3.5 511 400 D ( 0.000 0.014 0.023 0.912 0.051 ) *
## 35) pitch_belt > 0.825 5847 10000 A ( 0.435 0.189 0.308 0.037 0.031 )
## 70) magnet_arm_x < -360.5 1674 2000 A ( 0.803 0.055 0.131 0.004 0.007 ) *
## 71) magnet_arm_x > -360.5 4173 10000 C ( 0.288 0.243 0.379 0.050 0.041 )
## 142) cvtd_timestamp < 6.5 544 700 A ( 0.688 0.312 0.000 0.000 0.000 ) *
## 143) cvtd_timestamp > 6.5 3629 10000 C ( 0.228 0.232 0.436 0.057 0.047 )
## 286) user_name < 3.5 1441 4000 C ( 0.086 0.120 0.568 0.117 0.108 ) *
## 287) user_name > 3.5 2188 5000 C ( 0.321 0.306 0.349 0.017 0.006 ) *
## 9) pitch_dumbbell > -2.01532 6406 20000 B ( 0.224 0.277 0.097 0.246 0.156 )
## 18) cvtd_timestamp < 16.5 5775 20000 B ( 0.248 0.307 0.108 0.246 0.091 )
## 36) roll_belt < -0.01 366 500 E ( 0.000 0.000 0.000 0.320 0.680 ) *
## 37) roll_belt > -0.01 5409 20000 B ( 0.265 0.328 0.115 0.241 0.051 )
## 74) accel_dumbbell_x < 39.5 4638 10000 A ( 0.300 0.242 0.129 0.280 0.049 )
## 148) cvtd_timestamp < 6.5 570 900 B ( 0.221 0.725 0.047 0.000 0.007 ) *
## 149) cvtd_timestamp > 6.5 4068 10000 D ( 0.311 0.174 0.141 0.319 0.055 )
## 298) cvtd_timestamp < 7.5 518 300 D ( 0.000 0.025 0.000 0.931 0.044 ) *
## 299) cvtd_timestamp > 7.5 3550 10000 A ( 0.357 0.196 0.161 0.230 0.056 )
## 598) cvtd_timestamp < 10.5 773 1000 A ( 0.568 0.395 0.026 0.000 0.012 )
## 1196) yaw_belt < -93.15 402 100 A ( 0.963 0.037 0.000 0.000 0.000 ) *
## 1197) yaw_belt > -93.15 371 500 B ( 0.140 0.782 0.054 0.000 0.024 ) *
## 599) cvtd_timestamp > 10.5 2777 8000 A ( 0.298 0.140 0.199 0.293 0.069 )
## 1198) user_name < 3 545 900 D ( 0.000 0.000 0.099 0.640 0.261 ) *
## 1199) user_name > 3 2232 6000 A ( 0.371 0.175 0.224 0.209 0.022 )
## 2398) cvtd_timestamp < 13.5 1181 2000 A ( 0.483 0.208 0.309 0.000 0.000 ) *
## 2399) cvtd_timestamp > 13.5 1051 3000 D ( 0.245 0.137 0.127 0.443 0.047 )
## 4798) accel_belt_z < 13 515 400 D ( 0.000 0.000 0.012 0.893 0.095 ) *
## 4799) accel_belt_z > 13 536 1000 A ( 0.481 0.269 0.239 0.011 0.000 )
## 9598) cvtd_timestamp < 15.5 252 0 A ( 1.000 0.000 0.000 0.000 0.000 ) *
## 9599) cvtd_timestamp > 15.5 284 500 B ( 0.021 0.507 0.451 0.021 0.000 ) *
## 75) accel_dumbbell_x > 39.5 771 900 B ( 0.053 0.848 0.031 0.006 0.061 ) *
## 19) cvtd_timestamp > 16.5 631 700 E ( 0.000 0.000 0.000 0.246 0.754 ) *
## 5) cvtd_timestamp > 19.5 979 1000 E ( 0.000 0.000 0.008 0.483 0.509 )
## 10) pitch_belt < 3.14 316 0 E ( 0.000 0.000 0.000 0.000 1.000 ) *
## 11) pitch_belt > 3.14 663 900 D ( 0.000 0.000 0.012 0.713 0.275 ) *
## 3) roll_belt > 129.5 1606 500 E ( 0.039 0.000 0.000 0.000 0.961 ) *
accuracy.all
## methods small smaller
## 1 rpart 0.5166 0.42282
## 2 tree 0.1742 0.03158
## 3 randomForest 0.9995 0.99440
## 4 boost 0.9643 0.99440
Upon examining the accuracy of the different functions, and compared with the different sizes of training sets, the randomForest function with the small training set had the highest accuracy of the eight attempts, though boost was close, and both randomForest and boost, using the smaller training set, weren’t far behind. It’s interesting to see that having more predictors improved the accuracy rate for some, though not all, of the functions. The randomForest function, with the small data set, appears to be the best choice.
Now that I had built a model on the training set, and tested it on the test set, I needed to repeat the evaluation using the chosen model (randomForest) with all of the cross validation training and test sets, and average the estimated errors. For each of the 10 training/test sets, I measured the accuracy of the prediction by counting the number of correct predictions, divided by the number of observations in the test set.
After completing the cross validation tests to validate the model, the accuracy values for the 10 predictions were:
## [1] 1.000 0.999 1.000 0.999 0.999 0.999 0.999 0.998 0.999 1.000
The mean was 0.999. With the accuracy that high, I chose to fit the model with the entire training and testing sets, and use this to predict the classe variable from the testing set.
The final model is
##
## Call:
## randomForest(formula = classe ~ ., data = training.final, importance = TRUE, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 0.62%
## Confusion matrix:
## A B C D E class.error
## A 5561 10 9 0 0 0.003405
## B 31 3751 15 0 0 0.012115
## C 4 17 3386 15 0 0.010520
## D 1 0 9 3204 2 0.003731
## E 0 0 2 7 3598 0.002495
It’s interesting to note in the final model that the OOB estimate of the error rate is significantly lower than the cross validation accuracy results, which were 0.99. Still, the accuracy using the training and testing set with my randomForest model fit was 18/20, or 0.9, slightly below the estimated out of sample error that was expected to have success be closer to 0.99.
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
Read more: http://groupware.les.inf.puc-rio.br/har#ixzz3P78EmOe4