Using the training and test datasets provided by [http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises], we will attempt to construct the best fit model that will predict whether or not an specific exercise was performed correctly (classe==A) or in one of four incorrect manners (classe==B:E). see Appendix for full description of the classes.
In addition to finding the most accurate model, we’ll also keep track of processing time for each model because the most accurate model may not be the “best” model for real-life large data sets.
We are interested in the accelerometer data as it relates to whether an exercise was performed correctly. Visual inspection of the data shows us that we can remove columns that contain 1 through 7, which contain identifying data, time stamp and other data irrelevant to our question.
Many columns contain no data. If more than 80% of the data in a column is NA, eliminate that data from the column set.
## [1] 19622 53
Next,remove columns with near zero variance or zero variance
There are now 53 columns left, which is identical to what it was previously, so we can leave out the costly nearZeroVar()function in the future. What is left contains data that can be used to classify whether an exercise was performed correctly, class A or incorrectly, classes B through E.
The dataset is still very large, containing 19622 objects and `r dim(training)[2] variables. To speed up initial model testing, we’ll create three subsets of the training data, then partition each of those subsets. Once we model and predict on the smaller data sets we can apply the use the full data set with the best model (most accurate ~ processing time).
As this is a classification problem, we’ll be using a variety of “tree” models, including: rpart, randomForest and gbm.
Create first model fit using recursive partitioning and regression trees, a.k.a rpart and predict on that model. rpart is great for creating a decision tree plot (see Appendix fig 1) and is more scalable, but tends to have lower accuracy compared to other functions.
Below is the accuracy and confusion matrix:
## Accuracy
## 0.7188465
A | B | C | D | E | |
---|---|---|---|---|---|
A | 466 | 49 | 5 | 7 | 4 |
B | 11 | 220 | 33 | 12 | 25 |
C | 14 | 24 | 229 | 43 | 44 |
D | 50 | 73 | 64 | 246 | 49 |
E | 11 | 10 | 8 | 10 | 235 |
randomForest
differs from rpart in that it creates several subsets of trees, a forest rather than a single tree, and then averages them together to find the best model. It can be less scalable re: CPU, but if processing time is not an issue, generally produces the best results.
ptm<-proc.time()
set.seed(3383)
model.fit<-randomForest(classe ~ ., data=strain, importance = TRUE, allowParallel=TRUE) #being sneaky here and producing our final training model after discovering that this is the best model fit.
predict<-predict(model.fit, stest1)
cmp2<-confusionMatrix(predict, stest1$classe)
t<-cmp2$table
trainall.time<-proc.time()-ptm
predict on the second model
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9788877 0.9732862 0.9714669 0.9848079 0.2842430
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
cmp2$overall["Accuracy"]
Accuracy 0.9788877
kable(t2)
A | B | C | D | E | |
---|---|---|---|---|---|
A | 549 | 7 | 0 | 1 | 0 |
B | 1 | 365 | 11 | 0 | 0 |
C | 0 | 4 | 327 | 10 | 2 |
D | 0 | 0 | 1 | 307 | 2 |
E | 2 | 0 | 0 | 0 | 353 |
randomForest
gave us a pretty good Accuracy, we’ll see if we can do better with boosting using the method **gbm* (for classification boosting) in the train
function.
## Stochastic Gradient Boosting
##
## 4537 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 3629, 3629, 3629, 3631, 3630
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa Accuracy SD
## 1 50 0.7562297 0.6907052 0.014001115
## 1 100 0.8249983 0.7783593 0.013139273
## 1 150 0.8510022 0.8113774 0.012936625
## 2 50 0.8446087 0.8031443 0.010778871
## 2 100 0.9014762 0.8752628 0.007972290
## 2 150 0.9219713 0.9012586 0.012290797
## 3 50 0.8902355 0.8610216 0.012405872
## 3 100 0.9325532 0.9146411 0.006508225
## 3 150 0.9446755 0.9300113 0.005098588
## Kappa SD
## 0.017692966
## 0.016482413
## 0.016348686
## 0.013682292
## 0.010065436
## 0.015590197
## 0.015684179
## 0.008276051
## 0.006467697
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
predict on third model
Accuracy 0.9588054
A | B | C | D | E | |
---|---|---|---|---|---|
A | 542 | 9 | 1 | 0 | 0 |
B | 4 | 357 | 18 | 1 | 3 |
C | 2 | 9 | 316 | 9 | 6 |
D | 0 | 0 | 3 | 306 | 7 |
E | 4 | 1 | 1 | 2 | 341 |
Model 2, Random Forest, has a slightly higher accuracy at 0.9788877 than Gradient Boosting at 0.9588054, but the former’s computation time is 23.0811808% of the latter so we’ll use this model.
The out of sample error rate should be about 0.0211123
model | Accuracy | user.self | sys.self | elapsed |
---|---|---|---|---|
rpart | 0.7188465 | 1.06 | 0.00 | 1.06 |
gradient boosting | 0.9588054 | 108.10 | 0.25 | 108.40 |
random forest | 0.9788877 | 24.72 | 0.20 | 25.02 |
Applying Model 2 to our large training set vs. testing set we get the following predictions:
print(finalPrediction[1:20])
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
fig 1. Recursive Partitioning (rpart)Tree Plot & Prediction Plot Model 1
fig 2. Random Forest (randomForest) Error Plot & Prediction Plot Model 2
fig 3. Random Forest (randomForest) Error Plot & Prediction Plot Model 2
fig 4. Model 2 Random Forest Variable Importance
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.