The goal of this project is to use the data provided by wearable devices in order to build a model capable of telling whether the excercise was done correctly or incorrectly. A specific form of excercise, barbell lifts, was repeated by 6 participants in 5 different ways, both correctly an incorrectly. The report below details data processing, exploratory analysis, model selection, model validation, estimation of out of sample error and testing the model on a separate set of data.
The data primarily consists of readings from four accelerometers devices located on the: belt, forearm, arm, and dumbbell. However, the dataset contains information in 160 variables. The first step is to identify potential variables to be included in the model.
Literature review (http://groupware.les.inf.puc-rio.br/work.jsf?p1=10335) suggested that average and std values will be most helpful. Thus we will try to replicate these findings working with a subset of the data containing summaries of the observations in the form of averages and standard deviations and comparing with subsets containing other variables.
## Random Forest
##
## 246 samples
## 24 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 164, 164, 164
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.703 0.624
## 13 0.683 0.600
## 24 0.675 0.590
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 36 7 3 4 0
## B 3 21 1 0 3
## C 3 1 23 3 1
## D 0 1 1 20 2
## E 1 1 0 0 25
##
## Overall Statistics
##
## Accuracy : 0.7812
## 95% CI : (0.7091, 0.8427)
## No Information Rate : 0.2688
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7232
## Mcnemar's Test P-Value : 0.2469
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8372 0.6774 0.8214 0.7407 0.8065
## Specificity 0.8803 0.9457 0.9394 0.9699 0.9845
## Pos Pred Value 0.7200 0.7500 0.7419 0.8333 0.9259
## Neg Pred Value 0.9364 0.9242 0.9612 0.9485 0.9549
## Prevalence 0.2687 0.1938 0.1750 0.1688 0.1938
## Detection Rate 0.2250 0.1313 0.1437 0.1250 0.1562
## Detection Prevalence 0.3125 0.1750 0.1938 0.1500 0.1688
## Balanced Accuracy 0.8588 0.8116 0.8804 0.8553 0.8955
Training the random forest model from the caret package, we get the accuracy of around 0.78125. Out of sample error is around 30%.
However, one of the problems with this approach is that the data provided for testing contains no information for these set of predictors and so is useless in this approach.
In the following sections we consider alternative methods and alternative set of predictors.
We can see if another method does better.
##
## predlda A B C D E
## A 24 8 4 5 1
## B 8 11 0 2 3
## C 6 2 21 3 2
## D 1 7 3 14 4
## E 4 3 0 3 21
Training the “lda” model from the caret package, we get the accuracy of around 0.56875.
##
## predgbm A B C D E
## A 35 11 3 2 0
## B 5 15 3 1 3
## C 2 1 21 4 2
## D 0 2 1 17 2
## E 1 2 0 3 24
##
## predrpart A B C D E
## A 24 7 2 4 0
## B 10 14 6 1 2
## C 6 2 10 3 2
## D 2 7 10 17 2
## E 1 1 0 2 25
Using the “gbm” method, we get the accuracy of around 0.7; “rpart” method yielded accuracy of about 0.5625. Combining these models seems to reduce accuracy.
It is possible that we are discarding too much of the data. So we tried using more predictors.
Using “lda” method from the caret package with more variables, we increase the accuracy of predictions from 0.56875 to 0.56875. Where the accuracy for “rpart” method changes from 0.5625 to 0.5625. The “gbm” and random forests methods become slower with a larger subset of data. If computational time is not an issue, it is possible to achieve much greater accuracy by using all of the predictors which have no missing values, and getting rid of the first seven collumns: observation number, person’s name, time info, more time info, etc.
Below are the calculations that take longer, but result in much higher precision when it comes to classification.
##
## predrf A B C D E
## A 2229 11 0 0 0
## B 3 1505 16 0 0
## C 0 2 1350 22 0
## D 0 0 2 1263 3
## E 0 0 0 1 1439
The classes predicted for the test dataset provided are B, A, B, A, A, E, D, B, A, A, B, C, B, A, E, E, A, B, B, B, the model should get only about 1% of them wrong.
It seems that the best approach is using Random Forests method with a subset of data containing variables with average and sd values, that is predictors listed below.
| avg_pitch_arm |
| avg_pitch_belt |
| avg_pitch_dumbbell |
| avg_pitch_forearm |
| avg_roll_arm |
| avg_roll_belt |
| avg_roll_dumbbell |
| avg_roll_forearm |
| avg_yaw_arm |
| avg_yaw_belt |
| avg_yaw_dumbbell |
| avg_yaw_forearm |
| stddev_pitch_arm |
| stddev_pitch_belt |
| stddev_pitch_dumbbell |
| stddev_pitch_forearm |
| stddev_roll_arm |
| stddev_roll_belt |
| stddev_roll_dumbbell |
| stddev_roll_forearm |
| stddev_yaw_arm |
| stddev_yaw_belt |
| stddev_yaw_dumbbell |
| stddev_yaw_forearm |
The calculations are quick because the dataset is much smaller since only key statistics (mean and sd) describing observations are used. However, for greater precision, use ‘raw’ data from accelerometers with a random forests method of the caret package, if computational time is not an issue.