The Weight Lifting Dataset used in this study contains motion metrics of 6 individuals performing a specific weight lifting exercise (barbell lifts) in 5 different ways: 1 correct way according to specification and 4 incorrect.
An initial exploration of the data will narrow down the feature space and test predictors’ distributions for normality in order to assess the convenience of using PCA preprocessing and Discriminant Analysis modeling.
Finally, we’ll predict the 5 different ways this exercise can be done using 2 off-the-shelf classification models:
Due to performance constraints, 5-fold cross-validation will be used across the models to tune the regularization parameters and to estimate out-of-sample error, to then choose the model with highest accuracy.
The original study made use of a sliding window that aggregated subsequent sensor readings and generated 96 new features for every window out of the total 52 raw features. Only the raw sensor readings will be used in our modeling, eliminating the ones generated by the sliding window technique. timestamp & window features will be discarded. To avoid further complexity, username information will also be discarded.
This initial feature selection leaves us with 53 variables: 52 continuous predictors and 1 categorical outcome. These numerical predictors and categorical outcome hint us at using a Discriminant Analysis approach to generate a model, unless LDA’s assumption about normality of predictors’ distributions for each class can’t be met. That is \(P(\vec{x}|y)\) is Gaussian for every \(x\) in predictors and every \(y\) in type of outcome.
There’s another consequence derived from this non-normality that we just assessed: PCA preprocessing might miss higher order statistics beyond variance that are not taken into account by PCA, as it best captures variance when the data are normally distributed. Our 2 models will be applied off-the-shelf with basic centering and scaling of the 52 raw sensor features.
As introduced earlier, the 2 models will be trained and cross-validated with 5 data folds. Different parameter settings will be tuned through this cross-validation and a final model decision will be made based on accuracy values:
fitControl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
set.seed(23)
fit1 <- train(classe~., method="multinom",data=train_set, preProcess=c("center","scale"), trControl = fitControl)
set.seed(7)
fit2 <- train(classe~., method="rf" ,data=train_set, preProcess=c("center","scale"), trControl = fitControl)
| decay | Accuracy | Kappa | AccuracySD | KappaSD |
|---|---|---|---|---|
| 0e+00 | 0.7320365 | 0.6603409 | 0.0103164 | 0.0130881 |
| 1e-04 | 0.7309152 | 0.6589210 | 0.0103954 | 0.0132286 |
| 1e-01 | 0.7316285 | 0.6597812 | 0.0080569 | 0.0103398 |
| mtry | Accuracy | Kappa | AccuracySD | KappaSD |
|---|---|---|---|---|
| 2 | 0.9941903 | 0.9926504 | 0.0012132 | 0.0015351 |
| 27 | 0.9944959 | 0.9930376 | 0.0012302 | 0.0015560 |
| 52 | 0.9888901 | 0.9859447 | 0.0036771 | 0.0046541 |
The accuracy of Random Forest modeling stands out. Even under performance constrains, we were able to reach a 99% accuracy for the best tuned model. Further visualizations of the results can be found in Appendix 1.
Our final model performs the following steps on the test set:
test_original <- read.csv(test_link, na.strings = c("NA","#DIV/0!")) # "#DIV/0!" included as NA value
test_set <- test_original[,colSums(is.na(train_original))==0] # Eliminate features containing NA values
test_set <- test_set[,-(1:7)] # Eliminate username and window derived features
predict(fit2, newdata=test_set[,1:52]) # Predict on the 52 raw sensor features
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
## [1] "Javier Prado"