Introduction

The intent of this analysis is to use machine learning techniques to identify specific physical exercise activities from data acquired by wearable biometric sensors. In particular, to be able to distinguish correct from incorrect methods of performing a bicep curl using a dumbbell weight.

The data for this analysis comes from the Human Activity Recognition project. Details about the project and the data available can be found on the project web site.

Multiple models were tested to see which might yield the highest accuracy on a blind test data set. The final model selected has a suprisingly high accuracy, and in fact achieved a perfect score categorizing correct/incorrect exercises using the test data.

Exploratory Data Analysis and Preparation

This analysis was based on the Weight Lifting Exercise (WLE) data set available at the web site noted previously. The goal of the WLE data set is to provide data with which to train and test models for predicting correct and incorrect techniques for doing dumbbell bicep curl lifts. The raw data is collected from biometric devices worn by a test subject while doing the exercises. The devices provide real-time accelerometer, gyroscope, and magnetic field measurements.

The data consists of a time sequence of raw measurement data interjected with various summary statistics (maximums, minimums, averages, etc.) at regular intervals (“windows”) in time. For this analysis, the summarized data were not used because they comprised too small a percentage of the overall data set (2.07%). Ignoring the summary data, only a small amount of the remaining measurements are NA (approx. 0.112%). These values were set to zero for the model fitting. This proved very effective, so no other means of imputing missing values were tried.

Each observation in the data contains an indication of the bicep curl technique being demonstrated/measured at that point in the time sequence. Five curl exercise techniques were singled out in the research: one “correct” technique and four common “incorrect” techniques. The particular technique being used is the response variable to be classified/predicted by the model.

Model Training

The WLE training data set provided was randomly divided into a working training set (hereafter referred to as the training set) and a validation set. A 70/30 split was used.

A random forest method with maximum of 100 trees and 10-fold cross-validation was used to fit three separate models to the training set. The random forest method was chosen because it is known to be highly effective at creating very accurate models, and because it is the method used by the authors of the original paper describing the WLE research project (refer to the WLE web site noted previously). Originally, 500 trees were used for model fitting, but further review indicated that 100 trees was a sufficient number - and had the additional benefit of reducing processing time for model fitting (see chart below).

Three models were defined based upon a visual review of the raw data. Two types of data were apparent: 1) raw measurements from the accelerometers, gyroscopes, and magnetic field sensors in the biometric devices; and 2) higher-level data for each device in the form of roll, pitch, and yaw positions along with a measure of total acceleration. Model #1 fit only the roll, pitch, yaw, and total acceleration measurements as predictors for curl technique. Model #2 fit only the accelerometer, gyroscope, and magnetic measurements as predictors. Model #3 fit all of the measurements together.

Each of the models performed extremely well in terms of classification accuracy on the training set. All were above 0.97 in accuracy. The best model proved to be model #3 with an accuracy of 0.993. Second best was model #1 using just the roll, pitch, yaw, and total acceleration measurements. Such high accuracy rates led to concerns of possibly overfitting the test data, but that proved to not be the case during the model validation step.

The information below shows the model training results and a chart of the top 10 classifying variables with their importance to selecting each exercise technique (labeled A thru E).

## 
## Call:
##  randomForest(x = x, y = y, ntree = 100, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.79%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3894    7    3    1    1   0.0030722
## B   21 2625   11    1    0   0.0124153
## C    0   12 2377    7    0   0.0079299
## D    0    0   28 2221    3   0.0137655
## E    0    1    5    7 2512   0.0051485

Model Validation

Model #3 was used to predict the exercise techniques of the observations in the validation data set in order to validate the efectiveness of the model outside of the training set. A confusion matrix of the results is shown below.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1669   10    0    0    0
##          B    4 1128    5    4    0
##          C    0    1 1018   16    2
##          D    0    0    3  944    2
##          E    1    0    0    0 1078
## 
## Overall Statistics
##                                         
##                Accuracy : 0.992         
##                  95% CI : (0.989, 0.994)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.99          
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.997    0.990    0.992    0.979    0.996
## Specificity             0.998    0.997    0.996    0.999    1.000
## Pos Pred Value          0.994    0.989    0.982    0.995    0.999
## Neg Pred Value          0.999    0.998    0.998    0.996    0.999
## Prevalence              0.284    0.194    0.174    0.164    0.184
## Detection Rate          0.284    0.192    0.173    0.160    0.183
## Detection Prevalence    0.285    0.194    0.176    0.161    0.183
## Balanced Accuracy       0.997    0.994    0.994    0.989    0.998

The results proved model #3 to be equally accurate at classifying the validation set. The next step was to try the model on the real test data set.

Testing Model Accuracy With Test Data Set

The test data set consisted of 20 records selected from the WLE data set. This data set contained the predictor measurements only and did not include the true exercise technique identifying variable.

After the data was loaded and pre-processed in the same manner as the training/validation data, model #3 was used to classify the exercise technique for each observation. The model’s clasification results were blind compared with the correct classifications (which were not known nor accessible ahead of time). The result was a perfect 20/20 match.

Conclusion

The goal of this analysis was to create an accurate model for identifying correct and incorrect dumbbell bicep curl exercise techniques using the WLE data set. This was achieved with greater than 99% accuracy using a random forest method to create the model. Three separate models were compared. The best one was used to classify correct/incorrect exercise techniques in a blind test set with 100% accuracy.