Introduction

This article focuses on recognising when a particular weight exercise is performed correctly. In particular when a single arm curl is performed correctly. This article is done for the completion of a project in Practical Machine Learning. Therefore the idea and data for this project comes from the Human Activity Recognition website. Data was gathered from sensors connected to the dumbell,forearm, bicep area and waist of each test candidate. The candidates were then asked, under the guidance of a professional, to perform the single arm curl using five specific methods. The first method was considered correct and the other four were considered as common problematic methods associated with this type of exercise. So the goal of this article is to build and train a model on this data to identify when the exercise is been done correctly, or rather to be able to classify the exercise done into the five predefined methods.

Data Cleaning

The training and required test data set have been provided by the Practical Machine Learning organisors. The number of features in each set is 160, including the outcome variable called classe. The classe variable denotes which of the five methods are used at the time of recording a specific candidate. The number of features need to be reduced in order to make this problem more tractable. To do this it is noticed that the features in the test data that have usable values are far less than the 160 available. Considering the test data set will be used to predict the type of excersice method, the training data set will be reduced to the same number of features. In short, the summary statistics included in the data sets are not included in the test data set for prediction. Removing those features reduces the number of overall features from 160 to 60. Of the remaining 60 the first 7 are candidate and time based referenced variables, which will not be needed. Therefore there are 53 features to consider including the classe variable, which is to be separated out as the outcome variable to predict. The training data will be split up into a sub-training and validation set. Given there is a final test data set to use, the training data set will split 60/40 into a sub-training and validation set for the purposes of cross validation.

Exploratory Analysis

The remaining 53 variables are made up of the outcome variable classe and 52 variables representing 13 dimensions of each of the four sensors. The 13 variables are made up of the following metrics:

Skimming over the data it appears that some features may have more or less constant values. Implying the variance may be zero or near zero, if so they could be excuded from the data set. The following gives the result of the first ten variables.

##                  freqRatio percentUnique zeroVar   nzv
## roll_belt         1.143667     8.6786685   FALSE FALSE
## pitch_belt        1.136364    13.4850543   FALSE FALSE
## yaw_belt          1.071895    14.6229620   FALSE FALSE
## total_accel_belt  1.063466     0.2292799   FALSE FALSE
## gyros_belt_x      1.088670     1.0190217   FALSE FALSE
## gyros_belt_y      1.174414     0.5689538   FALSE FALSE
## gyros_belt_z      1.081232     1.3756793   FALSE FALSE
## accel_belt_x      1.010060     1.2907609   FALSE FALSE
## accel_belt_y      1.091684     1.1294158   FALSE FALSE
## accel_belt_z      1.151575     2.4286685   FALSE FALSE

From the first 10 variables it appears none may be excluded and if a count is performed on the all the zeroVar and nzv variables we get 0 and 0 respectively. Confriming no variables should be removed.

It is important to note that the gyroscope and magnetometor measures orientation and change in orientation, which in all likelihood may be captured in the first three variables, Roll, Pitch and Yaw. Further, the acceleration direction may be less important than the acceleration itself, if the first three variables capture the orientation then adding the total accelration would likely corroborate the type of method used. So the fourth variable Total Acceleration would be sufficient. Therefore it is postulated that only the first four variables per sensor will be required to succesfully capture the characteristics of each of the methods.

Machine Learning

The training data is filtered for the following variables for each of the sensors:

The resulting data is then used to train the model to predict the class of exercise type given the data in the validation set.

The algorithm to be used is boosting with trees using the gbm method within the train function from the caret package and will be applied to the sub-training set. This alogrithm is selected as it combines the classfication tree algorithm with boosting which weights the strengths of each variable to improve the prediction of the outcome. The confusion matrix table and accuracy will be calculated first for the sub-training set and then for the validation set. It is always expected that the out of sample error should be slighlty higher than in sample, therefore the accuracy for the validation set should be lower.

Results

The following is the confusion matrix for the model predicting the training data set.

A B C D E
A 3265 67 5 15 5
B 49 2083 93 10 24
C 20 95 1909 41 32
D 12 24 45 1858 33
E 2 10 2 6 2071

The following is the confusion matrix for the model predicting the validation data set.

A B C D E
A 2150 64 3 7 6
B 50 1353 78 11 32
C 11 73 1249 27 24
D 14 21 36 1236 18
E 7 7 2 5 1362

The accuracy for the model on the training and validation data set:

Data Set Used Accuracy
Training Set 0.9498981
Validation Set 0.9367831

As expected the out of sample error is higher, thus the accuracy is lower on the validation set. The overall accuracy was good enough and was used to submit the test data prediction succesfully.