WeightLifting Done Right

Summary.

Human Activity Recognition (HAR) is gaining increasing attention by the pervasive computing research community, especially for the development of context-aware systems. There are many potential applications for HAR, such as elderly monitoring, life log systems for monitoring energy expenditure and for supporting weight-loss programs, and digital assistants for weight lifting exercises.

The analysis presented here uses the Weight Lifting Exercise (WLE) dataset available at http://groupware.les.inf.puc-rio.br/har. Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

The goal was to obtain a prediction algorithm that takes the sensor readings contained within the WLE dataset and correctly predicts the corresponding class.

#Load some required packages
library(ggplot2)
library(caret)

## Warning: package 'caret' was built under R version 3.3.1

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.3.1

Exploratory data analysis

The development of the prediction model uses a data file containing training data (pml-training.csv) and a separate testing (pml-testing.csv) file to be used for assessing the accuracy of the predictive model. Note that the latter is not to be used for model development.

An initial inspection of the data file showed many missing values (NA) as well as cells containg “#DIV/0!”. The following command reads in the training data:

#Training and testing data are assumed to be in the same directory.
train = read.table("pml-training.csv", header = TRUE, sep = ",",
                   na.strings = c("NA", "#DIV/0!"))
dim(train)

## [1] 19622   160

The training data has 19,622 observations (rows) with 160 variables (columns). The first column is an integer designating the row number while the second column contains the user name. Columns three through seven contain values related to the time window during which the exrcise took place. Since these latter values are not sensor readings per se they will be omitted for the data frame. The remaining columns (eight to 159) contain the recorded values from four different sensors placed in a belt (belt) around the subject’s waist, on the right arm (arm) and right wrist (forearm) of the subject and on the dumbell itself (dumbell). The variables represent the raw accelerometer, gyroscope, and magneto readings as well as derived values on the Euler angles (roll, pitch, and yaw). For a full discussion of these values see Velloso et al. (2013).

Selecting and cleaning the data

#Select sensor-related variables
vars <- grep(pattern = "_belt|_arm|_dumbbell|_forearm", names(train))
df <- train[, c(vars, 160)]
#Remove variables which contain missing values
df <- df[, -which(colSums(is.na(df)) > 19200)]
dim(df)

## [1] 19622    53

This leaves 52 features (all numeric) to build a model for predicting the outcome, i.e., the class of the exercise. The latter is represented as a factor in the dataframe.

First, we split our dataset into training (80% of the total number of observations) and testing (remaining 20% of observations) subsets, based on the outcome, after we set a seed in order to ensure reproducibility.

set.seed(9779)
inTrain <- createDataPartition(y=df$classe, p=0.80, list=FALSE)
training <- df[inTrain,]
testing <- df[-inTrain,]

Model Development

The relatively high number of features precludes a simple linear regression while a multivariate linear regression would likely produce a model that is not easily interpretable apart from the fact that any method aimed at elucidating the best model may have to enumerate through 2^52 possible models. Among the non-parametric methods available, clustering via k-means with an initial five cluster centers representing the five class outcomes is a definitive option. However, here we opted for the construction of a random forest (using the train function from the caret package). This method includes significant subsampling such that cross validation is not necessary. The default number of trees, i.e., 500 will be used to train the model.

rfModel <- randomForest(classe~., data=training, ntree=500)
rfModel

## 
## Call:
##  randomForest(formula = classe ~ ., data = training, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.41%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 4460    3    1    0    0 0.0008960573
## B   11 3024    3    0    0 0.0046082949
## C    0   14 2723    1    0 0.0054784514
## D    0    0   23 2548    2 0.0097162845
## E    0    0    1    5 2880 0.0020790021

The resuls show that the resulting model has a low out-of-bag error rate of approximtely 0.45% while the confusion matrix shows that mis-classification rates for every class are quite low as indicated by the classification errors. The resulting random forest model is subsequently used to predict the class of the observations left out of the training set.

predictions <- predict(rfModel, newdata=testing)
confusionMatrix(predictions, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1115    4    0    0    0
##          B    0  754    2    0    0
##          C    0    1  682    1    2
##          D    0    0    0  642    0
##          E    1    0    0    0  719
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9972         
##                  95% CI : (0.995, 0.9986)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9965         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9991   0.9934   0.9971   0.9984   0.9972
## Specificity            0.9986   0.9994   0.9988   1.0000   0.9997
## Pos Pred Value         0.9964   0.9974   0.9942   1.0000   0.9986
## Neg Pred Value         0.9996   0.9984   0.9994   0.9997   0.9994
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2842   0.1922   0.1738   0.1637   0.1833
## Detection Prevalence   0.2852   0.1927   0.1749   0.1637   0.1835
## Balanced Accuracy      0.9988   0.9964   0.9979   0.9992   0.9985

The sensitivity and the specificity of the resulting model is fairly high for all classes. The accuracy and the narrow confidence interval likewise suggest that the random forest model performs adequately on the data set. In order to get some idea how the model performed visually, we can plot the original class designations on the x-axis and the predictions made by the model on the y-axis. Note that it will be necessary to change the character classes A through E into numeric values 1 through 5 first.

x <- as.numeric(testing$classe)
y <- as.numeric(predictions)
g <- ggplot(data=NULL, aes(x=jitter(x), y=jitter(y)), groupby=x) +
  geom_point(aes(colour=factor(y))) +
  labs(title="Prediction of Weightlifting Classes by Random Forests") +
  xlab("Actual Class Designations (A=1, B=2, etc.)") +
  ylab("Predicted Class Designations (A=1, B=2, etc.)")
g

Note that the color of the data points is based on the actual class designation and, as such, observations with a different color based on the vertically aligned points within a class have been incorrectly classified.

Finally, using the function varImpPlot we can get an idea of the variable importance as measured by the random forest model.

varImpPlot(rfModel)

References

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013. Accessed July 24, 2016.