Executive Summary

This paper presents a random forest model that is able to predict with approximately 98% accuracy the curling quality of selected weightlifters based on data from smartphone-like sensors mounted on the belt, arm, forearm, and dumbbell. However, it is also found that the results are sensitive to the particular weightlifters in the study, and that this could possibly be addressed by using aggregate sensor data rather than raw data as used in this project.

Introduction

The data from this project come from a study of six weightlifters who wore smartphone-like sensors on their belt, upper arm, forearm, and on a dumbell. The weightlifters performed curls repeatedly in the presence of a trainer while the sensors were sampled at 45 Hz. Data was provided on sensor orientation, as well as from the acceleraometer, gyroscope, and magnetometer. Each weightlifter was asked to perform curls in each of five different curling classes: correct curls in class A, and then curls performed with specific common errors, one per remaining class. The aim of the project is to try and automatically predict from raw sensor measurements if a weightlifter is performing curls correctly, or if the curls belong to one of the given error classes. The data are described more completely here:

Exploratory Analysis and Data Cleaning

By inspection, the data in the input are divided into two classes: raw data (new window=no) and summary data (new window=yes). The data has a time windowed design. The raw data featues include sensor readings, while the summary data include sensor readings and aggregate data over the previous time window, such as max roll, average magnetometer x, etc. It is required by the project to make predictions using raw data, so during the cleaning step we remove the aggregate columns.

Given the vector nature of most of the data, 3d plots of the data were looked at using the plot3D library. Also, rgl and evd libraries were also used to rotate the 3d plots in real time during exploration. Figures 1a and 1b show dumbbell accelerometer data in 3d, with an expected overall kinked shape corresponding to the acceleration profile of doing arm curls at different points in the stroke. Other data show interesting movements in orientation, gyroscope and magnetometer readings.

Figure 1: Dumbbell accelerometer readings by (a) curling class and (b) user.

If the points are colored by the curling class (variable name classe, Figure 1a) then we can see some modest separation between the curling classes. However, if the points are colored by the particular weight lifter (variable name user_name, Figure 1b) then the separateion is much greater. This is also true in the dumbbell orientation, shown in Figure 2.

Figure 2: Dumbbell orientation readings by (a) curling class and (b) user.

This pattern repeats itself for almost every variable in the dataset. In order to ferret out any classification to predict curling class, a machine learning algorithm must take advantage of many small correlations not easily visible to exploratory analysis. But care should be taken in interpreting the results due to the large and visible clustering based on the user. We will return to this point in the results and analysis section.

The clean dataset excludes the timestamp columns, the window numbering, and the aggregate columns as described above. The training dataset is a randomly drawn set comprising 60% of the data with 11532 rows, and the testing set contains the other 40% with 7684 rows. The index and user columns are left in for ease of plotting along the way, but are removed when models are being calculated.

Results and Analysis

A random forest classifier is used because of its excellent performance characteristics, with the default of 500 trees. The training machine has 12 Intel CoreI7 processors at 3.2 GHz with 12 GB of memory. In order to manage the running time, the doParallel library was used and 10 cores were registered. The above model finishes in approximately 8 minutes on this machine, and the results are shown in table 1 below. The model performs well in the training set, with accuracy and kappa both > 98%, in-sample.

  library(doParallel)
  registerDoParallel(10)    
  rf_model <- train(classe~., method="rf", data=train_data[,-c(1,2)])  # removes index and user
rf_model
## Random Forest 
## 
## 11532 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 11532, 11532, 11532, 11532, 11532, 11532, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9848870  0.9808739  0.002224193  0.002817366
##   27    0.9851428  0.9811991  0.002451804  0.003104132
##   52    0.9723812  0.9650483  0.004198998  0.005318467
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.
rf_model$finalModel$confusion
##      A    B    C    D    E class.error
## A 3274    7    1    0    2 0.003045067
## B   19 2206    8    0    0 0.012091357
## C    0   13 1978    7    0 0.010010010
## D    0    4   26 1868    2 0.016842105
## E    0    3    3    6 2105 0.005668399

Table 1: Random forest model and in-sample confusion matrix.

The training error in each curling class is plotted in Figure 3 versus the number of trees. The error rate stabilizes after approximately 50 trees. Running time can therefore be reduced even further by reducing the number of trees in the random forest to 50 from the default 500.

Figure 3: Training error versus number of trees included in the model.

The prediction accuracy on the test set is roughly the same as on the training set, better than 98% out-of-sample accuracy. The out-of-sample accuracy should be lower than the in-sample accuracy. In fact, it was lower before the seed was controlled. The confusion matrix is shown in table 2.

pred_rf_model <- predict(rf_model, newdata=testval_data)
confusionMatrix(pred_rf_model, testval_data$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2185   11    0    2    0
##          B    0 1467   11    2    2
##          C    2    7 1340   19    3
##          D    0    0    3 1223    2
##          E    0    0    0    1 1404
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9915          
##                  95% CI : (0.9892, 0.9935)
##     No Information Rate : 0.2846          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9893          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9991   0.9879   0.9897   0.9808   0.9950
## Specificity            0.9976   0.9976   0.9951   0.9992   0.9998
## Pos Pred Value         0.9941   0.9899   0.9774   0.9959   0.9993
## Neg Pred Value         0.9996   0.9971   0.9978   0.9963   0.9989
## Prevalence             0.2846   0.1933   0.1762   0.1623   0.1836
## Detection Rate         0.2844   0.1909   0.1744   0.1592   0.1827
## Detection Prevalence   0.2860   0.1929   0.1784   0.1598   0.1828
## Balanced Accuracy      0.9984   0.9927   0.9924   0.9900   0.9974

Table 2: Confusion matrix generated by random forest predictions on the test dataset.

Principal Component Analysis

We also pre-processing the data with principal component analysis. We ran two additional models with 20 components and 40 components respectively.

pca20 <- preProcess(train_data[,-c(1,2,length(train_data))], 
                    method="pca", pcaComp = 20)
train_pca20_data <- predict(pca20, train_data[,-c(1,2,length(train_data))])
rf_pca20_model <- train(train_data$classe~., method="rf", data=train_pca20_data)

pca40 <- preProcess(train_data[,-c(1,2,length(train_data))], 
                    method="pca", pcaComp = 40)
train_pca40_data <- predict(pca40, train_data[,-c(1,2,length(train_data))])
rf_pca40_model <- train(train_data$classe~., method="rf", data=train_pca40_data)

The accuracy of these models was 95% for the 20 component analysis and 96% for the 40 component analysis. This suggests that the number of principal components would have to be very close to the actual number of input variables used to recover the 98% in-sample accuracy obtained without it, and therefore isn’t providing much value to the analysis.

Cross Validation

The random forest model was re-run with 6-fold cross validation. The cross validation shows better than 98% accuracy with a standard deviation of 0.0024279.

##    Accuracy     Kappa Resample
## 1 0.9927121 0.9907792    Fold2
## 2 0.9859448 0.9822159    Fold1
## 3 0.9880395 0.9848682    Fold3
## 4 0.9911596 0.9888132    Fold6
## 5 0.9890682 0.9861687    Fold5
## 6 0.9880395 0.9848753    Fold4

Table 3: Accuracy under 6 fold validation.

Finally, the model fits were re-evaluated using a custom built “leave one user out” cross validation. For this analysis, data from one of the six weightlifters was left out of the training dataset. Then the random forest model was trained on the remaining five lifters, and the accuracy of predicting data for the remaining lifter was checked.

user_names <- unique(train_data$user_name)
users <- lapply(1:length(user_names), function(x) {train_data$user_name == user_names[x] } )
train_data_u <- lapply(1:6, function(x) train_data[users[[x]],])
train_data_nu <- lapply(1:6, function(x) train_data[!users[[x]],])
rf_model_u <- lapply(1:6, function(x) { train(classe~., method="rf", data=train_data_nu[[x]][,-c(1,2)])})
pred_u <- lapply(1:6, function(x) { predict(rf_model_u[[x]], newdata=train_data_u[[x]])})
conf_u <- lapply(1:6, function(x) { confusionMatrix(pred_u[[x]], train_data_u[[x]]$classe)})
acc_u <- sapply(conf_u, function(x) {x$overall[1]})
acc_u
##  Accuracy  Accuracy  Accuracy  Accuracy  Accuracy  Accuracy 
## 0.4669533 0.3324590 0.1841077 0.6085478 0.1706109 0.5597356

The results from this are not very promising. The accuracy of predicting curling class of one weightlifter based on model training the other five weightlifers ranges from 17% to 60%. This suggests that the method presented here may be good at identifying curling quality for the specific weightlifters in the study only, and may perform poorly outside of the sample.

Important predictors

As a final investigation, we can look at which variables the model thinks are most important. Figure 4 shows a plot of importance versus index in the random forest model.

Figure 4: Important predictors as found by the random forest model.

If we re-run the model with only the nine predictors with importance > 200 (excluding index and user name which are the first two in the figure) then we can get to approximately 96% out-of-sample accuracy.

confusionMatrix(pred_imps, testval_data$classe)$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   9.609578e-01   9.505582e-01   9.563835e-01   9.651781e-01   2.846174e-01 
## AccuracyPValue  McnemarPValue 
##   0.000000e+00   4.141341e-11

This warrants further investigation, and seems to be a more promising way to reduce the number of predictors than PCA.

Conclusion

In this paper, a random forest model was presented that is able to predict with approximately 98% accuracy the curling quality of selected weightlifters based on data from smartphone-like sensors mounted on the belt, arm, forearm, and dumbbell. However, it is also found that the results are sensitive to the particular weightlifters in the study.

On this last point, the academic paper linked in the introduction reveals some interesting facts. First, the researchers did not attempt to build a prediction model based on raw data as was done in this paper. Rather, those researchers used aggregate features extracted from the data, like maximum curl roll angle during a 2.5 sec window. The aggregate data do seem to be better suited for this analysis. For example, one of the curling classes is to do a curl halfway. But the raw data streamed at 45 Hz from a half curl is very likely to look a lot like raw data streamed from half of a real curl! However, we are constrained by the contents of the pml_testing.csv to use raw data. Second, the researchers also did a “leave one user out” analysis. At an average of 78%, their results were better than the ones presented above for the leave one out analysis. Their results are probbaly better because they were using aggregates; however, it also shows that users are a strong source of systematic bias even in their case. Unfortunateky we cannot verify this assertion because only raw data was provided with this project.

In closing, the project demonstrates an impressive ability of random forests to train on data with no obvious visual clustering and pull out working classification trees.