Executive Summary

In general, in the past it has been common to track how much of an exercise someone is doing. This data collection has not, however, addressed whether an exercise is done correctly. The dataset in this paper was collected from accelerometers on the belt, forearm, arm, and dumbell of 6 participants (males aged 20-28), and it relates specifically to how an exercise is performed. What kind of action the participant took (either correct or one of 4 incorrect ways) is categorized as one of 5 letters. ‘A’ corresponds to performing the dumbbell curl correctly. ‘B’, ‘C’, ‘D’, and ‘E’ are all variations that perform the exercise incorrectly. The ultimate goal is to use the data to determine in which manner the exercise is performed.
Usefully the random forest model in the caret package in R provides an oob (out-of-bag or out-of-sample) error estimate while predicting the “classe” variable on the dataset. We see that the random forest produces a model with over 99.8% prediction accuracy on an separate data sample (this is estimated out-of-sample error). This lends confidence to its predictions on the assigned testing set that has no given “classe” values

Analysis

Load Data

Here we can see the size (in bytes) of the files.

pmlTrainFile <- "pml-training.csv"
file.info(pmlTrainFile)$size 
## [1] 12222368
pmlTestFile <- "pml-testing.csv"
file.info(pmlTestFile)$size 
## [1] 15134

Here’s reading the data files into R. I’ve renamed the ‘test’ file to ‘pmlProject’ for clarity. You can see that I include a little code around the call to read.csv just so that I can see how long the load took.

a <- Sys.time(); pmlTrain <- read.csv(file = pmlTrainFile, as.is = TRUE); b <- Sys.time(); print(paste("Reading the file from", a, "to", b, ".")) 
## [1] "Reading the file from 2019-06-05 13:24:12 to 2019-06-05 13:24:13 ."
a <- Sys.time(); pmlProject <- read.csv(file = pmlTestFile, as.is = TRUE); b <- Sys.time(); print(paste("Reading the file from", a, "to", b, ".")) 
## [1] "Reading the file from 2019-06-05 13:24:13 to 2019-06-05 13:24:13 ."

Reading the training file takes between 1 and 3 seconds. Reading the project file takes less than a second.

And here’s how much data we’ve got.

dim(pmlTrain)
## [1] 19622   160
dim(pmlProject)
## [1]  20 160

So there are 19622 rows in the training data set and 20 rows in the set to predict. That 160 features is a lot, though. It would be good to try to reduce that.

Clean dataset

library(caret)

The first thing to do is check whether any of the features (variables) has zero or nearly zero variance. If they do then we should exclude them from the model because they will not be able to assist much in the prediction.

nzvDF <- nearZeroVar(pmlTrain, saveMetrics = TRUE)

A “true” in the third column indicates that the variable has zero variance. That means every value is the same.

table(nzvDF[,3])
## 
## FALSE 
##   160

Every value is FALSE. There are no variables with zero variance. A “true” in the fourth column indicates that the variable has near zero (very little) variance.

table(nzvDF[,4])
## 
## FALSE  TRUE 
##   100    60

60 variables are in this list. We’ll remove them from the dataset.

nzvDontUse <- nearZeroVar(pmlTrain, saveMetrics = FALSE)
pmlTrain1 <- pmlTrain[,-nzvDontUse]
pmlProject1 <- pmlProject[,-nzvDontUse]
dim(pmlTrain1)
## [1] 19622   100
dim(pmlProject1)
## [1]  20 100

By doing this there are 60 fewer variables (columns) in the dataset. Unfortunately 100 is still a lot.

The “num_window” variable isn’t explicitly a measurement, so it’s worth investigating a bit.

table(pmlTrain$new_window, pmlTrain$classe)
##      
##          A    B    C    D    E
##   no  5471 3718 3352 3147 3528
##   yes  109   79   70   69   79

“Yes” and “no” appear distributed amoung the classe variables in a similar way as the class variables are distributed. We won’t eliminate it from consideration.

Are there any more variables worth eliminating? Perhaps there a some with a large portion of NAs in their values.

fracNAs <- apply(pmlTrain1, MARGIN = 2, function(x) {mean(is.na(x))})
table(fracNAs)
## fracNAs
##                 0 0.979308938946081 
##                59                41

That says that 41 features are 97.93% NAs. We should remove those features from the dataset. Note that in the following transformation I also remove the first five columns of the dataset. These are an index and some timestamps that are not relevant to the analysis.

mostlyNAs <- which(fracNAs > .5)
mostlyNAs
##            max_roll_belt           max_picth_belt            min_roll_belt 
##                       11                       12                       13 
##           min_pitch_belt      amplitude_roll_belt     amplitude_pitch_belt 
##                       14                       15                       16 
##     var_total_accel_belt            avg_roll_belt         stddev_roll_belt 
##                       17                       18                       19 
##            var_roll_belt           avg_pitch_belt        stddev_pitch_belt 
##                       20                       21                       22 
##           var_pitch_belt             avg_yaw_belt          stddev_yaw_belt 
##                       23                       24                       25 
##             var_yaw_belt            var_accel_arm            max_picth_arm 
##                       26                       40                       50 
##              max_yaw_arm              min_yaw_arm        amplitude_yaw_arm 
##                       51                       52                       53 
##        max_roll_dumbbell       max_picth_dumbbell        min_roll_dumbbell 
##                       57                       58                       59 
##       min_pitch_dumbbell  amplitude_roll_dumbbell amplitude_pitch_dumbbell 
##                       60                       61                       62 
##       var_accel_dumbbell        avg_roll_dumbbell     stddev_roll_dumbbell 
##                       64                       65                       66 
##        var_roll_dumbbell       avg_pitch_dumbbell    stddev_pitch_dumbbell 
##                       67                       68                       69 
##       var_pitch_dumbbell         avg_yaw_dumbbell      stddev_yaw_dumbbell 
##                       70                       71                       72 
##         var_yaw_dumbbell        max_picth_forearm        min_pitch_forearm 
##                       73                       86                       87 
##  amplitude_pitch_forearm        var_accel_forearm 
##                       88                       90
pmlTrain2 <- (pmlTrain1[,-mostlyNAs])[,-c(1:5)]

And now our dataset has these dimensions.

dim(pmlTrain2)
## [1] 19622    54

That means we’ve eliminated 106 unusable features from the original dataset.

Let’s check that there are no more features that are mostly NAs.

fracNAs1 <- apply(pmlTrain2, MARGIN = 2, function(x) {mean(is.na(x))})
table(fracNAs1)
## fracNAs1
##  0 
## 54

None left. That’s good.

We have to make sure that the data we’ll predict from is transformed in the same way as the data we used to train the model.

pmlProject2 <- (pmlProject1[,-mostlyNAs])[,-c(1:5)]
dim(pmlProject2)
## [1] 20 54

Check that there aren’t any columns in this dataset that are mostly NAs.

fracNAsProj <- apply(pmlProject2, MARGIN = 2, function(x) {mean(is.na(x))})
table(fracNAsProj)
## fracNAsProj
##  0 
## 54
Split the data into training data and testing data.

Normally we’d split the data into a training set and a test set. In this case it’s unnecessary because the random forest model computes an out-of-bag (oob or out-of-sample) error for us during the training (creation) of the model.

Use parallel processing to reduce computation time

In order to perform the model fitting in a reasonable amount of time it is necessary to make use of the parallel processing capabilities of the computer’s processor.
Note that in the statement it is convention to leave one core free for the operating system to perform its functions.

library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)

Because there are plenty of rows it is the case that regular k-fold cross-validation may work for training the model. FOr the sake of completeness I’ll allow the training algorithm to use the default booststrap method.

pmlRFControl <- trainControl(allowParallel = TRUE)
a <- Sys.time(); pmlRF3 <- train(classe ~.,method="rf",data=pmlTrain2, trControl = pmlRFControl); b <- Sys.time(); print(paste("Training Random Forest model", a, "to", b, ".")) 

This takes 68 minutes using standard bootstrapping on 7 processing threads.

Now we have to close the parallel processing.

stopCluster(cluster)
registerDoSEQ()

And because it took that long we’ll save the model to a file and then load it again. That saves the processing time in the future.

save(pmlRF3, file = "Practical Machine Learning Random Forest Algorithm.rda")
load(file = "Practical Machine Learning Random Forest Algorithm.rda")

Model Analysis

Remember, the caret::train function’s random forest model contains detail about the prediciton error

(pmlRF3$finalModel)$confusion
##      A    B    C    D    E  class.error
## A 5578    1    0    0    1 0.0003584229
## B    6 3789    1    1    0 0.0021069265
## C    0    4 3418    0    0 0.0011689071
## D    0    0    9 3205    2 0.0034203980
## E    0    0    0    3 3604 0.0008317161

and the out-of-bag error estimate.

(pmlRF3$finalModel)$err.rate[500,1]
##        OOB 
## 0.00142697

So we can see that the class prediction error is less than \(3.5 \times 10^-3\) for each class and that the estimated oob error is 0.001427 or approximately \(1.4 \times 10^-3\).

The random forest uses a subset of the predictors for each tree. In this plot we can see how many features it used.

The plot confirms that R got the best forest when it used 27 features of the 53 available

We can also see what R thinks about the most important variables in the model.

It would appear the most effective predictor was num_window. This was followed by roll_belt, pitch_forearm, yaw_belt, magnet_dumbbell_z, pitch_belt, magnet_dumbell_y, and roll_forearm. The rest had progressively less effect on the Gini value (a measure of prediction accuracy).

And finally we can predict the classes for the ‘Project’ dataset.

predict(pmlRF3, pmlProject2)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

There are other classification methods that could potentially be useful. Linear Discriminant Analysis and Support Vector Machines are a couple. However, given the accuracy of this model (OOB error estimate of 0.001427) I am disinclined to pursue any other classification methods.

Note also that it may be possible to reduce overall processing time

  • Principal Component Analysis could further decrease the number of features
  • Using cross-validation instead of bootstrapping could reduce the amount of calculation that goes into training the model
  • The algorithm may not need to produce the default 500 trees in order to get an appropriate accuracy estimate

Conclusion

It appears to be possible to determine accurately the appropriateness of the form of an exercise (at least this exercise) using sensors on various parts of the body. A random forest model predicts very well and estimates its own error (on an unseen dataset) as approximately 0.001427 or about 0.142697 %. On the given testing data the predictions are B, A, B, A, A, E, D, B, A, A, B, C, B, A, E, E, A, B, B, B