In general, in the past it has been common to track how much of an exercise someone is doing. This data collection has not, however, addressed whether an exercise is done correctly. The dataset in this paper was collected from accelerometers on the belt, forearm, arm, and dumbell of 6 participants (males aged 20-28), and it relates specifically to how an exercise is performed. What kind of action the participant took (either correct or one of 4 incorrect ways) is categorized as one of 5 letters. ‘A’ corresponds to performing the dumbbell curl correctly. ‘B’, ‘C’, ‘D’, and ‘E’ are all variations that perform the exercise incorrectly. The ultimate goal is to use the data to determine in which manner the exercise is performed.
Usefully the random forest model in the caret package in R provides an oob (out-of-bag or out-of-sample) error estimate while predicting the “classe” variable on the dataset. We see that the random forest produces a model with over 99.8% prediction accuracy on an separate data sample (this is estimated out-of-sample error). This lends confidence to its predictions on the assigned testing set that has no given “classe” values
Here we can see the size (in bytes) of the files.
pmlTrainFile <- "pml-training.csv"
file.info(pmlTrainFile)$size
## [1] 12222368
pmlTestFile <- "pml-testing.csv"
file.info(pmlTestFile)$size
## [1] 15134
Here’s reading the data files into R. I’ve renamed the ‘test’ file to ‘pmlProject’ for clarity. You can see that I include a little code around the call to read.csv just so that I can see how long the load took.
a <- Sys.time(); pmlTrain <- read.csv(file = pmlTrainFile, as.is = TRUE); b <- Sys.time(); print(paste("Reading the file from", a, "to", b, "."))
## [1] "Reading the file from 2019-06-05 13:24:12 to 2019-06-05 13:24:13 ."
a <- Sys.time(); pmlProject <- read.csv(file = pmlTestFile, as.is = TRUE); b <- Sys.time(); print(paste("Reading the file from", a, "to", b, "."))
## [1] "Reading the file from 2019-06-05 13:24:13 to 2019-06-05 13:24:13 ."
Reading the training file takes between 1 and 3 seconds. Reading the project file takes less than a second.
And here’s how much data we’ve got.
dim(pmlTrain)
## [1] 19622 160
dim(pmlProject)
## [1] 20 160
So there are 19622 rows in the training data set and 20 rows in the set to predict. That 160 features is a lot, though. It would be good to try to reduce that.
library(caret)
The first thing to do is check whether any of the features (variables) has zero or nearly zero variance. If they do then we should exclude them from the model because they will not be able to assist much in the prediction.
nzvDF <- nearZeroVar(pmlTrain, saveMetrics = TRUE)
A “true” in the third column indicates that the variable has zero variance. That means every value is the same.
table(nzvDF[,3])
##
## FALSE
## 160
Every value is FALSE. There are no variables with zero variance. A “true” in the fourth column indicates that the variable has near zero (very little) variance.
table(nzvDF[,4])
##
## FALSE TRUE
## 100 60
60 variables are in this list. We’ll remove them from the dataset.
nzvDontUse <- nearZeroVar(pmlTrain, saveMetrics = FALSE)
pmlTrain1 <- pmlTrain[,-nzvDontUse]
pmlProject1 <- pmlProject[,-nzvDontUse]
dim(pmlTrain1)
## [1] 19622 100
dim(pmlProject1)
## [1] 20 100
By doing this there are 60 fewer variables (columns) in the dataset. Unfortunately 100 is still a lot.
The “num_window” variable isn’t explicitly a measurement, so it’s worth investigating a bit.
table(pmlTrain$new_window, pmlTrain$classe)
##
## A B C D E
## no 5471 3718 3352 3147 3528
## yes 109 79 70 69 79
“Yes” and “no” appear distributed amoung the classe variables in a similar way as the class variables are distributed. We won’t eliminate it from consideration.
Are there any more variables worth eliminating? Perhaps there a some with a large portion of NAs in their values.
fracNAs <- apply(pmlTrain1, MARGIN = 2, function(x) {mean(is.na(x))})
table(fracNAs)
## fracNAs
## 0 0.979308938946081
## 59 41
That says that 41 features are 97.93% NAs. We should remove those features from the dataset. Note that in the following transformation I also remove the first five columns of the dataset. These are an index and some timestamps that are not relevant to the analysis.
mostlyNAs <- which(fracNAs > .5)
mostlyNAs
## max_roll_belt max_picth_belt min_roll_belt
## 11 12 13
## min_pitch_belt amplitude_roll_belt amplitude_pitch_belt
## 14 15 16
## var_total_accel_belt avg_roll_belt stddev_roll_belt
## 17 18 19
## var_roll_belt avg_pitch_belt stddev_pitch_belt
## 20 21 22
## var_pitch_belt avg_yaw_belt stddev_yaw_belt
## 23 24 25
## var_yaw_belt var_accel_arm max_picth_arm
## 26 40 50
## max_yaw_arm min_yaw_arm amplitude_yaw_arm
## 51 52 53
## max_roll_dumbbell max_picth_dumbbell min_roll_dumbbell
## 57 58 59
## min_pitch_dumbbell amplitude_roll_dumbbell amplitude_pitch_dumbbell
## 60 61 62
## var_accel_dumbbell avg_roll_dumbbell stddev_roll_dumbbell
## 64 65 66
## var_roll_dumbbell avg_pitch_dumbbell stddev_pitch_dumbbell
## 67 68 69
## var_pitch_dumbbell avg_yaw_dumbbell stddev_yaw_dumbbell
## 70 71 72
## var_yaw_dumbbell max_picth_forearm min_pitch_forearm
## 73 86 87
## amplitude_pitch_forearm var_accel_forearm
## 88 90
pmlTrain2 <- (pmlTrain1[,-mostlyNAs])[,-c(1:5)]
And now our dataset has these dimensions.
dim(pmlTrain2)
## [1] 19622 54
That means we’ve eliminated 106 unusable features from the original dataset.
Let’s check that there are no more features that are mostly NAs.
fracNAs1 <- apply(pmlTrain2, MARGIN = 2, function(x) {mean(is.na(x))})
table(fracNAs1)
## fracNAs1
## 0
## 54
None left. That’s good.
We have to make sure that the data we’ll predict from is transformed in the same way as the data we used to train the model.
pmlProject2 <- (pmlProject1[,-mostlyNAs])[,-c(1:5)]
dim(pmlProject2)
## [1] 20 54
Check that there aren’t any columns in this dataset that are mostly NAs.
fracNAsProj <- apply(pmlProject2, MARGIN = 2, function(x) {mean(is.na(x))})
table(fracNAsProj)
## fracNAsProj
## 0
## 54
Normally we’d split the data into a training set and a test set. In this case it’s unnecessary because the random forest model computes an out-of-bag (oob or out-of-sample) error for us during the training (creation) of the model.
In order to perform the model fitting in a reasonable amount of time it is necessary to make use of the parallel processing capabilities of the computer’s processor.
Note that in the statement it is convention to leave one core free for the operating system to perform its functions.
library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
Because there are plenty of rows it is the case that regular k-fold cross-validation may work for training the model. FOr the sake of completeness I’ll allow the training algorithm to use the default booststrap method.
pmlRFControl <- trainControl(allowParallel = TRUE)
a <- Sys.time(); pmlRF3 <- train(classe ~.,method="rf",data=pmlTrain2, trControl = pmlRFControl); b <- Sys.time(); print(paste("Training Random Forest model", a, "to", b, "."))
This takes 68 minutes using standard bootstrapping on 7 processing threads.
Now we have to close the parallel processing.
stopCluster(cluster)
registerDoSEQ()
And because it took that long we’ll save the model to a file and then load it again. That saves the processing time in the future.
save(pmlRF3, file = "Practical Machine Learning Random Forest Algorithm.rda")
load(file = "Practical Machine Learning Random Forest Algorithm.rda")
Remember, the caret::train function’s random forest model contains detail about the prediciton error
(pmlRF3$finalModel)$confusion
## A B C D E class.error
## A 5578 1 0 0 1 0.0003584229
## B 6 3789 1 1 0 0.0021069265
## C 0 4 3418 0 0 0.0011689071
## D 0 0 9 3205 2 0.0034203980
## E 0 0 0 3 3604 0.0008317161
and the out-of-bag error estimate.
(pmlRF3$finalModel)$err.rate[500,1]
## OOB
## 0.00142697
So we can see that the class prediction error is less than \(3.5 \times 10^-3\) for each class and that the estimated oob error is 0.001427 or approximately \(1.4 \times 10^-3\).
The random forest uses a subset of the predictors for each tree. In this plot we can see how many features it used.
The plot confirms that R got the best forest when it used 27 features of the 53 available
We can also see what R thinks about the most important variables in the model.
It would appear the most effective predictor was num_window. This was followed by roll_belt, pitch_forearm, yaw_belt, magnet_dumbbell_z, pitch_belt, magnet_dumbell_y, and roll_forearm. The rest had progressively less effect on the Gini value (a measure of prediction accuracy).
And finally we can predict the classes for the ‘Project’ dataset.
predict(pmlRF3, pmlProject2)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
There are other classification methods that could potentially be useful. Linear Discriminant Analysis and Support Vector Machines are a couple. However, given the accuracy of this model (OOB error estimate of 0.001427) I am disinclined to pursue any other classification methods.
Note also that it may be possible to reduce overall processing time
It appears to be possible to determine accurately the appropriateness of the form of an exercise (at least this exercise) using sensors on various parts of the body. A random forest model predicts very well and estimates its own error (on an unseen dataset) as approximately 0.001427 or about 0.142697 %. On the given testing data the predictions are B, A, B, A, A, E, D, B, A, A, B, C, B, A, E, E, A, B, B, B