We built a classification model trained on data from accelerometers on a weight and the belt, arm and forearm (Figure 1, below) of six human subjects to predict whether they executed 10 repetitions of a simple Unilateral Dumbbell Biceps Curl exercise correctly, or not. See the Weight Lifting Exercises Dataset collected by the Human Activity Recognition project (HAR) at the Catholic University of Rio de Janeiro (Brazil).
We predicted six responses, labeled A through E, a correct lift (A) and five different incorrect lifts: throwing elbows forward (B), lifting only halfway (C), lowering only halfway( D), and throwing hips forward (E).
Since we predicted those responses without any subject matter expertise to select plausible causal features, we built four alternative models that select features automatically. Then, we assessed their Out of Sample Error with cross-validation on a training subset and their Accuracy in predicting a separate validation subset. No model stacking was necessary.
Finally, we made predictions on a very small testing set with the most accurate model, cross-validated Random Forest. Its Accuracy was only slightly higher than a much faster Support Vector Machines model and a slightly faster Generalized Boosted-regression Model. Each of those three were much more accurate than a Linear Discriminant Analysis model.
Fig. 1, Location of Accelerometers:
library(plyr)
library(dplyr)
library(caret)
library(kernlab)
library(randomForest)
library(ggplot2)
library(ggthemes)
options(digits = 3, max.print = 200)
We used a version of the original dataset already split into a very large training set (N=19622) and a very small testing set (N=20) for the Practical Machine Learning course. They contained 160 features.
urlTrain <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
urlTest <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(urlTrain, destfile = "./data/pml-training.csv", method = "curl")
download.file(urlTest, destfile = "./data/pml-testing.csv", method = "curl")
training = read.csv("./data/pml-training.csv", na.strings = c("", "NA"))
testing = read.csv("./data/pml-testing.csv", na.strings = c("", "NA"))
dim(training)
[1] 19622 160
dim(testing)
[1] 20 160
Visual inspection of the dataset summary (not included here) immediately revealed one major problem: 19,216 of the 19,622 records contained only NA or #DIV/0! for 100 of the 160 features rendering them useless for prediction. Here are the counts of features with 0 or 19,216 NAs, the only combinations in the entire dataset:
trainingIsNA <- is.na.data.frame(training)
table(apply(trainingIsNA, 2, sum))
0 19216
60 100
trainingAnyNAs <- apply(training, 2, anyNA)
table(trainingAnyNAs)
trainingAnyNAs
FALSE TRUE
60 100
So, the first data cleaning step was simply to drop all 100 features containing mostly NA and #DIV/0!, leaving 60 features and 19,622 complete records.
trainingNoNAs <- training[, trainingAnyNAs == FALSE]
dim(trainingNoNAs)
[1] 19622 60
A second feature selection step was to drop seven non-numeric features that had no predictive value leaving a slimmed-down training set with only 53 features.
trainingCleanVars <- trainingNoNAs[, -c(1:7)]
dim(trainingCleanVars)
[1] 19622 53
While the reduction in features might sound a bit drastic, those remaining covered readings of all three dimensions (x, y and z) from all four accelerometers (dumbell, belt, arm and forearm), as well as the classe response, a non-numeric factor.
names(trainingCleanVars)
[1] "roll_belt" "pitch_belt" "yaw_belt"
[4] "total_accel_belt" "gyros_belt_x" "gyros_belt_y"
[7] "gyros_belt_z" "accel_belt_x" "accel_belt_y"
[10] "accel_belt_z" "magnet_belt_x" "magnet_belt_y"
[13] "magnet_belt_z" "roll_arm" "pitch_arm"
[16] "yaw_arm" "total_accel_arm" "gyros_arm_x"
[19] "gyros_arm_y" "gyros_arm_z" "accel_arm_x"
[22] "accel_arm_y" "accel_arm_z" "magnet_arm_x"
[25] "magnet_arm_y" "magnet_arm_z" "roll_dumbbell"
[28] "pitch_dumbbell" "yaw_dumbbell" "total_accel_dumbbell"
[31] "gyros_dumbbell_x" "gyros_dumbbell_y" "gyros_dumbbell_z"
[34] "accel_dumbbell_x" "accel_dumbbell_y" "accel_dumbbell_z"
[37] "magnet_dumbbell_x" "magnet_dumbbell_y" "magnet_dumbbell_z"
[40] "roll_forearm" "pitch_forearm" "yaw_forearm"
[43] "total_accel_forearm" "gyros_forearm_x" "gyros_forearm_y"
[46] "gyros_forearm_z" "accel_forearm_x" "accel_forearm_y"
[49] "accel_forearm_z" "magnet_forearm_x" "magnet_forearm_y"
[52] "magnet_forearm_z" "classe"
We double-checked that the same number of incomplete features…
testingNoNAs <- testing[, trainingAnyNAs == FALSE]
dim(testingNoNAs)
[1] 20 60
…and non-numeric features were dropped from the much smaller testing dataset, as well.
testingCleanVars <- testingNoNAs[, -c(1:7)]
dim(testingCleanVars)
[1] 20 53
Finally, we confirmed that none of the retained features suffered from non-zero variance.
NearZeroClean <- nzv(trainingCleanVars, saveMetrics = TRUE)
sum(NearZeroClean$nzv)
[1] 0
Taking advantage of the training set’s large size, we subsetted it into still large train (N=14718) and valid (N=4904) subsets for model building and assessment, respectively, as well as features-only (52) and response-only (1) alternatives for convenience.
set.seed(34567)
inTrain <- createDataPartition(trainingCleanVars$classe, p = 0.75, list = FALSE)
trainCleanVars <- trainingCleanVars[inTrain, ]
validCleanVars <- trainingCleanVars[-inTrain, ]
trainFeatures <- as.matrix(trainCleanVars[, -53])
trainResponse <- as.factor(trainCleanVars[, 53])
validFeatures <- validCleanVars[, -53]
validResponse <- validCleanVars[, 53]
dim(trainFeatures)
[1] 14718 52
str(trainResponse)
Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
dim(validFeatures)
[1] 4904 52
str(validResponse)
Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
We confirmed that the resulting distribution of classe responses was similar for the two subsets:
table(trainResponse)
trainResponse
A B C D E
4185 2848 2567 2412 2706
table(validResponse)
validResponse
A B C D E
1395 949 855 804 901
We set up 10-fold Cross Validation—without repetition in order to conserve runtime) for the Linear Discriminant Analysis model and Generalized Boosted-regression Model executed in caret, leaving that to be set up directly in non-caret calls that built the Support Vector Machines and Random Forest models. (The choice of whether to build each type of model in caret, or not, was based on runtime.)
ctrl <- trainControl(method = "cv", number = 10)
Since the 52 features were all numeric or integer, we decided to scale and center them in each of the separate model algorithm calls below. While not strictly necessary for the Generalized Boosted-regression Model and Random Forest tree-generating models, this normalization was required for Support Vector Machines and would help for Linear Discriminant Analysis, too.
Out of Sample Error for the LDA model built with caret was 1 - the 0.701 Accuracy, a disappointing 29.1%. Runtime was short, just 1:00 minute.
set.seed(34567)
fitLda <- train(classe ~ ., data = trainCleanVars, method = "lda", preProc = c("center",
"scale"), trControl = ctrl)
fitLda
Linear Discriminant Analysis
14718 samples
52 predictor
5 classes: 'A', 'B', 'C', 'D', 'E'
Pre-processing: centered (52), scaled (52)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 13246, 13245, 13245, 13247, 13248, 13248, ...
Resampling results:
Accuracy Kappa
0.701 0.622
The LDA model got fully 1,421 predictions wrong. So, its Confusion Matrix exhibited lots of error, both Type I false positives and Type II false negatives.
predictLda <- predict(fitLda, validFeatures)
Loading required package: MASS
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
confusionMatrix(predictLda, validResponse)$table
Reference
Prediction A B C D E
A 1140 128 69 56 40
B 35 623 98 25 155
C 115 123 576 94 81
D 101 33 98 606 87
E 4 42 14 23 538
table(predictLda == validResponse)
FALSE TRUE
1421 3483
confusionMatrix(predictLda, validResponse)$overall[1]
Accuracy
0.71
Out of Sample Error for the GBM model built with caret was dramatically lower at 1 - 0.961, or 3.9%, for the automatically-chosen tuning parameters of n.trees = 150 and interactiondepth = 3. GBM got only 189 predictions wrong with slightly more Type I false positives than Type II false negatives. Runtime, though, was much longer, approx. 16:00 minutes.
set.seed(34567)
fitGbm <- train(classe ~ ., data = trainCleanVars, method = "gbm", preProc = c("scale",
"center"), trControl = ctrl, verbose = FALSE)
fitGbm
Stochastic Gradient Boosting
14718 samples
52 predictor
5 classes: 'A', 'B', 'C', 'D', 'E'
Pre-processing: scaled (52), centered (52)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 13246, 13245, 13245, 13247, 13248, 13248, ...
Resampling results across tuning parameters:
interaction.depth n.trees Accuracy Kappa
1 50 0.754 0.688
1 100 0.825 0.778
1 150 0.857 0.819
2 50 0.855 0.817
2 100 0.908 0.884
2 150 0.933 0.915
3 50 0.899 0.873
3 100 0.945 0.930
3 150 0.964 0.954
Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning parameter 'n.minobsinnode' was held constant at a value of 10
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 150,
interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
predictGbm <- predict(fitGbm, validCleanVars)
confusionMatrix(predictGbm, validCleanVars$classe)$table
Reference
Prediction A B C D E
A 1361 27 0 0 1
B 24 898 27 3 4
C 7 23 818 26 7
D 3 0 10 765 16
E 0 1 0 10 873
table(predictGbm == validCleanVars$classe)
FALSE TRUE
189 4715
confusionMatrix(predictGbm, validCleanVars$classe)$overall[1]
Accuracy
0.961
Figure 2 plots SVM model Accuracy by Boosting Iterations while tuning the Maximum Tree Depth parameter to 1, 2 & 3, which makes clear why caret set it to 3.
plot(fitGbm, main = "Fig. 2: Tuning the Maximum Tree Depth parameter", ylab = "Accuracy",
xlab = "Boosting Iterations")
The SVM model, built with ksvm using the Radial Basis Function kernel in the kernlab package, scaled and centered all features by default, but 10-fold Cross Validation had to be chosen manually by setting the cross option to 10.
Out of Sample Error, reported below as Cross validation error, was also impressively low at 1.8%, not much higher than the Training error of 1.2%. SVM got only 67 predictions wrong, mostly Type I false positives. Runtime was fast, too, approx. 2:00 minutes.
set.seed(34567)
fitSVMrbf <- ksvm(classe ~ ., data = trainCleanVars, kernel = "rbfdot", scaled = TRUE,
C = 16, cross = 10)
fitSVMrbf
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 16
Gaussian Radial Basis kernel function.
Hyperparameter : sigma = 0.0137325880957421
Number of Support Vectors : 3838
Objective Function Value : -5166 -1622 -1440 -662 -3845 -1026 -1526 -6935 -2314 -1966
Training error : 0.01223
Cross validation error : 0.0179
predictSVMrbf <- predict(fitSVMrbf, validCleanVars)
confusionMatrix(predictSVMrbf, validCleanVars$classe)$table
Reference
Prediction A B C D E
A 1391 12 0 0 0
B 4 934 1 0 0
C 0 0 852 35 1
D 0 1 2 769 9
E 0 2 0 0 891
table(predictSVMrbf == validCleanVars$classe)
FALSE TRUE
67 4837
confusionMatrix(predictSVMrbf, validCleanVars$classe)$overall[1]
Accuracy
0.986
The Out of Sample Error of 0.9% for the RF model built with randomForest was the lowest of all. We tuned the mtry parameter to 13, as explained below, and therefore read the corresponding Out of Sample Error on the table line for 13 “variables” (i.e., randomly selected for consideration at each split). RF got only 17 (out of 4,904) predictions for the valid subset wrong, very few of either Type I or II.
Random Forest is not as prone to over-fitting as other algorithms; so, it may not require cross validation on its own (see StackExchange, “Does modeling with Random Forests requre cross-validation?”. Since it bags a large number of trees generated randomly by resampling with replacement, it reported OOB estimate of error rate: 0.58% is not, however, completely comparable to the Out of Sample Error rates reported, so far. “OOB” is Out Of Bag but not 10-fold Cross Validation, for which the additional, slow call to rfcv, mentioned above, was required. This extra Cross Validation step is the only reason RF runtime ended up being the slowest of all, approx. 25:00 minutes combined.
set.seed(34567)
fitRandomFcv500 <- rfcv(trainFeatures, trainResponse, cv.fold = 10)
str(fitRandomFcv500)
List of 3
$ n.var : num [1:6] 52 26 13 6 3 1
$ error.cv : Named num [1:6] 0.0055 0.00666 0.00931 0.04253 0.10871 ...
..- attr(*, "names")= chr [1:6] "52" "26" "13" "6" ...
$ predicted:List of 6
..$ 52: Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ 26: Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ 13: Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ 6 : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ 3 : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ 1 : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 3 2 1 1 1 2 2 ...
names(fitRandomFcv500)
[1] "n.var" "error.cv" "predicted"
error <- as.numeric(fitRandomFcv500$error.cv)
variables <- as.numeric(fitRandomFcv500$n.var)
errorRates500 <- data.frame(cbind(variables, error))
errorRates500
variables error
1 52 0.00550
2 26 0.00666
3 13 0.00931
4 6 0.04253
5 3 0.10871
6 1 0.58846
Figure 3 shows that Out of Sample Error does not drop much more beyond n.var = 13 variables at each tree split.
ggplot(errorRates500) + aes(variables, error) + geom_point(color = "blue2") +
theme_few() + geom_step(color = "blue2", alpha = 0.5) + scale_x_continuous(breaks = c(1,
3, 6, 13, 26, 52)) + labs(y = "Out of Sample Error", x = "Variables at each split",
title = "Fig. 3: Tuning Random Forest's mtry parameter:", subtitle = "OOB Error drops quickly until mtry reaches 13")
Based on that preliminary Cross Validation step, we decided to manually tune the RF model’s mtry parameter to 13. Without needing to execute Cross Validation again, it then ran very quickly, approx. 1:00 minute.
set.seed(34567)
fitRandomF <- randomForest(classe ~ ., trainCleanVars, mtry = 13)
fitRandomF
Call:
randomForest(formula = classe ~ ., data = trainCleanVars, mtry = 13)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 13
OOB estimate of error rate: 0.5%
Confusion matrix:
A B C D E class.error
A 4180 2 1 1 1 0.00119
B 13 2830 5 0 0 0.00632
C 0 12 2551 4 0 0.00623
D 0 1 20 2388 3 0.00995
E 0 0 3 8 2695 0.00407
Although randomForest defaulted to growing 500 trees, Figure 4 suggests that no more than 100 would be necessary to keep Out of Bag Error low, which might make deployment of an even faster model possible (but that’s beyond the scope of this report).
OOBerrors <- data.frame(fitRandomF$err.rate)
ggplot(OOBerrors) + aes(row(OOBerrors)[, 1], OOBerrors[, 1]) + geom_line(color = "red2") +
theme_few() + labs(title = "Fig. 4: Out of Bag Error Rate drops as Forest grows",
y = "OOB Error", x = "Trees")
predictRf <- predict(fitRandomF, validFeatures)
confusionMatrix(predictRf, validResponse)$table
Reference
Prediction A B C D E
A 1392 2 0 0 0
B 3 945 3 0 0
C 0 2 851 5 1
D 0 0 1 799 0
E 0 0 0 0 900
table(predictRf == validResponse)
FALSE TRUE
17 4887
confusionMatrix(predictRf, validResponse)$overall[1]
Accuracy
0.997
predictFinalLda <- predict(fitLda, testingCleanVars)
predictFinalGbm <- predict(fitGbm, testingCleanVars)
Loading required package: gbm
Loading required package: survival
Attaching package: 'survival'
The following object is masked from 'package:caret':
cluster
Loading required package: splines
Loading required package: parallel
Loaded gbm 2.1.3
predictFinalSVM <- predict(fitSVMrbf, testingCleanVars)
predictFinalRF <- predict(fitRandomF, testingCleanVars)
Since the classe response feature was withheld from our small testing set, we settled for an hint indicator of potential success, instead—the encouraging fact that the top three models agreed completely. Only LDA disagreed, making 7 different predictions, presumably false, which was enough to eliminate it from further consideration.
PredictionAgreement <- c(sum(predictFinalLda == predictFinalGbm), sum(predictFinalGbm ==
predictFinalSVM), sum(predictFinalSVM == predictFinalRF))
ModelPairs <- c("LDA vs GBM", "GBM vs SVM", "SVM vs RF")
data.frame(cbind(ModelPairs, PredictionAgreement))
ModelPairs PredictionAgreement
1 LDA vs GBM 13
2 GBM vs SVM 20
3 SVM vs RF 20
Although the top three models were nearly indistinguishable, the table below ranks them—by Out of Sample modelAccuracy on the large valid subset, not just cross validation on the train subset—in this order: RF > SVM > GBM >> LDA.
modelAccuracy <- c(RF = confusionMatrix(predictRf, validResponse)$overall[1],
SVM = confusionMatrix(predictSVMrbf, validCleanVars$classe)$overall[1],
GBM = confusionMatrix(predictGbm, validCleanVars$classe)$overall[1], LDA = confusionMatrix(predictLda,
validResponse)$overall[1])
data.frame(modelAccuracy, row.names = c("RF", "SVM", "GBM", "LDA"))
modelAccuracy
RF 0.997
SVM 0.986
GBM 0.961
LDA 0.710
Only LDA appeared to have failed to predict all 20 testing values. Of the remaining three models, SVM had the fastest runtime, but RF was most accurate by a very slim margin with GBM the also-ran on both dimensions. We opted for highest Accuracy over shortest runtime, choosing to make the final predictions with the Random Forest model despite its manual tuning, separate Cross Validation, and relative slowness. The final response values (from A through E) predicted by the doubly Cross Validated RF model were:
as.matrix(predictFinalRF)
[,1]
1 "B"
2 "A"
3 "B"
4 "A"
5 "A"
6 "E"
7 "D"
8 "B"
9 "A"
10 "A"
11 "B"
12 "C"
13 "B"
14 "A"
15 "E"
16 "E"
17 "A"
18 "B"
19 "B"
20 "B"
NOTE: wordcount < 1,500