Course Project, Practical Machine Learning

Summary:

Predicting incorrect, as well as correct, weight lifting

We built a classification model trained on data from accelerometers on a weight and the belt, arm and forearm (Figure 1, below) of six human subjects to predict whether they executed 10 repetitions of a simple Unilateral Dumbbell Biceps Curl exercise correctly, or not. See the Weight Lifting Exercises Dataset collected by the Human Activity Recognition project (HAR) at the Catholic University of Rio de Janeiro (Brazil).

We predicted six responses, labeled A through E, a correct lift (A) and five different incorrect lifts: throwing elbows forward (B), lifting only halfway (C), lowering only halfway( D), and throwing hips forward (E).

Since we predicted those responses without any subject matter expertise to select plausible causal features, we built four alternative models that select features automatically. Then, we assessed their Out of Sample Error with cross-validation on a training subset and their Accuracy in predicting a separate validation subset. No model stacking was necessary.

Finally, we made predictions on a very small testing set with the most accurate model, cross-validated Random Forest. Its Accuracy was only slightly higher than a much faster Support Vector Machines model and a slightly faster Generalized Boosted-regression Model. Each of those three were much more accurate than a Linear Discriminant Analysis model.

Fig. 1, Location of Accelerometers: Courtesy of HAR

library(plyr)
library(dplyr)
library(caret)
library(kernlab)
library(randomForest)
library(ggplot2)
library(ggthemes)
options(digits = 3, max.print = 200)

Getting the Dataset

We used a version of the original dataset already split into a very large training set (N=19622) and a very small testing set (N=20) for the Practical Machine Learning course. They contained 160 features.

urlTrain <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
urlTest <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(urlTrain, destfile = "./data/pml-training.csv", method = "curl")
download.file(urlTest, destfile = "./data/pml-testing.csv", method = "curl")
training = read.csv("./data/pml-training.csv", na.strings = c("", "NA"))
testing = read.csv("./data/pml-testing.csv", na.strings = c("", "NA"))
dim(training)
[1] 19622   160
dim(testing)
[1]  20 160

Data Preparation

1. too many NAs in most features

Visual inspection of the dataset summary (not included here) immediately revealed one major problem: 19,216 of the 19,622 records contained only NA or #DIV/0! for 100 of the 160 features rendering them useless for prediction. Here are the counts of features with 0 or 19,216 NAs, the only combinations in the entire dataset:

trainingIsNA <- is.na.data.frame(training)
table(apply(trainingIsNA, 2, sum))

    0 19216 
   60   100 
trainingAnyNAs <- apply(training, 2, anyNA)
table(trainingAnyNAs)
trainingAnyNAs
FALSE  TRUE 
   60   100 

So, the first data cleaning step was simply to drop all 100 features containing mostly NA and #DIV/0!, leaving 60 features and 19,622 complete records.

trainingNoNAs <- training[, trainingAnyNAs == FALSE]
dim(trainingNoNAs)
[1] 19622    60

A second feature selection step was to drop seven non-numeric features that had no predictive value leaving a slimmed-down training set with only 53 features.

trainingCleanVars <- trainingNoNAs[, -c(1:7)]
dim(trainingCleanVars)
[1] 19622    53

While the reduction in features might sound a bit drastic, those remaining covered readings of all three dimensions (x, y and z) from all four accelerometers (dumbell, belt, arm and forearm), as well as the classe response, a non-numeric factor.

names(trainingCleanVars)
 [1] "roll_belt"            "pitch_belt"           "yaw_belt"            
 [4] "total_accel_belt"     "gyros_belt_x"         "gyros_belt_y"        
 [7] "gyros_belt_z"         "accel_belt_x"         "accel_belt_y"        
[10] "accel_belt_z"         "magnet_belt_x"        "magnet_belt_y"       
[13] "magnet_belt_z"        "roll_arm"             "pitch_arm"           
[16] "yaw_arm"              "total_accel_arm"      "gyros_arm_x"         
[19] "gyros_arm_y"          "gyros_arm_z"          "accel_arm_x"         
[22] "accel_arm_y"          "accel_arm_z"          "magnet_arm_x"        
[25] "magnet_arm_y"         "magnet_arm_z"         "roll_dumbbell"       
[28] "pitch_dumbbell"       "yaw_dumbbell"         "total_accel_dumbbell"
[31] "gyros_dumbbell_x"     "gyros_dumbbell_y"     "gyros_dumbbell_z"    
[34] "accel_dumbbell_x"     "accel_dumbbell_y"     "accel_dumbbell_z"    
[37] "magnet_dumbbell_x"    "magnet_dumbbell_y"    "magnet_dumbbell_z"   
[40] "roll_forearm"         "pitch_forearm"        "yaw_forearm"         
[43] "total_accel_forearm"  "gyros_forearm_x"      "gyros_forearm_y"     
[46] "gyros_forearm_z"      "accel_forearm_x"      "accel_forearm_y"     
[49] "accel_forearm_z"      "magnet_forearm_x"     "magnet_forearm_y"    
[52] "magnet_forearm_z"     "classe"              

We double-checked that the same number of incomplete features…

testingNoNAs <- testing[, trainingAnyNAs == FALSE]
dim(testingNoNAs)
[1] 20 60

…and non-numeric features were dropped from the much smaller testing dataset, as well.

testingCleanVars <- testingNoNAs[, -c(1:7)]
dim(testingCleanVars)
[1] 20 53

Finally, we confirmed that none of the retained features suffered from non-zero variance.

NearZeroClean <- nzv(trainingCleanVars, saveMetrics = TRUE)
sum(NearZeroClean$nzv)
[1] 0

Split off extra validation subset

(for assessing model Accuracy):

Taking advantage of the training set’s large size, we subsetted it into still large train (N=14718) and valid (N=4904) subsets for model building and assessment, respectively, as well as features-only (52) and response-only (1) alternatives for convenience.

set.seed(34567)
inTrain <- createDataPartition(trainingCleanVars$classe, p = 0.75, list = FALSE)
trainCleanVars <- trainingCleanVars[inTrain, ]
validCleanVars <- trainingCleanVars[-inTrain, ]
trainFeatures <- as.matrix(trainCleanVars[, -53])
trainResponse <- as.factor(trainCleanVars[, 53])
validFeatures <- validCleanVars[, -53]
validResponse <- validCleanVars[, 53]
dim(trainFeatures)
[1] 14718    52
str(trainResponse)
 Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
dim(validFeatures)
[1] 4904   52
str(validResponse)
 Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

We confirmed that the resulting distribution of classe responses was similar for the two subsets:

table(trainResponse)
trainResponse
   A    B    C    D    E 
4185 2848 2567 2412 2706 
table(validResponse)
validResponse
   A    B    C    D    E 
1395  949  855  804  901 

Cross Validation & Normalization

Cross Validation to measure Out of Sample Error

We set up 10-fold Cross Validation—without repetition in order to conserve runtime) for the Linear Discriminant Analysis model and Generalized Boosted-regression Model executed in caret, leaving that to be set up directly in non-caret calls that built the Support Vector Machines and Random Forest models. (The choice of whether to build each type of model in caret, or not, was based on runtime.)

ctrl <- trainControl(method = "cv", number = 10)

Scale and Center all features

Since the 52 features were all numeric or integer, we decided to scale and center them in each of the separate model algorithm calls below. While not strictly necessary for the Generalized Boosted-regression Model and Random Forest tree-generating models, this normalization was required for Support Vector Machines and would help for Linear Discriminant Analysis, too.

Four alternative models with Cross Validation

1. Linear Discriminant Analysis (LDA)

Out of Sample Error for the LDA model built with caret was 1 - the 0.701 Accuracy, a disappointing 29.1%. Runtime was short, just 1:00 minute.

set.seed(34567)
fitLda <- train(classe ~ ., data = trainCleanVars, method = "lda", preProc = c("center", 
    "scale"), trControl = ctrl)
fitLda
Linear Discriminant Analysis 

14718 samples
   52 predictor
    5 classes: 'A', 'B', 'C', 'D', 'E' 

Pre-processing: centered (52), scaled (52) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 13246, 13245, 13245, 13247, 13248, 13248, ... 
Resampling results:

  Accuracy  Kappa
  0.701     0.622

 

The LDA model got fully 1,421 predictions wrong. So, its Confusion Matrix exhibited lots of error, both Type I false positives and Type II false negatives.

predictLda <- predict(fitLda, validFeatures)
Loading required package: MASS

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select
confusionMatrix(predictLda, validResponse)$table
          Reference
Prediction    A    B    C    D    E
         A 1140  128   69   56   40
         B   35  623   98   25  155
         C  115  123  576   94   81
         D  101   33   98  606   87
         E    4   42   14   23  538
table(predictLda == validResponse)

FALSE  TRUE 
 1421  3483 
confusionMatrix(predictLda, validResponse)$overall[1]
Accuracy 
    0.71 

2. Generalized Boosted-regression Modeling (GBM) trees

Out of Sample Error for the GBM model built with caret was dramatically lower at 1 - 0.961, or 3.9%, for the automatically-chosen tuning parameters of n.trees = 150 and interactiondepth = 3. GBM got only 189 predictions wrong with slightly more Type I false positives than Type II false negatives. Runtime, though, was much longer, approx. 16:00 minutes.

set.seed(34567)
fitGbm <- train(classe ~ ., data = trainCleanVars, method = "gbm", preProc = c("scale", 
    "center"), trControl = ctrl, verbose = FALSE)
fitGbm
Stochastic Gradient Boosting 

14718 samples
   52 predictor
    5 classes: 'A', 'B', 'C', 'D', 'E' 

Pre-processing: scaled (52), centered (52) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 13246, 13245, 13245, 13247, 13248, 13248, ... 
Resampling results across tuning parameters:

  interaction.depth  n.trees  Accuracy  Kappa
  1                   50      0.754     0.688
  1                  100      0.825     0.778
  1                  150      0.857     0.819
  2                   50      0.855     0.817
  2                  100      0.908     0.884
  2                  150      0.933     0.915
  3                   50      0.899     0.873
  3                  100      0.945     0.930
  3                  150      0.964     0.954

Tuning parameter 'shrinkage' was held constant at a value of 0.1

Tuning parameter 'n.minobsinnode' was held constant at a value of 10
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were n.trees = 150,
 interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10. 
predictGbm <- predict(fitGbm, validCleanVars)
confusionMatrix(predictGbm, validCleanVars$classe)$table
          Reference
Prediction    A    B    C    D    E
         A 1361   27    0    0    1
         B   24  898   27    3    4
         C    7   23  818   26    7
         D    3    0   10  765   16
         E    0    1    0   10  873
table(predictGbm == validCleanVars$classe)

FALSE  TRUE 
  189  4715 
confusionMatrix(predictGbm, validCleanVars$classe)$overall[1]
Accuracy 
   0.961 

Figure 2 plots SVM model Accuracy by Boosting Iterations while tuning the Maximum Tree Depth parameter to 1, 2 & 3, which makes clear why caret set it to 3.

plot(fitGbm, main = "Fig. 2: Tuning the Maximum Tree Depth parameter", ylab = "Accuracy", 
    xlab = "Boosting Iterations")

3. Support Vector Machines (SVM)

The SVM model, built with ksvm using the Radial Basis Function kernel in the kernlab package, scaled and centered all features by default, but 10-fold Cross Validation had to be chosen manually by setting the cross option to 10.

Out of Sample Error, reported below as Cross validation error, was also impressively low at 1.8%, not much higher than the Training error of 1.2%. SVM got only 67 predictions wrong, mostly Type I false positives. Runtime was fast, too, approx. 2:00 minutes.

set.seed(34567)
fitSVMrbf <- ksvm(classe ~ ., data = trainCleanVars, kernel = "rbfdot", scaled = TRUE, 
    C = 16, cross = 10)
fitSVMrbf
Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 16 

Gaussian Radial Basis kernel function. 
 Hyperparameter : sigma =  0.0137325880957421 

Number of Support Vectors : 3838 

Objective Function Value : -5166 -1622 -1440 -662 -3845 -1026 -1526 -6935 -2314 -1966 
Training error : 0.01223 
Cross validation error : 0.0179 
predictSVMrbf <- predict(fitSVMrbf, validCleanVars)
confusionMatrix(predictSVMrbf, validCleanVars$classe)$table
          Reference
Prediction    A    B    C    D    E
         A 1391   12    0    0    0
         B    4  934    1    0    0
         C    0    0  852   35    1
         D    0    1    2  769    9
         E    0    2    0    0  891
table(predictSVMrbf == validCleanVars$classe)

FALSE  TRUE 
   67  4837 
confusionMatrix(predictSVMrbf, validCleanVars$classe)$overall[1]
Accuracy 
   0.986 

4. Random Forest (RF) bagged trees

Two Error rates: Out of Bag & Cross Validation

The Out of Sample Error of 0.9% for the RF model built with randomForest was the lowest of all. We tuned the mtry parameter to 13, as explained below, and therefore read the corresponding Out of Sample Error on the table line for 13 “variables” (i.e., randomly selected for consideration at each split). RF got only 17 (out of 4,904) predictions for the valid subset wrong, very few of either Type I or II.

Random Forest is not as prone to over-fitting as other algorithms; so, it may not require cross validation on its own (see StackExchange, “Does modeling with Random Forests requre cross-validation?”. Since it bags a large number of trees generated randomly by resampling with replacement, it reported OOB estimate of error rate: 0.58% is not, however, completely comparable to the Out of Sample Error rates reported, so far. “OOB” is Out Of Bag but not 10-fold Cross Validation, for which the additional, slow call to rfcv, mentioned above, was required. This extra Cross Validation step is the only reason RF runtime ended up being the slowest of all, approx. 25:00 minutes combined.

set.seed(34567)
fitRandomFcv500 <- rfcv(trainFeatures, trainResponse, cv.fold = 10)
str(fitRandomFcv500)
List of 3
 $ n.var    : num [1:6] 52 26 13 6 3 1
 $ error.cv : Named num [1:6] 0.0055 0.00666 0.00931 0.04253 0.10871 ...
  ..- attr(*, "names")= chr [1:6] "52" "26" "13" "6" ...
 $ predicted:List of 6
  ..$ 52: Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
  ..$ 26: Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
  ..$ 13: Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
  ..$ 6 : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
  ..$ 3 : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
  ..$ 1 : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 3 2 1 1 1 2 2 ...
names(fitRandomFcv500)
[1] "n.var"     "error.cv"  "predicted"
error <- as.numeric(fitRandomFcv500$error.cv)
variables <- as.numeric(fitRandomFcv500$n.var)
errorRates500 <- data.frame(cbind(variables, error))
errorRates500
  variables   error
1        52 0.00550
2        26 0.00666
3        13 0.00931
4         6 0.04253
5         3 0.10871
6         1 0.58846

Figure 3 shows that Out of Sample Error does not drop much more beyond n.var = 13 variables at each tree split.

ggplot(errorRates500) + aes(variables, error) + geom_point(color = "blue2") + 
    theme_few() + geom_step(color = "blue2", alpha = 0.5) + scale_x_continuous(breaks = c(1, 
    3, 6, 13, 26, 52)) + labs(y = "Out of Sample Error", x = "Variables at each split", 
    title = "Fig. 3: Tuning Random Forest's mtry parameter:", subtitle = "OOB Error drops quickly until mtry reaches 13")

Based on that preliminary Cross Validation step, we decided to manually tune the RF model’s mtry parameter to 13. Without needing to execute Cross Validation again, it then ran very quickly, approx. 1:00 minute.

set.seed(34567)
fitRandomF <- randomForest(classe ~ ., trainCleanVars, mtry = 13)
fitRandomF

Call:
 randomForest(formula = classe ~ ., data = trainCleanVars, mtry = 13) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 13

        OOB estimate of  error rate: 0.5%
Confusion matrix:
     A    B    C    D    E class.error
A 4180    2    1    1    1     0.00119
B   13 2830    5    0    0     0.00632
C    0   12 2551    4    0     0.00623
D    0    1   20 2388    3     0.00995
E    0    0    3    8 2695     0.00407

Although randomForest defaulted to growing 500 trees, Figure 4 suggests that no more than 100 would be necessary to keep Out of Bag Error low, which might make deployment of an even faster model possible (but that’s beyond the scope of this report).

OOBerrors <- data.frame(fitRandomF$err.rate)
ggplot(OOBerrors) + aes(row(OOBerrors)[, 1], OOBerrors[, 1]) + geom_line(color = "red2") + 
    theme_few() + labs(title = "Fig. 4: Out of Bag Error Rate drops as Forest grows", 
    y = "OOB Error", x = "Trees")

predictRf <- predict(fitRandomF, validFeatures)
confusionMatrix(predictRf, validResponse)$table
          Reference
Prediction    A    B    C    D    E
         A 1392    2    0    0    0
         B    3  945    3    0    0
         C    0    2  851    5    1
         D    0    0    1  799    0
         E    0    0    0    0  900
table(predictRf == validResponse)

FALSE  TRUE 
   17  4887 
confusionMatrix(predictRf, validResponse)$overall[1]
Accuracy 
   0.997 

Final Predictions

predictFinalLda <- predict(fitLda, testingCleanVars)
predictFinalGbm <- predict(fitGbm, testingCleanVars)
Loading required package: gbm
Loading required package: survival

Attaching package: 'survival'
The following object is masked from 'package:caret':

    cluster
Loading required package: splines
Loading required package: parallel
Loaded gbm 2.1.3
predictFinalSVM <- predict(fitSVMrbf, testingCleanVars)
predictFinalRF <- predict(fitRandomF, testingCleanVars)

Since the classe response feature was withheld from our small testing set, we settled for an hint indicator of potential success, instead—the encouraging fact that the top three models agreed completely. Only LDA disagreed, making 7 different predictions, presumably false, which was enough to eliminate it from further consideration.

PredictionAgreement <- c(sum(predictFinalLda == predictFinalGbm), sum(predictFinalGbm == 
    predictFinalSVM), sum(predictFinalSVM == predictFinalRF))
ModelPairs <- c("LDA vs GBM", "GBM vs SVM", "SVM vs RF")
data.frame(cbind(ModelPairs, PredictionAgreement))
  ModelPairs PredictionAgreement
1 LDA vs GBM                  13
2 GBM vs SVM                  20
3  SVM vs RF                  20

Ranking models by Accuracy

Although the top three models were nearly indistinguishable, the table below ranks them—by Out of Sample modelAccuracy on the large valid subset, not just cross validation on the train subset—in this order: RF > SVM > GBM >> LDA.

modelAccuracy <- c(RF = confusionMatrix(predictRf, validResponse)$overall[1], 
    SVM = confusionMatrix(predictSVMrbf, validCleanVars$classe)$overall[1], 
    GBM = confusionMatrix(predictGbm, validCleanVars$classe)$overall[1], LDA = confusionMatrix(predictLda, 
        validResponse)$overall[1])
data.frame(modelAccuracy, row.names = c("RF", "SVM", "GBM", "LDA"))
    modelAccuracy
RF          0.997
SVM         0.986
GBM         0.961
LDA         0.710

Conclusion:

Only LDA appeared to have failed to predict all 20 testing values. Of the remaining three models, SVM had the fastest runtime, but RF was most accurate by a very slim margin with GBM the also-ran on both dimensions. We opted for highest Accuracy over shortest runtime, choosing to make the final predictions with the Random Forest model despite its manual tuning, separate Cross Validation, and relative slowness. The final response values (from A through E) predicted by the doubly Cross Validated RF model were:

as.matrix(predictFinalRF)
   [,1]
1  "B" 
2  "A" 
3  "B" 
4  "A" 
5  "A" 
6  "E" 
7  "D" 
8  "B" 
9  "A" 
10 "A" 
11 "B" 
12 "C" 
13 "B" 
14 "A" 
15 "E" 
16 "E" 
17 "A" 
18 "B" 
19 "B" 
20 "B" 

NOTE: wordcount < 1,500