Practical Machine Learning Project

Introduction

This project addresses a sample machine learning problem: predicting correct / incorrect barbell lifts using personal activity data.

The present analysis uses data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. Barbell lifts were performed in 5 different ways, which we will attempt to classify using the accelerometer data. The 5 lift types are coded in the data as the “classe” variable.

Initially, we download and inspect the data.

Loading and Preparing the Data for Exploratory Analysis

require("downloader"); require("caret"); require("ggplot2"); require("glmnet"); require("randomForest")

## Loading required package: downloader
## Loading required package: caret

## Warning: package 'caret' was built under R version 3.1.2

## Loading required package: lattice
## Loading required package: ggplot2
## Loading required package: glmnet
## Loading required package: Matrix
## Loaded glmnet 1.9-8
## 
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

dataset_url_train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
dataset_url_test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download(dataset_url_train, dest = "training.csv")
download(dataset_url_test, dest = "testing.csv")
training = read.csv("training.csv", header = TRUE, na.strings = c("", "NA","#DIV/0!"))
testing = read.csv("testing.csv")

Exploratory Data Analysis and Data Cleaning

A cursory examination of the training data reveals that it begins with an index varible and a series of timestamps. Because these are uninterpretable (timestamps) or not useful for model building (index variables), they are removed at the outset.

We also see that a number of the potential covariates contain negligible amounts of actual data. We continue by removing variables that contain greater than 95% NAs, missing observations or #DIV/0!.

vars2remove <- c(1, 3:5)
varsNA <- which(colMeans(is.na(training)) > .95)
training_clean <- training[, -c(vars2remove, varsNA)]

Next, we break the training set up into factor and non-factor variables to facilitate model building.

varsFactor <-which(sapply(training_clean, FUN = class) == "factor")
trainingFactor <- training_clean[, varsFactor]
trainingNoFactor <- as.matrix(training_clean[, -varsFactor])

Model Building

For this multinomial classification problem, we want to compare two approaches: a generalized linear model (GLM) with cross validation via penalized maximum likelihood and a random forest (RF) model. The GLM is quite flexible, readily interpretable and robust with regard to overfitting (Fox, 2008). The RF model’s main asset is its accuracy.

For the GLM model, we opt for an elastic net model (by specifying alpha = 0.5 in the function call) in order to achieve a compromise between ridge and lasso penalty approaches.

We first fit a model without cross validation, then plot this initial model for each level of the “classe” variable.

eNet <-glmnet(x = trainingNoFactor, y = trainingFactor[, "classe"], 
              family = "multinomial", alpha = 0.5)
print(eNet)

## 
## Call:  glmnet(x = trainingNoFactor, y = trainingFactor[, "classe"],      family = "multinomial", alpha = 0.5) 
## 
##        Df       %Dev    Lambda
##   [1,]  0 -2.671e-13 3.551e-01
##   [2,]  1  4.698e-03 3.236e-01
##   [3,]  1  9.093e-03 2.948e-01
##   [4,]  1  1.318e-02 2.686e-01
##   [5,]  3  1.953e-02 2.448e-01
##   [6,]  3  2.752e-02 2.230e-01
##   [7,]  3  3.470e-02 2.032e-01
##   [8,]  4  4.130e-02 1.852e-01
##   [9,]  4  4.960e-02 1.687e-01
##  [10,]  6  5.977e-02 1.537e-01
##  [11,]  7  7.134e-02 1.401e-01
##  [12,]  9  8.266e-02 1.276e-01
##  [13,]  9  9.331e-02 1.163e-01
##  [14,] 11  1.043e-01 1.060e-01
##  [15,] 12  1.158e-01 9.654e-02
##  [16,] 14  1.269e-01 8.797e-02
##  [17,] 15  1.374e-01 8.015e-02
##  [18,] 18  1.473e-01 7.303e-02
##  [19,] 20  1.587e-01 6.654e-02
##  [20,] 22  1.698e-01 6.063e-02
##  [21,] 23  1.810e-01 5.525e-02
##  [22,] 24  1.914e-01 5.034e-02
##  [23,] 25  2.013e-01 4.587e-02
##  [24,] 26  2.114e-01 4.179e-02
##  [25,] 27  2.223e-01 3.808e-02
##  [26,] 28  2.327e-01 3.470e-02
##  [27,] 29  2.429e-01 3.161e-02
##  [28,] 30  2.538e-01 2.881e-02
##  [29,] 35  2.650e-01 2.625e-02
##  [30,] 37  2.765e-01 2.391e-02
##  [31,] 39  2.884e-01 2.179e-02
##  [32,] 42  2.998e-01 1.985e-02
##  [33,] 44  3.104e-01 1.809e-02
##  [34,] 46  3.209e-01 1.648e-02
##  [35,] 48  3.317e-01 1.502e-02
##  [36,] 49  3.425e-01 1.368e-02
##  [37,] 49  3.528e-01 1.247e-02
##  [38,] 49  3.628e-01 1.136e-02
##  [39,] 50  3.733e-01 1.035e-02
##  [40,] 51  3.835e-01 9.433e-03
##  [41,] 51  3.932e-01 8.595e-03
##  [42,] 51  4.027e-01 7.831e-03
##  [43,] 51  4.123e-01 7.135e-03
##  [44,] 51  4.213e-01 6.501e-03
##  [45,] 51  4.299e-01 5.924e-03
##  [46,] 52  4.382e-01 5.398e-03
##  [47,] 52  4.462e-01 4.918e-03
##  [48,] 52  4.539e-01 4.481e-03
##  [49,] 51  4.612e-01 4.083e-03
##  [50,] 51  4.683e-01 3.720e-03
##  [51,] 51  4.752e-01 3.390e-03
##  [52,] 51  4.816e-01 3.089e-03
##  [53,] 51  4.878e-01 2.814e-03
##  [54,] 51  4.937e-01 2.564e-03
##  [55,] 51  4.991e-01 2.337e-03
##  [56,] 51  5.042e-01 2.129e-03
##  [57,] 51  5.090e-01 1.940e-03
##  [58,] 51  5.134e-01 1.767e-03
##  [59,] 52  5.177e-01 1.610e-03
##  [60,] 52  5.217e-01 1.467e-03
##  [61,] 52  5.255e-01 1.337e-03
##  [62,] 52  5.290e-01 1.218e-03
##  [63,] 52  5.325e-01 1.110e-03
##  [64,] 52  5.357e-01 1.011e-03
##  [65,] 53  5.386e-01 9.216e-04
##  [66,] 53  5.414e-01 8.397e-04
##  [67,] 53  5.440e-01 7.651e-04
##  [68,] 53  5.465e-01 6.971e-04
##  [69,] 53  5.488e-01 6.352e-04
##  [70,] 53  5.509e-01 5.788e-04
##  [71,] 53  5.529e-01 5.274e-04
##  [72,] 53  5.547e-01 4.805e-04
##  [73,] 53  5.565e-01 4.378e-04
##  [74,] 53  5.582e-01 3.989e-04
##  [75,] 53  5.597e-01 3.635e-04
##  [76,] 53  5.611e-01 3.312e-04
##  [77,] 53  5.625e-01 3.018e-04
##  [78,] 53  5.637e-01 2.750e-04
##  [79,] 53  5.648e-01 2.505e-04
##  [80,] 53  5.659e-01 2.283e-04
##  [81,] 53  5.669e-01 2.080e-04
##  [82,] 53  5.678e-01 1.895e-04
##  [83,] 53  5.687e-01 1.727e-04
##  [84,] 53  5.695e-01 1.573e-04
##  [85,] 53  5.702e-01 1.434e-04
##  [86,] 53  5.708e-01 1.306e-04
##  [87,] 53  5.715e-01 1.190e-04
##  [88,] 53  5.720e-01 1.085e-04
##  [89,] 53  5.725e-01 9.882e-05
##  [90,] 53  5.730e-01 9.004e-05
##  [91,] 53  5.734e-01 8.204e-05
##  [92,] 53  5.739e-01 7.475e-05
##  [93,] 53  5.742e-01 6.811e-05
##  [94,] 53  5.745e-01 6.206e-05
##  [95,] 53  5.749e-01 5.655e-05
##  [96,] 53  5.752e-01 5.152e-05
##  [97,] 53  5.754e-01 4.695e-05
##  [98,] 53  5.756e-01 4.278e-05
##  [99,] 53  5.758e-01 3.898e-05
## [100,] 53  5.760e-01 3.551e-05

plot (eNet, label = TRUE)

In the plot, each curve shows the change in a single variable’s coefficient compared to the whole coefficient vector at values of lambda. Across the levels of the “classe” variable, variables 6, 7 and 8 (gyros_belt_x, gyros_belt_y and gyros_belt_z, i.e. the accelerometers on participants’ belts) appear to have a large impact.

Cross Validation

In order to choose from the sequence of models with different lambda values provided by glmnet, we perform cross validation using the cv.glmnet function with misclassification error as the criterion for 5-fold procedure.

The lambda value that yields the minimum cross validation error is returned and depicted on a plot of the model’s misclassification error between speckled bars. The very small error rate at the optimal lambda is a sign of the model’s effectiveness.

eNet.cv <-cv.glmnet(x = trainingNoFactor, y = trainingFactor[, "classe"], 
family = "multinomial", alpha = 0.5, nfolds = 5, type.measure = 
"class")
eNet.cv$lambda.min

## [1] 5.654657e-05

plot(eNet.cv)

Prediction

Generalized Linear Model

We begin by building our GLM classification model.

sample.error <- predict(eNet.cv, newx = trainingNoFactor, type = "class", s = "lambda.min")
confMat <-confusionMatrix(sample.error, trainingFactor$classe)
print(confMat)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4865  447  318  217  158
##          B  188 2578  300  149  407
##          C  232  363 2429  382  229
##          D  232  125  205 2314  263
##          E   63  284  170  154 2550
## 
## Overall Statistics
##                                          
##                Accuracy : 0.751          
##                  95% CI : (0.7449, 0.757)
##     No Information Rate : 0.2844         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6842         
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8719   0.6790   0.7098   0.7195   0.7070
## Specificity            0.9188   0.9340   0.9256   0.9497   0.9581
## Pos Pred Value         0.8102   0.7118   0.6682   0.7372   0.7917
## Neg Pred Value         0.9475   0.9238   0.9379   0.9453   0.9356
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2479   0.1314   0.1238   0.1179   0.1300
## Detection Prevalence   0.3060   0.1846   0.1853   0.1600   0.1642
## Balanced Accuracy      0.8953   0.8065   0.8177   0.8346   0.8325

An initial estimate of the out-of-sample error with the cross-validated model can be obtained by comparing values generated from the model with the actual observed values of the “classe” variable. Here, a confusion matrix generated from this comparison shows an overall accuracy of 75.12% with a Kappa of 0.6845.

Random Forest Model

To get an idea of the effectiveness of the generalized linear model, we compare it here to a random forest approach. Random forests are known to be quite accurate, if less readily interpretable and sometimes prone to overfitting, in comparison to other ML appraoches.

RFmodel <- randomForest(trainingFactor$classe ~. , data = trainingNoFactor)
print(RFmodel)

## 
## Call:
##  randomForest(formula = trainingFactor$classe ~ ., data = trainingNoFactor) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.13%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 5579    0    0    0    1 0.0001792115
## B    2 3794    1    0    0 0.0007900974
## C    0    5 3416    1    0 0.0017533606
## D    0    0   10 3205    1 0.0034203980
## E    0    0    0    5 3602 0.0013861935

Our confusion matrix here looks more encouraging, with a much lower error rate.

At this point, we are ready to predict the classes for the actual testing data. First, we perform the same operations on the testing data that were performed on the training data.

vars2remove <- c(1, 3:5)
varsNA <- which(colMeans(is.na(testing)) > .95)
testing_clean <- testing[, -c(vars2remove, varsNA)]

testing_clean$problem_id <- factor(testing_clean$problem_id)
varsFactor <-which(sapply(testing_clean, FUN = class) == "factor")
testingFactor <- testing_clean[, varsFactor]
testingNoFactor <- data.matrix(testing_clean[, -varsFactor])

GLM_predics <- predict(eNet.cv, newx = testingNoFactor, type = "class", s = "lambda.min")
RF_predicts <- predict(RFmodel, testingNoFactor, type = "class")

Conclusion

The random forest model yields considerably more accuracy, but is a bit more of a “black box” than the generalized linear model. Because the task at hand emphasizes accurate predictions, the RF model is preferable. If the research question had addressed the role of co-variates, the GLM may have been the preferred approach.