This project addresses a sample machine learning problem: predicting correct / incorrect barbell lifts using personal activity data.
The present analysis uses data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. Barbell lifts were performed in 5 different ways, which we will attempt to classify using the accelerometer data. The 5 lift types are coded in the data as the “classe” variable.
Initially, we download and inspect the data.
require("downloader"); require("caret"); require("ggplot2"); require("glmnet"); require("randomForest")
## Loading required package: downloader
## Loading required package: caret
## Warning: package 'caret' was built under R version 3.1.2
## Loading required package: lattice
## Loading required package: ggplot2
## Loading required package: glmnet
## Loading required package: Matrix
## Loaded glmnet 1.9-8
##
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
dataset_url_train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
dataset_url_test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download(dataset_url_train, dest = "training.csv")
download(dataset_url_test, dest = "testing.csv")
training = read.csv("training.csv", header = TRUE, na.strings = c("", "NA","#DIV/0!"))
testing = read.csv("testing.csv")
A cursory examination of the training data reveals that it begins with an index varible and a series of timestamps. Because these are uninterpretable (timestamps) or not useful for model building (index variables), they are removed at the outset.
We also see that a number of the potential covariates contain negligible amounts of actual data. We continue by removing variables that contain greater than 95% NAs, missing observations or #DIV/0!.
vars2remove <- c(1, 3:5)
varsNA <- which(colMeans(is.na(training)) > .95)
training_clean <- training[, -c(vars2remove, varsNA)]
Next, we break the training set up into factor and non-factor variables to facilitate model building.
varsFactor <-which(sapply(training_clean, FUN = class) == "factor")
trainingFactor <- training_clean[, varsFactor]
trainingNoFactor <- as.matrix(training_clean[, -varsFactor])
For this multinomial classification problem, we want to compare two approaches: a generalized linear model (GLM) with cross validation via penalized maximum likelihood and a random forest (RF) model. The GLM is quite flexible, readily interpretable and robust with regard to overfitting (Fox, 2008). The RF model’s main asset is its accuracy.
For the GLM model, we opt for an elastic net model (by specifying alpha = 0.5 in the function call) in order to achieve a compromise between ridge and lasso penalty approaches.
We first fit a model without cross validation, then plot this initial model for each level of the “classe” variable.
eNet <-glmnet(x = trainingNoFactor, y = trainingFactor[, "classe"],
family = "multinomial", alpha = 0.5)
print(eNet)
##
## Call: glmnet(x = trainingNoFactor, y = trainingFactor[, "classe"], family = "multinomial", alpha = 0.5)
##
## Df %Dev Lambda
## [1,] 0 -2.671e-13 3.551e-01
## [2,] 1 4.698e-03 3.236e-01
## [3,] 1 9.093e-03 2.948e-01
## [4,] 1 1.318e-02 2.686e-01
## [5,] 3 1.953e-02 2.448e-01
## [6,] 3 2.752e-02 2.230e-01
## [7,] 3 3.470e-02 2.032e-01
## [8,] 4 4.130e-02 1.852e-01
## [9,] 4 4.960e-02 1.687e-01
## [10,] 6 5.977e-02 1.537e-01
## [11,] 7 7.134e-02 1.401e-01
## [12,] 9 8.266e-02 1.276e-01
## [13,] 9 9.331e-02 1.163e-01
## [14,] 11 1.043e-01 1.060e-01
## [15,] 12 1.158e-01 9.654e-02
## [16,] 14 1.269e-01 8.797e-02
## [17,] 15 1.374e-01 8.015e-02
## [18,] 18 1.473e-01 7.303e-02
## [19,] 20 1.587e-01 6.654e-02
## [20,] 22 1.698e-01 6.063e-02
## [21,] 23 1.810e-01 5.525e-02
## [22,] 24 1.914e-01 5.034e-02
## [23,] 25 2.013e-01 4.587e-02
## [24,] 26 2.114e-01 4.179e-02
## [25,] 27 2.223e-01 3.808e-02
## [26,] 28 2.327e-01 3.470e-02
## [27,] 29 2.429e-01 3.161e-02
## [28,] 30 2.538e-01 2.881e-02
## [29,] 35 2.650e-01 2.625e-02
## [30,] 37 2.765e-01 2.391e-02
## [31,] 39 2.884e-01 2.179e-02
## [32,] 42 2.998e-01 1.985e-02
## [33,] 44 3.104e-01 1.809e-02
## [34,] 46 3.209e-01 1.648e-02
## [35,] 48 3.317e-01 1.502e-02
## [36,] 49 3.425e-01 1.368e-02
## [37,] 49 3.528e-01 1.247e-02
## [38,] 49 3.628e-01 1.136e-02
## [39,] 50 3.733e-01 1.035e-02
## [40,] 51 3.835e-01 9.433e-03
## [41,] 51 3.932e-01 8.595e-03
## [42,] 51 4.027e-01 7.831e-03
## [43,] 51 4.123e-01 7.135e-03
## [44,] 51 4.213e-01 6.501e-03
## [45,] 51 4.299e-01 5.924e-03
## [46,] 52 4.382e-01 5.398e-03
## [47,] 52 4.462e-01 4.918e-03
## [48,] 52 4.539e-01 4.481e-03
## [49,] 51 4.612e-01 4.083e-03
## [50,] 51 4.683e-01 3.720e-03
## [51,] 51 4.752e-01 3.390e-03
## [52,] 51 4.816e-01 3.089e-03
## [53,] 51 4.878e-01 2.814e-03
## [54,] 51 4.937e-01 2.564e-03
## [55,] 51 4.991e-01 2.337e-03
## [56,] 51 5.042e-01 2.129e-03
## [57,] 51 5.090e-01 1.940e-03
## [58,] 51 5.134e-01 1.767e-03
## [59,] 52 5.177e-01 1.610e-03
## [60,] 52 5.217e-01 1.467e-03
## [61,] 52 5.255e-01 1.337e-03
## [62,] 52 5.290e-01 1.218e-03
## [63,] 52 5.325e-01 1.110e-03
## [64,] 52 5.357e-01 1.011e-03
## [65,] 53 5.386e-01 9.216e-04
## [66,] 53 5.414e-01 8.397e-04
## [67,] 53 5.440e-01 7.651e-04
## [68,] 53 5.465e-01 6.971e-04
## [69,] 53 5.488e-01 6.352e-04
## [70,] 53 5.509e-01 5.788e-04
## [71,] 53 5.529e-01 5.274e-04
## [72,] 53 5.547e-01 4.805e-04
## [73,] 53 5.565e-01 4.378e-04
## [74,] 53 5.582e-01 3.989e-04
## [75,] 53 5.597e-01 3.635e-04
## [76,] 53 5.611e-01 3.312e-04
## [77,] 53 5.625e-01 3.018e-04
## [78,] 53 5.637e-01 2.750e-04
## [79,] 53 5.648e-01 2.505e-04
## [80,] 53 5.659e-01 2.283e-04
## [81,] 53 5.669e-01 2.080e-04
## [82,] 53 5.678e-01 1.895e-04
## [83,] 53 5.687e-01 1.727e-04
## [84,] 53 5.695e-01 1.573e-04
## [85,] 53 5.702e-01 1.434e-04
## [86,] 53 5.708e-01 1.306e-04
## [87,] 53 5.715e-01 1.190e-04
## [88,] 53 5.720e-01 1.085e-04
## [89,] 53 5.725e-01 9.882e-05
## [90,] 53 5.730e-01 9.004e-05
## [91,] 53 5.734e-01 8.204e-05
## [92,] 53 5.739e-01 7.475e-05
## [93,] 53 5.742e-01 6.811e-05
## [94,] 53 5.745e-01 6.206e-05
## [95,] 53 5.749e-01 5.655e-05
## [96,] 53 5.752e-01 5.152e-05
## [97,] 53 5.754e-01 4.695e-05
## [98,] 53 5.756e-01 4.278e-05
## [99,] 53 5.758e-01 3.898e-05
## [100,] 53 5.760e-01 3.551e-05
plot (eNet, label = TRUE)
In the plot, each curve shows the change in a single variable’s coefficient compared to the whole coefficient vector at values of lambda. Across the levels of the “classe” variable, variables 6, 7 and 8 (gyros_belt_x, gyros_belt_y and gyros_belt_z, i.e. the accelerometers on participants’ belts) appear to have a large impact.
In order to choose from the sequence of models with different lambda values provided by glmnet, we perform cross validation using the cv.glmnet function with misclassification error as the criterion for 5-fold procedure.
The lambda value that yields the minimum cross validation error is returned and depicted on a plot of the model’s misclassification error between speckled bars. The very small error rate at the optimal lambda is a sign of the model’s effectiveness.
eNet.cv <-cv.glmnet(x = trainingNoFactor, y = trainingFactor[, "classe"],
family = "multinomial", alpha = 0.5, nfolds = 5, type.measure =
"class")
eNet.cv$lambda.min
## [1] 5.654657e-05
plot(eNet.cv)
We begin by building our GLM classification model.
sample.error <- predict(eNet.cv, newx = trainingNoFactor, type = "class", s = "lambda.min")
confMat <-confusionMatrix(sample.error, trainingFactor$classe)
print(confMat)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4865 447 318 217 158
## B 188 2578 300 149 407
## C 232 363 2429 382 229
## D 232 125 205 2314 263
## E 63 284 170 154 2550
##
## Overall Statistics
##
## Accuracy : 0.751
## 95% CI : (0.7449, 0.757)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6842
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8719 0.6790 0.7098 0.7195 0.7070
## Specificity 0.9188 0.9340 0.9256 0.9497 0.9581
## Pos Pred Value 0.8102 0.7118 0.6682 0.7372 0.7917
## Neg Pred Value 0.9475 0.9238 0.9379 0.9453 0.9356
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2479 0.1314 0.1238 0.1179 0.1300
## Detection Prevalence 0.3060 0.1846 0.1853 0.1600 0.1642
## Balanced Accuracy 0.8953 0.8065 0.8177 0.8346 0.8325
An initial estimate of the out-of-sample error with the cross-validated model can be obtained by comparing values generated from the model with the actual observed values of the “classe” variable. Here, a confusion matrix generated from this comparison shows an overall accuracy of 75.12% with a Kappa of 0.6845.
To get an idea of the effectiveness of the generalized linear model, we compare it here to a random forest approach. Random forests are known to be quite accurate, if less readily interpretable and sometimes prone to overfitting, in comparison to other ML appraoches.
RFmodel <- randomForest(trainingFactor$classe ~. , data = trainingNoFactor)
print(RFmodel)
##
## Call:
## randomForest(formula = trainingFactor$classe ~ ., data = trainingNoFactor)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.13%
## Confusion matrix:
## A B C D E class.error
## A 5579 0 0 0 1 0.0001792115
## B 2 3794 1 0 0 0.0007900974
## C 0 5 3416 1 0 0.0017533606
## D 0 0 10 3205 1 0.0034203980
## E 0 0 0 5 3602 0.0013861935
Our confusion matrix here looks more encouraging, with a much lower error rate.
At this point, we are ready to predict the classes for the actual testing data. First, we perform the same operations on the testing data that were performed on the training data.
vars2remove <- c(1, 3:5)
varsNA <- which(colMeans(is.na(testing)) > .95)
testing_clean <- testing[, -c(vars2remove, varsNA)]
testing_clean$problem_id <- factor(testing_clean$problem_id)
varsFactor <-which(sapply(testing_clean, FUN = class) == "factor")
testingFactor <- testing_clean[, varsFactor]
testingNoFactor <- data.matrix(testing_clean[, -varsFactor])
GLM_predics <- predict(eNet.cv, newx = testingNoFactor, type = "class", s = "lambda.min")
RF_predicts <- predict(RFmodel, testingNoFactor, type = "class")
The random forest model yields considerably more accuracy, but is a bit more of a “black box” than the generalized linear model. Because the task at hand emphasizes accurate predictions, the RF model is preferable. If the research question had addressed the role of co-variates, the GLM may have been the preferred approach.