knitr::opts_chunk$set(echo = TRUE)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha

Summary

This was for a machine learning class I recently completed. The proliferation of devices like Fitbit, Apple iWatches, etc, allows us to collect large amounts of data on personal activities. Many people (including yours truly), rarely take our devices off. While much of the research in this area is focused on quantifying how much we do particular activities, there is little data quantifying how well we do those activities.

In this analysis, I use data from the weight lifting exercise data set provided by Velloso et al. (2013) to quantify how well six volunteers lift dumbbells. The volunteers were instructed to do one set of ten repetitions for each of five different ways: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A is the proper way of lifting dumbbells, the other four are common mistakes.

Each volunteer wore fitness devises on their biceps, forearms, hips, and their dumbbells during the exercise. An experienced weightlifter supervised each volunteer to ensure each lift was done according to the class they were simulating. The goal of the assignment was to a) correctly classify each category of exercise using the accelerometer data and b) correctly classify the exercise categories of a second test data set using accelerometer data alone.

Data preparation

I downloaded the data using the links provided.

training <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv")
testing <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv")
dim(training)
## [1] 19622   160
dim(testing)
## [1]  20 160

The training data set includes 19,622 observations in each of 160 columns whereas the testing data includes 20 observations in 160 columns. I then excluded variables that were mostly NA, removed metadata, and further excluded variables with little variation.

training <- training[ , -c(1:7)]
training <- training[ , colMeans(is.na(training)) < 0.9]
near_zero <- nearZeroVar(training)
training <- training[ , -near_zero]
dim(training)
## [1] 19622    53

I then split the training data set into trainer and validation data sets, allowing the “testing” data to be left for the Course Prediction Quiz.

training$classe <- as.factor(training$classe)
inTrain <- createDataPartition(y = training$classe, p = 0.7, list = FALSE)
trainer <- training[inTrain, ]
validation <- training[-inTrain, ]

Analysis

I used the caret package to fit random forest (rf), lda, and gradient boosting machine (gbm) and the kernlab package to fit support vector machine (svm) models to the trainer data. For my final model, I combined predictors from the individual model predictions. Cross-validation is done using 3-fold cross validation. The accuracy of each model was assessed on the validation data using a confusion matrix.

Random Forest

fit_rf <- train(classe~., data = trainer, method = "rf", trControl = trainControl(method = "cv", number = 3, verboseIter = FALSE), prox = TRUE)
pred_rf <- predict(fit_rf, validation)
confusionMatrix(pred_rf, validation$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    4    0    0    0
##          B    1 1132    5    0    0
##          C    0    3 1020   19    1
##          D    0    0    1  944    2
##          E    0    0    0    1 1079
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9937          
##                  95% CI : (0.9913, 0.9956)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.992           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9939   0.9942   0.9793   0.9972
## Specificity            0.9991   0.9987   0.9953   0.9994   0.9998
## Pos Pred Value         0.9976   0.9947   0.9779   0.9968   0.9991
## Neg Pred Value         0.9998   0.9985   0.9988   0.9959   0.9994
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1924   0.1733   0.1604   0.1833
## Detection Prevalence   0.2850   0.1934   0.1772   0.1609   0.1835
## Balanced Accuracy      0.9992   0.9963   0.9947   0.9893   0.9985

Linear Discriminant Analysis

fit_lda <- train(classe~., data = trainer, method = "lda", trControl = trainControl(method = "cv", number = 3, verboseIter = FALSE), prox = TRUE, na.action = na.omit)
pred_lda <- predict(fit_lda, validation)
confusionMatrix(pred_lda, validation$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1376  181   97   54   30
##          B   43  700   94   36  183
##          C  146  161  690  142   97
##          D  104   46  123  695  101
##          E    5   51   22   37  671
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7021          
##                  95% CI : (0.6903, 0.7138)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6231          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8220   0.6146   0.6725   0.7210   0.6201
## Specificity            0.9140   0.9250   0.8876   0.9240   0.9761
## Pos Pred Value         0.7917   0.6629   0.5583   0.6501   0.8537
## Neg Pred Value         0.9281   0.9091   0.9277   0.9441   0.9194
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2338   0.1189   0.1172   0.1181   0.1140
## Detection Prevalence   0.2953   0.1794   0.2100   0.1816   0.1336
## Balanced Accuracy      0.8680   0.7698   0.7801   0.8225   0.7981

Gradient Boosting Machines

fit_gbm <- train(classe~., data = trainer, method = "gbm", trControl = trainControl(method = "cv", number = 3, verboseIter = FALSE), verbose = FALSE, tuneLength = 5)
pred_gbm <- predict(fit_gbm, validation)
confusionMatrix(pred_gbm, validation$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1671    8    0    0    0
##          B    1 1122    8    0    2
##          C    2    9 1012   10    5
##          D    0    0    6  951    6
##          E    0    0    0    3 1069
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9898          
##                  95% CI : (0.9869, 0.9922)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9871          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9851   0.9864   0.9865   0.9880
## Specificity            0.9981   0.9977   0.9946   0.9976   0.9994
## Pos Pred Value         0.9952   0.9903   0.9750   0.9875   0.9972
## Neg Pred Value         0.9993   0.9964   0.9971   0.9974   0.9973
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2839   0.1907   0.1720   0.1616   0.1816
## Detection Prevalence   0.2853   0.1925   0.1764   0.1636   0.1822
## Balanced Accuracy      0.9982   0.9914   0.9905   0.9920   0.9937

Support Vector Machine

fit_svm <- train(classe~., data = trainer, method = "svmLinear", trControl = trainControl(method = "cv", number = 3, verboseIter = FALSE), verbose = FALSE, tuneLength = 5)
pred_svm <- predict(fit_svm, validation)
confusionMatrix(pred_svm, validation$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1538  158   89   60   54
##          B   33  789   87   33  153
##          C   49   84  800  118   80
##          D   48   23   26  718   63
##          E    6   85   24   35  732
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7777          
##                  95% CI : (0.7669, 0.7883)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7175          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9188   0.6927   0.7797   0.7448   0.6765
## Specificity            0.9143   0.9355   0.9319   0.9675   0.9688
## Pos Pred Value         0.8099   0.7205   0.7073   0.8178   0.8299
## Neg Pred Value         0.9659   0.9269   0.9525   0.9509   0.9300
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2613   0.1341   0.1359   0.1220   0.1244
## Detection Prevalence   0.3227   0.1861   0.1922   0.1492   0.1499
## Balanced Accuracy      0.9165   0.8141   0.8558   0.8561   0.8226

Combined Model

pred_df <- data.frame(pred_rf, pred_lda, pred_gbm, pred_svm, classe = validation$classe)
fit_combo <- train(classe~., method = "rf", data = pred_df)
pred_combo <- predict(fit_combo, pred_df)
confusionMatrix(pred_combo, pred_df$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    4    0    0    0
##          B    0 1132    3    0    0
##          C    0    3 1018    7    1
##          D    0    0    5  956    2
##          E    0    0    0    1 1079
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9956          
##                  95% CI : (0.9935, 0.9971)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9944          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9939   0.9922   0.9917   0.9972
## Specificity            0.9991   0.9994   0.9977   0.9986   0.9998
## Pos Pred Value         0.9976   0.9974   0.9893   0.9927   0.9991
## Neg Pred Value         1.0000   0.9985   0.9984   0.9984   0.9994
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1924   0.1730   0.1624   0.1833
## Detection Prevalence   0.2851   0.1929   0.1749   0.1636   0.1835
## Balanced Accuracy      0.9995   0.9966   0.9950   0.9951   0.9985

Overall accuracy and out of sample error rate

Model Accuracy OoS Error
RF 0.9941 0.0059
LDA 0.6955 0.3045
GBM 0.9901 0.0099
SVM 0.7794 0.2206
Combo 0.9958 0.0042

The combined model gave the best predictions, with a 99.58% accuracy rate and only a 0.42% out of sample error rate.

Prediction on the testing data set

Finally, I ran the Combo model on the testing data set originally set aside at the start of this analysis and printed out the resultant predictions of the category of exercise recorded in each row, whether correct dumbbell lifting or one of the four mistake categories.

pred_rf_final <- predict(fit_rf, testing)
pred_lda_final <- predict(fit_lda, testing)
pred_gbm_final <- predict(fit_gbm, testing)
pred_svm_final <- predict(fit_svm, testing)
pred_final_df <- data.frame(pred_rf = pred_rf_final, pred_lda = pred_lda_final, pred_gbm = pred_gbm_final, pred_svm = pred_svm_final)
combo_pred_final <- predict(fit_combo, pred_final_df)
combo_pred_final
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

At the end of the assignment, the correct categories for the test data set were revealed. My predictions were correct.

Literature Cited

Velloso, E., A. Bulling, H. Gellersen, W. Ugulino, and H. Fuks. 2013. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013. http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har