knitr::opts_chunk$set(echo = TRUE)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
This was for a machine learning class I recently completed. The proliferation of devices like Fitbit, Apple iWatches, etc, allows us to collect large amounts of data on personal activities. Many people (including yours truly), rarely take our devices off. While much of the research in this area is focused on quantifying how much we do particular activities, there is little data quantifying how well we do those activities.
In this analysis, I use data from the weight lifting exercise data set provided by Velloso et al. (2013) to quantify how well six volunteers lift dumbbells. The volunteers were instructed to do one set of ten repetitions for each of five different ways: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A is the proper way of lifting dumbbells, the other four are common mistakes.
Each volunteer wore fitness devises on their biceps, forearms, hips, and their dumbbells during the exercise. An experienced weightlifter supervised each volunteer to ensure each lift was done according to the class they were simulating. The goal of the assignment was to a) correctly classify each category of exercise using the accelerometer data and b) correctly classify the exercise categories of a second test data set using accelerometer data alone.
I downloaded the data using the links provided.
training <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv")
testing <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv")
dim(training)
## [1] 19622 160
dim(testing)
## [1] 20 160
The training data set includes 19,622 observations in each of 160 columns whereas the testing data includes 20 observations in 160 columns. I then excluded variables that were mostly NA, removed metadata, and further excluded variables with little variation.
training <- training[ , -c(1:7)]
training <- training[ , colMeans(is.na(training)) < 0.9]
near_zero <- nearZeroVar(training)
training <- training[ , -near_zero]
dim(training)
## [1] 19622 53
I then split the training data set into trainer and validation data sets, allowing the “testing” data to be left for the Course Prediction Quiz.
training$classe <- as.factor(training$classe)
inTrain <- createDataPartition(y = training$classe, p = 0.7, list = FALSE)
trainer <- training[inTrain, ]
validation <- training[-inTrain, ]
I used the caret package to fit random forest (rf), lda, and gradient boosting machine (gbm) and the kernlab package to fit support vector machine (svm) models to the trainer data. For my final model, I combined predictors from the individual model predictions. Cross-validation is done using 3-fold cross validation. The accuracy of each model was assessed on the validation data using a confusion matrix.
fit_rf <- train(classe~., data = trainer, method = "rf", trControl = trainControl(method = "cv", number = 3, verboseIter = FALSE), prox = TRUE)
pred_rf <- predict(fit_rf, validation)
confusionMatrix(pred_rf, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 4 0 0 0
## B 1 1132 5 0 0
## C 0 3 1020 19 1
## D 0 0 1 944 2
## E 0 0 0 1 1079
##
## Overall Statistics
##
## Accuracy : 0.9937
## 95% CI : (0.9913, 0.9956)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.992
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9939 0.9942 0.9793 0.9972
## Specificity 0.9991 0.9987 0.9953 0.9994 0.9998
## Pos Pred Value 0.9976 0.9947 0.9779 0.9968 0.9991
## Neg Pred Value 0.9998 0.9985 0.9988 0.9959 0.9994
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1924 0.1733 0.1604 0.1833
## Detection Prevalence 0.2850 0.1934 0.1772 0.1609 0.1835
## Balanced Accuracy 0.9992 0.9963 0.9947 0.9893 0.9985
fit_lda <- train(classe~., data = trainer, method = "lda", trControl = trainControl(method = "cv", number = 3, verboseIter = FALSE), prox = TRUE, na.action = na.omit)
pred_lda <- predict(fit_lda, validation)
confusionMatrix(pred_lda, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1376 181 97 54 30
## B 43 700 94 36 183
## C 146 161 690 142 97
## D 104 46 123 695 101
## E 5 51 22 37 671
##
## Overall Statistics
##
## Accuracy : 0.7021
## 95% CI : (0.6903, 0.7138)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6231
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8220 0.6146 0.6725 0.7210 0.6201
## Specificity 0.9140 0.9250 0.8876 0.9240 0.9761
## Pos Pred Value 0.7917 0.6629 0.5583 0.6501 0.8537
## Neg Pred Value 0.9281 0.9091 0.9277 0.9441 0.9194
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2338 0.1189 0.1172 0.1181 0.1140
## Detection Prevalence 0.2953 0.1794 0.2100 0.1816 0.1336
## Balanced Accuracy 0.8680 0.7698 0.7801 0.8225 0.7981
fit_gbm <- train(classe~., data = trainer, method = "gbm", trControl = trainControl(method = "cv", number = 3, verboseIter = FALSE), verbose = FALSE, tuneLength = 5)
pred_gbm <- predict(fit_gbm, validation)
confusionMatrix(pred_gbm, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1671 8 0 0 0
## B 1 1122 8 0 2
## C 2 9 1012 10 5
## D 0 0 6 951 6
## E 0 0 0 3 1069
##
## Overall Statistics
##
## Accuracy : 0.9898
## 95% CI : (0.9869, 0.9922)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9871
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9982 0.9851 0.9864 0.9865 0.9880
## Specificity 0.9981 0.9977 0.9946 0.9976 0.9994
## Pos Pred Value 0.9952 0.9903 0.9750 0.9875 0.9972
## Neg Pred Value 0.9993 0.9964 0.9971 0.9974 0.9973
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2839 0.1907 0.1720 0.1616 0.1816
## Detection Prevalence 0.2853 0.1925 0.1764 0.1636 0.1822
## Balanced Accuracy 0.9982 0.9914 0.9905 0.9920 0.9937
fit_svm <- train(classe~., data = trainer, method = "svmLinear", trControl = trainControl(method = "cv", number = 3, verboseIter = FALSE), verbose = FALSE, tuneLength = 5)
pred_svm <- predict(fit_svm, validation)
confusionMatrix(pred_svm, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1538 158 89 60 54
## B 33 789 87 33 153
## C 49 84 800 118 80
## D 48 23 26 718 63
## E 6 85 24 35 732
##
## Overall Statistics
##
## Accuracy : 0.7777
## 95% CI : (0.7669, 0.7883)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7175
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9188 0.6927 0.7797 0.7448 0.6765
## Specificity 0.9143 0.9355 0.9319 0.9675 0.9688
## Pos Pred Value 0.8099 0.7205 0.7073 0.8178 0.8299
## Neg Pred Value 0.9659 0.9269 0.9525 0.9509 0.9300
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2613 0.1341 0.1359 0.1220 0.1244
## Detection Prevalence 0.3227 0.1861 0.1922 0.1492 0.1499
## Balanced Accuracy 0.9165 0.8141 0.8558 0.8561 0.8226
pred_df <- data.frame(pred_rf, pred_lda, pred_gbm, pred_svm, classe = validation$classe)
fit_combo <- train(classe~., method = "rf", data = pred_df)
pred_combo <- predict(fit_combo, pred_df)
confusionMatrix(pred_combo, pred_df$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 4 0 0 0
## B 0 1132 3 0 0
## C 0 3 1018 7 1
## D 0 0 5 956 2
## E 0 0 0 1 1079
##
## Overall Statistics
##
## Accuracy : 0.9956
## 95% CI : (0.9935, 0.9971)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9944
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9939 0.9922 0.9917 0.9972
## Specificity 0.9991 0.9994 0.9977 0.9986 0.9998
## Pos Pred Value 0.9976 0.9974 0.9893 0.9927 0.9991
## Neg Pred Value 1.0000 0.9985 0.9984 0.9984 0.9994
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1924 0.1730 0.1624 0.1833
## Detection Prevalence 0.2851 0.1929 0.1749 0.1636 0.1835
## Balanced Accuracy 0.9995 0.9966 0.9950 0.9951 0.9985
| Model | Accuracy | OoS Error |
|---|---|---|
| RF | 0.9941 | 0.0059 |
| LDA | 0.6955 | 0.3045 |
| GBM | 0.9901 | 0.0099 |
| SVM | 0.7794 | 0.2206 |
| Combo | 0.9958 | 0.0042 |
The combined model gave the best predictions, with a 99.58% accuracy rate and only a 0.42% out of sample error rate.
Finally, I ran the Combo model on the testing data set originally set aside at the start of this analysis and printed out the resultant predictions of the category of exercise recorded in each row, whether correct dumbbell lifting or one of the four mistake categories.
pred_rf_final <- predict(fit_rf, testing)
pred_lda_final <- predict(fit_lda, testing)
pred_gbm_final <- predict(fit_gbm, testing)
pred_svm_final <- predict(fit_svm, testing)
pred_final_df <- data.frame(pred_rf = pred_rf_final, pred_lda = pred_lda_final, pred_gbm = pred_gbm_final, pred_svm = pred_svm_final)
combo_pred_final <- predict(fit_combo, pred_final_df)
combo_pred_final
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
At the end of the assignment, the correct categories for the test data set were revealed. My predictions were correct.
Velloso, E., A. Bulling, H. Gellersen, W. Ugulino, and H. Fuks. 2013. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013. http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har