Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
The data for this project is kindly provided by:
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H.: Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13). Stuttgart, Germany: ACM SIGCHI, 2013 (https://dl.acm.org/doi/10.1145/2459236.2459256).
The Weight Lifting Exercises (WLE) dataset is used to investigate how well an activity is being performed. Six participants were performing one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:
Class A - exactly according to the specification,
Class B - throwing the elbows to the front,
Class C - lifting the dumbbell only halfway,
Class D - lowering the dumbbell only halfway,
Class E - throwing the hips to the front.
Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
A review of the training and test data sets indicates that there are 160 variables (37 character, 35 integer, 88 numeric) with 19622 and 20 observations, respectivley. The last variable “classe” indicates which of the 5 classes of exercise were performed for each trial. Tabulating the number of trials for each class of exercise indicates that the correct exercise(A) was performed approximately twice as many times as any of the incorrect individual exercises, which were all executed approximately the same amount (5500 vs 3500 repetitions).
trainInit <- read.csv("pml-training.csv")
testInit <- read.csv("pml-testing.csv")
dim(trainInit)
dim(testInit)
table(sapply(trainInit, class))
which(names(trainInit) == "classe")
trainInit %>% count(trainInit$classe, sort = TRUE)
The original data set contains 159 potential predictors, many of which might not be appropriate to use for our modeling. One possible strategy would to eliminate variables which contain a large number of NA’s or those having a Near Zero Variance. These were initially attempted and did reduce the number of variables but an alternative strategy was utilized in the end. Data collection occurred from 4 sensors attached to each subject (belt, arm, forearm, & dumbbell). Using the “grep” command and “belt” it was evident that the sensor data related to roll, pitch, and yaw for the accelerator, gyroscope, and magnetometer were all numeric. Focusing on these raw variables (while ignoring any summary variables) gave a rich data set of 52 predictors
predictorInd <- c(grep("^accel", names(trainInit)), grep("^gyros", names(trainInit)),
grep("^magnet", names(trainInit)), grep("^roll", names(trainInit)),
grep("^pitch", names(trainInit)), grep("^total", names(trainInit)),
grep("^yaw", names(trainInit)))
trainReduce <- trainInit[, c(predictorInd, 160)]
names(trainReduce)
## [1] "accel_belt_x" "accel_belt_y" "accel_belt_z"
## [4] "accel_arm_x" "accel_arm_y" "accel_arm_z"
## [7] "accel_dumbbell_x" "accel_dumbbell_y" "accel_dumbbell_z"
## [10] "accel_forearm_x" "accel_forearm_y" "accel_forearm_z"
## [13] "gyros_belt_x" "gyros_belt_y" "gyros_belt_z"
## [16] "gyros_arm_x" "gyros_arm_y" "gyros_arm_z"
## [19] "gyros_dumbbell_x" "gyros_dumbbell_y" "gyros_dumbbell_z"
## [22] "gyros_forearm_x" "gyros_forearm_y" "gyros_forearm_z"
## [25] "magnet_belt_x" "magnet_belt_y" "magnet_belt_z"
## [28] "magnet_arm_x" "magnet_arm_y" "magnet_arm_z"
## [31] "magnet_dumbbell_x" "magnet_dumbbell_y" "magnet_dumbbell_z"
## [34] "magnet_forearm_x" "magnet_forearm_y" "magnet_forearm_z"
## [37] "roll_belt" "roll_arm" "roll_dumbbell"
## [40] "roll_forearm" "pitch_belt" "pitch_arm"
## [43] "pitch_dumbbell" "pitch_forearm" "total_accel_belt"
## [46] "total_accel_arm" "total_accel_dumbbell" "total_accel_forearm"
## [49] "yaw_belt" "yaw_arm" "yaw_dumbbell"
## [52] "yaw_forearm" "classe"
Although we already have a “test” set associated with the data, this set has no class variable and will be utilized for final model evaluation. To estimate the error rate for the models on the 20 blind predictions, it is necessary to split the test set into two parts (a new smaller test set and a validation set). The execution of this task (.7/.3) yeilded data sets with 13737 and 5885 observations, respectively.
inTrain <- createDataPartition(y=trainReduce$classe, p=0.7, list=F)
trainFinal <- trainReduce[inTrain,]
trainValid <- trainReduce[-inTrain,]
dim(trainFinal)
dim(trainValid)
The first model applied to the data was the Decision Tree Algorithm. The method used a 3 fold cross validation and ran very quickly. Although this model is easy to understand and interpret (see figure below with decision tree) the accuracy of the model was fairly poor with only 54% of the cases being correctly classified in the training set. The confusion matrix indicates most of the A category (correct form) were identified correctly but many of the incorrect form were misclassified as being correct.
#Decision Tree Analysis
control <- trainControl(method="cv", number=3, verboseIter=F)
mod_tree <- train(classe~., data=trainFinal, method="rpart", trControl = control, tuneLength = 5)
fancyRpartPlot(mod_tree$finalModel)
pred_tree <- predict(mod_tree, trainValid)
cnftree <- confusionMatrix(pred_tree, factor(trainValid$classe))
cnftree$table
## Reference
## Prediction A B C D E
## A 1510 472 486 425 156
## B 40 369 37 12 130
## C 93 110 416 137 141
## D 27 188 87 390 190
## E 4 0 0 0 465
cnftree$overall['Accuracy']
## Accuracy
## 0.5352591
The second model applied to the data set was the Linear Discriminant Analysis utilizing the same 3-fold cross validation. Once again the analysis ran quickly but as with the Decision Tree, the accuracy of the model was relative weak at 70%.
#Linear Discriminant Analysis
mod_lda <- train(classe~., data=trainFinal, method="lda", trControl = control)
pred_lda <- predict(mod_lda, trainValid)
cnflda <- confusionMatrix(pred_lda, factor(trainValid$classe))
cnflda$table
## Reference
## Prediction A B C D E
## A 1395 177 107 43 46
## B 38 723 109 42 188
## C 122 144 658 110 104
## D 114 43 117 728 109
## E 5 52 35 41 635
cnflda$overall['Accuracy']
## Accuracy
## 0.7033135
The third model was a Quadratic Discriminant Analysis, relaxing the LDA common covariance matrix for classes and allowing for quadratic decision boundaries. It utilized the 3-fold cross validation and took approximately the same time as the LDA but resulted in much better predictions with a 89% accuracy.
#Quadratic Discriminant Analysis
mod_qda <- train(classe~., data=trainFinal, method="qda", trControl = control)
pred_qda <- predict(mod_qda, trainValid)
cnfqda <- confusionMatrix(pred_qda, factor(trainValid$classe))
cnfqda$table
## Reference
## Prediction A B C D E
## A 1553 51 2 2 1
## B 68 957 53 2 31
## C 21 113 958 123 50
## D 27 4 7 821 27
## E 5 14 6 16 973
cnfqda$overall['Accuracy']
## Accuracy
## 0.8941376
The fourth model applied to the data set was a Random Forest (with the same cross validation) and the performance of the algorithm changed fairly dramatically. Completion of the analysis was notably slower than for the first three models but the accuracy on the test set jumped to 99%, with few cases being mis-classified.
#Random Forest Analysis
mod_rf <- train(classe~., data=trainFinal, method="rf", trControl = control)
pred_rf <- predict(mod_rf, trainValid)
cnfrf <- confusionMatrix(pred_rf, factor(trainValid$classe))
cnfrf$table
## Reference
## Prediction A B C D E
## A 1674 6 0 0 0
## B 0 1128 10 0 1
## C 0 5 1012 10 0
## D 0 0 4 954 4
## E 0 0 0 0 1077
cnfrf$overall['Accuracy']
## Accuracy
## 0.9932031
The last algorithm applied to the data set was a Generalized Boosted Regression Analysis. This analysis mimicked the Random Forest very closely in both time to execute (long) and accuracy (99%). As before it utilizes a 3-fold cross-validation and the algorithm is allowed to optimize its tuning parameters.
#Generalized Boosted Regression Analysis
mod_gbm <- train(classe~., data=trainFinal, method="gbm", trControl = control, tuneLength = 5, verbose = F)
pred_gbm <- predict(mod_gbm, trainValid)
cnfgbm <- confusionMatrix(pred_gbm, factor(trainValid$classe))
cnfgbm$table
## Reference
## Prediction A B C D E
## A 1671 5 0 0 0
## B 3 1127 5 0 2
## C 0 7 1007 7 3
## D 0 0 13 953 10
## E 0 0 1 4 1067
cnfgbm$overall['Accuracy']
## Accuracy
## 0.9898046
A quick review of the accuracy of the various models on the test set shows dramatic variation from 53 to 99% accuracy. Given the results from this analysis, there would seem little reason to use the Decision Tree or LDA methods (unless a simplistic model was desired). The Quadratic Discriminant Analysis ran nearly as quickly as those two algorithms and performed much better. If computing time is not an issue the Random Forest or Generalized Boosting Regression performed much better and should strongly be considered.
vacc_trees <- cnftree$overall['Accuracy']
vacc_lda <- cnflda$overall['Accuracy']
vacc_qda <- cnfqda$overall['Accuracy']
vacc_rf <- cnfrf$overall['Accuracy']
vacc_gbm <- cnfgbm$overall['Accuracy']
vacc <- data.frame(vacc_trees,vacc_lda,vacc_qda,vacc_rf,vacc_gbm)
colnames(vacc) <- c("DT","LDA","QDA","RF","GBM")
vacc
## DT LDA QDA RF GBM
## Accuracy 0.5352591 0.7033135 0.8941376 0.9932031 0.9898046
By dividing each confusion matrix by its column total, we can convert the confusion matrix to show the percent from each for each activity that is classified into each category. Focusing on the diagonal elements indicates how often each type of activity was correctly classified. Taking the diagonal elements for each type of model and stacking them allows us to evaluate how well each model did at correctly classifying each of the activities. Reviewing the table indicated that activities B & C were the most difficult to correctly predict (with the inferior models).
cnft_tree <- cnftree$table
cnftpct_tree <- prop.table(as.matrix(cnft_tree), margin = 2) * 100
cnftpct_tree <- round(cnftpct_tree, 1)
de_tree <- diag(cnftpct_tree)
cnft_lda <- cnflda$table
cnftpct_lda <- prop.table(as.matrix(cnft_lda), margin = 2) * 100
cnftpct_lda <- round(cnftpct_lda, 1)
de_lda <- diag(cnftpct_lda)
cnft_qda <- cnfqda$table
cnftpct_qda <- prop.table(as.matrix(cnft_qda), margin = 2) * 100
cnftpct_qda <- round(cnftpct_qda, 1)
de_qda <- diag(cnftpct_qda)
cnft_rf <- cnfrf$table
cnftpct_rf <- prop.table(as.matrix(cnft_rf), margin = 2) * 100
cnftpct_rf <- round(cnftpct_rf, 1)
de_rf <- diag(cnftpct_rf)
cnft_gbm <- cnfgbm$table
cnftpct_gbm <- prop.table(as.matrix(cnft_gbm), margin = 2) * 100
cnftpct_gbm <- round(cnftpct_gbm, 1)
de_gbm <- diag(cnftpct_gbm)
de <- rbind(de_tree,de_lda,de_qda,de_rf,de_gbm)
rownames(de) <- c("DT", "LDA", "QDA","RF","GBM")
de
## A B C D E
## DT 90.2 32.4 40.5 40.5 43.0
## LDA 83.3 63.5 64.1 75.5 58.7
## QDA 92.8 84.0 93.4 85.2 89.9
## RF 100.0 99.0 98.6 99.0 99.5
## GBM 99.8 98.9 98.1 98.9 98.6
The final component of this analysis was to apply the chosen model to the (blind) test set to make our predictions. For purposes of this analysis I will utilized the RF model as it seemed to perform best among those chosen (though GBM was virtually equivalent). The predicted classes are shown below for all the models and not surprisingly the chosen RF and alternative GBM method actually agree completely on their predictions (though the other methods deviate on a number of predictions). Running the predicted classes through the Prediction Quiz yielded a perfect prediction score for the Random Forest (and GBM) models which is not overly unexpected given the extremely low out-of-sample error rates found on the validation set.
tfit_lda <- predict(mod_lda,testInit)
tfit_qda <- predict(mod_qda,testInit)
tfit_tree <- predict(mod_tree,testInit)
tfit_rf <- predict(mod_rf,testInit)
tfit_gbm <- predict(mod_gbm,testInit)
mfits <- data.frame(tfit_lda,tfit_qda,tfit_tree,tfit_rf,tfit_gbm)
colnames(mfits) <- c("LDA","QDA","DT","RF","GBM")
mfits
## LDA QDA DT RF GBM
## 1 B A C B B
## 2 A A A A A
## 3 B B D B B
## 4 C A A A A
## 5 C A A A A
## 6 E E C E E
## 7 D D D D D
## 8 D B A B B
## 9 A A A A A
## 10 A A A A A
## 11 D B C B B
## 12 A C D C C
## 13 B B C B B
## 14 A A A A A
## 15 E E D E E
## 16 A E A E E
## 17 A A A A A
## 18 B B A B B
## 19 B B A B B
## 20 B B D B B