PML Project

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

The data for this project is kindly provided by:

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H.: Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13). Stuttgart, Germany: ACM SIGCHI, 2013 (https://dl.acm.org/doi/10.1145/2459236.2459256).

The Weight Lifting Exercises (WLE) dataset is used to investigate how well an activity is being performed. Six participants were performing one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:

Class A - exactly according to the specification,
Class B - throwing the elbows to the front,
Class C - lifting the dumbbell only halfway,
Class D - lowering the dumbbell only halfway,
Class E - throwing the hips to the front.

Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

A review of the training and test data sets indicates that there are 160 variables (37 character, 35 integer, 88 numeric) with 19622 and 20 observations, respectivley. The last variable “classe” indicates which of the 5 classes of exercise were performed for each trial. Tabulating the number of trials for each class of exercise indicates that the correct exercise(A) was performed approximately twice as many times as any of the incorrect individual exercises, which were all executed approximately the same amount (5500 vs 3500 repetitions).

trainInit <- read.csv("pml-training.csv")
testInit  <- read.csv("pml-testing.csv")

dim(trainInit)
dim(testInit)
table(sapply(trainInit, class))
which(names(trainInit) == "classe")
trainInit %>% count(trainInit$classe, sort = TRUE)

Variable Reduction

The original data set contains 159 potential predictors, many of which might not be appropriate to use for our modeling. One possible strategy would to eliminate variables which contain a large number of NA’s or those having a Near Zero Variance. These were initially attempted and did reduce the number of variables but an alternative strategy was utilized in the end. Data collection occurred from 4 sensors attached to each subject (belt, arm, forearm, & dumbbell). Using the “grep” command and “belt” it was evident that the sensor data related to roll, pitch, and yaw for the accelerator, gyroscope, and magnetometer were all numeric. Focusing on these raw variables (while ignoring any summary variables) gave a rich data set of 52 predictors

predictorInd <- c(grep("^accel", names(trainInit)), grep("^gyros", names(trainInit)), 
                  grep("^magnet", names(trainInit)), grep("^roll", names(trainInit)), 
                  grep("^pitch", names(trainInit)), grep("^total", names(trainInit)),
                  grep("^yaw", names(trainInit)))
trainReduce <- trainInit[, c(predictorInd, 160)]
names(trainReduce)

##  [1] "accel_belt_x"         "accel_belt_y"         "accel_belt_z"        
##  [4] "accel_arm_x"          "accel_arm_y"          "accel_arm_z"         
##  [7] "accel_dumbbell_x"     "accel_dumbbell_y"     "accel_dumbbell_z"    
## [10] "accel_forearm_x"      "accel_forearm_y"      "accel_forearm_z"     
## [13] "gyros_belt_x"         "gyros_belt_y"         "gyros_belt_z"        
## [16] "gyros_arm_x"          "gyros_arm_y"          "gyros_arm_z"         
## [19] "gyros_dumbbell_x"     "gyros_dumbbell_y"     "gyros_dumbbell_z"    
## [22] "gyros_forearm_x"      "gyros_forearm_y"      "gyros_forearm_z"     
## [25] "magnet_belt_x"        "magnet_belt_y"        "magnet_belt_z"       
## [28] "magnet_arm_x"         "magnet_arm_y"         "magnet_arm_z"        
## [31] "magnet_dumbbell_x"    "magnet_dumbbell_y"    "magnet_dumbbell_z"   
## [34] "magnet_forearm_x"     "magnet_forearm_y"     "magnet_forearm_z"    
## [37] "roll_belt"            "roll_arm"             "roll_dumbbell"       
## [40] "roll_forearm"         "pitch_belt"           "pitch_arm"           
## [43] "pitch_dumbbell"       "pitch_forearm"        "total_accel_belt"    
## [46] "total_accel_arm"      "total_accel_dumbbell" "total_accel_forearm" 
## [49] "yaw_belt"             "yaw_arm"              "yaw_dumbbell"        
## [52] "yaw_forearm"          "classe"

Validation Set

Although we already have a “test” set associated with the data, this set has no class variable and will be utilized for final model evaluation. To estimate the error rate for the models on the 20 blind predictions, it is necessary to split the test set into two parts (a new smaller test set and a validation set). The execution of this task (.7/.3) yeilded data sets with 13737 and 5885 observations, respectively.

inTrain <- createDataPartition(y=trainReduce$classe, p=0.7, list=F)
trainFinal <- trainReduce[inTrain,]
trainValid <- trainReduce[-inTrain,]
dim(trainFinal)
dim(trainValid)

Decision Tree Analysis

The first model applied to the data was the Decision Tree Algorithm. The method used a 3 fold cross validation and ran very quickly. Although this model is easy to understand and interpret (see figure below with decision tree) the accuracy of the model was fairly poor with only 54% of the cases being correctly classified in the training set. The confusion matrix indicates most of the A category (correct form) were identified correctly but many of the incorrect form were misclassified as being correct.

#Decision Tree Analysis
control <- trainControl(method="cv", number=3, verboseIter=F)
mod_tree <- train(classe~., data=trainFinal, method="rpart", trControl = control, tuneLength = 5)
fancyRpartPlot(mod_tree$finalModel)

pred_tree <- predict(mod_tree, trainValid)
cnftree <- confusionMatrix(pred_tree, factor(trainValid$classe))
cnftree$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1510  472  486  425  156
##          B   40  369   37   12  130
##          C   93  110  416  137  141
##          D   27  188   87  390  190
##          E    4    0    0    0  465

cnftree$overall['Accuracy']

##  Accuracy 
## 0.5352591

Linear Discriminant Analysis

The second model applied to the data set was the Linear Discriminant Analysis utilizing the same 3-fold cross validation. Once again the analysis ran quickly but as with the Decision Tree, the accuracy of the model was relative weak at 70%.

#Linear Discriminant Analysis
mod_lda <- train(classe~., data=trainFinal, method="lda", trControl = control)
pred_lda <- predict(mod_lda, trainValid)
cnflda <- confusionMatrix(pred_lda, factor(trainValid$classe))
cnflda$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1395  177  107   43   46
##          B   38  723  109   42  188
##          C  122  144  658  110  104
##          D  114   43  117  728  109
##          E    5   52   35   41  635

cnflda$overall['Accuracy']

##  Accuracy 
## 0.7033135

Quadratic Discriminant Analysis

The third model was a Quadratic Discriminant Analysis, relaxing the LDA common covariance matrix for classes and allowing for quadratic decision boundaries. It utilized the 3-fold cross validation and took approximately the same time as the LDA but resulted in much better predictions with a 89% accuracy.

#Quadratic Discriminant Analysis
mod_qda <- train(classe~., data=trainFinal, method="qda", trControl = control)
pred_qda <- predict(mod_qda, trainValid)
cnfqda <- confusionMatrix(pred_qda, factor(trainValid$classe))
cnfqda$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1553   51    2    2    1
##          B   68  957   53    2   31
##          C   21  113  958  123   50
##          D   27    4    7  821   27
##          E    5   14    6   16  973

cnfqda$overall['Accuracy']

##  Accuracy 
## 0.8941376

Random Forest Analysis

The fourth model applied to the data set was a Random Forest (with the same cross validation) and the performance of the algorithm changed fairly dramatically. Completion of the analysis was notably slower than for the first three models but the accuracy on the test set jumped to 99%, with few cases being mis-classified.

#Random Forest Analysis
mod_rf <- train(classe~., data=trainFinal, method="rf", trControl = control)
pred_rf <- predict(mod_rf, trainValid)
cnfrf <- confusionMatrix(pred_rf, factor(trainValid$classe))
cnfrf$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1674    6    0    0    0
##          B    0 1128   10    0    1
##          C    0    5 1012   10    0
##          D    0    0    4  954    4
##          E    0    0    0    0 1077

cnfrf$overall['Accuracy']

##  Accuracy 
## 0.9932031

Generalized Boosted Regression Analysis

The last algorithm applied to the data set was a Generalized Boosted Regression Analysis. This analysis mimicked the Random Forest very closely in both time to execute (long) and accuracy (99%). As before it utilizes a 3-fold cross-validation and the algorithm is allowed to optimize its tuning parameters.

#Generalized Boosted Regression Analysis
mod_gbm <- train(classe~., data=trainFinal, method="gbm", trControl = control, tuneLength = 5, verbose = F)
pred_gbm <- predict(mod_gbm, trainValid)
cnfgbm <- confusionMatrix(pred_gbm, factor(trainValid$classe))
cnfgbm$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1671    5    0    0    0
##          B    3 1127    5    0    2
##          C    0    7 1007    7    3
##          D    0    0   13  953   10
##          E    0    0    1    4 1067

cnfgbm$overall['Accuracy']

##  Accuracy 
## 0.9898046

Accuracy for all the models (out of sample error)

A quick review of the accuracy of the various models on the test set shows dramatic variation from 53 to 99% accuracy. Given the results from this analysis, there would seem little reason to use the Decision Tree or LDA methods (unless a simplistic model was desired). The Quadratic Discriminant Analysis ran nearly as quickly as those two algorithms and performed much better. If computing time is not an issue the Random Forest or Generalized Boosting Regression performed much better and should strongly be considered.

vacc_trees <- cnftree$overall['Accuracy']
vacc_lda <- cnflda$overall['Accuracy']
vacc_qda <- cnfqda$overall['Accuracy']
vacc_rf <- cnfrf$overall['Accuracy']
vacc_gbm <- cnfgbm$overall['Accuracy']

vacc <- data.frame(vacc_trees,vacc_lda,vacc_qda,vacc_rf,vacc_gbm)
colnames(vacc) <- c("DT","LDA","QDA","RF","GBM")
vacc

##                 DT       LDA       QDA        RF       GBM
## Accuracy 0.5352591 0.7033135 0.8941376 0.9932031 0.9898046

Accuracy for each activity

By dividing each confusion matrix by its column total, we can convert the confusion matrix to show the percent from each for each activity that is classified into each category. Focusing on the diagonal elements indicates how often each type of activity was correctly classified. Taking the diagonal elements for each type of model and stacking them allows us to evaluate how well each model did at correctly classifying each of the activities. Reviewing the table indicated that activities B & C were the most difficult to correctly predict (with the inferior models).

cnft_tree <- cnftree$table
cnftpct_tree <- prop.table(as.matrix(cnft_tree), margin = 2) * 100
cnftpct_tree <- round(cnftpct_tree, 1)
de_tree <- diag(cnftpct_tree)

cnft_lda <- cnflda$table
cnftpct_lda <- prop.table(as.matrix(cnft_lda), margin = 2) * 100
cnftpct_lda <- round(cnftpct_lda, 1)
de_lda <- diag(cnftpct_lda)

cnft_qda <- cnfqda$table
cnftpct_qda <- prop.table(as.matrix(cnft_qda), margin = 2) * 100
cnftpct_qda <- round(cnftpct_qda, 1)
de_qda <- diag(cnftpct_qda)

cnft_rf <- cnfrf$table
cnftpct_rf <- prop.table(as.matrix(cnft_rf), margin = 2) * 100
cnftpct_rf <- round(cnftpct_rf, 1)
de_rf <- diag(cnftpct_rf)

cnft_gbm <- cnfgbm$table
cnftpct_gbm <- prop.table(as.matrix(cnft_gbm), margin = 2) * 100
cnftpct_gbm <- round(cnftpct_gbm, 1)
de_gbm <- diag(cnftpct_gbm)

de <- rbind(de_tree,de_lda,de_qda,de_rf,de_gbm)
rownames(de) <- c("DT", "LDA", "QDA","RF","GBM")
de

##         A    B    C    D    E
## DT   90.2 32.4 40.5 40.5 43.0
## LDA  83.3 63.5 64.1 75.5 58.7
## QDA  92.8 84.0 93.4 85.2 89.9
## RF  100.0 99.0 98.6 99.0 99.5
## GBM  99.8 98.9 98.1 98.9 98.6

Prediction on the test set

The final component of this analysis was to apply the chosen model to the (blind) test set to make our predictions. For purposes of this analysis I will utilized the RF model as it seemed to perform best among those chosen (though GBM was virtually equivalent). The predicted classes are shown below for all the models and not surprisingly the chosen RF and alternative GBM method actually agree completely on their predictions (though the other methods deviate on a number of predictions). Running the predicted classes through the Prediction Quiz yielded a perfect prediction score for the Random Forest (and GBM) models which is not overly unexpected given the extremely low out-of-sample error rates found on the validation set.

tfit_lda <- predict(mod_lda,testInit)
tfit_qda <- predict(mod_qda,testInit)
tfit_tree <- predict(mod_tree,testInit)
tfit_rf <- predict(mod_rf,testInit)
tfit_gbm <- predict(mod_gbm,testInit)

mfits <- data.frame(tfit_lda,tfit_qda,tfit_tree,tfit_rf,tfit_gbm)
colnames(mfits) <- c("LDA","QDA","DT","RF","GBM")
mfits

##    LDA QDA DT RF GBM
## 1    B   A  C  B   B
## 2    A   A  A  A   A
## 3    B   B  D  B   B
## 4    C   A  A  A   A
## 5    C   A  A  A   A
## 6    E   E  C  E   E
## 7    D   D  D  D   D
## 8    D   B  A  B   B
## 9    A   A  A  A   A
## 10   A   A  A  A   A
## 11   D   B  C  B   B
## 12   A   C  D  C   C
## 13   B   B  C  B   B
## 14   A   A  A  A   A
## 15   E   E  D  E   E
## 16   A   E  A  E   E
## 17   A   A  A  A   A
## 18   B   B  A  B   B
## 19   B   B  A  B   B
## 20   B   B  D  B   B