Prediction Assignment Practical Machine Learning

Background

This report examines the effort taken to find a prediction model for dumbbell bicep curls. The data set* was designed to examine how “well” six individuals did bicep curls. Each person did biceps curls correctly, then made four different errors, throwing their elbows to the front, lifting only halfway, lowering only halfway, and throwing their hips to the front. Sensors added to the dumbbell, the waist belt, forearm and upper arm recorded data in x, y, and z axes plus other variables include roll, pitch, yaw, gryos, acceleration, time, user name, kurtosis, skewness, amplitude, magnet, with calculated averages, variation, standard deviation, for a total of 160 variables including the variable that identifies how the curl was performed, correctly or which error.

If one does not regularly do bicep curls, one might not understand how they are done properly or what mistakes might look like. YouTube fortunately has many videos doing “Unilateral dumbbell biceps curls” and several more on “Top Mistakes Doing Bicep Curls.” After watching numerous videos, one will understand that the upper arm and waist belt barely move when doing bicep curls properly. When moving the elbows to the front incorrectly, the dumbbell has a higher arc. Throwing the hips to the front should be indicated on the belt sensor and not necessarily during other movements.

Cleaning Data

The data set is quite large with more than 19,000 observations and 160 variables. A very small testing data set was provided for final analysis. As best practice a validation test was created of 30 percent of the data and an exploring data set of 2,000 observations. The research was done on the exploring data until the final model was decided upon. The original data set included many columns with NAs and blank cells as they were columns with calculations across many observations. These columns were eliminated from the exploring data set and reduced the data set to 53 columns. The outcome variable was the “classe” of bicep curl and was originally a character variable. As part of the cleaning process it was converted to a factor variable with five levels, A, B, C, D, and E.

Exploratory Analysis

After watching many videos, one could guess that a belt movement variable would be important. Some quick plots were done to look at the belt movement variable data. There are 38 variables measuring some aspect of belt movement. The E classe variable identified the throwing the hips forward movement which was easy to find in some plots.

#create validation test set
inTrain <- createDataPartition(y=dumbbell$classe, p =.7, list = FALSE)
training <- dumbbell[inTrain, ]
validation <- dumbbell[-inTrain, ]
training$classe <- as.factor(training$classe)
set.seed(123)
exploring <- sample_n(training, 2000)
exploring <- as.data.frame(exploring)

#data set without na values
exploringy <- select(exploring, 1, 9:12, 38:50, 61:69, 85:87, 103, 114:125, 141, 152:160)

#plot that separates throwing hips to front well, lowering halfway somewhat, other movements not separated
idea2 <- ggplot(exploring, aes(x = (roll_forearm+magnet_dumbbell_y), y = roll_belt, colour = factor(classe))) 
idea2 + geom_point(size = 2, alpha =.3)

Models

Being able to predict one classe was possible, but the goal was to predict all five different classes. Using the caret package, several different models were examined including the rf, rpart, rpart2, ada, and pca methods. The rpart2 method was able to build a tree that had all five classes in separate leaves. The seven variables were used to build other models. Various models were tried. Although it could separate the classes, the accuracy was at best 90% and a higher accuracy was desired.

idea6 <- train(classe~., method ="rpart2", data = exploringy) #finds seven variables inlcudes tuning parameters
print(idea6$finalModel)

## n= 2000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 2000 1425 A (0.29 0.2 0.17 0.17 0.17)  
##    2) roll_belt< 130.5 1824 1251 A (0.31 0.22 0.19 0.19 0.09)  
##      4) pitch_forearm< -33.55 158    1 A (0.99 0.0063 0 0 0) *
##      5) pitch_forearm>=-33.55 1666 1250 A (0.25 0.24 0.21 0.21 0.099)  
##       10) roll_forearm< 127.5 1079  703 A (0.35 0.26 0.14 0.21 0.052)  
##         20) magnet_dumbbell_y< 444.5 910  536 A (0.41 0.2 0.15 0.18 0.051) *
##         21) magnet_dumbbell_y>=444.5 169   75 B (0.012 0.56 0.036 0.34 0.059)  
##           42) total_accel_dumbbell>=6.5 98   13 B (0.02 0.87 0.061 0 0.051) *
##           43) total_accel_dumbbell< 6.5 71   14 D (0 0.13 0 0.8 0.07) *
##       11) roll_forearm>=127.5 587  390 C (0.068 0.2 0.34 0.21 0.19)  
##         22) magnet_dumbbell_y< 291.5 314  135 C (0.086 0.086 0.57 0.11 0.15)  
##           44) magnet_dumbbell_z>=287 39   16 A (0.59 0.026 0.051 0.1 0.23) *
##           45) magnet_dumbbell_z< 287 275   98 C (0.015 0.095 0.64 0.11 0.14) *
##         23) magnet_dumbbell_y>=291.5 273  182 B (0.048 0.33 0.066 0.33 0.23)  
##           46) accel_forearm_x>=-101 165   95 B (0.048 0.42 0.11 0.12 0.3) *
##           47) accel_forearm_x< -101 108   39 D (0.046 0.19 0 0.64 0.12) *
##    3) roll_belt>=130.5 176    2 E (0.011 0 0 0 0.99) *

#final leaves of tree with seven variable and probabilities of each classification (A, B, C, D, E)

To get better accuracy, another direction was taken to add all the variables in using the dot option and change the method. Three classification methods were chosen, rf (random forest) which had an out of bounds error rate of 5.15% or accuracy rate of 94.85%, svmRadical, which had a significantly higher out of bounds error rate, and xgboost.

Best Fit

The xgboost method, or extreme gradient boosting, was determined to be the best model. It has almost perfect prediction across the entire exploring data set. The only problem with this model is that it takes a long time to run for large data sets as is used in this report. The Accuracy rate with the validation data set of 30% of the training set is 96.04%. As with the plot, classe E, throwing the hips forward, is the easiest to separate and has the lowest misapplied rate with less than 10 misapplied observations out of more than 1,000 E classe.

validation$classe <- as.factor(validation$classe)
validationy <- select(validation, 1, 9:12, 38:50, 61:69, 85:87, 103, 114:125, 141, 152:160)
idea11 <- train(classe~ ., method= "xgbTree", data = exploringy) #The BEST model
ideaPred11 <- predict(idea11, newdata = validationy)
confusionMatrix(ideaPred11, validationy$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1656   39    1    6    2
##          B    5 1047   32    7   15
##          C    4   36  969   30   19
##          D    7    4   15  916   17
##          E    2   13    9    5 1029
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9545          
##                  95% CI : (0.9488, 0.9596)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9424          
##  Mcnemar's Test P-Value : 2.718e-06       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9892   0.9192   0.9444   0.9502   0.9510
## Specificity            0.9886   0.9876   0.9817   0.9913   0.9940
## Pos Pred Value         0.9718   0.9467   0.9159   0.9552   0.9726
## Neg Pred Value         0.9957   0.9807   0.9882   0.9903   0.9890
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2814   0.1779   0.1647   0.1556   0.1749
## Detection Prevalence   0.2895   0.1879   0.1798   0.1630   0.1798
## Balanced Accuracy      0.9889   0.9534   0.9631   0.9707   0.9725

Cross Validation

To cross validate the model, a validation data set was created at beginning with 30% of the training set. The accuracy measure for the validation data set was 95.63%. Finally, the entire training set was run and its accuracy measure was 96.71%.

training <- as.data.frame(training)
trainingy <- select(training, 1, 9:12, 38:50, 61:69, 85:87, 103, 114:125, 141, 152:160) #data set without na values
ideaPred11a <- predict(idea11, newdata = trainingy)
confusionMatrix(ideaPred11a, trainingy$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3853   72    1   14    5
##          B   23 2494   76    9   30
##          C   14   63 2292   77   32
##          D   14   11   24 2146   20
##          E    2   18    3    6 2438
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9626          
##                  95% CI : (0.9593, 0.9657)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9527          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9864   0.9383   0.9566   0.9529   0.9655
## Specificity            0.9906   0.9875   0.9836   0.9940   0.9974
## Pos Pred Value         0.9767   0.9476   0.9249   0.9688   0.9882
## Neg Pred Value         0.9946   0.9852   0.9908   0.9908   0.9923
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2805   0.1816   0.1668   0.1562   0.1775
## Detection Prevalence   0.2872   0.1916   0.1804   0.1612   0.1796
## Balanced Accuracy      0.9885   0.9629   0.9701   0.9735   0.9815

Conclusion

The best model to classify bicep curl movements is made using the extreme gradient boosting method in the caret package with all the variables. This provides an error rate less than 5% for the exploratory, validation and training data sets.

*Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.