Predicting Mode of Weight-Lifting Exercise Performance

H-RM Tan

• Synopsis :

Wearable devices that monitor physical activity are on the rise and provide a wealth of useful information. Apart from measuring quantity of activity, assessing the manner in which an activity is performed could also improve remote human activity monitoring. The Weight Lifting Exercise Dataset (Velloso, Bulling, Gellersen, Ugulino, Fuks, 2013) provides a means to derive a “proof-of-concept” in decoding the mode of weight lifting performance. Data was acquired from accelerometer sensors on the belt, forearm, arm, and dumbell worn by 6 participants as they performed barbell lifts either correctly and incorrectly in 5 different ways. Further information is available from http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises.

• Load & Clean Data :

The required libraries (parallel; doMC; caret; randomForest; glm; survival; dplyr; cluster; splines) are included in the knitr Rmd setup.

We load the data from the URLs provided.

trainUrl <-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
trainO <- read.csv(trainUrl, na.string = c(""," ","NA") )
testO <- read.csv(testUrl, na.string = c(""," ","NA") )

Missing values and variables (~62%) are excluded using a custom function that finds variable columns with no missing values.

keepCols <- function(df) { colnames(df)[unlist(lapply(df, function(x) anyNA(x)==0) ) ] }
train1 <- trainO[keepCols(trainO)]
test1 <- testO[keepCols(testO)]

Variables e.g. data entry row ‘X’; user_names; timestamps, which are unlikely informative for the current modelling and prediction are also excluded.

train <- train1[!grepl("^X|user|timestamp|window", names(train1))]
test <- test1[!grepl("^X|user|timestamp|window", names(test1))]

Setting the randomisation with a specific seed helps with reproducibility. We subsequently split train data for training (80%) and validating (20%) the model(s).

set.seed(2612)
inTrain <- createDataPartition(train$classe, p = 0.8, list = FALSE)
training <- train[inTrain, ]
validating <- train[-inTrain, ]

The dimensions of these data splits and the original train and test sets are as follows:

##            Nrows Nvars
## training   15699    53
## validating  3923    53
## trainData  19622    53
## testData      20    53

• Fitting Tree Models :

Both Gradient Boosting and Random Forest Tree Models are frequently used in e.g. kaggle competitions, in part for their high predictive accuracy. For the current purpose, a simple set of parameters and their inherent algorithmic defaults are used for both models to predict the mode of weight-lifting performance. We use a control parameter with 10 fold cross validation, which would hopefully reduce estimation bias at the expense of some computational time (ref. http://stats.stackexchange.com/questions/27730/choice-of-k-in-k-fold-cross-validation). (The models have been pre-run and loaded during the knitr Rmd setup. If you wish to run it on your machine – please note it could take a while to complete.)

We assess the variable(s) of importance, accuracy,‘out-of-bag’ errors, and prediction outcomes for both models.

• Fitting a Stochastic Gradient Boost Model (GBM) :

A Gradient boosting algorithm generally attempts to iteratively “boost” (by increasing its weighing factor) many weak (but better than random) predictive models into a strong one, in the form of a cluster of weak models, while minimising the training error. Additionally, for any given training data, the GBM algorithm attempts to find an optimal linear combination of trees (assuming that the final model is the combination of the weighted sum of predictions of individual trees), through parameter tuning.

set.seed(2612)
registerDoMC(3)
control <- trainControl(method = "cv", number = 10) 
fit_gbm <- train(classe ~ ., data = training, method = "gbm",
                trControl = control)

• Assessing GBM :

fit_gbm$finalModel

## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 44 had non-zero influence.

varImp(fit_gbm)

## gbm variable importance
## 
##   only 20 most important variables shown (out of 52)
## 
##                   Overall
## roll_belt         100.000
## pitch_forearm      47.362
## yaw_belt           38.312
## magnet_dumbbell_z  30.579
## magnet_dumbbell_y  26.524
## roll_forearm       25.200
## magnet_belt_z      21.422
## accel_dumbbell_y   16.072
## accel_forearm_x    14.217
## gyros_belt_z       12.529
## pitch_belt         11.633
## gyros_dumbbell_y    9.438
## magnet_forearm_z    9.331
## accel_dumbbell_x    6.604
## yaw_arm             6.509
## accel_forearm_z     6.251
## magnet_dumbbell_x   6.096
## roll_dumbbell       6.007
## magnet_belt_y       5.764
## roll_arm            5.130

dotPlot(varImp(fit_gbm))

It appears that the top 3 variables (‘roll_belt’; ‘pitch_forearm’; ‘yaw_belt’) and expecially that of ‘roll_belt’ have the strongest influence on the model. We validate the GBM to see how it performs.

• Validating GBM :

validate_gbm <- predict(fit_gbm, validating)

(confMat_val_gbm <- confusionMatrix(validating$classe, validate_gbm))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1102    7    5    1    1
##          B   25  704   29    1    0
##          C    0   21  656    7    0
##          D    1    1   17  620    4
##          E    0   11    9    7  694
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9625          
##                  95% CI : (0.9561, 0.9683)
##     No Information Rate : 0.2875          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9526          
##  Mcnemar's Test P-Value : 6.391e-06       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9770   0.9462   0.9162   0.9748   0.9928
## Specificity            0.9950   0.9827   0.9913   0.9930   0.9916
## Pos Pred Value         0.9875   0.9275   0.9591   0.9642   0.9626
## Neg Pred Value         0.9907   0.9874   0.9815   0.9951   0.9984
## Prevalence             0.2875   0.1897   0.1825   0.1621   0.1782
## Detection Rate         0.2809   0.1795   0.1672   0.1580   0.1769
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9860   0.9645   0.9537   0.9839   0.9922

# (accuracy_val_gbm <- postResample(validate_gbm, validating$classe) )
(OutOfSampleErrorRate_gbm <- (1 - as.numeric(confMat_val_gbm$overall[1]) )*100 )

## [1] 3.747132

The validated GBM yielded a rather high accuracy of 0.963, and the out-of-sample error rate is about 3.75%.

We also assess the random forest model before making prediction(s) on the test data.

• Fitting a Random Forest Model :

Random Forests are trained with random sampling of data as well as the option to randomize feature selection in its training. They are constructed from a mass of decision trees at training time. The outcome represents the highest frequency of the classes (classification) or mean prediction (regression) of the individual trees. The premise for this approach is that randomization will better generalize (minimizes overfitting by decision trees) performance outside of the training dataset.

set.seed(2612)
registerDoMC(cores = 3)
control <- trainControl(method = "cv", number = 10) 
fit_rf <- train(classe ~ ., data = training, 
                method = "rf", trControl = control, importance=TRUE)

• Assessing the RF :

fit_rf

## Random Forest 
## 
## 15699 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 14129, 14128, 14128, 14131, 14130, 14129, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9935667  0.9918619
##   27    0.9937574  0.9921028
##   52    0.9889808  0.9860603
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

The training accuracy is around 0.99 with 27 variables. It is often useful to also check the variable of importance (as with GBM before) as well as how accuracy may depend on the number of randomly selected predictor variables.

par(mfrow=c(1,3))
varImpPlot(fit_rf$finalModel, type = 1, pch=16, col='navy')
varImpPlot(fit_rf$finalModel, type = 2, pch=16, col='dark green')

plot(fit_rf$results[,c(1,2)] , xlab = "Predictors", ylab = "Accuracy" , col="red")
lines(fit_rf$results[,c(1,2)], col="red")

As with the GBM, it appears that the top 3 variables (‘roll_belt’; ‘yaw_belt’; ‘pitch_forearm’) have the strongest influence on the model. This can be appreciated both by the amount they decrease the mean accuracy when they are excluded from the model, and also by the corresponding (high) mean decrease in their Gini Coefficient which reflect the purity of their measures (ref. https://dinsdalelab.sdsu.edu/metag.stats/code/randomforest.html). It is worth noting that including more variables didn’t necessarily improve the training accuracy.

fit_rf$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.57%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 4461    2    0    0    1 0.000672043
## B   18 3012    8    0    0 0.008558262
## C    0   10 2716   12    0 0.008035062
## D    0    1   25 2545    2 0.010882239
## E    0    1    4    5 2876 0.003465003

The final trained model with 27 parameters has an error rate of about 0.57%. Let’s validate the RF to see how it performs.

• Validating RF :

validate_rf <- predict(fit_rf, validating)
(confMat_val_rf <- confusionMatrix(validating$classe, validate_rf))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1114    1    1    0    0
##          B    5  751    3    0    0
##          C    0    4  677    3    0
##          D    0    0    4  638    1
##          E    0    0    2    3  716
## 
## Overall Statistics
##                                         
##                Accuracy : 0.9931        
##                  95% CI : (0.99, 0.9955)
##     No Information Rate : 0.2852        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.9913        
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9955   0.9934   0.9854   0.9907   0.9986
## Specificity            0.9993   0.9975   0.9978   0.9985   0.9984
## Pos Pred Value         0.9982   0.9895   0.9898   0.9922   0.9931
## Neg Pred Value         0.9982   0.9984   0.9969   0.9982   0.9997
## Prevalence             0.2852   0.1927   0.1751   0.1642   0.1828
## Detection Rate         0.2840   0.1914   0.1726   0.1626   0.1825
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9974   0.9954   0.9916   0.9946   0.9985

# (accuracy_val_rf <- postResample(validate_rf, validating$classe)) 
(OutOfSampleErrorRate_rf <- (1 - as.numeric(confMat_val_rf$overall[1]) )*100 )

## [1] 0.6882488

The validated RF yielded a high accuracy of 0.993, and the out-of-sample error rate is about 0.69%.

Let’s predict the test data with both models.

• Predicting with Models :

While GBM has a slightly lower accuracy than RF model, they both yielded the same (Quiz-verified) predicted mode of weight lifting peformance in the test dataset.

(predict_rf <- predict(fit_rf, test))

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

(predict_gbm <- predict(fit_gbm, test))

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

• Summary :

At least for this current dataset, both GBM and RF models do comparatively well in their prediction accuracy. It is interesting to observe the higher out-of-sample error rate in the GBM relative to the RF model.

• Further exploration :

Parameter-tuning may simplify the model(s). It would be nice to figure a way to plot the final tree model(s).