H-RM Tan
Wearable devices that monitor physical activity are on the rise and provide a wealth of useful information. Apart from measuring quantity of activity, assessing the manner in which an activity is performed could also improve remote human activity monitoring. The Weight Lifting Exercise Dataset (Velloso, Bulling, Gellersen, Ugulino, Fuks, 2013) provides a means to derive a “proof-of-concept” in decoding the mode of weight lifting performance. Data was acquired from accelerometer sensors on the belt, forearm, arm, and dumbell worn by 6 participants as they performed barbell lifts either correctly and incorrectly in 5 different ways. Further information is available from http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises.
The required libraries (parallel; doMC; caret; randomForest; glm; survival; dplyr; cluster; splines) are included in the knitr Rmd setup.
We load the data from the URLs provided.
trainUrl <-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
trainO <- read.csv(trainUrl, na.string = c(""," ","NA") )
testO <- read.csv(testUrl, na.string = c(""," ","NA") )
Missing values and variables (~62%) are excluded using a custom function that finds variable columns with no missing values.
keepCols <- function(df) { colnames(df)[unlist(lapply(df, function(x) anyNA(x)==0) ) ] }
train1 <- trainO[keepCols(trainO)]
test1 <- testO[keepCols(testO)]
Variables e.g. data entry row ‘X’; user_names; timestamps, which are unlikely informative for the current modelling and prediction are also excluded.
train <- train1[!grepl("^X|user|timestamp|window", names(train1))]
test <- test1[!grepl("^X|user|timestamp|window", names(test1))]
Setting the randomisation with a specific seed helps with reproducibility. We subsequently split train data for training (80%) and validating (20%) the model(s).
set.seed(2612)
inTrain <- createDataPartition(train$classe, p = 0.8, list = FALSE)
training <- train[inTrain, ]
validating <- train[-inTrain, ]
The dimensions of these data splits and the original train and test sets are as follows:
## Nrows Nvars
## training 15699 53
## validating 3923 53
## trainData 19622 53
## testData 20 53
Both Gradient Boosting and Random Forest Tree Models are frequently used in e.g. kaggle competitions, in part for their high predictive accuracy. For the current purpose, a simple set of parameters and their inherent algorithmic defaults are used for both models to predict the mode of weight-lifting performance. We use a control parameter with 10 fold cross validation, which would hopefully reduce estimation bias at the expense of some computational time (ref. http://stats.stackexchange.com/questions/27730/choice-of-k-in-k-fold-cross-validation). (The models have been pre-run and loaded during the knitr Rmd setup. If you wish to run it on your machine – please note it could take a while to complete.)
We assess the variable(s) of importance, accuracy,‘out-of-bag’ errors, and prediction outcomes for both models.
A Gradient boosting algorithm generally attempts to iteratively “boost” (by increasing its weighing factor) many weak (but better than random) predictive models into a strong one, in the form of a cluster of weak models, while minimising the training error. Additionally, for any given training data, the GBM algorithm attempts to find an optimal linear combination of trees (assuming that the final model is the combination of the weighted sum of predictions of individual trees), through parameter tuning.
set.seed(2612)
registerDoMC(3)
control <- trainControl(method = "cv", number = 10)
fit_gbm <- train(classe ~ ., data = training, method = "gbm",
trControl = control)
fit_gbm$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 44 had non-zero influence.
varImp(fit_gbm)
## gbm variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 100.000
## pitch_forearm 47.362
## yaw_belt 38.312
## magnet_dumbbell_z 30.579
## magnet_dumbbell_y 26.524
## roll_forearm 25.200
## magnet_belt_z 21.422
## accel_dumbbell_y 16.072
## accel_forearm_x 14.217
## gyros_belt_z 12.529
## pitch_belt 11.633
## gyros_dumbbell_y 9.438
## magnet_forearm_z 9.331
## accel_dumbbell_x 6.604
## yaw_arm 6.509
## accel_forearm_z 6.251
## magnet_dumbbell_x 6.096
## roll_dumbbell 6.007
## magnet_belt_y 5.764
## roll_arm 5.130
dotPlot(varImp(fit_gbm))
It appears that the top 3 variables (‘roll_belt’; ‘pitch_forearm’; ‘yaw_belt’) and expecially that of ‘roll_belt’ have the strongest influence on the model. We validate the GBM to see how it performs.
validate_gbm <- predict(fit_gbm, validating)
(confMat_val_gbm <- confusionMatrix(validating$classe, validate_gbm))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1102 7 5 1 1
## B 25 704 29 1 0
## C 0 21 656 7 0
## D 1 1 17 620 4
## E 0 11 9 7 694
##
## Overall Statistics
##
## Accuracy : 0.9625
## 95% CI : (0.9561, 0.9683)
## No Information Rate : 0.2875
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9526
## Mcnemar's Test P-Value : 6.391e-06
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9770 0.9462 0.9162 0.9748 0.9928
## Specificity 0.9950 0.9827 0.9913 0.9930 0.9916
## Pos Pred Value 0.9875 0.9275 0.9591 0.9642 0.9626
## Neg Pred Value 0.9907 0.9874 0.9815 0.9951 0.9984
## Prevalence 0.2875 0.1897 0.1825 0.1621 0.1782
## Detection Rate 0.2809 0.1795 0.1672 0.1580 0.1769
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9860 0.9645 0.9537 0.9839 0.9922
# (accuracy_val_gbm <- postResample(validate_gbm, validating$classe) )
(OutOfSampleErrorRate_gbm <- (1 - as.numeric(confMat_val_gbm$overall[1]) )*100 )
## [1] 3.747132
The validated GBM yielded a rather high accuracy of 0.963, and the out-of-sample error rate is about 3.75%.
We also assess the random forest model before making prediction(s) on the test data.
Random Forests are trained with random sampling of data as well as the option to randomize feature selection in its training. They are constructed from a mass of decision trees at training time. The outcome represents the highest frequency of the classes (classification) or mean prediction (regression) of the individual trees. The premise for this approach is that randomization will better generalize (minimizes overfitting by decision trees) performance outside of the training dataset.
set.seed(2612)
registerDoMC(cores = 3)
control <- trainControl(method = "cv", number = 10)
fit_rf <- train(classe ~ ., data = training,
method = "rf", trControl = control, importance=TRUE)
fit_rf
## Random Forest
##
## 15699 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 14129, 14128, 14128, 14131, 14130, 14129, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9935667 0.9918619
## 27 0.9937574 0.9921028
## 52 0.9889808 0.9860603
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
The training accuracy is around 0.99 with 27 variables. It is often useful to also check the variable of importance (as with GBM before) as well as how accuracy may depend on the number of randomly selected predictor variables.
par(mfrow=c(1,3))
varImpPlot(fit_rf$finalModel, type = 1, pch=16, col='navy')
varImpPlot(fit_rf$finalModel, type = 2, pch=16, col='dark green')
plot(fit_rf$results[,c(1,2)] , xlab = "Predictors", ylab = "Accuracy" , col="red")
lines(fit_rf$results[,c(1,2)], col="red")
As with the GBM, it appears that the top 3 variables (‘roll_belt’; ‘yaw_belt’; ‘pitch_forearm’) have the strongest influence on the model. This can be appreciated both by the amount they decrease the mean accuracy when they are excluded from the model, and also by the corresponding (high) mean decrease in their Gini Coefficient which reflect the purity of their measures (ref. https://dinsdalelab.sdsu.edu/metag.stats/code/randomforest.html). It is worth noting that including more variables didn’t necessarily improve the training accuracy.
fit_rf$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.57%
## Confusion matrix:
## A B C D E class.error
## A 4461 2 0 0 1 0.000672043
## B 18 3012 8 0 0 0.008558262
## C 0 10 2716 12 0 0.008035062
## D 0 1 25 2545 2 0.010882239
## E 0 1 4 5 2876 0.003465003
The final trained model with 27 parameters has an error rate of about 0.57%. Let’s validate the RF to see how it performs.
validate_rf <- predict(fit_rf, validating)
(confMat_val_rf <- confusionMatrix(validating$classe, validate_rf))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1114 1 1 0 0
## B 5 751 3 0 0
## C 0 4 677 3 0
## D 0 0 4 638 1
## E 0 0 2 3 716
##
## Overall Statistics
##
## Accuracy : 0.9931
## 95% CI : (0.99, 0.9955)
## No Information Rate : 0.2852
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9913
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9955 0.9934 0.9854 0.9907 0.9986
## Specificity 0.9993 0.9975 0.9978 0.9985 0.9984
## Pos Pred Value 0.9982 0.9895 0.9898 0.9922 0.9931
## Neg Pred Value 0.9982 0.9984 0.9969 0.9982 0.9997
## Prevalence 0.2852 0.1927 0.1751 0.1642 0.1828
## Detection Rate 0.2840 0.1914 0.1726 0.1626 0.1825
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9974 0.9954 0.9916 0.9946 0.9985
# (accuracy_val_rf <- postResample(validate_rf, validating$classe))
(OutOfSampleErrorRate_rf <- (1 - as.numeric(confMat_val_rf$overall[1]) )*100 )
## [1] 0.6882488
The validated RF yielded a high accuracy of 0.993, and the out-of-sample error rate is about 0.69%.
Let’s predict the test data with both models.
While GBM has a slightly lower accuracy than RF model, they both yielded the same (Quiz-verified) predicted mode of weight lifting peformance in the test dataset.
(predict_rf <- predict(fit_rf, test))
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
(predict_gbm <- predict(fit_gbm, test))
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
At least for this current dataset, both GBM and RF models do comparatively well in their prediction accuracy. It is interesting to observe the higher out-of-sample error rate in the GBM relative to the RF model.
Parameter-tuning may simplify the model(s). It would be nice to figure a way to plot the final tree model(s).