Summary

The aim of this study is to produce a machine learning algorithm that accurately predicts the quality of execution during exercise using measurements captured by body sensors as described in the Weight Lifting Exercise Dataset compiled by Velloso et al (Ref: http://groupware.les.inf.puc-rio.br/har). The final model is based on a Random Forest algorithm applied to 52 predictors of the original set of 160. I used 10-fold K Means cross-validation to achieve an optimal model with 99.3% accuracy. Applying this algorithm to a testing subset of the training data (4904 observations), I estimate an out-of-sample rate of 0.57%.

Loading and preprocessing the data

The training dataset and caret package were loaded as follows.

library(caret)
training <- read.csv("pml-training.csv", header = T)

Examining the structure of the dataset showed that the first seven variables were ‘book keeping’ variables and served no purpose in prediction. These were thus removed.

training <- training[,-(1:7)]

Removing variables with near-zero covariance

As an initial pre-processing step, I examined the dataset for variables with near-zero covariance using the nearZeroVar() function.

##Looking for zero covariates
nearZero <- nearZeroVar(training, saveMetrics = T)
head(nearZero)
##                       freqRatio percentUnique zeroVar   nzv
## roll_belt              1.101904     6.7781062   FALSE FALSE
## pitch_belt             1.036082     9.3772296   FALSE FALSE
## yaw_belt               1.058480     9.9734991   FALSE FALSE
## total_accel_belt       1.063160     0.1477933   FALSE FALSE
## kurtosis_roll_belt  1921.600000     2.0232392   FALSE  TRUE
## kurtosis_picth_belt  600.500000     1.6155336   FALSE  TRUE

Any variable with near-zero covariance was removed, leaving a dataframe with 94 observations remaining.

nearZeroIndex <- nearZero[nearZero$nzv == TRUE, ]
zeroNames <- row.names(nearZeroIndex)
Names <- names(training)
goodNamesRevInd <- Names %in% zeroNames
goodNames <- Names[!goodNamesRevInd]
trainingCov <- training[, goodNames]

Removing summary variables

Many of these variables in the dataset contain mostly NA values. These values were summary values, in other words, variables that were calculated from the raw data values. Because these variables likely have little predictive power, I also removed them from the dataset.

TCraws <- trainingCov[, ! apply(trainingCov, 2, 
                               function(x) any(is.na(x)))]

The resulting dataset, TCraws, contains 19622 observations of 53 variables and represents my trimmed, pre-processed training set.

Model selection

Subsetting into sub-train and sub-test subsets

To train and test possible models, I split my trimmed training set and subsetted it into sub-train and a sub-test subsets containing 75% and 25% of the data, respectively.

set.seed(0708)
insubTrain <- createDataPartition(y = TCraws$classe, p = 0.75, list = F)
subTrain <- TCraws[insubTrain,]
subTest <- TCraws[-insubTrain,]

The subTrain set is still large (14,718 observations), making the computation for model selection cumbersome. To stream-line the process, I selected a subset of the subTrain data containing 5% of the data. This contained 736 observations and was my sample training set for model selection.

subTrain3Index <- seq(from = 1, to = nrow(subTrain), by = 20)
subTrain3 <- subTrain[subTrain3Index, ]

Testing and tuning model parameters

Because this dataset contains both numeric and categorical data, I chose possible algorithms that had dual use (http://topepo.github.io/caret/modelList.html). I shortlisted this to three model types: bagging, boosting with trees and Random Forest. (Code is listed below, but not evaluated in this document.)

##Treebagging with bootstrapping resampling
CARTTrain13 <- train(classe ~ ., data = subTrain3,
                     method = "rpart")

##Boosting with trees with bootstrapping resampling
gbmTrain113 <- train(classe ~ ., data = subTrain3,
                   method = "gbm",
                   verbose = F)
##Random Forest with bootstrapping resampling
rfTrainAB13 <- train(classe ~ ., data = subTrain3,
                    method = "rf")
CARTTrain13
gbmTrain113
rfTrainAB13

These models produced maximum accuracies of 47%, 79% and 80%, respectively.

To further train the models, I introduced K-means cross-validation in place of bootstrapping. (Code is listed below, but not evaluated in this document.)

##Set train control for K-means resampling
ctrl <- trainControl(method = "repeatedcv", repeats = 3)
##Test the models again
CARTTrain23 <- train(classe ~ ., data = subTrain3,
                     method = "rpart",
                     trControl = ctrl)
gbmTrain23 <- train(classe ~ ., data = subTrain3,
                   method = "gbm",
                   trControl = ctrl,
                   verbose = F)
rfTrainAB23 <- train(classe ~ ., data = subTrain3,
                    method = "rf",
                    trControl = ctrl)
CARTTrain23
gbmTrain23
rfTrainAB23

This increased maximum accuracies to 46.6%, 82.3% and 84.8%, respectively. I concluded that prediction with trees performed better on this dataset than bagging, and that the Random Forest algorithm slightly out-performed boosting.

To determine if normalized pre-processing might futher-improve the Random Forest algorithm, I tested it with the following code. (Code is listed below, but not evaluated in this document.)

rfTrainAB33 <- train(classe ~ ., data = subTrain3,
                     method = "rf",
                     trControl = ctrl,
                     preProc = c("center", "scale"))
rfTrainAB33

This produced lower accuracy than above, so I decided not to include pre-processing algorithms in the final model.

Training the final model

To train my final model, I applied the Random Forest algorithm with K-means cross-validation to the larger sub-training dataset of 14,718 observations. (Code is listed below, but not evaluated in this document.)

ctrl <- trainControl(method = "repeatedcv", repeats = 3)
rfModel1 <- train(classe ~ ., data = subTrain,
                     method = "rf",
                     trControl = ctrl)
rfModel1
rfModel1$finalModel

This model achieves 99.29% accuracy with an out-of-bounds estimated error rate of 0.57%. However, this algorithm was computationally expensive, requiring nearly an hour to calculate on a computer with 4 GB of RAM.

I therefore made some adjustments to the tuning parameters and used the randomForest algorith directly (outside of the caret train() function) to speed up the computation.

##Using direct rf function
library(randomForest)
set.seed(123)
ctrl2 <- trainControl(method = "repeatedcv", repeats = 3,
                      returnData = F,
                      returnResamp = "none",
                      savePredictions = F)
rfModel3 <- randomForest(classe ~ ., data = subTrain, trControl = ctrl2)
rfModel3
## 
## Call:
##  randomForest(formula = classe ~ ., data = subTrain, trControl = ctrl2) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.48%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 4182    2    0    0    1 0.0007168459
## B   14 2826    8    0    0 0.0077247191
## C    0    9 2555    3    0 0.0046747176
## D    0    0   21 2388    3 0.0099502488
## E    0    1    1    8 2696 0.0036954915

Testing the model on the sub-test subset of the training set.

With final model in hand, I tested its performance on my subTest subset of the training set.

set.seed(124)
DirRFPred <- predict(rfModel3, subTest)

To assess the performance of the model on the test set, I performed a few simple diagnostic tests to look for prediction accuracy and out-of-sample error rates.

##Table of predicted values versus true values
table(DirRFPred, subTest$classe)
##          
## DirRFPred    A    B    C    D    E
##         A 1393    4    0    0    0
##         B    2  944    3    0    0
##         C    0    1  851   14    2
##         D    0    0    1  790    1
##         E    0    0    0    0  898
##Calculate success rate
success <- DirRFPred == subTest$classe
##Produce table of success rate
table(success)
## success
## FALSE  TRUE 
##    28  4876
##Calculate out-of-sample error rate
oosRate <- length(success[success == FALSE])/length(success)
oosRate
## [1] 0.005709625

The output from these comands shows the success rate table as well as the out-of-sample error rate, which is 0.57%.

Conclusions

In this study, I examined three types of predictive algorithms for accuracy in making predictions on the Weight Lifting Exercise Dataset compiled by Velloso et al. I found that a Random Forest algorithm, combined with K-means cross-validation, produced a robust, accurate model with an out-of-sample error rate of 0.59%. Moreover, I saw first-hand how tweeking tuning parameters and making wise choices about tuning commands can greatly affect computational demands of the algorithm.