The aim of this study is to produce a machine learning algorithm that accurately predicts the quality of execution during exercise using measurements captured by body sensors as described in the Weight Lifting Exercise Dataset compiled by Velloso et al (Ref: http://groupware.les.inf.puc-rio.br/har). The final model is based on a Random Forest algorithm applied to 52 predictors of the original set of 160. I used 10-fold K Means cross-validation to achieve an optimal model with 99.3% accuracy. Applying this algorithm to a testing subset of the training data (4904 observations), I estimate an out-of-sample rate of 0.57%.
The training dataset and caret package were loaded as follows.
library(caret)
training <- read.csv("pml-training.csv", header = T)
Examining the structure of the dataset showed that the first seven variables were ‘book keeping’ variables and served no purpose in prediction. These were thus removed.
training <- training[,-(1:7)]
As an initial pre-processing step, I examined the dataset for variables with near-zero covariance using the nearZeroVar() function.
##Looking for zero covariates
nearZero <- nearZeroVar(training, saveMetrics = T)
head(nearZero)
## freqRatio percentUnique zeroVar nzv
## roll_belt 1.101904 6.7781062 FALSE FALSE
## pitch_belt 1.036082 9.3772296 FALSE FALSE
## yaw_belt 1.058480 9.9734991 FALSE FALSE
## total_accel_belt 1.063160 0.1477933 FALSE FALSE
## kurtosis_roll_belt 1921.600000 2.0232392 FALSE TRUE
## kurtosis_picth_belt 600.500000 1.6155336 FALSE TRUE
Any variable with near-zero covariance was removed, leaving a dataframe with 94 observations remaining.
nearZeroIndex <- nearZero[nearZero$nzv == TRUE, ]
zeroNames <- row.names(nearZeroIndex)
Names <- names(training)
goodNamesRevInd <- Names %in% zeroNames
goodNames <- Names[!goodNamesRevInd]
trainingCov <- training[, goodNames]
Many of these variables in the dataset contain mostly NA values. These values were summary values, in other words, variables that were calculated from the raw data values. Because these variables likely have little predictive power, I also removed them from the dataset.
TCraws <- trainingCov[, ! apply(trainingCov, 2,
function(x) any(is.na(x)))]
The resulting dataset, TCraws, contains 19622 observations of 53 variables and represents my trimmed, pre-processed training set.
To train and test possible models, I split my trimmed training set and subsetted it into sub-train and a sub-test subsets containing 75% and 25% of the data, respectively.
set.seed(0708)
insubTrain <- createDataPartition(y = TCraws$classe, p = 0.75, list = F)
subTrain <- TCraws[insubTrain,]
subTest <- TCraws[-insubTrain,]
The subTrain set is still large (14,718 observations), making the computation for model selection cumbersome. To stream-line the process, I selected a subset of the subTrain data containing 5% of the data. This contained 736 observations and was my sample training set for model selection.
subTrain3Index <- seq(from = 1, to = nrow(subTrain), by = 20)
subTrain3 <- subTrain[subTrain3Index, ]
Because this dataset contains both numeric and categorical data, I chose possible algorithms that had dual use (http://topepo.github.io/caret/modelList.html). I shortlisted this to three model types: bagging, boosting with trees and Random Forest. (Code is listed below, but not evaluated in this document.)
##Treebagging with bootstrapping resampling
CARTTrain13 <- train(classe ~ ., data = subTrain3,
method = "rpart")
##Boosting with trees with bootstrapping resampling
gbmTrain113 <- train(classe ~ ., data = subTrain3,
method = "gbm",
verbose = F)
##Random Forest with bootstrapping resampling
rfTrainAB13 <- train(classe ~ ., data = subTrain3,
method = "rf")
CARTTrain13
gbmTrain113
rfTrainAB13
These models produced maximum accuracies of 47%, 79% and 80%, respectively.
To further train the models, I introduced K-means cross-validation in place of bootstrapping. (Code is listed below, but not evaluated in this document.)
##Set train control for K-means resampling
ctrl <- trainControl(method = "repeatedcv", repeats = 3)
##Test the models again
CARTTrain23 <- train(classe ~ ., data = subTrain3,
method = "rpart",
trControl = ctrl)
gbmTrain23 <- train(classe ~ ., data = subTrain3,
method = "gbm",
trControl = ctrl,
verbose = F)
rfTrainAB23 <- train(classe ~ ., data = subTrain3,
method = "rf",
trControl = ctrl)
CARTTrain23
gbmTrain23
rfTrainAB23
This increased maximum accuracies to 46.6%, 82.3% and 84.8%, respectively. I concluded that prediction with trees performed better on this dataset than bagging, and that the Random Forest algorithm slightly out-performed boosting.
To determine if normalized pre-processing might futher-improve the Random Forest algorithm, I tested it with the following code. (Code is listed below, but not evaluated in this document.)
rfTrainAB33 <- train(classe ~ ., data = subTrain3,
method = "rf",
trControl = ctrl,
preProc = c("center", "scale"))
rfTrainAB33
This produced lower accuracy than above, so I decided not to include pre-processing algorithms in the final model.
To train my final model, I applied the Random Forest algorithm with K-means cross-validation to the larger sub-training dataset of 14,718 observations. (Code is listed below, but not evaluated in this document.)
ctrl <- trainControl(method = "repeatedcv", repeats = 3)
rfModel1 <- train(classe ~ ., data = subTrain,
method = "rf",
trControl = ctrl)
rfModel1
rfModel1$finalModel
This model achieves 99.29% accuracy with an out-of-bounds estimated error rate of 0.57%. However, this algorithm was computationally expensive, requiring nearly an hour to calculate on a computer with 4 GB of RAM.
I therefore made some adjustments to the tuning parameters and used the randomForest algorith directly (outside of the caret train() function) to speed up the computation.
##Using direct rf function
library(randomForest)
set.seed(123)
ctrl2 <- trainControl(method = "repeatedcv", repeats = 3,
returnData = F,
returnResamp = "none",
savePredictions = F)
rfModel3 <- randomForest(classe ~ ., data = subTrain, trControl = ctrl2)
rfModel3
##
## Call:
## randomForest(formula = classe ~ ., data = subTrain, trControl = ctrl2)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.48%
## Confusion matrix:
## A B C D E class.error
## A 4182 2 0 0 1 0.0007168459
## B 14 2826 8 0 0 0.0077247191
## C 0 9 2555 3 0 0.0046747176
## D 0 0 21 2388 3 0.0099502488
## E 0 1 1 8 2696 0.0036954915
With final model in hand, I tested its performance on my subTest subset of the training set.
set.seed(124)
DirRFPred <- predict(rfModel3, subTest)
To assess the performance of the model on the test set, I performed a few simple diagnostic tests to look for prediction accuracy and out-of-sample error rates.
##Table of predicted values versus true values
table(DirRFPred, subTest$classe)
##
## DirRFPred A B C D E
## A 1393 4 0 0 0
## B 2 944 3 0 0
## C 0 1 851 14 2
## D 0 0 1 790 1
## E 0 0 0 0 898
##Calculate success rate
success <- DirRFPred == subTest$classe
##Produce table of success rate
table(success)
## success
## FALSE TRUE
## 28 4876
##Calculate out-of-sample error rate
oosRate <- length(success[success == FALSE])/length(success)
oosRate
## [1] 0.005709625
The output from these comands shows the success rate table as well as the out-of-sample error rate, which is 0.57%.
In this study, I examined three types of predictive algorithms for accuracy in making predictions on the Weight Lifting Exercise Dataset compiled by Velloso et al. I found that a Random Forest algorithm, combined with K-means cross-validation, produced a robust, accurate model with an out-of-sample error rate of 0.59%. Moreover, I saw first-hand how tweeking tuning parameters and making wise choices about tuning commands can greatly affect computational demands of the algorithm.