Executive Summary

This purpose of this report is to predict how well 6 participants aged between 20-28 years perform a Unilateral Dumbbell Biceps curl. Using data from inertial measurement units (IMU) in the users glove, armband, lumbar belt and dumbbell participants were asked to perform one set of 10 repetitions using correct and incorrect weight lifting technique. The resulting data contains 5 ‘classes’ corresponding to correctly specified execution and 4 common mistakes. The data was explored, and trained using 3 prediction modelling techniques namely, random forest, classification trees, and gradient boosting. The random forest technique was selected because of its high level of accuracy to produce a prediction model capable of predicting how well the exercise will be performed based on a separate testing dataset.

Note that the echo = FALSE parameter has been added to the code chunks to prevent printing of the R code, all of which can be found in the appendix.

Summary of the data

## [1] 19622   160
## [1]  20 160

Exploration of the training dataset reveals a data frame with 19622 observations of 160 variables and a testing dataset of 20 observations of 160 variables. The testing dataset will be set aside to test the final prediction model.

Exploratory data analyses for predictor variable selection

Inspection of the dataset to identify the presence of predictors that are almost constant across samples. These predictors are non-informative and may adversely affect prediction models.The report will use the nearZeroVar() function in caret to remove predictors that have one unique value across samples (zero variance predictors) and predictors that have both 1) few unique values relative to the number of samples and 2) large ratio of the frequency of the most common value to the frequency of the second most common value (near-zero variance predictors). In addition, predictor variables that contain NA or are blank will be removed because there are too many missing values to impute. The following variables possess near zero variance and will be excluded from the dataset.

##  [1]   6  12  13  14  15  16  17  20  23  26  51  52  53  54  55  56  57  58  59
## [20]  69  70  71  72  73  74  75  78  79  81  82  87  88  89  90  91  92  95  98
## [39] 101 125 126 127 128 129 130 131 133 134 136 137 139 142 143 144 145 146 147
## [58] 148 149 150
## [1] 19622    53

Removal of variables with no value or missing vales has reduced the number of predictor variables to 53.

Correlation matrix to analyse significance of remaining predictor variables

The analyses will identify and remove predictor variables that are highly correlated (and do not add value to the prediction model). Before processing the data has a Max. correlation of 0.98 indicating the existence of highly correlated variables. It will use the findCorrelation() function in caret to determine the highly correlated predictor variables and remove them.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.992008 -0.110080  0.002092  0.001790  0.092552  0.980924
Figure  1: Correlation matrix after the removal of the highly correlated predictor variables

Figure 1: Correlation matrix after the removal of the highly correlated predictor variables

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.768688 -0.108470  0.009256  0.011198  0.110351  0.780565

After removal of the highly correlated variables (cutoff set at 0.8) the Max. correlation has been reduced to 0.78

Selection of the most significant predictor variables

Following the analysis and selection of the most significant predictor variables, the dataset dimensions and been reduced to 40 columns (39 predictor variables and 1 outcome variable).

## [1] 19622    40

Partition the remaining variables from the original training dataset further into separate training and validation datasets

Predictive modelling and model selection

The purpose of these analyses is to predict one of the ‘classe’ outcomes corresponding to the execution of a Unilateral Dumbbell Biceps curl. Therefore, it is a classification (rather than a regression) type problem and for this reason the analyses will use decision tree and random forest classification algorithms that are best suited to this type of data analysis and the output/outcome is a discrete value.

Cross-validation for estimation of prediction error

Cross-validation is a technique for evaluating machine learning models by training several models on subsets of the available input data (training dataset in this analyses) and evaluating them on the complementary subset of the data (validation dataset). It is used to detect overfitting, i.e., failing to generalize a pattern that gives accurate predictions for training data but not for new data.

This analyses uses the k-fold, where k = 5, cross-validation for system processing reasons. The k-fold cross validation method involves splitting the training dataset into 5 subsets. Each subset is isolated in turn while the model is trained on all other subsets. The accuracy is determined for each subset in the dataset and an overall accuracy estimate is calculated for the model.

Prediction with Random Forest

Random forest builds a large collection of de-correlated trees, and then either vote or average them to get the prediction for a new outcome. When used for classification, as in this case, a random forest obtains a class vote from each tree, and then classifies using majority vote.

## Accuracy 
## 0.990441

Prediction with classification trees

Predicting with trees takes the prediction variables used to predict ‘classe’ and for each of the variables splits the outcome into different groups and then evaluates the homogeneity of the outcome within each group. The model will continue to split until the outcomes that are separated into groups are homogeneous enough, or are small enough, to stop further splitting.
Figure  2: Plot of the classification tree model

Figure 2: Plot of the classification tree model

##  Accuracy 
## 0.5406577

Prediction with gradient boosting

Boosting combines the outputs of many ‘weak’ classifiers to produce a stronger predictor. The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data to produce a sequence of weak classifiers. The predictions from all of them are then combined through a weighted majority vote to produce the final prediction.

##  Accuracy 
## 0.9463421

Summary of accuracy

Summary of Accuracy
Random.Forest Gradient.Boosting Classification.Tree
Accuracy 0.99 0.95 0.54

Summary of out-of-sample error rates

Sometimes called generalization error, it is the error rate you get on a new data set. Estimate an out of sample error by aggregating the accuracy analysis across a series of training runs and subtracting from 1.

Out-of-sample error rates
Random.Forest Gradient.Boosting Classification.Tree
Accuracy 0.01 0.054 0.459

Final model selection

The random forest model has been selected because of its high level of accuracy to produce a prediction model capable of predicting how well the exercise will be performed based on the entirely separate testing dataset.

Final prediction on testing dataset

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Appendix A: All R code for this report

# knitr global options
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, fig.cap = TRUE, fig.align = "center",
                      fig.path="figures/", options(scipen=999))
knitr::opts_current$get('label')
# use captioner to add figure number and caption
library(captioner)
fig_nums <- captioner()
fig_nums("figa", "Correlation matrix after the removal of the highly correlated predictor variables")
fig_nums("figb", "Plot of the classification tree model")
# load the dataset and display the dimensions of the training and testing dataset
pml_training <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv")
pml_testing <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv")
dim(pml_training)
dim(pml_testing)
# load libraries to remove unnecessary variables
library(caret)
library(dplyr)
# identifying and removing zero- and near zero-variance predictors (these may cause issues when subsampling)
nearZeroVar(pml_training)
df1 <- pml_training[,-nearZeroVar(pml_training)]
# identifying and removing columns with missing values
df1 <- df1 %>% select_if(~ !any(is.na(.) | . == "")) 
# remove first 6 columns that are not important
df1 <- df1[,-(1:6)] 
dim(df1)
library(corrplot)
# a correlation of the predictors (excluding 'classe') and summary
df1_cor <- cor(df1[,-53])
summary(df1_cor[upper.tri(df1_cor)]) # max = 0.98
# use findCorrelation() function to determine the highly correlated predictor variables
cor.index <- findCorrelation(df1_cor, cutoff=0.8)
# remove the highly correlated variables 
df2 <- df1[, -cor.index]
df2_cor <- cor(df2[,-40])
# and display the resulting correlation matrix and summary
diag(df2_cor) <- 0
corrplot(df2_cor)
summary(df2_cor[upper.tri(df2_cor)]) 
# display dimensions of the dataset before creating the partition for training
dim(df2)
# create partitioned data with a 60/40 split
set.seed(32323)
inTrain <- createDataPartition(df2$classe, p = 0.6, list = FALSE)
training <- df2[ inTrain,]
validation <- df2[-inTrain,]
# set up prediction with random forest (with parallel processing)
set.seed(95014)
# set up x and y to avoid slowness of caret() with model syntax
y <- training[,40]
x <- training[,-40]
# use parallel processing capabilities to speed up performance
library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1) # leave 1 core for OS
registerDoParallel(cluster)
fitControl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
fitrf <- train(x,y, method="rf", data=training, trControl = fitControl, tuneGrid=data.frame(mtry=7))
stopCluster(cluster)
registerDoSEQ()
# model prediction
prf <- predict(fitrf, validation)
confusionMatrix(prf, as.factor(validation$classe))$overall[1]
# set up prediction with classification trees
library(rattle)
# cart model
fitControl1 <- trainControl(method = "cv", number = 5)
fitdt <- train(classe ~ ., method="rpart", data=training, trControl = fitControl1)
fancyRpartPlot(fitdt$finalModel, sub = "", caption = "")
# model prediction
pdt <- predict(fitdt, validation)
confusionMatrix(pdt, as.factor(validation$classe))$overall[1]
# set up prediction with boosting
fitgbm <- train(classe ~ ., method="gbm", data=training, trControl = fitControl1, verbose = FALSE)
# model prediction
pgbm <- predict(fitgbm, validation)
confusionMatrix(pgbm, as.factor(validation$classe))$overall[1]
# summary of accuracy
sumacc <- data.frame(Random.Forest=confusionMatrix(prf, as.factor(validation$classe))$overall[1], 
                Gradient.Boosting=confusionMatrix(pgbm, as.factor(validation$classe))$overall[1],
                Classification.Tree=confusionMatrix(pdt, as.factor(validation$classe))$overall[1])
library(knitr)
kable(sumacc, caption = "Summary of Accuracy", digits = 2)
# summary of out-of-sample errors (1-accuracy)
oose <- data.frame(Random.Forest=1-(confusionMatrix(prf, as.factor(validation$classe))$overall[1]), 
                Gradient.Boosting=1-(confusionMatrix(pgbm, as.factor(validation$classe))$overall[1]),
                Classification.Tree=1-(confusionMatrix(pdt, as.factor(validation$classe))$overall[1]))
kable(oose, caption = "Out-of-sample error rates", digits = 3)
# final model prediction and result
prf_final <- predict(fitrf, pml_testing)
prf_final

Bibliography

Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York, Springer.