This purpose of this report is to predict how well 6 participants aged between 20-28 years perform a Unilateral Dumbbell Biceps curl. Using data from inertial measurement units (IMU) in the users glove, armband, lumbar belt and dumbbell participants were asked to perform one set of 10 repetitions using correct and incorrect weight lifting technique. The resulting data contains 5 ‘classes’ corresponding to correctly specified execution and 4 common mistakes. The data was explored, and trained using 3 prediction modelling techniques namely, random forest, classification trees, and gradient boosting. The random forest technique was selected because of its high level of accuracy to produce a prediction model capable of predicting how well the exercise will be performed based on a separate testing dataset.
Note that the echo = FALSE
parameter has been added
to the code chunks to prevent printing of the R code, all of which can
be found in the appendix.
## [1] 19622 160
## [1] 20 160
Exploration of the training dataset reveals a data frame with 19622 observations of 160 variables and a testing dataset of 20 observations of 160 variables. The testing dataset will be set aside to test the final prediction model.
Inspection of the dataset to identify the presence of predictors that
are almost constant across samples. These predictors are non-informative
and may adversely affect prediction models.The report will use the
nearZeroVar()
function in caret
to remove
predictors that have one unique value across samples (zero variance
predictors) and predictors that have both 1) few unique values relative
to the number of samples and 2) large ratio of the frequency of the most
common value to the frequency of the second most common value (near-zero
variance predictors). In addition, predictor variables that contain NA
or are blank will be removed because there are too many missing values
to impute. The following variables possess near zero variance and will
be excluded from the dataset.
## [1] 6 12 13 14 15 16 17 20 23 26 51 52 53 54 55 56 57 58 59
## [20] 69 70 71 72 73 74 75 78 79 81 82 87 88 89 90 91 92 95 98
## [39] 101 125 126 127 128 129 130 131 133 134 136 137 139 142 143 144 145 146 147
## [58] 148 149 150
## [1] 19622 53
Removal of variables with no value or missing vales has reduced the number of predictor variables to 53.
The analyses will identify and remove predictor variables that are
highly correlated (and do not add value to the prediction model). Before
processing the data has a Max. correlation of 0.98 indicating the
existence of highly correlated variables. It will use the
findCorrelation()
function in caret
to
determine the highly correlated predictor variables and remove them.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.992008 -0.110080 0.002092 0.001790 0.092552 0.980924
Figure 1: Correlation matrix after the removal of the highly correlated predictor variables
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.768688 -0.108470 0.009256 0.011198 0.110351 0.780565
After removal of the highly correlated variables (cutoff set at 0.8) the Max. correlation has been reduced to 0.78
Following the analysis and selection of the most significant predictor variables, the dataset dimensions and been reduced to 40 columns (39 predictor variables and 1 outcome variable).
## [1] 19622 40
The purpose of these analyses is to predict one of the ‘classe’ outcomes corresponding to the execution of a Unilateral Dumbbell Biceps curl. Therefore, it is a classification (rather than a regression) type problem and for this reason the analyses will use decision tree and random forest classification algorithms that are best suited to this type of data analysis and the output/outcome is a discrete value.
Cross-validation is a technique for evaluating machine learning models by training several models on subsets of the available input data (training dataset in this analyses) and evaluating them on the complementary subset of the data (validation dataset). It is used to detect overfitting, i.e., failing to generalize a pattern that gives accurate predictions for training data but not for new data.
This analyses uses the k-fold, where k = 5, cross-validation for system processing reasons. The k-fold cross validation method involves splitting the training dataset into 5 subsets. Each subset is isolated in turn while the model is trained on all other subsets. The accuracy is determined for each subset in the dataset and an overall accuracy estimate is calculated for the model.
Random forest builds a large collection of de-correlated trees, and then either vote or average them to get the prediction for a new outcome. When used for classification, as in this case, a random forest obtains a class vote from each tree, and then classifies using majority vote.
## Accuracy
## 0.990441
Figure 2: Plot of the classification tree model
## Accuracy
## 0.5406577
Boosting combines the outputs of many ‘weak’ classifiers to produce a stronger predictor. The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data to produce a sequence of weak classifiers. The predictions from all of them are then combined through a weighted majority vote to produce the final prediction.
## Accuracy
## 0.9463421
Random.Forest | Gradient.Boosting | Classification.Tree | |
---|---|---|---|
Accuracy | 0.99 | 0.95 | 0.54 |
Sometimes called generalization error, it is the error rate you get on a new data set. Estimate an out of sample error by aggregating the accuracy analysis across a series of training runs and subtracting from 1.
Random.Forest | Gradient.Boosting | Classification.Tree | |
---|---|---|---|
Accuracy | 0.01 | 0.054 | 0.459 |
The random forest model has been selected because of its high level of accuracy to produce a prediction model capable of predicting how well the exercise will be performed based on the entirely separate testing dataset.
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
# knitr global options
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, fig.cap = TRUE, fig.align = "center",
fig.path="figures/", options(scipen=999))
knitr::opts_current$get('label')
# use captioner to add figure number and caption
library(captioner)
fig_nums <- captioner()
fig_nums("figa", "Correlation matrix after the removal of the highly correlated predictor variables")
fig_nums("figb", "Plot of the classification tree model")
# load the dataset and display the dimensions of the training and testing dataset
pml_training <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv")
pml_testing <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv")
dim(pml_training)
dim(pml_testing)
# load libraries to remove unnecessary variables
library(caret)
library(dplyr)
# identifying and removing zero- and near zero-variance predictors (these may cause issues when subsampling)
nearZeroVar(pml_training)
df1 <- pml_training[,-nearZeroVar(pml_training)]
# identifying and removing columns with missing values
df1 <- df1 %>% select_if(~ !any(is.na(.) | . == ""))
# remove first 6 columns that are not important
df1 <- df1[,-(1:6)]
dim(df1)
library(corrplot)
# a correlation of the predictors (excluding 'classe') and summary
df1_cor <- cor(df1[,-53])
summary(df1_cor[upper.tri(df1_cor)]) # max = 0.98
# use findCorrelation() function to determine the highly correlated predictor variables
cor.index <- findCorrelation(df1_cor, cutoff=0.8)
# remove the highly correlated variables
df2 <- df1[, -cor.index]
df2_cor <- cor(df2[,-40])
# and display the resulting correlation matrix and summary
diag(df2_cor) <- 0
corrplot(df2_cor)
summary(df2_cor[upper.tri(df2_cor)])
# display dimensions of the dataset before creating the partition for training
dim(df2)
# create partitioned data with a 60/40 split
set.seed(32323)
inTrain <- createDataPartition(df2$classe, p = 0.6, list = FALSE)
training <- df2[ inTrain,]
validation <- df2[-inTrain,]
# set up prediction with random forest (with parallel processing)
set.seed(95014)
# set up x and y to avoid slowness of caret() with model syntax
y <- training[,40]
x <- training[,-40]
# use parallel processing capabilities to speed up performance
library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1) # leave 1 core for OS
registerDoParallel(cluster)
fitControl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
fitrf <- train(x,y, method="rf", data=training, trControl = fitControl, tuneGrid=data.frame(mtry=7))
stopCluster(cluster)
registerDoSEQ()
# model prediction
prf <- predict(fitrf, validation)
confusionMatrix(prf, as.factor(validation$classe))$overall[1]
# set up prediction with classification trees
library(rattle)
# cart model
fitControl1 <- trainControl(method = "cv", number = 5)
fitdt <- train(classe ~ ., method="rpart", data=training, trControl = fitControl1)
fancyRpartPlot(fitdt$finalModel, sub = "", caption = "")
# model prediction
pdt <- predict(fitdt, validation)
confusionMatrix(pdt, as.factor(validation$classe))$overall[1]
# set up prediction with boosting
fitgbm <- train(classe ~ ., method="gbm", data=training, trControl = fitControl1, verbose = FALSE)
# model prediction
pgbm <- predict(fitgbm, validation)
confusionMatrix(pgbm, as.factor(validation$classe))$overall[1]
# summary of accuracy
sumacc <- data.frame(Random.Forest=confusionMatrix(prf, as.factor(validation$classe))$overall[1],
Gradient.Boosting=confusionMatrix(pgbm, as.factor(validation$classe))$overall[1],
Classification.Tree=confusionMatrix(pdt, as.factor(validation$classe))$overall[1])
library(knitr)
kable(sumacc, caption = "Summary of Accuracy", digits = 2)
# summary of out-of-sample errors (1-accuracy)
oose <- data.frame(Random.Forest=1-(confusionMatrix(prf, as.factor(validation$classe))$overall[1]),
Gradient.Boosting=1-(confusionMatrix(pgbm, as.factor(validation$classe))$overall[1]),
Classification.Tree=1-(confusionMatrix(pdt, as.factor(validation$classe))$overall[1]))
kable(oose, caption = "Out-of-sample error rates", digits = 3)
# final model prediction and result
prf_final <- predict(fitrf, pml_testing)
prf_final
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York, Springer.