Introduction

The goal of this analysis is to create a machine learning model which can successfully differentiate between weightlifting techniques based on accelerometer data. The data for this analysis comes from the Weight Lifting Exercises (WLE) dataset, which contains measurements from accelerometers on participants’ arms, forearms, waists and dumbbells as they performed 10 unilateral dumbbell bicep curls. Participants performed these exercises according to one of five specific techniques, one of which was correct. Using the R language and the caret package, we will build a machine learning model to classify observations based on which of the five techniques they correspond to. For more information on the WLE dataset, see this paper by Velloso, Bulling, Gellersen, Ugulino, and Fuks.

Loading and Processing the Data

The first step is to download the data if necessary and load it into R.

train_data_file <- "pml-training.csv"
train_data_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
if(!file.exists(train_data_file)) {
    download.file(url = train_data_url,
                  destfile = train_data_file,
                  method = "curl")
}
training_set <- read.csv(train_data_file)
dim(training_set)
## [1] 19622   160
test_data_file <- "pml-testing.csv"
test_data_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
if(!file.exists(test_data_file)) {
    download.file(url = test_data_url,
                  destfile = test_data_file,
                  method = "curl")
}
test_set <- read.csv(test_data_file)
dim(test_set)
## [1]  20 160

These datasets are large and cannot be easily displayed, but viewing the training and test sets reveals many variables which cannot be used for prediction. Some are unrelated to the weightlifting technique used, such as the participant name, and some have no values recorded in the test set. These variables are therefore removed.

training_set <- training_set[-c(1:7)]
test_set <- test_set[-c(1:7, dim(test_set)[2])]
not_missing <- which(!is.na(colSums(test_set)))
training_set <- training_set[c(not_missing, dim(training_set)[2])]
test_set <- test_set[not_missing]
dim(training_set)
## [1] 19622    53
dim(test_set)
## [1] 20 52

We now have 52 predictors, as opposed to 159. Next, we check whether there remain any missing values in the data which need to be addressed.

any(is.na(training_set))
## [1] FALSE
any(is.na(test_set))
## [1] FALSE

Finally, we plot the distributions of each of the variables to check for outliers or problematic patterns.

library(reshape2)
library(ggplot2)
training_set_melt <- melt(training_set)
ggplot(training_set_melt, aes(variable, value)) + 
    geom_boxplot(outlier.alpha = 0.33, outlier.size = 1) + 
    coord_flip() +
    xlab("Variable") +
    ylab("Value") +
    ggtitle("Distribution of Training Data")

No problems are apparent from this plot; most of the data seems relatively evenly distributed around zero, and only one distant outlier is present.

Model Training, Selection and Evaluation

We will create our model using the caret package, which automates many aspects of model selection, and we will use 10-fold cross-validation so that the reported training accuracy will approximate the out-of-sample accuracy. Aiming for an accuracy of 99%, we will trial different algorithms, and possibly stack models, stopping as soon as this threshold is achieved. Firstly, we will train a random forest model.

library(caret)
library(doParallel)

set.seed(3791)
cl <- makeCluster(detectCores())
registerDoParallel(cl)

train_control <- trainControl(method="cv", number=10)
model_rf <- train(classe ~ ., training_set, 
                  method="rf", trControl=train_control)

stopCluster(cl)
model_rf
## Random Forest 
## 
## 19622 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 17660, 17659, 17659, 17660, 17661, 17658, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9953624  0.9941335
##   27    0.9950055  0.9936821
##   52    0.9899094  0.9872352
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

It can be seen that this random forest model achieves a classification accuracy of 99.53624%, and a corresponding error rate of 0.46376%, and thus seems to be effective at distinguishing between the different weightlifting techniques using accelerometer data. Additionally, since these accuracy and error rate figures were obtained through 10-fold cross-validation, they should be a close estimate of the out-of-sample accuracy and error. Since the accuracy of this model is over 99%, no further models are trained.