Introduction

This paper summarizes the knowledge gained during the Practical Machine Learning module of the Coursera Data Science Specialization. It describes the processes of fitting various tree-based classification algorithms against training and validation datasets, evaluating the results of predictions made based on model fitted data, then choosing the best fit and using it to predict behavior for a non-classified dataset.

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit, it is now possible to collect a large amount of data about personal activity relatively inexpensively. These types of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. This project uses data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here (see the section on the Weight Lifting Exercise Dataset).

Data

The training data for this project are available here. The test data are available here.

The data for this project come from this source. If you use their data for any purpose, please cite them in your work.

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Read more.

Creating Training and Testing Datasets

From the two datasets, a training and validation dataset will be created from the main training data; the testing dataset will be untouched until the appropriate model has been selected, at which time it will be used to predict the activity performed by each trial participant.

Load the Data

Before the data was loaded into R, it was viewed using a spreadsheet program. Examination of the data revealed columns with spurious data, such as divide-by-zero indicators. These were identified and specified to the na.strings parameter of the read.csv method.

trainingData <- read.csv("pml-training.csv", na.strings = c("NA", "#DIV/0!", ""))
testingData <- read.csv("pml-testing.csv", na.strings = c("NA", "#DIV/0!", ""))

Load the necessary packages

This paper makes extensive use of the caret package for the data tidying work as well as model building. More information about this package can be found here.

library(caret)

Data Tidying

The initial data prep puts the datasets in a good state for data tidying.

dim(trainingData)
## [1] 19622   160

Clearly there are a lot (160) of variables/predictors to choose from. But not all of them might be suitable. The task here is to eliminate/reduce the number of predictors to a reasonable working set. The first step in that process is to look at the data. Judging from the columns names seen in the spreadsheet (and, in the absence of a data dictionary), the first 7 columns are likely candidates for exclusion:

head(testingData[, 1:7], n = 3)
##   X user_name raw_timestamp_part_1 raw_timestamp_part_2   cvtd_timestamp
## 1 1     pedro           1323095002               868349 05/12/2011 14:23
## 2 2    jeremy           1322673067               778725 30/11/2011 17:11
## 3 3    jeremy           1322673075               342967 30/11/2011 17:11
##   new_window num_window
## 1         no         74
## 2         no        431
## 3         no        439
tail(testingData[, 1:7], n = 3)
##     X user_name raw_timestamp_part_1 raw_timestamp_part_2   cvtd_timestamp
## 18 18  carlitos           1323084285               176314 05/12/2011 11:24
## 19 19     pedro           1323094999               828379 05/12/2011 14:23
## 20 20    eurico           1322489658               106658 28/11/2011 14:14
##    new_window num_window
## 18         no        361
## 19         no         72
## 20         no        255

These values are clearly not useful as predictor variables and can be removed from both the training and testing datasets.

trainingData <- trainingData[, 8:ncol(trainingData)]
testingData <- testingData[, 8:ncol(testingData)]

The data contains many columns with a large number of NA values. Columns with 50% or greater of NA values will contribute nothing to the model. These are removed with the following lines of code:

trainingData <- trainingData[, colSums(is.na(trainingData)) < (nrow(trainingData) * 0.5)]
testingData <- testingData[, colSums(is.na(testingData)) < (nrow(testingData) * 0.5)]

The last step in the data tidying process is to check for any columns where the variance is near or equal to zero, meaning that these variables can potentially cause problems for modeling.

nearZeroVar(trainingData)
## integer(0)
nearZeroVar(testingData)
## integer(0)

The result, integer(0), means that no columns have a near-zero variance (i.e., zeroVar and nzv display FALSE for all predictors when using the saveMetrics parameter set to TRUE) and thus all the columns we have in the dataset can be considered adequate predictors.

The data tidying portion of the work is done; on to model building.

Model Evaluation and Selection

This part of the process entails the evaluation and selection of a classifier algorithm from amongst the ones discussed in the course. As the algorithms are compute-intensive, the selection is limited by the processing power of the machine on which this paper was composed.

Reproducibility of work

In order to ensure consistent results across subsequent runs of the code, a seed value is set so that the same values will be generated for all the model training and fitting executions.

set.seed(9999)

Create Training and Testing datasets

In order to train the model and preserve the testing dataset, the original training data is split into two sub-datasets: training and testing. It’s a bit confusing but the purpose is to allow a higher degree of accuracy (i.e., low out-of-sample errors) when making predictions, based on a refined model fit, on the real test dataset. Basically, the procedure is:

  1. split the original training set into sub-training/test sets
  2. build the model on sub-training set
  3. evaluate the model on sub-test set
  4. repeat and average estimated errors

This is accomplished using the createDataPartition method in the caret package:

INDEX <- createDataPartition(y = trainingData$classe, p = 0.6, list = FALSE)

# Create a sub-training set using 60% of the original training data
trainingSet <- trainingData[INDEX,]

# Create a sub-testing set using 40% of the original training data
testingSet <- trainingData[-INDEX,]

Next, a trainControl instance is created, using 3-fold cross validation. The idea is to break the dataset up into k folds, or data sets of roughly equal size. Then, the first fold is excluded and a model is fit on the remaining 2 folds. The model is used to predict the first fold; this process is continued until all k folds have been modeled and predicted upon.

trainingControl <- trainControl(method = "cv", number = 3)

Once the trainingControl has been created, it is passed to the train function, which applies the desired machine learning algorithm to construct a model from the subset of the training data. Due to the lack of sufficient computing power, only two different tree-based algorithms will be evaluated:

  1. Decision Trees (rpart)

  2. Random Forest Decision Trees (rf)

A helper function is used to fit the data to the model:

rpartModel <- trainer("rpart")
rfModel <- trainer("rf")

Prediction using Training and Test Sets

Once the data has been fit to a model, an evaluation of the predictive nature of each algorithm can be performed. The predict method is used for this purpose, and then the predicted values are processed using a confusionMatrix, which is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. This is similar in concept to a truth-table. For binary (2-class) classification models, it tells:

rpartPrediction <- predict(rpartModel, newdata = trainingSet, type = "raw")
rpartCM <- confusionMatrix(rpartPrediction, trainingSet$classe)

rfPrediction <- predict(rfModel, newdata = trainingSet, type="raw")
rfCM <- confusionMatrix(rfPrediction, trainingSet$classe)

The accuracy for the rpart algorithm is 0.497962 and for the rf algorithm the accuracy is 1. Clearly, the accuracy of prediction for the rf algorithm surpasses that of the rpart algorithm; looking at their truth tables adds additional support for this conclusion:

##           Reference
## Prediction    A    B    C    D    E
##          A 3035  909  954  873  304
##          B   54  792   62  329  292
##          C  252  578 1038  728  570
##          D    0    0    0    0    0
##          E    7    0    0    0  999
##           Reference
## Prediction    A    B    C    D    E
##          A 3348    0    0    0    0
##          B    0 2279    0    0    0
##          C    0    0 2054    0    0
##          D    0    0    0 1930    0
##          E    0    0    0    0 2165

Looking at the diagonals for each table leaves no “confusion” about which algorithm to apply to the testing subset and for predicting the labels of the full test dataset.

Since the rf algorithm clearly has the most accurate predictive results, evaluate the sub-test dataset using this algorithm.

prediction <- predict(rfModel, newdata = testingSet, type="raw")
testRFCM <- confusionMatrix(prediction, testingSet$classe)

The accuracy here is 0.9910783 and the confusion matrix is:

##           Reference
## Prediction    A    B    C    D    E
##          A 2231   13    0    0    0
##          B    1 1501   10    0    0
##          C    0    4 1353   23    6
##          D    0    0    5 1260    5
##          E    0    0    0    3 1431

Predictions Using Test Data

The final step in the project is to use the model to make 20 predictions about the manner in which each of the participants performed the exercise; the output of the predictions is shown below and was submitted for verification by the course participants in a peer-review fashion.

prediction <- predict(rfModel, testingData, type = "raw")
for (i in 1:length(prediction)) 
        print(paste0("Prediction ", i, " = ", prediction[i]))
## [1] "Prediction 1 = B"
## [1] "Prediction 2 = A"
## [1] "Prediction 3 = B"
## [1] "Prediction 4 = A"
## [1] "Prediction 5 = A"
## [1] "Prediction 6 = E"
## [1] "Prediction 7 = D"
## [1] "Prediction 8 = B"
## [1] "Prediction 9 = A"
## [1] "Prediction 10 = A"
## [1] "Prediction 11 = B"
## [1] "Prediction 12 = C"
## [1] "Prediction 13 = B"
## [1] "Prediction 14 = A"
## [1] "Prediction 15 = E"
## [1] "Prediction 16 = E"
## [1] "Prediction 17 = A"
## [1] "Prediction 18 = B"
## [1] "Prediction 19 = B"
## [1] "Prediction 20 = B"