The Human Activity Recognition data in the Qualitative Activity Recognition of Weight Lifting Exercises dataset collected by researchers at Groupware@LES presents a multiclass classification problem. The dataset contains data on subjects performing dumbbell biceps curls correctly in one way and incorrectly in four different ways. The analysis below classifies each movement according to the 5 ways that subjects performed the activity: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D), and throwing the hips to the front (Class E).1

# load packages needed for analysis
library(randomForest)
library(caret)

# Assumes pml-training.csv file is located in the current working directory.
# Split the pml-training set into training and testing sets.
data <- read.csv("pml-training.csv", na.strings=c("", " ","NA"))
inTrain <- createDataPartition(y=data$classe, p=0.6, list=FALSE)
training <- data[inTrain, ]
testing <- data[-inTrain, ]
dim(training); dim(testing);
## [1] 11776   160
## [1] 7846  160

Explore the dataset

Look at the variables classes, levels and data examples using th str() function, and look for zeros and NAs in the dataset.

# Don't print the output of objStr - the ouput is 160 lines long.
objStr <- str(training)
sum(na.omit(training==0))
## [1] 2511
sum(is.na(training))
## [1] 1151700
naCol <- round(colMeans(is.na(training)), 2)
# Print the precentages of NAs in columns.
unique(naCol)
## [1] 0.00 0.98

Clean the Dataset

A number of columns in the dataset are comprised of 98% NAs. Remove these columns from the training and testing set, as there isn’t enough non-NA data in these columns to properly impute data into the NA rows. Also, the first 7 columns in the dataset contain identification and timestamp data that don’t directly impact whether or not the subject performed the weight lifting exercise correctly. Drop the “X”" column, which is a row identifier, the timestamp columns (“raw_timestamp_part_1”, “raw_timestamp_part_2”, “cvtd_timestamp”), and the start and stop indicators for groups of correctly or incorrectly performed exercises (“new_window”, “num_window”).

training <- training[, !(naCol==0.98)]
training <- training[, -c(1,3:7)]
dim(training)
## [1] 11776    54
trainNames <- colnames(training)
testing <- testing[trainNames]

Cross Validation & Build the Model

Cross validation is incorporated into the random forests model, which estimates the out of sample error for the observations that are not randomly sampled (out of bag) in a particular tree.2 Tune the mtry value before fitting the random forests model to find the optimal value of mtry (the number of variables randomly sampled as candidates at each split in the tree) as it impacts the out-of-bag error estimate. Then fit and print the model.

set.seed(123)
mtryRf <- tuneRF(x=training[ , -length(training)], y=training[ , length(training)])
## mtry = 7  OOB error = 1% 
## Searching left ...
## mtry = 4     OOB error = 1.28% 
## -0.279661 0.05 
## Searching right ...
## mtry = 14    OOB error = 1.02% 
## -0.01694915 0.05

modelRf <- randomForest(classe ~ ., data=training, importance=TRUE, mtry=7, ntree=500)
print(modelRf)
## 
## Call:
##  randomForest(formula = classe ~ ., data = training, importance = TRUE,      mtry = 7, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.7%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3343    4    0    0    1 0.001493429
## B   16 2257    6    0    0 0.009653357
## C    0   16 2032    6    0 0.010710808
## D    0    0   27 1902    1 0.014507772
## E    0    0    1    4 2160 0.002309469

Evaluate the Model

Check the importance of the variables included in the model in the plot below. The MeanDecreaseAccuracy importance measure tends to be more optimistic than the MeanDecreaseGini measure3. Variables listed at the top of the MeanDecreaseGini plot contribute most to partitioning data at nodes in the tree. Note that while the user_name variable was included in the training set as a subject identifier, it does not appear to be an important variable when classifying the dumbbell outcomes.

varImpPlot(modelRf, main="Variable Importance", pch=19, color="dodgerblue")

As the model output above indicates, the OOB estimate for the error rate is 0.7%. The plot below shows the error rate for OOB and each classifier verses the number of trees. As the number of trees increase, the error rates uniformly decrease and roughly flatten.

plot(modelRf, log="y", main="Error Rate vs Number of Trees")
legend("topright", colnames(modelRf$err.rate), col=1:6, cex=0.8, fill=1:6)

Classify the Test Classes

Predict the classes of the dumbbell exercises using the testing set data to see how well the random forest model predicts using new data set.

pred <- predict(modelRf, newdata=testing, type="response")
table(observed=testing[ ,"classe"], predicted=pred)
##         predicted
## observed    A    B    C    D    E
##        A 2230    2    0    0    0
##        B    1 1516    1    0    0
##        C    0    3 1365    0    0
##        D    0    0    7 1278    1
##        E    0    0    0    3 1439

The estimated out-of-sample error is as follows:

round((sum(testing$classe != pred)) / length(pred), 4)
## [1] 0.0023

An ROC curve plot to illustrate the performance of the classification task is not appropriate for this dataset. This dataset is a multiclass classification problem and ROC curves are generally limited to binary classification problems due to the quadratic increase of degrees of freedom and the increase of dimensions in multiclass classification problems4.


1 http://groupware.les.inf.puc-rio.br/public/papers/2013.Velloso.QAR-WLE.pdf
2 http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr
3 http://stats.stackexchange.com/a/83625
4 http://en.wikipedia.org/wiki/Receiver_operating_characteristic