Background

This project was part of Coursera MOOC course by Johns Hopskins University: Practical Machine Learning

The data for this assignement was collected from on-body sensors and it is available along with more information on the research by Groupware@LES Human Activity Recognition Project

Here is their brief description of the setting:

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.

Assignment

The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.

Initial Analysis

The first part involved exploring different summaries of the data. Reading descriptions of the measurement gathering and feature building strategies from the original researcheres was very informative too. These steps have not been documented here to keep this presentation short.

After initial analysis it was clear that many of the variables will not contribute to prediction accuracy and they can be dropped. Also I decided to use random forest model, since this is a basic classification problem with many variables and the outcomes are quite evenly distributed.

Loading Libraries and Reading in the Data

options(scipen =999)

library(caret)
library(randomForest)
library(mlearning)
library(knitr)

rawTraining <- read.csv('pml-training.csv', stringsAsFactors = F, na.strings = "")
rawTest <- read.csv('pml-testing.csv', stringsAsFactors = F, na.strings = "")

Prepare Data For Analysis

## Remove first 7 variables as they don't contain useful information for prediction
training <- rawTraining[, -(1:7)]
testing <- rawTest[, -(1:7)]

# Most variables don't have any missing values, but some have over 97% of values missing. Let's remove those.
training <- training[colSums(is.na(testing)) == 0]
testing <- testing[colSums(is.na(testing)) == 0]

## Remove variables with near zero variance
nzv <- nearZeroVar(training)
training <- training[-nzv]

nzv <- nearZeroVar(testing)
testing <- testing[-nzv]

Model Building

Random forest model seemed suitable from the get-go and it performed well. I tried it with different parameters and all got less than 1% error prediction for the OOB (Out-of-Bag) -error. In the end, the default parameters gave the best accuracy I could find: OOB = 0.47%.

## Set outcome variable "classe" to factor
training$classe <- as.factor(training$classe)

## Create data partitions
set.seed(9875)

trainIndex <- createDataPartition(training$classe, list = F, p = 0.7)
train <- training[trainIndex, ]
validation <- training[- trainIndex, ]

## Train random forest model
rfModel <- randomForest(classe ~ .,
                        data = train,
                        mtry = 7,
                        ntree = 500,
                        proximity = F)
print(rfModel)
## 
## Call:
##  randomForest(formula = classe ~ ., data = train, mtry = 7, ntree = 500,      proximity = F) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.47%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3904    2    0    0    0 0.0005120328
## B   12 2644    2    0    0 0.0052671181
## C    0   14 2377    5    0 0.0079298831
## D    0    0   17 2233    2 0.0084369449
## E    0    0    4    6 2515 0.0039603960

Cross Validation and Out-of-Sample Error

Even though it is not necessary to use a validation set to get an estimate of out-of sample-error with random forest’s internal cross-validation features, we have a lot of data so the model was used on a separate validation set.

After applying the model to the validation set, we get an error rate of 0.5%. We can confidently use this model to predict the outcomes of the 20 cases in the actual test set.

## Use the model to predict on the validation set and create a confusion matrix of the results
rfPredict <- predict(rfModel, validation)
predictConfusion <- confusion(validation$classe, rfPredict)

## Compare model estimates and prediction results
print(predictConfusion)
## 5885 items classified with 5858 true positives (error rate = 0.5%)
##        Predicted
## Actual    01   02   03   04   05 (sum) (FNR%)
##   01 A  1671    8    0    0    0  1679      0
##   02 B     2 1130    0    6    0  1138      1
##   03 E     1    0 1079    0    1  1081      0
##   04 C     0    1    1 1020    5  1027      1
##   05 D     0    0    2    0  958   960      0
##   (sum) 1674 1139 1082 1026  964  5885      0
confusionBarplot(predictConfusion, col = 'white', main = 'False Positives vs False Negatives')

## Plot the top ten prediction variables based on MeanDecreaseGini
varImpPlot(rfModel, n.var = 10, main = "Top 10 Variables by Relative Importance (MeanDecreaseGini)")