Final Project - Practical Machine Learning

Introduction

In this analysis we use accelerometer data to build a predictive model to classify exercise movement type. Specifically, each of six participants was given instructions for five different classes of performing a Unilateral Dumbbell Biceps Curl, and then asked to perform 10 repetitions of each method. The five exercise classes were: - exactly according to the specification (Class A), - throwing the elbows to the front (Class B), - lifting the dumbbell only halfway (Class C), - lowering the dumbbell only halfway (Class D) and - throwing the hips to the front (Class E) We use 10-fold repeated cross-validation (3 repeats) and a random forest model to predict the exercise class.

Pre-processing and Exploratory Data Analysis

First we read our training and testing data sets and assign them as variables.

library(caret)

## Warning: package 'caret' was built under R version 3.4.1

## Loading required package: lattice

## Loading required package: ggplot2

training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")

Our training set has 19,622 observations of 160 variables, while the testing set has 20 observations of these variables. The variables are the same except for the fact that the testing data table has a problem_id variable instead of a classe variable. We discuss this in more detail shortly.

Also importantly, we notice that there are many columns containing mostly missing values (NAs). Since these columns consist overwhelmingly of missing values and they are problematic for our model, we simply remove these columns. We also remove columns 1 through 7 because they contain non-pertinent information such as row index number, username and various timestamps. We are left with testing_clean and training_clean datasets after removing these variables.

training_clean <- training[, c(8:11, 37:49, 60:68, 84:86, 102, 
                               113:124, 140, 151:160)]

testing_clean <- testing[, c(8:11, 37:49, 60:68, 84:86, 102, 
                                113:124, 140, 151:160)]

We also make sure our testing_clean data table matches our training_clean data table by replacing the problem_id variable in testing_clean with the classe variable instead. This is what we’re trying to predict.

testing_clean$classe <- testing_clean$problem_id
testing_clean <- testing_clean[, -53]

Cross-validation

In order to avoid overfitting a random forest model, it’s important to do cross-validation. In this analysis we use 5-fold cross validation.

train_control <- trainControl(method="cv", number=5)

Model Construction

Next, we fit the random forest model. We are predicting the classe variable based on all other remaining variables.

modFit  <- train(classe ~., data=training_clean, trControl=train_control, method="rf")

## Loading required package: randomForest

## Warning: package 'randomForest' was built under R version 3.4.1

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

Out-of-sample error rate

We see that our out-of-sample error rate is about 0.57%. We find this by finding the accuracy of the random forest model of 99.43%. Note that this assumes an mtry value of 27, which means that for each split 27 variables are sampled.

print(modFit)

## Random Forest 
## 
## 19622 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15699, 15697, 15697, 15696, 15699 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9944962  0.9930379
##   27    0.9940376  0.9924576
##   52    0.9883805  0.9853028
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

modFit$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.41%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 5578    1    0    0    1 0.0003584229
## B   11 3783    3    0    0 0.0036871214
## C    0   21 3400    1    0 0.0064289889
## D    0    0   37 3177    2 0.0121268657
## E    0    0    0    4 3603 0.0011089548

Conclusion

Citation:

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.