Background and Purpose

This analysis uses exercise data from an activity tracker (i.e., FitBit, Jawbone Up, etc) that was recording properly and improperly performed barbell exercises. The goal of this analysis is to use the training data to correctly predict properly and improperly performed lifts in the test set.

Load and Clean Data

Load Data

The data come in two separate files, one for a training set and another for testing.

train <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"), header = TRUE, stringsAsFactors = FALSE, na.strings = c("", NA))
validation <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"), header = TRUE, stringsAsFactors = FALSE, na.strings = c("", NA))

Clean Data

I’ll want to perform data manipulation on both sets, so I’ll place the train and test data into a list and perform simultaneous cleaning.

dataList <- list(train = train, validation = validation)
names(dataList) ##just checking
## [1] "train"      "validation"

The new_window variable seems to have a lot of missing data when new_window == 'yes'. The variables with almost wholesale missingness are variables that typically have names containing central tendency measures or various statistical descriptors (e.g., max_, min_, avg_, etc).
With this in mind, the data will be subsetted for new_window == 'no' and these variables (as well as other identifier/time variables) will be eliminated from both training and test data.

library(dplyr, quietly = T)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# clean out NAs ----
dataList <- lapply(dataList, function(x) {subset(x, new_window == "no")})
dataList <- lapply(dataList, function(x) {x[, !grepl("^X|^user|timestamp|window|^kurtosis|^skewness|^min|^max|^avg|^stdd|^var|^amplitude", names(x))]})

# check to see if there are missing data ----
lapply(dataList, function(x) {sum(is.na(x))})
## $train
## [1] 0
## 
## $validation
## [1] 0
# assign train and test back to the environment ----
list2env(dataList, .GlobalEnv)
## <environment: R_GlobalEnv>

No missing data remain in the test and train sets.

Start Training models

The outcome variable, classe, is categorical and an algorithm that can handle multiple categorical outcomes will need to be used.

Training Models

Split train data into train_data and test_data.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
trainIndex <- createDataPartition(train$classe, p = .80, list = FALSE)
trainData <- train[trainIndex, ]
testData <- train[-trainIndex, ]

Stochastic Gradient Boosting

First algorithm will be stochastic gradient boosting (GBM). Warning: Depending on your computer’s processing speed, this could take a long time to run due to the repeated cross-validation. I did have code in here to run this across multiple processors,

set.seed(9)
cv_settings <- trainControl(method = "repeatedcv", repeats = 3, number = 10)
gbm_train <- train(as.factor(classe) ~ ., data = trainData, 
                   trControl = cv_settings, method = "gbm", verbose = FALSE)
## Loading required package: gbm
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
confusionMatrix(predict(gbm_train, testData), testData$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1077   18    0    1    0
##          B   11  703   20    3   12
##          C    3   21  643   27    8
##          D    0    0    6  593   14
##          E    3    1    1    5  671
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9599          
##                  95% CI : (0.9532, 0.9659)
##     No Information Rate : 0.2848          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9493          
##  Mcnemar's Test P-Value : 3.167e-06       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9845   0.9462   0.9597   0.9428   0.9518
## Specificity            0.9931   0.9852   0.9814   0.9938   0.9968
## Pos Pred Value         0.9827   0.9386   0.9160   0.9674   0.9853
## Neg Pred Value         0.9938   0.9871   0.9914   0.9888   0.9892
## Prevalence             0.2848   0.1934   0.1744   0.1638   0.1835
## Detection Rate         0.2804   0.1830   0.1674   0.1544   0.1747
## Detection Prevalence   0.2853   0.1950   0.1828   0.1596   0.1773
## Balanced Accuracy      0.9888   0.9657   0.9705   0.9683   0.9743

Random Forest

The second model will be random forest.

set.seed(9)
rf_train <- train(as.factor(classe) ~ ., data = trainData, 
                  trControl = cv_settings, method = "rf")
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
confusionMatrix(predict(rf_train, testData), testData$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1092    3    0    0    0
##          B    1  739    3    0    1
##          C    1    1  665   13    0
##          D    0    0    2  615    0
##          E    0    0    0    1  704
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9932          
##                  95% CI : (0.9901, 0.9956)
##     No Information Rate : 0.2848          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9914          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9946   0.9925   0.9777   0.9986
## Specificity            0.9989   0.9984   0.9953   0.9994   0.9997
## Pos Pred Value         0.9973   0.9933   0.9779   0.9968   0.9986
## Neg Pred Value         0.9993   0.9987   0.9984   0.9957   0.9997
## Prevalence             0.2848   0.1934   0.1744   0.1638   0.1835
## Detection Rate         0.2843   0.1924   0.1731   0.1601   0.1833
## Detection Prevalence   0.2851   0.1937   0.1770   0.1606   0.1835
## Balanced Accuracy      0.9985   0.9965   0.9939   0.9886   0.9991

With an accuracy of 99.4% and sensitivity measures between 98% - 100%, it’s clear that random forest is the better classifier compared to GBM. With this in mind, the random forest model will be used to predict classes on the validation set.

Out of Sample Performance

The random forest model was used to assign a classe to each case in the validation data set.

validation_predictions <- predict(rf_train, validation)
validation$predicted_classe <- validation_predictions
validation %>% select(problem_id, predicted_classe)
##    problem_id predicted_classe
## 1           1                B
## 2           2                A
## 3           3                B
## 4           4                A
## 5           5                A
## 6           6                E
## 7           7                D
## 8           8                B
## 9           9                A
## 10         10                A
## 11         11                B
## 12         12                C
## 13         13                B
## 14         14                A
## 15         15                E
## 16         16                E
## 17         17                A
## 18         18                B
## 19         19                B
## 20         20                B

The values in predicted_classe were evaluated during the quiz portion, and the predicted values were 100% correct.

Conclusion

The random forest classifier performed best both in training and on out-of-sample validation data.