Practical Machine Learning

Executive Summary

This analysis shows the results of

Applying a variety of machine learning algorithms to the Weight Lifting Exercise dataset available at http://groupware.les.inf.puc-rio.br/har.
Selecting the model with the highest accuracy and lowest expected out-of-sample error rate.

The algorithms applied were:

Tree (rpart)
Random Forest (rf)
Boosting (gbm)
Linear Discriminant Analysis (lda)
Naive Bayesian (nb)

The best model was obtained using the random forest algorithm, which yielded a model with accuracy of 99.34% / OOS error rate of 0.66% when run with the testing data set.

Overfitting was addressed by scrubbing the data carefully and by the use of cross validation in the train() function.

Analysis

R Packages and Settings

# global settings for knitr
library(knitr)
opts_chunk$set(message=FALSE,
               warnings=FALSE,
               tidy=TRUE,
               echo=FALSE,
               fig.height=3,
               fig.width=4)
# load required libraries
suppressMessages(library(caret))
suppressMessages(library(rpart))
suppressMessages(library(randomForest))

Data Preparation

Significan data cleansing was required. See comments in the code below for details.

# read files located in same directory as script; one we'll split into
# training and test data sets, the other we'll reserve for validation
# testing
pml_train <- read.csv("pml-training.csv", header = TRUE, na.strings = c("", 
    "NA"))
validation <- read.csv("pml-testing.csv", header = TRUE, na.strings = c("", 
    "NA"))
# partition pml_train into training and testing data sets
set.seed(32343)
inTrain <- createDataPartition(y = pml_train$classe, p = 0.6, list = FALSE)
training <- pml_train[inTrain, ]
testing <- pml_train[-inTrain, ]
# filter out new_window rows (summary rows for a time frame?) from all three
# data sets
training <- training[training$new_window != "yes", ]
testing <- testing[testing$new_window != "yes", ]
validation <- validation[validation$new_window != "yes", ]
# filter out covariates with near zero variance; most values in these
# columns are NA's; use training data to determine which covariates will be
# filtered out
skip_columns <- nearZeroVar(training)
training <- training[, -skip_columns]
testing <- testing[, -skip_columns]
validation <- validation[, -skip_columns]
# remove index number, subject name, and timestamp columms; instructions for
# project specifically say use the accelerometer data (only); including
# these columns would contribute to overfitting.
omit_columns <- c(1:6)
training <- training[, -omit_columns]
testing <- testing[, -omit_columns]
validation <- validation[, -omit_columns]
# split data sets into predictor vectors and outcome vector; this is a
# recommended optimization for train() method
training_predictors <- training[, -53]
training_outcome <- training[, 53]

Random Forest

# optimal mtry parameter value was obtained from previous run of the model;
# saves time to just pass it in on subsequent runs
set.seed(32343)
mtryGrid <- expand.grid(mtry = 2)
rf <- train(x = training_predictors, y = training_outcome, method = "rf", metric = "Accuracy", 
    trControl = trainControl(method = "cv", repeats = 5), tuneGrid = mtryGrid, 
    prox = TRUE)
rf

## Random Forest 
## 
## 11528 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 10375, 10375, 10376, 10375, 10375, 10376, ... 
## 
## Resampling results
## 
##   Accuracy   Kappa      Accuracy SD  Kappa SD   
##   0.9913249  0.9890249  0.002350955  0.002974092
## 
## Tuning parameter 'mtry' was held constant at a value of 2
##

# show confusion matrix for testing data only
pred <- predict(rf, testing)
confusionMatrix(pred, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2190    8    0    0    0
##          B    2 1481   15    0    0
##          C    0    3 1322   16    0
##          D    0    0    5 1240    1
##          E    0    0    0    1 1404
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9934          
##                  95% CI : (0.9913, 0.9951)
##     No Information Rate : 0.2851          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9916          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9991   0.9926   0.9851   0.9865   0.9993
## Specificity            0.9985   0.9973   0.9970   0.9991   0.9998
## Pos Pred Value         0.9964   0.9887   0.9858   0.9952   0.9993
## Neg Pred Value         0.9996   0.9982   0.9968   0.9974   0.9998
## Prevalence             0.2851   0.1941   0.1746   0.1635   0.1828
## Detection Rate         0.2849   0.1926   0.1720   0.1613   0.1826
## Detection Prevalence   0.2859   0.1948   0.1744   0.1621   0.1828
## Balanced Accuracy      0.9988   0.9949   0.9911   0.9928   0.9996

Appendix

Validation Testing

The following code was used to output the results of validation testing to Coursera.

Programming Environment

sessionInfo()

## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  splines   stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
## [1] plyr_1.8.1          gbm_2.1             survival_2.37-7    
## [4] randomForest_4.6-10 rpart_4.1-9         caret_6.0-41       
## [7] ggplot2_1.0.0       lattice_0.20-29     knitr_1.9          
## 
## loaded via a namespace (and not attached):
##  [1] BradleyTerry2_1.0-6 brglm_0.5-9         car_2.0-24         
##  [4] class_7.3-12        codetools_0.2-10    colorspace_1.2-4   
##  [7] compiler_3.1.2      digest_0.6.8        e1071_1.6-4        
## [10] evaluate_0.5.5      foreach_1.4.2       formatR_1.0        
## [13] grid_3.1.2          gtable_0.1.2        gtools_3.4.1       
## [16] htmltools_0.2.6     iterators_1.0.7     lme4_1.1-7         
## [19] MASS_7.3-37         Matrix_1.1-5        mgcv_1.8-4         
## [22] minqa_1.2.4         munsell_0.4.2       nlme_3.1-119       
## [25] nloptr_1.0.4        nnet_7.3-9          pbkrtest_0.4-2     
## [28] proto_0.3-10        quantreg_5.11       Rcpp_0.11.4        
## [31] reshape2_1.4.1      rmarkdown_0.5.1     scales_0.2.4       
## [34] SparseM_1.6         stringr_0.6.2       tools_3.1.2        
## [37] yaml_2.1.13

Practical Machine Learning - Course Project

Stephen Moore

February 19, 2015