This analysis shows the results of
The algorithms applied were:
The best model was obtained using the random forest algorithm, which yielded a model with accuracy of 99.34% / OOS error rate of 0.66% when run with the testing data set.
Overfitting was addressed by scrubbing the data carefully and by the use of cross validation in the train() function.
# global settings for knitr
library(knitr)
opts_chunk$set(message=FALSE,
warnings=FALSE,
tidy=TRUE,
echo=FALSE,
fig.height=3,
fig.width=4)
# load required libraries
suppressMessages(library(caret))
suppressMessages(library(rpart))
suppressMessages(library(randomForest))
Significan data cleansing was required. See comments in the code below for details.
# read files located in same directory as script; one we'll split into
# training and test data sets, the other we'll reserve for validation
# testing
pml_train <- read.csv("pml-training.csv", header = TRUE, na.strings = c("",
"NA"))
validation <- read.csv("pml-testing.csv", header = TRUE, na.strings = c("",
"NA"))
# partition pml_train into training and testing data sets
set.seed(32343)
inTrain <- createDataPartition(y = pml_train$classe, p = 0.6, list = FALSE)
training <- pml_train[inTrain, ]
testing <- pml_train[-inTrain, ]
# filter out new_window rows (summary rows for a time frame?) from all three
# data sets
training <- training[training$new_window != "yes", ]
testing <- testing[testing$new_window != "yes", ]
validation <- validation[validation$new_window != "yes", ]
# filter out covariates with near zero variance; most values in these
# columns are NA's; use training data to determine which covariates will be
# filtered out
skip_columns <- nearZeroVar(training)
training <- training[, -skip_columns]
testing <- testing[, -skip_columns]
validation <- validation[, -skip_columns]
# remove index number, subject name, and timestamp columms; instructions for
# project specifically say use the accelerometer data (only); including
# these columns would contribute to overfitting.
omit_columns <- c(1:6)
training <- training[, -omit_columns]
testing <- testing[, -omit_columns]
validation <- validation[, -omit_columns]
# split data sets into predictor vectors and outcome vector; this is a
# recommended optimization for train() method
training_predictors <- training[, -53]
training_outcome <- training[, 53]
# optimal mtry parameter value was obtained from previous run of the model;
# saves time to just pass it in on subsequent runs
set.seed(32343)
mtryGrid <- expand.grid(mtry = 2)
rf <- train(x = training_predictors, y = training_outcome, method = "rf", metric = "Accuracy",
trControl = trainControl(method = "cv", repeats = 5), tuneGrid = mtryGrid,
prox = TRUE)
rf
## Random Forest
##
## 11528 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 10375, 10375, 10376, 10375, 10375, 10376, ...
##
## Resampling results
##
## Accuracy Kappa Accuracy SD Kappa SD
## 0.9913249 0.9890249 0.002350955 0.002974092
##
## Tuning parameter 'mtry' was held constant at a value of 2
##
# show confusion matrix for testing data only
pred <- predict(rf, testing)
confusionMatrix(pred, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2190 8 0 0 0
## B 2 1481 15 0 0
## C 0 3 1322 16 0
## D 0 0 5 1240 1
## E 0 0 0 1 1404
##
## Overall Statistics
##
## Accuracy : 0.9934
## 95% CI : (0.9913, 0.9951)
## No Information Rate : 0.2851
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9916
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9991 0.9926 0.9851 0.9865 0.9993
## Specificity 0.9985 0.9973 0.9970 0.9991 0.9998
## Pos Pred Value 0.9964 0.9887 0.9858 0.9952 0.9993
## Neg Pred Value 0.9996 0.9982 0.9968 0.9974 0.9998
## Prevalence 0.2851 0.1941 0.1746 0.1635 0.1828
## Detection Rate 0.2849 0.1926 0.1720 0.1613 0.1826
## Detection Prevalence 0.2859 0.1948 0.1744 0.1621 0.1828
## Balanced Accuracy 0.9988 0.9949 0.9911 0.9928 0.9996
The following code was used to output the results of validation testing to Coursera.
sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel splines stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] plyr_1.8.1 gbm_2.1 survival_2.37-7
## [4] randomForest_4.6-10 rpart_4.1-9 caret_6.0-41
## [7] ggplot2_1.0.0 lattice_0.20-29 knitr_1.9
##
## loaded via a namespace (and not attached):
## [1] BradleyTerry2_1.0-6 brglm_0.5-9 car_2.0-24
## [4] class_7.3-12 codetools_0.2-10 colorspace_1.2-4
## [7] compiler_3.1.2 digest_0.6.8 e1071_1.6-4
## [10] evaluate_0.5.5 foreach_1.4.2 formatR_1.0
## [13] grid_3.1.2 gtable_0.1.2 gtools_3.4.1
## [16] htmltools_0.2.6 iterators_1.0.7 lme4_1.1-7
## [19] MASS_7.3-37 Matrix_1.1-5 mgcv_1.8-4
## [22] minqa_1.2.4 munsell_0.4.2 nlme_3.1-119
## [25] nloptr_1.0.4 nnet_7.3-9 pbkrtest_0.4-2
## [28] proto_0.3-10 quantreg_5.11 Rcpp_0.11.4
## [31] reshape2_1.4.1 rmarkdown_0.5.1 scales_0.2.4
## [34] SparseM_1.6 stringr_0.6.2 tools_3.1.2
## [37] yaml_2.1.13