Excerpt taken verbatim from course instructions:
The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
Why you made the choices you did.
For this task, I chose to use Random Forest and Extreme Gradient Boosting.
These correspond to using train in the caret package with the options method="rf" and method="xgbTree" respectively.
The primary reason for choosing these two is that they perform well for classifaction and lend themselves well to ensemble learning. I intend to stack these two classifiers to (hopefully) improve their accuracy.
In particular, tree-based methods like Random Forest and Extreme Gradient Boosting tend to work well when:
In the best of cases, feature engineering is hard; feature engineering, without domain knowledge, harder still (maybe exponentially so). Random Forest and Extreme Gradient Boosting are convenient ways to confront or bypass these issues.
How you used cross validation.
To train my model, I used repeated cross-validation (CV). Tree-based methods are notorious for being low bias and high variance. Repeated cross validation, which repeats K-fold CV several times and then averages the results, keeps the low bias and reduces the variance.
Below I set the number of folds to be 10 in each iteration of k-fold CV and repeat k-fold CV 5 times.
#### Set default for repeated cross-validation ####
trControl <- caret::trainControl( method = "repeatedcv", number = 10, repeats = 5, verboseIter = FALSE, allowParallel = TRUE )
Below I download the training data.
#### Download training data ####
urlTrainingData <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
trainDF <- fread(urlTrainingData)
stopifnot( 'classe' %in% names(trainDF) )
# Convert response to factor variable
trainDF$classe <- factor( trainDF$classe )
As is, the data needs to be preprocessed as it contains many predictors that have near zero-variance. I drop these variables from the dataset i.e. near zero-variance won’t be used in the model.
I also drop columns with missing values. Imputation would have been another option. But choosing a good imputation method (see here for example) is itself a delicate art and can introduce its own set of problems.
This approach seemed more robust.
#### Data cleaning ####
# Drop columns with time identifier i.e. timestamp, etc
dropCols <- c( 'V1', 'raw_timestamp_part_1', 'raw_timestamp_part_2', 'cvtd_timestamp' )
trainDF[ , (dropCols) := NULL ]
stopifnot( !any( dropCols %in% names(trainDF) ) )
# Show examples of near zero-variance predictors
nzvSummary <- nearZeroVar( trainDF, saveMetrics = TRUE )
nzvDropped <- nzvSummary[ nzvSummary$nzv == TRUE, ]
head( nzvDropped, 10 )
## freqRatio percentUnique zeroVar nzv
## new_window 47.330 0.010193 FALSE TRUE
## kurtosis_roll_belt 1921.600 2.023239 FALSE TRUE
## kurtosis_picth_belt 600.500 1.615534 FALSE TRUE
## kurtosis_yaw_belt 47.330 0.010193 FALSE TRUE
## skewness_roll_belt 2135.111 2.013047 FALSE TRUE
## skewness_roll_belt.1 600.500 1.722556 FALSE TRUE
## skewness_yaw_belt 47.330 0.010193 FALSE TRUE
## max_yaw_belt 640.533 0.346550 FALSE TRUE
## min_yaw_belt 640.533 0.346550 FALSE TRUE
## amplitude_yaw_belt 50.042 0.020385 FALSE TRUE
tail( nzvDropped, 10 )
## freqRatio percentUnique zeroVar nzv
## amplitude_yaw_forearm 59.677 0.015289 FALSE TRUE
## avg_roll_forearm 27.667 1.641015 FALSE TRUE
## stddev_roll_forearm 87.000 1.630823 FALSE TRUE
## var_roll_forearm 43.500 1.625726 FALSE TRUE
## avg_pitch_forearm 83.000 1.651208 FALSE TRUE
## stddev_pitch_forearm 41.500 1.646112 FALSE TRUE
## var_pitch_forearm 83.000 1.651208 FALSE TRUE
## avg_yaw_forearm 83.000 1.651208 FALSE TRUE
## stddev_yaw_forearm 85.000 1.641015 FALSE TRUE
## var_yaw_forearm 85.000 1.641015 FALSE TRUE
# Print number of near zero-variance predictors dropped
nrow( nzvDropped )
## [1] 60
# Drop columns with no variation
nzvColumnsIndex <- nearZeroVar( trainDF )
trainCleanDF <- trainDF[ , -nzvColumnsIndex, with = FALSE ]
# Drop columns with NAs: no imputation of missing values
dfNA <- trainCleanDF[ , lapply( .SD, anyNA), .SDcols = names(trainCleanDF) ]
indexNA <- unlist( dfNA )
indexFullCols <- indexNA[ indexNA == FALSE ]
keepCols <- names( indexFullCols )
trainCleanDF <- trainCleanDF[ , keepCols, with = FALSE ]
This shows that I used 54 predictors out of a possible 159 predictors from the original dataset.
How you built your model.
I split the (preprocessed) dataset into 3 parts: 60% training, 20% validation and 20% testing.
#### Split dataset into training and test set ####
# Split dataset into 3 parts: 70% training, 15% validation and 15% testing
# Split dataset into 70% training set and 30% rest
inTrain <- createDataPartition( trainCleanDF$classe, p = 0.7 )[[1]]
training <- trainCleanDF[ inTrain, ]
rest <- trainCleanDF[ -inTrain, ]
# Split training dataset into 50% testing set and 50% validation set
inTesting <- createDataPartition( rest$classe, p = 0.5 )[[1]]
testing <- rest[ inTesting, ]
validation <- rest[ -inTesting, ]
I fist train the two classifiers (Random Forest and Extreme Gradient Boosting) on the training set. I use repeated CV for this.
#### Train classifiers ####
# Set up formula
formulaTrees <- classe ~ .
# Add parallel backend
set.seed( 123456 )
registerDoParallel( cores = 4 )
getDoParWorkers()
## [1] 4
fitRF <- train(formulaTrees, data = training, trControl = trControl, method = "rf" )
fitRF # Inspect repeated cv
## Random Forest
##
## 13737 samples
## 54 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 12363, 12365, 12362, 12364, 12363, 12364, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.99383 0.99219
## 30 0.99713 0.99637
## 58 0.99595 0.99488
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 30.
#### Train classifiers ####
# Set up formula
formulaTrees <- classe ~ .
# Train Gradient Boosting
set.seed( 123456 )
registerDoParallel( cores = 4 )
getDoParWorkers()
## [1] 4
fitGBM <- train(formulaTrees, data = training, trControl = trControl, method = "xgbTree" )
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 3.3.1
fitGBM # Inspect repeated cv
## eXtreme Gradient Boosting
##
## 13737 samples
## 54 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 12363, 12365, 12362, 12364, 12363, 12364, ...
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree nrounds Accuracy Kappa
## 0.3 1 0.6 50 0.81539 0.76602
## 0.3 1 0.6 100 0.88479 0.85419
## 0.3 1 0.6 150 0.91477 0.89215
## 0.3 1 0.8 50 0.81794 0.76927
## 0.3 1 0.8 100 0.88577 0.85542
## 0.3 1 0.8 150 0.91614 0.89389
## 0.3 2 0.6 50 0.95615 0.94452
## 0.3 2 0.6 100 0.98902 0.98611
## 0.3 2 0.6 150 0.99607 0.99503
## 0.3 2 0.8 50 0.95570 0.94395
## 0.3 2 0.8 100 0.98936 0.98654
## 0.3 2 0.8 150 0.99640 0.99545
## 0.3 3 0.6 50 0.99020 0.98761
## 0.3 3 0.6 100 0.99828 0.99783
## 0.3 3 0.6 150 0.99936 0.99919
## 0.3 3 0.8 50 0.99116 0.98882
## 0.3 3 0.8 100 0.99859 0.99821
## 0.3 3 0.8 150 0.99927 0.99908
## 0.4 1 0.6 50 0.84818 0.80777
## 0.4 1 0.6 100 0.90726 0.88264
## 0.4 1 0.6 150 0.93598 0.91900
## 0.4 1 0.8 50 0.84985 0.80988
## 0.4 1 0.8 100 0.90797 0.88355
## 0.4 1 0.8 150 0.93587 0.91885
## 0.4 2 0.6 50 0.97355 0.96654
## 0.4 2 0.6 100 0.99461 0.99319
## 0.4 2 0.6 150 0.99835 0.99792
## 0.4 2 0.8 50 0.97454 0.96779
## 0.4 2 0.8 100 0.99473 0.99333
## 0.4 2 0.8 150 0.99828 0.99783
## 0.4 3 0.6 50 0.99537 0.99414
## 0.4 3 0.6 100 0.99913 0.99890
## 0.4 3 0.6 150 0.99946 0.99932
## 0.4 3 0.8 50 0.99557 0.99440
## 0.4 3 0.8 100 0.99929 0.99910
## 0.4 3 0.8 150 0.99948 0.99934
##
## Tuning parameter 'gamma' was held constant at a value of 0
##
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 150, max_depth = 3,
## eta = 0.4, gamma = 0, colsample_bytree = 0.8 and min_child_weight = 1.
Comparing these two classifiers shows that Extreme Gradient Boosting is better when using accuracy as the performance measure/benchmark.
#### Compare models ####
rValues <- resamples( list( xgb = fitGBM, rf = fitRF ) )
bwplot( rValues, metric = "Accuracy", main = "rf vs xgboost" )
I then stack/combine these classifiers into an ensemble model on the validation set. Again I use repeated CV.
#### Evaluate whether stacking classifiers (ensemble) is worthwhile ####
# Predict models
predRF <- predict( fitRF, newdata = validation )
predGBM <- predict( fitGBM, newdata = validation )
# Create dataset to use for tuning parameters of ensemble model
stackingDF <- data.frame( predRF = predRF, predGBM = predGBM, classe = validation$classe )
# Show observations where classifiers' predictions differ
stackingDF[ stackingDF$predRF != stackingDF$predGBM, ]
## predRF predGBM classe
## 957 A B B
## 1070 C B B
## 1289 B A B
## 1701 B C C
## 1957 C D D
## 2128 C D D
## 2130 C D D
## 2756 E D E
# Stack predictions
formulaEnsemble <- classe ~ predRF + predGBM
fitEnsemble <- train( formulaEnsemble, data = stackingDF, trControl = trControl, method = "xgbTree" )
Looking at observations where the classifiers differ, Extreme Gradient Boosting clearly is the superior algorithm. This diminishes the benefits of using an ensemble model. Extreme Gradient Boosting alone performs well enough.
What you think the expected out of sample error is.
I finally use the testing set I devised to assess the expected out-of-sample error. I’ll use Extreme Gradient Boosting as my final model. As we saw, stacking the classifiers is unlikely to boost model accuracy by much in this case.
#### Assess the expected out-of-sample error ####
# Accuracy on test sample
predFinal <- predict( fitGBM, newdata = testing )
accFinal <- postResample( predFinal, testing$classe )["Accuracy"]
accFinal
## Accuracy
## 0.99966
This step shows that the final model – Extreme Gradient Boosting – has an accuracy of 0.99966 on the testing dataset I devised. I expect similar performance on the testing dataset provided for the assignment.
#### Predict labels on testing data provided ####
# Download testing dataset
urlTestingData <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
testDF <- fread(urlTestingData)
# Show predictions on testing dataset
predAssignment <- predict( fitGBM, newdata = testDF )
predAssignment
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Please run the script run-analysis.R (see github repo) to replicate this analysis.
Please see gh-pages branch for this very report.
Also, here’s the packages/session details on my machine at the time I ran/wrote the report:
sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10240)
##
## locale:
## [1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C
## [5] LC_TIME=English_Canada.1252
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] randomForest_4.6-12 xgboost_0.4-4 doParallel_1.0.10
## [4] iterators_1.0.8 foreach_1.4.3 knitr_1.13
## [7] caret_6.0-70 ggplot2_2.1.0 lattice_0.20-33
## [10] data.table_1.9.6
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.6 formatR_1.4 nloptr_1.0.4
## [4] plyr_1.8.4 tools_3.3.0 digest_0.6.9
## [7] lme4_1.1-12 evaluate_0.9 gtable_0.2.0
## [10] nlme_3.1-127 mgcv_1.8-12 Matrix_1.2-6
## [13] yaml_2.1.13 SparseM_1.7 stringr_1.0.0
## [16] MatrixModels_0.4-1 stats4_3.3.0 grid_3.3.0
## [19] nnet_7.3-12 rmarkdown_1.0 minqa_1.2.4
## [22] reshape2_1.4.1 car_2.1-2 magrittr_1.5
## [25] scales_0.4.0 codetools_0.2-14 htmltools_0.3.5
## [28] MASS_7.3-45 splines_3.3.0 pbkrtest_0.4-6
## [31] colorspace_1.2-6 quantreg_5.26 stringi_1.1.1
## [34] munsell_0.4.3 chron_2.3-47
And that’s that. Thank you.