Prediction Assignment (Practical Machine Learning

Instructions for assignment

Excerpt taken verbatim from course instructions:

The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.

Predictive Models

Why you made the choices you did.

For this task, I chose to use Random Forest and Extreme Gradient Boosting.

These correspond to using train in the caret package with the options method="rf" and method="xgbTree" respectively.

The primary reason for choosing these two is that they perform well for classifaction and lend themselves well to ensemble learning. I intend to stack these two classifiers to (hopefully) improve their accuracy.

In particular, tree-based methods like Random Forest and Extreme Gradient Boosting tend to work well when:

we have many predictors/covariates – some of which are probably irrelevant – and we don’t want to reduce dimensions with methods like PCA.
we want to minimize resources spent preprocessing the predictors/covariates. Tree-based methods are robust to feature transformations and handle interactions well.

In the best of cases, feature engineering is hard; feature engineering, without domain knowledge, harder still (maybe exponentially so). Random Forest and Extreme Gradient Boosting are convenient ways to confront or bypass these issues.

Cross Validation

How you used cross validation.

To train my model, I used repeated cross-validation (CV). Tree-based methods are notorious for being low bias and high variance. Repeated cross validation, which repeats K-fold CV several times and then averages the results, keeps the low bias and reduces the variance.

Below I set the number of folds to be 10 in each iteration of k-fold CV and repeat k-fold CV 5 times.

#### Set default for repeated cross-validation #### 
trControl <- caret::trainControl( method = "repeatedcv", number = 10, repeats = 5, verboseIter = FALSE, allowParallel = TRUE )

Data processing

Below I download the training data.

#### Download training data ####
urlTrainingData <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
trainDF <- fread(urlTrainingData)
stopifnot( 'classe' %in% names(trainDF) )
# Convert response to factor variable
trainDF$classe <- factor( trainDF$classe )

As is, the data needs to be preprocessed as it contains many predictors that have near zero-variance. I drop these variables from the dataset i.e. near zero-variance won’t be used in the model.

I also drop columns with missing values. Imputation would have been another option. But choosing a good imputation method (see here for example) is itself a delicate art and can introduce its own set of problems.

This approach seemed more robust.

#### Data cleaning ####
# Drop columns with time identifier i.e. timestamp, etc
dropCols <- c( 'V1', 'raw_timestamp_part_1', 'raw_timestamp_part_2', 'cvtd_timestamp' )
trainDF[ , (dropCols) := NULL ]
stopifnot( !any( dropCols %in% names(trainDF) ) )
# Show examples of near zero-variance predictors
nzvSummary <- nearZeroVar( trainDF, saveMetrics = TRUE )
nzvDropped <- nzvSummary[ nzvSummary$nzv == TRUE, ]
head( nzvDropped, 10 )

##                      freqRatio percentUnique zeroVar  nzv
## new_window              47.330      0.010193   FALSE TRUE
## kurtosis_roll_belt    1921.600      2.023239   FALSE TRUE
## kurtosis_picth_belt    600.500      1.615534   FALSE TRUE
## kurtosis_yaw_belt       47.330      0.010193   FALSE TRUE
## skewness_roll_belt    2135.111      2.013047   FALSE TRUE
## skewness_roll_belt.1   600.500      1.722556   FALSE TRUE
## skewness_yaw_belt       47.330      0.010193   FALSE TRUE
## max_yaw_belt           640.533      0.346550   FALSE TRUE
## min_yaw_belt           640.533      0.346550   FALSE TRUE
## amplitude_yaw_belt      50.042      0.020385   FALSE TRUE

tail( nzvDropped, 10 )

##                       freqRatio percentUnique zeroVar  nzv
## amplitude_yaw_forearm    59.677      0.015289   FALSE TRUE
## avg_roll_forearm         27.667      1.641015   FALSE TRUE
## stddev_roll_forearm      87.000      1.630823   FALSE TRUE
## var_roll_forearm         43.500      1.625726   FALSE TRUE
## avg_pitch_forearm        83.000      1.651208   FALSE TRUE
## stddev_pitch_forearm     41.500      1.646112   FALSE TRUE
## var_pitch_forearm        83.000      1.651208   FALSE TRUE
## avg_yaw_forearm          83.000      1.651208   FALSE TRUE
## stddev_yaw_forearm       85.000      1.641015   FALSE TRUE
## var_yaw_forearm          85.000      1.641015   FALSE TRUE

# Print number of near zero-variance predictors dropped
nrow( nzvDropped )

## [1] 60

# Drop columns with no variation
nzvColumnsIndex <- nearZeroVar( trainDF )
trainCleanDF <- trainDF[ , -nzvColumnsIndex, with = FALSE ]
# Drop columns with NAs: no imputation of missing values
dfNA <- trainCleanDF[ , lapply( .SD, anyNA), .SDcols = names(trainCleanDF) ]
indexNA <- unlist( dfNA )
indexFullCols <- indexNA[ indexNA == FALSE ]
keepCols <- names( indexFullCols )
trainCleanDF <- trainCleanDF[ , keepCols, with = FALSE  ]

This shows that I used 54 predictors out of a possible 159 predictors from the original dataset.

Model building

How you built your model.

I split the (preprocessed) dataset into 3 parts: 60% training, 20% validation and 20% testing.

#### Split dataset into training and test set ####
# Split dataset into 3 parts: 70% training, 15% validation and 15% testing
# Split dataset into 70% training set and 30% rest
inTrain <- createDataPartition( trainCleanDF$classe, p = 0.7 )[[1]]
training <- trainCleanDF[ inTrain, ]
rest <- trainCleanDF[ -inTrain, ]
# Split training dataset into 50% testing set and 50% validation set
inTesting <- createDataPartition( rest$classe, p = 0.5 )[[1]]
testing <- rest[ inTesting, ]
validation <- rest[ -inTesting, ]

I fist train the two classifiers (Random Forest and Extreme Gradient Boosting) on the training set. I use repeated CV for this.

#### Train classifiers ####
# Set up formula 
formulaTrees <- classe ~ .
# Add parallel backend
set.seed( 123456 )
registerDoParallel( cores = 4 )
getDoParWorkers()

## [1] 4

fitRF <- train(formulaTrees, data = training, trControl = trControl, method = "rf" )
fitRF # Inspect repeated cv

## Random Forest 
## 
## 13737 samples
##    54 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 12363, 12365, 12362, 12364, 12363, 12364, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa  
##    2    0.99383   0.99219
##   30    0.99713   0.99637
##   58    0.99595   0.99488
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 30.

#### Train classifiers ####
# Set up formula 
formulaTrees <- classe ~ .
# Train Gradient Boosting
set.seed( 123456 )
registerDoParallel( cores = 4 )
getDoParWorkers()

## [1] 4

fitGBM <- train(formulaTrees, data = training, trControl = trControl, method = "xgbTree" )

## Loading required package: plyr

## Warning: package 'plyr' was built under R version 3.3.1

fitGBM # Inspect repeated cv

## eXtreme Gradient Boosting 
## 
## 13737 samples
##    54 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 12363, 12365, 12362, 12364, 12363, 12364, ... 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  nrounds  Accuracy  Kappa  
##   0.3  1          0.6                50      0.81539   0.76602
##   0.3  1          0.6               100      0.88479   0.85419
##   0.3  1          0.6               150      0.91477   0.89215
##   0.3  1          0.8                50      0.81794   0.76927
##   0.3  1          0.8               100      0.88577   0.85542
##   0.3  1          0.8               150      0.91614   0.89389
##   0.3  2          0.6                50      0.95615   0.94452
##   0.3  2          0.6               100      0.98902   0.98611
##   0.3  2          0.6               150      0.99607   0.99503
##   0.3  2          0.8                50      0.95570   0.94395
##   0.3  2          0.8               100      0.98936   0.98654
##   0.3  2          0.8               150      0.99640   0.99545
##   0.3  3          0.6                50      0.99020   0.98761
##   0.3  3          0.6               100      0.99828   0.99783
##   0.3  3          0.6               150      0.99936   0.99919
##   0.3  3          0.8                50      0.99116   0.98882
##   0.3  3          0.8               100      0.99859   0.99821
##   0.3  3          0.8               150      0.99927   0.99908
##   0.4  1          0.6                50      0.84818   0.80777
##   0.4  1          0.6               100      0.90726   0.88264
##   0.4  1          0.6               150      0.93598   0.91900
##   0.4  1          0.8                50      0.84985   0.80988
##   0.4  1          0.8               100      0.90797   0.88355
##   0.4  1          0.8               150      0.93587   0.91885
##   0.4  2          0.6                50      0.97355   0.96654
##   0.4  2          0.6               100      0.99461   0.99319
##   0.4  2          0.6               150      0.99835   0.99792
##   0.4  2          0.8                50      0.97454   0.96779
##   0.4  2          0.8               100      0.99473   0.99333
##   0.4  2          0.8               150      0.99828   0.99783
##   0.4  3          0.6                50      0.99537   0.99414
##   0.4  3          0.6               100      0.99913   0.99890
##   0.4  3          0.6               150      0.99946   0.99932
##   0.4  3          0.8                50      0.99557   0.99440
##   0.4  3          0.8               100      0.99929   0.99910
##   0.4  3          0.8               150      0.99948   0.99934
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## 
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were nrounds = 150, max_depth = 3,
##  eta = 0.4, gamma = 0, colsample_bytree = 0.8 and min_child_weight = 1.

Comparing these two classifiers shows that Extreme Gradient Boosting is better when using accuracy as the performance measure/benchmark.

#### Compare models ####
rValues <- resamples( list( xgb = fitGBM, rf = fitRF ) )
bwplot( rValues, metric = "Accuracy", main = "rf vs xgboost" )

I then stack/combine these classifiers into an ensemble model on the validation set. Again I use repeated CV.

#### Evaluate whether stacking classifiers (ensemble) is worthwhile ####
# Predict models
predRF <- predict( fitRF, newdata = validation )
predGBM <- predict( fitGBM, newdata = validation )
# Create dataset to use for tuning parameters of ensemble model
stackingDF <- data.frame( predRF = predRF, predGBM = predGBM, classe = validation$classe )
# Show observations where classifiers' predictions differ
stackingDF[ stackingDF$predRF != stackingDF$predGBM, ]

##      predRF predGBM classe
## 957       A       B      B
## 1070      C       B      B
## 1289      B       A      B
## 1701      B       C      C
## 1957      C       D      D
## 2128      C       D      D
## 2130      C       D      D
## 2756      E       D      E

# Stack predictions
formulaEnsemble <- classe ~ predRF + predGBM
fitEnsemble <- train( formulaEnsemble, data = stackingDF, trControl = trControl, method = "xgbTree" )

Looking at observations where the classifiers differ, Extreme Gradient Boosting clearly is the superior algorithm. This diminishes the benefits of using an ensemble model. Extreme Gradient Boosting alone performs well enough.

Model Perfomance

What you think the expected out of sample error is.

I finally use the testing set I devised to assess the expected out-of-sample error. I’ll use Extreme Gradient Boosting as my final model. As we saw, stacking the classifiers is unlikely to boost model accuracy by much in this case.

#### Assess the expected out-of-sample error ####
# Accuracy on test sample
predFinal <- predict( fitGBM, newdata = testing )
accFinal <- postResample( predFinal, testing$classe )["Accuracy"]
accFinal

## Accuracy 
##  0.99966

This step shows that the final model – Extreme Gradient Boosting – has an accuracy of 0.99966 on the testing dataset I devised. I expect similar performance on the testing dataset provided for the assignment.

#### Predict labels on testing data provided ####
# Download testing dataset
urlTestingData <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
testDF <- fread(urlTestingData)
# Show predictions on testing dataset
predAssignment <- predict( fitGBM, newdata = testDF )
predAssignment

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Postmortem

Please run the script run-analysis.R (see github repo) to replicate this analysis.

Please see gh-pages branch for this very report.

Also, here’s the packages/session details on my machine at the time I ran/wrote the report:

sessionInfo()

## R version 3.3.0 (2016-05-03)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10240)
## 
## locale:
## [1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.1252    
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] randomForest_4.6-12 xgboost_0.4-4       doParallel_1.0.10  
##  [4] iterators_1.0.8     foreach_1.4.3       knitr_1.13         
##  [7] caret_6.0-70        ggplot2_2.1.0       lattice_0.20-33    
## [10] data.table_1.9.6   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.6        formatR_1.4        nloptr_1.0.4      
##  [4] plyr_1.8.4         tools_3.3.0        digest_0.6.9      
##  [7] lme4_1.1-12        evaluate_0.9       gtable_0.2.0      
## [10] nlme_3.1-127       mgcv_1.8-12        Matrix_1.2-6      
## [13] yaml_2.1.13        SparseM_1.7        stringr_1.0.0     
## [16] MatrixModels_0.4-1 stats4_3.3.0       grid_3.3.0        
## [19] nnet_7.3-12        rmarkdown_1.0      minqa_1.2.4       
## [22] reshape2_1.4.1     car_2.1-2          magrittr_1.5      
## [25] scales_0.4.0       codetools_0.2-14   htmltools_0.3.5   
## [28] MASS_7.3-45        splines_3.3.0      pbkrtest_0.4-6    
## [31] colorspace_1.2-6   quantreg_5.26      stringi_1.1.1     
## [34] munsell_0.4.3      chron_2.3-47

And that’s that. Thank you.

Prediction Assignment (Practical Machine Learning - Coursera)

Vathy M. Kamulete