Executive Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Goal

The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.

Approach

I intend to evaluate each of the variables within the dataset to determine which variables are insignificant for predicting the classe.
Once in the proper format, after transformations, I’ll apply three different models to the data to determine which has the best fit.
I will then test the model against the validation set to look for over-fitting.
I will then use that model to predict on the testing set.
I have relied heavily on the work of Max Kuhn to streamline the model testing process through caret.
http://www.edii.uclm.es/~useR-2013/Tutorials/kuhn/user_caret_2up.pdf
Applied Predictive Modeling; Kuhn & Johnson

Pre-processing the Data

Load Libraries

setwd("C:/Users/Kier/Documents/Analytics Course/07_PracticalMachineLearning")
suppressMessages(library(caret))
suppressMessages(library(ggplot2))
suppressMessages(library(plyr))
suppressMessages(library(tidyverse))
suppressMessages(library(rattle))
suppressMessages(library(partykit))
suppressMessages(library(randomForest))

Time-saving RDS’

Some of these models take a longggg time to build so instead of re-running the training processes I will load the models from RDS files. I will show the code used to build the models if you care to run it on your own.

tr_fit_rpart <- readRDS("tr_fit_rpart.RDS")
tr_fit_rf <- readRDS("tr_fit_rf.RDS")
tr_fit_svm_caret <- readRDS("tr_fit_svm_caret.RDS")
tr_fit_gbm <- readRDS("tr_fit_gbm.RDS")

Load data locally

training_raw <- read.csv("pml-training.csv", na.strings = c("NA", "#DIV/0!", ""))
testing_raw <- read.csv("pml-testing.csv", na.strings = c("NA", "#DIV/0!", ""))
# Thankfully the na.strings attribute exists to deal with those DIV/0 values.

Take training and break into validation and training.

set.seed(317) 
in_train <- createDataPartition(training_raw$classe, p = 0.7, list = FALSE)
validation_raw <- training_raw[-in_train, ]
tr1 <- training_raw[in_train, ]

Identify the variables with majority NA, and remove

set_for_removal <- tr1[,colSums(is.na(tr1))/nrow(tr1) >= 0.50]
names_for_removal <- names(set_for_removal)
cols_for_removal <- which(names(tr1) %in% names_for_removal)
tr2 <- tr1[, -cols_for_removal]
# Down to 60 variables

Get rid of variables with near zero variance

nsv <- nearZeroVar(tr2)
tr3 <- tr2[, -nsv]
# Down to 59 variables

Remove the first six variables. They have very little value in prediction

tr4 <- tr3[, -c(1:6)]
# down to 53 variables

Ready for pre-processing

Remove our result variable so it doesn’t get preprocessed too.

trainX <- tr4[, names(tr4) != "classe"]

Center & Scale

trainPreProcValues <- preProcess(trainX, method = c("center", "scale"))

Predict the preProcValues model on trainX

trainScaled <- predict(trainPreProcValues, trainX)
# This is really nice.  It does all the work for you.

Look for, and removed, highly correlated variables

correlations <- cor(trainScaled)
high_corr <- findCorrelation(correlations, cutoff = .75)
trainFiltered <- trainScaled[, -high_corr]

Bind the classe variable back onto the transformed dataset.

training <- cbind(classe = tr4$classe, trainFiltered)

Process the validation set the same way.

validation_classe <- validation_raw$classe
v1 <- validation_raw[, -cols_for_removal]
v2 <- v1[, -nsv]
v3 <- v2[, -c(1:6)]
valX <- v3[, names(v3) != "classe"]
valPreProcValues <- preProcess(valX, method = c("center", "scale"))
valScaled <- predict(valPreProcValues, valX)
valFiltered <- valScaled[, -high_corr]
validation <- cbind(classe = validation_classe, valFiltered)

Process the testing set the same way

problem_id <- testing_raw$problem_id
te1 <- testing_raw[, -cols_for_removal]
te2 <- te1[, -nsv]
te3 <- te2[, -c(1:6, 59)]
testX <- te3
testPreProcValues <- preProcess(testX, method = c("center", "scale"))
testScaled <- predict(testPreProcValues, testX)
testFiltered <- testScaled[, -high_corr]
testing <- testFiltered %>%
    mutate(classe = NA_character_) # Add the classe variable to test since it doesn't currently exist.

Remove intermediate variables

rm(te1, te2, te3, tr1, tr2, tr3, tr4, testScaled, trainScaled, 
   set_for_removal, testX, trainX, v1, v2, v3, valScaled, valX, 
   cols_for_removal, names_for_removal, nsv, testPreProcValues, 
   trainPreProcValues, valPreProcValues, testFiltered, valFiltered, 
   trainFiltered, correlations)

Model Evaluation

Kuhn recommends to start with the black-box models like svm and gbm and then see if there are any simpler models that produce similar results. Black-box models tend to produce better results at the expense of interpretability. Simpler models are more interpretable and sometimes produce very similar results.

Each of the models will use the same cross-validation controller.

In this case R will do repeated 10-fold cross-validations on the training set, three times. This takes longer but produces better results.

cvCtrl <- trainControl(method = "repeatedcv", repeats = 3, savePred=TRUE)

Support Vector Machines (SVM)

This one has the best name by far. It also produces some good results.

tr_fit_svm_caret <- train(classe ~ ., data = training, 
                          method = "svmRadial", 
                          tuneGrid = data.frame(.C = c(.25, .5, 1),
                                                .sigma = .05),
                          trControl = cvCtrl)

I like to apply the model to both the training and validation set to see if there is a large gap in results. If there is a large gap in accuracy between the training set and the validation set it may mean that our model is over-fitting to the training set it was modelled on.

tr_pred_svm_caret <- suppressMessages(predict(tr_fit_svm_caret, training))
confusionMatrix(tr_pred_svm_caret, training$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3905   98    0    0    0
##          B    1 2548   24    0    0
##          C    0   12 2351  161   31
##          D    0    0   19 2087   41
##          E    0    0    2    4 2453
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9714          
##                  95% CI : (0.9685, 0.9741)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9638          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9997   0.9586   0.9812   0.9267   0.9715
## Specificity            0.9900   0.9977   0.9820   0.9948   0.9995
## Pos Pred Value         0.9755   0.9903   0.9202   0.9721   0.9976
## Neg Pred Value         0.9999   0.9901   0.9960   0.9858   0.9936
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1855   0.1711   0.1519   0.1786
## Detection Prevalence   0.2914   0.1873   0.1860   0.1563   0.1790
## Balanced Accuracy      0.9949   0.9782   0.9816   0.9608   0.9855

# Accuracy 97.14%; Kappa 0.9638
val_pred_svm_caret <- predict(tr_fit_svm_caret, validation)
confusionMatrix(val_pred_svm_caret, validation$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1668   76    3    0    3
##          B    1  909   19    0    0
##          C    1   90  989   86   31
##          D    1   26   15  874   29
##          E    3   38    0    4 1019
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9276          
##                  95% CI : (0.9207, 0.9341)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9084          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.7981   0.9639   0.9066   0.9418
## Specificity            0.9805   0.9958   0.9572   0.9856   0.9906
## Pos Pred Value         0.9531   0.9785   0.8262   0.9249   0.9577
## Neg Pred Value         0.9985   0.9536   0.9921   0.9818   0.9869
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2834   0.1545   0.1681   0.1485   0.1732
## Detection Prevalence   0.2974   0.1579   0.2034   0.1606   0.1808
## Balanced Accuracy      0.9885   0.8969   0.9606   0.9461   0.9662

# Accuracy 92.76; kappa 0.9084

This is not a bad way to start. Predicting over 90% correct on the validation set is very promising.

Generalized Boosting Model (GBM)

This is another black-box modeling package. Let’s see how it does…

tr_fit_gbm <- train(classe ~ ., data = training, 
                 method = "gbm", 
                 trControl = cvCtrl,
                 verbose = FALSE)

tr_pred_gbm <- suppressMessages(predict(tr_fit_gbm, training))
confusionMatrix(tr_pred_gbm, training$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3852   79    1    5   11
##          B   32 2490  101   20   26
##          C   12   76 2263   83   31
##          D    9    8   30 2129   21
##          E    1    5    1   15 2436
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9587         
##                  95% CI : (0.9553, 0.962)
##     No Information Rate : 0.2843         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9478         
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9862   0.9368   0.9445   0.9454   0.9648
## Specificity            0.9902   0.9838   0.9822   0.9941   0.9980
## Pos Pred Value         0.9757   0.9329   0.9181   0.9690   0.9910
## Neg Pred Value         0.9945   0.9848   0.9882   0.9893   0.9921
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2804   0.1813   0.1647   0.1550   0.1773
## Detection Prevalence   0.2874   0.1943   0.1794   0.1599   0.1789
## Balanced Accuracy      0.9882   0.9603   0.9633   0.9697   0.9814

# Accuracy is 95.87% and Kappa is 0.9478
val_pred_gbm <- predict(tr_fit_gbm, validation)
confusionMatrix(val_pred_gbm, validation$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1637   97   23   18    4
##          B    9  872   54   28   23
##          C   10  108  861   34   33
##          D   17   46   60  858   22
##          E    1   16   28   26 1000
## 
## Overall Statistics
##                                         
##                Accuracy : 0.8884        
##                  95% CI : (0.88, 0.8963)
##     No Information Rate : 0.2845        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.8585        
##  Mcnemar's Test P-Value : < 2.2e-16     
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9779   0.7656   0.8392   0.8900   0.9242
## Specificity            0.9663   0.9760   0.9619   0.9705   0.9852
## Pos Pred Value         0.9202   0.8844   0.8231   0.8554   0.9337
## Neg Pred Value         0.9910   0.9455   0.9659   0.9783   0.9830
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2782   0.1482   0.1463   0.1458   0.1699
## Detection Prevalence   0.3023   0.1675   0.1777   0.1704   0.1820
## Balanced Accuracy      0.9721   0.8708   0.9006   0.9303   0.9547

# Accuracy is 88.84% and Kappa is 0.8585.

Accuracies of ~96% and ~89%, respectively.

Recursive Partitioning (rpart)

tr_fit_rpart <- train(classe ~ ., data = training, method = "rpart",
                      tuneLength = 50,
                      trControl = cvCtrl)

# Predict on training set
train_pred_rpart <- suppressMessages(predict.train(tr_fit_rpart, newdata = training))
confusionMatrix(train_pred_rpart, training$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3803   50   14   27   19
##          B   51 2495   67   27   43
##          C   20   52 2261   50   42
##          D   17   34   34 2132   38
##          E   15   27   20   16 2383
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9517         
##                  95% CI : (0.948, 0.9553)
##     No Information Rate : 0.2843         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.939          
##  Mcnemar's Test P-Value : 0.000863       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9736   0.9387   0.9437   0.9467   0.9438
## Specificity            0.9888   0.9830   0.9855   0.9893   0.9930
## Pos Pred Value         0.9719   0.9299   0.9324   0.9455   0.9683
## Neg Pred Value         0.9895   0.9853   0.9881   0.9895   0.9874
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2768   0.1816   0.1646   0.1552   0.1735
## Detection Prevalence   0.2849   0.1953   0.1765   0.1642   0.1792
## Balanced Accuracy      0.9812   0.9609   0.9646   0.9680   0.9684

# Predict on validation set
val_pred_rpart <- predict.train(tr_fit_rpart, newdata = validation)
confusionMatrix(val_pred_rpart, validation$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1525  121   36   39    6
##          B   85  738   84   34   41
##          C   25  109  751   30   51
##          D   22   62  128  834   27
##          E   17  109   27   27  957
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8165          
##                  95% CI : (0.8064, 0.8263)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7678          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9110   0.6479   0.7320   0.8651   0.8845
## Specificity            0.9520   0.9486   0.9558   0.9514   0.9625
## Pos Pred Value         0.8830   0.7515   0.7774   0.7773   0.8417
## Neg Pred Value         0.9642   0.9182   0.9441   0.9730   0.9737
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2591   0.1254   0.1276   0.1417   0.1626
## Detection Prevalence   0.2935   0.1669   0.1641   0.1823   0.1932
## Balanced Accuracy      0.9315   0.7983   0.8439   0.9083   0.9235

Accuracies of ~87% and ~79%, respectively.

Random Forest

I had a hard time getting this to run in caret so I used the functions within the randomForest package.

tr_fit_rf = randomForest(classe ~ ., data=training)

# Predict on training set
train_pred_rf <- predict(tr_fit_rf, training)
confusionMatrix(train_pred_rf, training$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3906    0    0    0    0
##          B    0 2658    0    0    0
##          C    0    0 2396    0    0
##          D    0    0    0 2252    0
##          E    0    0    0    0 2525
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

# Predict on validation set
val_pred_rf <- predict(tr_fit_rf, validation)
confusionMatrix(val_pred_rf, validation$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1664   49    8    5    0
##          B    4 1042   18    1    0
##          C    2   30  974   25    8
##          D    0    1   26  933    2
##          E    4   17    0    0 1072
## 
## Overall Statistics
##                                           
##                Accuracy : 0.966           
##                  95% CI : (0.9611, 0.9705)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.957           
##  Mcnemar's Test P-Value : 3.456e-13       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9940   0.9148   0.9493   0.9678   0.9908
## Specificity            0.9853   0.9952   0.9866   0.9941   0.9956
## Pos Pred Value         0.9641   0.9784   0.9374   0.9699   0.9808
## Neg Pred Value         0.9976   0.9799   0.9893   0.9937   0.9979
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2828   0.1771   0.1655   0.1585   0.1822
## Detection Prevalence   0.2933   0.1810   0.1766   0.1635   0.1857
## Balanced Accuracy      0.9897   0.9550   0.9680   0.9810   0.9932

Interesting. Accuracies of 100% and ~97%, respectively.

Choosing the model

SVM and Random Forest produced very high accuracies. Let’s apply each of them to the testing set to see what theire results are.

(test_pred_svm_caret <- predict(tr_fit_svm_caret, testing))

##  [1] B A A A A E D B A A A C B A E E A B B B
## Levels: A B C D E

(test_pred_rf <- predict(tr_fit_rf, testing))

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  E  A  A  A  D  B  D  B  A  A  B  C  D  A  E  E  E  B  E  B 
## Levels: A B C D E

confusionMatrix(test_pred_rf, test_pred_svm_caret)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction A B C D E
##          A 6 0 0 0 0
##          B 1 3 0 0 1
##          C 0 0 1 0 0
##          D 1 1 0 1 0
##          E 1 2 0 0 2
## 
## Overall Statistics
##                                           
##                Accuracy : 0.65            
##                  95% CI : (0.4078, 0.8461)
##     No Information Rate : 0.45            
##     P-Value [Acc > NIR] : 0.05803         
##                                           
##                   Kappa : 0.5286          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.6667   0.5000     1.00   1.0000   0.6667
## Specificity            1.0000   0.8571     1.00   0.8947   0.8235
## Pos Pred Value         1.0000   0.6000     1.00   0.3333   0.4000
## Neg Pred Value         0.7857   0.8000     1.00   1.0000   0.9333
## Prevalence             0.4500   0.3000     0.05   0.0500   0.1500
## Detection Rate         0.3000   0.1500     0.05   0.0500   0.1000
## Detection Prevalence   0.3000   0.2500     0.05   0.1500   0.2500
## Balanced Accuracy      0.8333   0.6786     1.00   0.9474   0.7451

Only 13 of the 20 test cases produced the same results between SVM and RandomForest.

Conclusion

In this case I would choose Random Forest as the model due to it’s accuracy scores and that it’s results are more interpretable. I could further refine the rf model by pruning it a bit so that it would take out some of the complexity but produce similar results.

test_pred_rf

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  E  A  A  A  D  B  D  B  A  A  B  C  D  A  E  E  E  B  E  B 
## Levels: A B C D E

Practical Machine Learning - Project

Kier O’Neil

January 9, 2017