library(dplyr)
library(data.table)
library(caret)
library(randomForest)
library(rpart)
library(gbm)
library(e1071)
testing<-fread("pml-testing.csv",na.strings=c("#DIV/0!","", " ","NA"))
training<-fread("pml-training.csv",na.strings=c("#DIV/0!","", " ","NA"))

First I look at and clean the data

I find that there are a number of columns that have more than 50% missing data. Although that data does well to predict if we have it, the test set doesn’t have those variables, either. I remove them and am left with 19622 complete observations. Finally, I split my data into test and training sets.

head(training)
colMeans(is.na(training))

trainingTrim<-training[,select_if(.SD,function(x) {mean(is.na(x))<.5})]
testingTrim<-testing[,select_if(.SD,function(x) {mean(is.na(x))<.5})]

victorsVector<-createDataPartition(trainingTrim$classe,p=.8)[[1]]
trainTrain<-trainingTrim[victorsVector]
validTrain<-trainingTrim[-victorsVector]

Make some models

I make a couple of models appropriate to the type of predicting we are doing. In each case, I normalize the numeric variables by pre-processing with “center” and “scale” parameters. I create predictions using the validation set I created, and finally check the performance of each model against the validation set.

mod1norm<-train(classe~.,data=trainTrain[,.SD,.SDcols=6:60],
            preProcess=c("center","scale"),method="rpart")
mod2norm<-train(classe~.,data=trainTrain[,.SD,.SDcols=6:60],
            preProcess=c("center","scale"),method="gbm")
mod3norm<-train(classe~.,data=trainTrain[,.SD,.SDcols=6:60],
            preProcess=c("center","scale"),method="nnet")

rpPred<-predict(mod1norm,newdata=validTrain[,.SD,.SDcols=6:60])
gbmPred<-predict(mod2norm,newdata=validTrain[,.SD,.SDcols=6:60])
nnPred<-predict(mod3norm,newdata=validTrain[,.SD,.SDcols=6:60])
confusionMatrix(rpPred,validTrain$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1025  325  314  289  103
##          B   22  235   22  109   96
##          C   65  199  348  245  188
##          D    0    0    0    0    0
##          E    4    0    0    0  334
## 
## Overall Statistics
##                                           
##                Accuracy : 0.495           
##                  95% CI : (0.4793, 0.5108)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3397          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9185   0.3096  0.50877   0.0000  0.46325
## Specificity            0.6327   0.9213  0.78481   1.0000  0.99875
## Pos Pred Value         0.4985   0.4855  0.33301      NaN  0.98817
## Neg Pred Value         0.9513   0.8476  0.88325   0.8361  0.89205
## Prevalence             0.2845   0.1935  0.17436   0.1639  0.18379
## Detection Rate         0.2613   0.0599  0.08871   0.0000  0.08514
## Detection Prevalence   0.5241   0.1234  0.26638   0.0000  0.08616
## Balanced Accuracy      0.7756   0.6155  0.64679   0.5000  0.73100
confusionMatrix(gbmPred,validTrain$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1112    3    0    0    0
##          B    3  746    3    0    1
##          C    0    8  676    6    1
##          D    1    2    3  636    6
##          E    0    0    2    1  713
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9898          
##                  95% CI : (0.9861, 0.9927)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9871          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9829   0.9883   0.9891   0.9889
## Specificity            0.9989   0.9978   0.9954   0.9963   0.9991
## Pos Pred Value         0.9973   0.9907   0.9783   0.9815   0.9958
## Neg Pred Value         0.9986   0.9959   0.9975   0.9979   0.9975
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2835   0.1902   0.1723   0.1621   0.1817
## Detection Prevalence   0.2842   0.1919   0.1761   0.1652   0.1825
## Balanced Accuracy      0.9977   0.9903   0.9918   0.9927   0.9940
confusionMatrix(nnPred,validTrain$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 969 103  71  78  15
##          B  70 539  97  46 131
##          C  19  28 459  79  91
##          D  55  27  25 378  82
##          E   3  62  32  62 402
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7002          
##                  95% CI : (0.6856, 0.7145)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.619           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8683   0.7101   0.6711  0.58787   0.5576
## Specificity            0.9049   0.8913   0.9330  0.94238   0.9503
## Pos Pred Value         0.7840   0.6104   0.6790  0.66667   0.7166
## Neg Pred Value         0.9453   0.9276   0.9307  0.92104   0.9051
## Prevalence             0.2845   0.1935   0.1744  0.16391   0.1838
## Detection Rate         0.2470   0.1374   0.1170  0.09635   0.1025
## Detection Prevalence   0.3151   0.2251   0.1723  0.14453   0.1430
## Balanced Accuracy      0.8866   0.8007   0.8020  0.76512   0.7540

Pick my model

Although my intention was to create an ensemble model using the three above models, the GBM outperformed the other two by such a margin that I decided to use it alone.

finalPred<-predict(mod2norm,newdata=testingTrim[,.SD,.SDcols=6:60])
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
finalPred
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

These are the final predictions for the project. They gave me 20/20 on the quiz, so I’m assuming they are fairly accurate.