library(dplyr)
library(data.table)
library(caret)
library(randomForest)
library(rpart)
library(gbm)
library(e1071)
testing<-fread("pml-testing.csv",na.strings=c("#DIV/0!","", " ","NA"))
training<-fread("pml-training.csv",na.strings=c("#DIV/0!","", " ","NA"))
I find that there are a number of columns that have more than 50% missing data. Although that data does well to predict if we have it, the test set doesn’t have those variables, either. I remove them and am left with 19622 complete observations. Finally, I split my data into test and training sets.
head(training)
colMeans(is.na(training))
trainingTrim<-training[,select_if(.SD,function(x) {mean(is.na(x))<.5})]
testingTrim<-testing[,select_if(.SD,function(x) {mean(is.na(x))<.5})]
victorsVector<-createDataPartition(trainingTrim$classe,p=.8)[[1]]
trainTrain<-trainingTrim[victorsVector]
validTrain<-trainingTrim[-victorsVector]
I make a couple of models appropriate to the type of predicting we are doing. In each case, I normalize the numeric variables by pre-processing with “center” and “scale” parameters. I create predictions using the validation set I created, and finally check the performance of each model against the validation set.
mod1norm<-train(classe~.,data=trainTrain[,.SD,.SDcols=6:60],
preProcess=c("center","scale"),method="rpart")
mod2norm<-train(classe~.,data=trainTrain[,.SD,.SDcols=6:60],
preProcess=c("center","scale"),method="gbm")
mod3norm<-train(classe~.,data=trainTrain[,.SD,.SDcols=6:60],
preProcess=c("center","scale"),method="nnet")
rpPred<-predict(mod1norm,newdata=validTrain[,.SD,.SDcols=6:60])
gbmPred<-predict(mod2norm,newdata=validTrain[,.SD,.SDcols=6:60])
nnPred<-predict(mod3norm,newdata=validTrain[,.SD,.SDcols=6:60])
confusionMatrix(rpPred,validTrain$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1025 325 314 289 103
## B 22 235 22 109 96
## C 65 199 348 245 188
## D 0 0 0 0 0
## E 4 0 0 0 334
##
## Overall Statistics
##
## Accuracy : 0.495
## 95% CI : (0.4793, 0.5108)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3397
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9185 0.3096 0.50877 0.0000 0.46325
## Specificity 0.6327 0.9213 0.78481 1.0000 0.99875
## Pos Pred Value 0.4985 0.4855 0.33301 NaN 0.98817
## Neg Pred Value 0.9513 0.8476 0.88325 0.8361 0.89205
## Prevalence 0.2845 0.1935 0.17436 0.1639 0.18379
## Detection Rate 0.2613 0.0599 0.08871 0.0000 0.08514
## Detection Prevalence 0.5241 0.1234 0.26638 0.0000 0.08616
## Balanced Accuracy 0.7756 0.6155 0.64679 0.5000 0.73100
confusionMatrix(gbmPred,validTrain$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1112 3 0 0 0
## B 3 746 3 0 1
## C 0 8 676 6 1
## D 1 2 3 636 6
## E 0 0 2 1 713
##
## Overall Statistics
##
## Accuracy : 0.9898
## 95% CI : (0.9861, 0.9927)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9871
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9964 0.9829 0.9883 0.9891 0.9889
## Specificity 0.9989 0.9978 0.9954 0.9963 0.9991
## Pos Pred Value 0.9973 0.9907 0.9783 0.9815 0.9958
## Neg Pred Value 0.9986 0.9959 0.9975 0.9979 0.9975
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2835 0.1902 0.1723 0.1621 0.1817
## Detection Prevalence 0.2842 0.1919 0.1761 0.1652 0.1825
## Balanced Accuracy 0.9977 0.9903 0.9918 0.9927 0.9940
confusionMatrix(nnPred,validTrain$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 969 103 71 78 15
## B 70 539 97 46 131
## C 19 28 459 79 91
## D 55 27 25 378 82
## E 3 62 32 62 402
##
## Overall Statistics
##
## Accuracy : 0.7002
## 95% CI : (0.6856, 0.7145)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.619
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8683 0.7101 0.6711 0.58787 0.5576
## Specificity 0.9049 0.8913 0.9330 0.94238 0.9503
## Pos Pred Value 0.7840 0.6104 0.6790 0.66667 0.7166
## Neg Pred Value 0.9453 0.9276 0.9307 0.92104 0.9051
## Prevalence 0.2845 0.1935 0.1744 0.16391 0.1838
## Detection Rate 0.2470 0.1374 0.1170 0.09635 0.1025
## Detection Prevalence 0.3151 0.2251 0.1723 0.14453 0.1430
## Balanced Accuracy 0.8866 0.8007 0.8020 0.76512 0.7540
Although my intention was to create an ensemble model using the three above models, the GBM outperformed the other two by such a margin that I decided to use it alone.
finalPred<-predict(mod2norm,newdata=testingTrim[,.SD,.SDcols=6:60])
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
finalPred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
These are the final predictions for the project. They gave me 20/20 on the quiz, so I’m assuming they are fairly accurate.