This report will demonstrate the statistical methods in R used to predict an outcome utilizing two machine learning algorithms, boosting with trees (gbm) and random forest (rf).
I will be utilizing the data from the paper cited below which captured correct and incorrect executions of Unilateral Dumbbell Biceps Curls and use models fitted to that data to predict in which manner members of the test set did the exercise (classe variable).
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
The first step is of course obtaining the data.
filename <- "pml_training.csv"
if(!file.exists(filename)){
fileURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(fileURL, filename)
}
filename2 <- "pml_testing.csv"
if(!file.exists(filename2)){
fileURL2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(fileURL2, filename2)
}
training <- read.csv(filename)
testing <- read.csv(filename2)
A quick view of the training data shows there are quite a few NA values and blank cells that will throw off our algorithms. It also looks like the first 7 columns are not predictive. I will remove the columns that have NAs as well as the first 7 columns. I’ll do the same to the test data so we don’t forget to later.
preObjTrain <- training[,-c(1:7)]
colToRemove <- which(colSums(is.na(preObjTrain)|preObjTrain=="")>0.9*dim(preObjTrain)[1])
preObjTrain <- preObjTrain[,-colToRemove]
preObjTest <- testing[,-c(1:7)]
preObjTest <- preObjTest[,-colToRemove]
Even though we already have a testing set and a training set in order to test for out-of-sample error I’ll break the testing set into 70% test and 30% validate.
library(caret)
inTrain <- createDataPartition(preObjTrain$classe, p=.7, list=FALSE)
train <- preObjTrain[inTrain,]
validate <- preObjTrain[-inTrain,]
I’m going to use two methods to fit our training model. The first one a decision tree using the “rpart” method. Decision trees work pretty well if the data is discrete enough. They also let you make pretty dendrogram plots that I like a lot.
First I’ll set up some stuff so the algorithm doesn’t take too long.
library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
fitControl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
Now we’ll do the decision tree modeling
library(rpart)
modFit1 <- train(classe ~ ., data = train, method = "rpart", trControl = fitControl)
We can see the pretty dendrogram.
library(rattle)
library(rpart.plot)
fancyRpartPlot(modFit1$finalModel)
Let’s try to predict the validate table we made using this model.
predictMod1 <- predict(modFit1, validate)
confusionMatrix(validate$classe, predictMod1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1052 3 325 291 3
## B 202 205 200 532 0
## C 26 16 691 293 0
## D 55 5 280 624 0
## E 11 3 214 358 496
##
## Overall Statistics
##
## Accuracy : 0.5213
## 95% CI : (0.5085, 0.5342)
## No Information Rate : 0.3565
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4036
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.7816 0.88362 0.4041 0.2974 0.99399
## Specificity 0.8630 0.83478 0.9198 0.9102 0.89120
## Pos Pred Value 0.6284 0.17998 0.6735 0.6473 0.45841
## Neg Pred Value 0.9302 0.99431 0.7903 0.7005 0.99938
## Prevalence 0.2287 0.03942 0.2906 0.3565 0.08479
## Detection Rate 0.1788 0.03483 0.1174 0.1060 0.08428
## Detection Prevalence 0.2845 0.19354 0.1743 0.1638 0.18386
## Balanced Accuracy 0.8223 0.85920 0.6619 0.6038 0.94259
The accuracy of the model was very poor.
confusionMatrix(validate$classe, predictMod1)$overall[1]
## Accuracy
## 0.5213254
Time to break out the big guns. The random forest algorithm might take forever to calculate but it should also give us the best accuracy for this type of problem.
modFit2 <- train(classe ~ ., data = train, method = "rf", trControl = fitControl)
Let’s try to predict the validate table we made using this model.
predictMod2 <- predict(modFit2, validate)
confusionMatrix(validate$classe, predictMod2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1672 1 0 0 1
## B 10 1126 3 0 0
## C 0 2 1021 3 0
## D 0 0 14 950 0
## E 0 0 1 6 1075
##
## Overall Statistics
##
## Accuracy : 0.993
## 95% CI : (0.9906, 0.995)
## No Information Rate : 0.2858
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9912
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9941 0.9973 0.9827 0.9906 0.9991
## Specificity 0.9995 0.9973 0.9990 0.9972 0.9985
## Pos Pred Value 0.9988 0.9886 0.9951 0.9855 0.9935
## Neg Pred Value 0.9976 0.9994 0.9963 0.9982 0.9998
## Prevalence 0.2858 0.1918 0.1766 0.1630 0.1828
## Detection Rate 0.2841 0.1913 0.1735 0.1614 0.1827
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9968 0.9973 0.9908 0.9939 0.9988
This model fit very well and resulted in an accuracy over 99%!
confusionMatrix(validate$classe, predictMod2)$overall[1]
## Accuracy
## 0.9930331
Now that we have a model that tests well against our validation set we can use it to try to predict the test set.
predict(modFit2, preObjTest)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E