Summary

This report will demonstrate the statistical methods in R used to predict an outcome utilizing two machine learning algorithms, boosting with trees (gbm) and random forest (rf).

I will be utilizing the data from the paper cited below which captured correct and incorrect executions of Unilateral Dumbbell Biceps Curls and use models fitted to that data to predict in which manner members of the test set did the exercise (classe variable).

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

Data Aquisition

The first step is of course obtaining the data.

filename <- "pml_training.csv"
if(!file.exists(filename)){
        fileURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
        download.file(fileURL, filename)
}
filename2 <- "pml_testing.csv"
if(!file.exists(filename2)){
        fileURL2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
        download.file(fileURL2, filename2)
}
training <- read.csv(filename)
testing <- read.csv(filename2)

Data Cleaning

A quick view of the training data shows there are quite a few NA values and blank cells that will throw off our algorithms. It also looks like the first 7 columns are not predictive. I will remove the columns that have NAs as well as the first 7 columns. I’ll do the same to the test data so we don’t forget to later.

preObjTrain <- training[,-c(1:7)]
colToRemove <- which(colSums(is.na(preObjTrain)|preObjTrain=="")>0.9*dim(preObjTrain)[1])
preObjTrain <- preObjTrain[,-colToRemove]
preObjTest <- testing[,-c(1:7)]
preObjTest <- preObjTest[,-colToRemove]

Data Splitting

Even though we already have a testing set and a training set in order to test for out-of-sample error I’ll break the testing set into 70% test and 30% validate.

library(caret)
inTrain <- createDataPartition(preObjTrain$classe, p=.7, list=FALSE)
train <- preObjTrain[inTrain,]
validate <- preObjTrain[-inTrain,]

Model Fitting

I’m going to use two methods to fit our training model. The first one a decision tree using the “rpart” method. Decision trees work pretty well if the data is discrete enough. They also let you make pretty dendrogram plots that I like a lot.

First I’ll set up some stuff so the algorithm doesn’t take too long.

library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
fitControl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)

Now we’ll do the decision tree modeling

library(rpart)
modFit1 <- train(classe ~ ., data = train, method = "rpart", trControl = fitControl)

We can see the pretty dendrogram.

library(rattle)
library(rpart.plot)
fancyRpartPlot(modFit1$finalModel)

Let’s try to predict the validate table we made using this model.

predictMod1 <- predict(modFit1, validate)
confusionMatrix(validate$classe, predictMod1)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1052    3  325  291    3
##          B  202  205  200  532    0
##          C   26   16  691  293    0
##          D   55    5  280  624    0
##          E   11    3  214  358  496
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5213          
##                  95% CI : (0.5085, 0.5342)
##     No Information Rate : 0.3565          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4036          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7816  0.88362   0.4041   0.2974  0.99399
## Specificity            0.8630  0.83478   0.9198   0.9102  0.89120
## Pos Pred Value         0.6284  0.17998   0.6735   0.6473  0.45841
## Neg Pred Value         0.9302  0.99431   0.7903   0.7005  0.99938
## Prevalence             0.2287  0.03942   0.2906   0.3565  0.08479
## Detection Rate         0.1788  0.03483   0.1174   0.1060  0.08428
## Detection Prevalence   0.2845  0.19354   0.1743   0.1638  0.18386
## Balanced Accuracy      0.8223  0.85920   0.6619   0.6038  0.94259

The accuracy of the model was very poor.

confusionMatrix(validate$classe, predictMod1)$overall[1]
##  Accuracy 
## 0.5213254

Time to break out the big guns. The random forest algorithm might take forever to calculate but it should also give us the best accuracy for this type of problem.

modFit2 <- train(classe ~ ., data = train, method = "rf", trControl = fitControl)

Let’s try to predict the validate table we made using this model.

predictMod2 <- predict(modFit2, validate)
confusionMatrix(validate$classe, predictMod2)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1672    1    0    0    1
##          B   10 1126    3    0    0
##          C    0    2 1021    3    0
##          D    0    0   14  950    0
##          E    0    0    1    6 1075
## 
## Overall Statistics
##                                          
##                Accuracy : 0.993          
##                  95% CI : (0.9906, 0.995)
##     No Information Rate : 0.2858         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9912         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9941   0.9973   0.9827   0.9906   0.9991
## Specificity            0.9995   0.9973   0.9990   0.9972   0.9985
## Pos Pred Value         0.9988   0.9886   0.9951   0.9855   0.9935
## Neg Pred Value         0.9976   0.9994   0.9963   0.9982   0.9998
## Prevalence             0.2858   0.1918   0.1766   0.1630   0.1828
## Detection Rate         0.2841   0.1913   0.1735   0.1614   0.1827
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9968   0.9973   0.9908   0.9939   0.9988

This model fit very well and resulted in an accuracy over 99%!

confusionMatrix(validate$classe, predictMod2)$overall[1]
##  Accuracy 
## 0.9930331

Test Set Prediction

Now that we have a model that tests well against our validation set we can use it to try to predict the test set.

predict(modFit2, preObjTest)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E