Introduction

Using data provided by http://groupware.les.inf.puc-rio.br/har [1], we will build a model to predict how well a bicep curl exercise is performed. The data is obtained using accelerometors on numerous points of the body. The performance of the exercise will fall into one of five categories:

Class A - Exactly to Specification Class B - Throwing elbows to the front Class C - Lifting dumbell only halfway Class D - Lowering dumbell only halfway Class E - throwing hips to the front

Getting the data

library(caret)
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv","training.csv")
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv","testing.csv")

train <- read.csv("training.csv")
test <- read.csv("testing.csv")

#Convert all columns to same class in both datasets
for(i in 1:length(test[1,])){
    if(class(train[,i])=="factor"){
      test[,i]<-as.factor(test[,i])
    } else class(test[,i])<- class(train[,i])
}

#Blanks to NAs
train[train==""] <- NA
test[test==""] <- NA

#Determine unnecessary variables
var <- nearZeroVar(train)
train <- train[,-var]
test <- test[,-var]

#Remove individual identifiers
train <- train[,-c(1:5)]
test <- test[,-c(1:5)]

#Determine variables that are mostly NA
nearNA <- sapply(train,FUN=function(x)mean(is.na(x)>.9))
train <- train[,!nearNA]
test <- test[,!nearNA]
dim(train)
## [1] 19622    54
dim(test)
## [1] 20 54
table(train$classe)
## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

Setup Parallel Processing


Dataset is quite large. Setting up three cores to handle the random forest algorithm.

library(parallel)
library(doParallel)
cluster <- makeCluster(3)
registerDoParallel(cluster)

Build the model

library(caret)
set.seed(123)
#3-fold CV with parallel processing
ctrl = trainControl(method="cv",
                    number=3,
                    allowParallel=TRUE)
#Random Forest model build
fit <- train(classe~.,
             data=train,
             method="rf",
             trControl=ctrl
             )

Turn off Cluster

stopCluster(cluster)
registerDoSEQ()

Assess the model

library(caret)
fit
## Random Forest 
## 
## 19622 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 13081, 13082, 13081 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9952095  0.9939402
##   27    0.9974009  0.9967123
##   53    0.9955152  0.9943270
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
fit$resample
##    Accuracy     Kappa Resample
## 1 0.9977068 0.9970994    Fold1
## 2 0.9975539 0.9969059    Fold3
## 3 0.9969419 0.9961316    Fold2
confusionMatrix.train(fit)
## Cross-Validated (3 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 28.4  0.1  0.0  0.0  0.0
##          B  0.0 19.3  0.0  0.0  0.0
##          C  0.0  0.0 17.4  0.1  0.0
##          D  0.0  0.0  0.0 16.3  0.1
##          E  0.0  0.0  0.0  0.0 18.3
##                             
##  Accuracy (average) : 0.9974

Predict Values

library(caret)
pred <- predict(fit,test[,-length(test)])
write.csv(pred,"predictions.csv")

Explanation

How I built the model - I used the caret package and the random forest algorithm because it is a thorough and realistic model that estimates the out of sample error well.

How I used Cross Validation - I defined a 3-fold cross validation for the random forest to use. I used only 3 folds due to the size of the training data. With a more manageable dataset, I would have used 10.

Expected out of sample rate - Since cross validation is a big component of random forests, the accuracy rate is much more realistic than on a highly fitted single tree method. However, unseen data will still have a lower accuracy than any prediction of accuracy based on the training data. So my expected accuracy rate here will be at best 99.74%, which is excellent, but optimistic.

Citations

[1] Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.

Read more: http://groupware.les.inf.puc-rio.br/har#dataset#ixzz4xxTstr00