1. Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity. These type of devices are part of the quantified self movement. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants collected while perform barbell lifts correctly and incorrectly in order to predict whether or not these lifts were done correctly. This is the “classe” variable in the data sets.

We would like to thank the ‘Human Activity Recognition’ group of the Informatics Department at the Pontifical Catholic University of Rio de Janeiro for making this dataset available for this study.

2. Data loading, reduction and exploration

Though the assignment explicitly provides both training and test sets, for the purpose of this study we will partition just the training to generate the test set, and the original test set will be used as validation set.

The training/test set can be found here, and the validation can be downloaded from here. We will download both locally to avoid network overhead, and explore and process the training set first. Note that it will be necessary to transform the classe var into a factor, and that doing this in the original training set will cause the generated test set to have it as factor as well.

dat <- read.csv('./pml-training.csv', 
                header = T, 
                sep=",",
                na.strings = c("#DIV/0!","NA",""), 
                strip.white = T, 
                stringsAsFactors=F)
validation <- read.csv('./pml-testing.csv', 
                header = T, 
                sep=",",
                na.strings = c("#DIV/0!","NA",""), 
                strip.white = T, 
                stringsAsFactors=F)
dat$classe <- as.factor(dat$classe)

As explained throughout the course, the first and foremost step in building a statistical learning model is to partition the data. We will have 70% of the original training set as training, and 30% of it as test set. Note that the validation set will not be touched beyond this point.

inTrain <- createDataPartition(y=dat$classe, p=0.7, list=F)
training <- dat[inTrain,] # Don't forget the commas!
test <- dat[-inTrain,] # Don't forget the commas!

We end up with 13737 rows for the training, 5885 for the test set, and 20 for the validation test. Upon first inspection, we find that the following columns have mostly NA observations, which means we should only consider those columns that have enough information. We have determined that relevant variables are those with at least 30% of non-NAs. This and subsequent transformations will only happen on the training set, since both test and validation sets must remain untouched.

redTraining <- training[,colMeans(!is.na(training)) >= 0.3]

We also remove the X since, being just a row id, it will have 100 variance and this will have an effect on our prediction. We’ll also remove the user_name column since, being the subject’s name, it doesn’t add information to the model. Finally, we remove the *timestamp* and the *window* columns, since they are time-related variables and we do not want a prediction model on a time-series.

finalTraining <- redTraining[,-c(1:7)]

With this we have gone from 160 to 53. We can now proceed to feature and model selection.

3. Model building

We choose a Random Forest model due to it having built-in feature selection. According to the caret package documentation, its feature selection algorithm is coupled with the parameter estimation algorithm, making it faster than if the features were searched for externally.

Also, even though some authors state that RFs do not require cross-validation, we will nonetheless train our RF with a 5-fold, single-pass cross-validation in order to reduce any potential bias, and to further get a more accurate estimate of the out-of-sample error.

registerDoMC(cores = 3) # Register core for parallel processing.
tControl <- trainControl(method='cv', number = 5, allowParallel = T) # train control function for X-validation.
ptm <- proc.time()
rfModel <- train(classe ~ ., data = finalTraining, method='rf', trControl=tControl, importance = T) # model building
finalTime <- proc.time() - ptm
rfModel

## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10989, 10990, 10990, 10990, 10989 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9903911  0.9878443  0.001736499  0.002197175
##   27    0.9895903  0.9868326  0.001533900  0.001939395
##   52    0.9809274  0.9758708  0.002719702  0.003442569
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

We can also prove that the Random Forest algorithm is computationally expensive by looking at the following table, where it shows that on an Intel Quad-core i7 with 16GB RAM and SSD the model trained in 12 mins, even though we registered 3 cores for parallel execution. Doing it with a single core increases that time to 37 mins.

finalTime

##     user   system  elapsed 
## 1114.083    9.257  714.307

3.1 Selected variables

According to the plot, RF’s built-in feature selection has determined that the variables that most contribute to decrease model impurity are roll_belt, yaw_belt, magnet_dumbbell_z and magnet_dumbbell_y. This means that it would be possible to build a random forest with these variables at the expense of small decrease in accuracy and a small increase in OOB error.

3.2 Accuracy and in-sample error

trainingConf <- rfModel$finalModel$confusion
trainingConf

##      A    B    C    D    E  class.error
## A 3903    2    0    0    1 0.0007680492
## B   13 2641    4    0    0 0.0063957863
## C    0   17 2376    3    0 0.0083472454
## D    0    0   34 2216    2 0.0159857904
## E    0    0    1    9 2515 0.0039603960

From the fitted model above, we can see that we’ve achieved an in-sample accuracy measure of 99.29% for mtry = 2 (vars sampled as potential splits). with an in-sample error of 0.7%.

4. Model performance

We will now apply the model to the test set and explore its performance by looking at the out-of-sample error and the accuracy. Remember that our process of stripping variables was only applied to the training set.

4.1 Prediction accuracy and error on test set

pred <- predict(rfModel, test)
testConf <- confusionMatrix(pred, test$classe)
testConf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673   10    0    0    0
##          B    1 1125    8    0    0
##          C    0    4 1018   18    0
##          D    0    0    0  945    1
##          E    0    0    0    1 1081
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9927               
##                  95% CI : (0.9902, 0.9947)     
##     No Information Rate : 0.2845               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.9908               
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9877   0.9922   0.9803   0.9991
## Specificity            0.9976   0.9981   0.9955   0.9998   0.9998
## Pos Pred Value         0.9941   0.9921   0.9788   0.9989   0.9991
## Neg Pred Value         0.9998   0.9971   0.9983   0.9962   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1912   0.1730   0.1606   0.1837
## Detection Prevalence   0.2860   0.1927   0.1767   0.1607   0.1839
## Balanced Accuracy      0.9985   0.9929   0.9938   0.9900   0.9994

As we can see, we’ve achieved an accuracy of 99.27% and an out-of-sample error of 0.73%. As we can observe, these figures fall closely in line with the in-sample error and accuracy. The general expectation is for the out-of-sample metrics to be much greater. This difference is curious, but not entirely unexpected, due to any of the following:

The training set having more diversified cases and hence accounts for more variance.
The cross-validation, as expected, is reducing the bias, but marginally increasing the variance.
it is possible that there is no significant difference (in the statistical sense) between our prediction’s 0.73% and the model’s 0.7% sample error.

4.2 Prediction on an entirely new dataset

We have called this test set the ‘validation set’, and even though we don’t have the correct predictions in it, we will nonetheless apply our model to it.

blindPred <- predict(rfModel, validation)
blindPred

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

We will convert the predictions to chr data type and will generate 20 text files with the individual prediction for each case for submission purposes.

finalBlindPred <- as.character(blindPred)

pml_write_files = function(x) {
        n = length(x)
        for(i in 1:n){
                filename = paste0("problem_id_",i,".txt")
                write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
        }
}

pml_write_files(finalBlindPred)

The accuracy of these predictions is reserved by the author.

Predicting barbell lifts from ‘quantified self’ devices

J.S. Ramos

September 26, 2015