Executive Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In one study, 6 participants were asked to perform barbell lifts in 5 different ways, one correct and 4 incorrect. This project will use the data collected for that study to create a model that can predict whether a person is correctly performing barbell lifts.

Data Download and Transformation

Training data

Test data

Here is a link to the study that collected the data. The full citation is below:
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

Getting and Transforming the Data

Note: Libraries and code used for this project can be found in the Appendix.

Download the test and training csv files.

## 'data.frame':    19622 obs. of  16 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name           : chr  "carlitos" "carlitos" "carlitos" "carlitos" ...
##  $ raw_timestamp_part_1: int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2: int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp      : chr  "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" ...
##  $ new_window          : chr  "no" "no" "no" "no" ...
##  $ num_window          : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt           : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt          : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt            : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ kurtosis_roll_belt  : chr  "" "" "" "" ...
##  $ kurtosis_picth_belt : chr  "" "" "" "" ...
##  $ kurtosis_yaw_belt   : chr  "" "" "" "" ...
##  $ skewness_roll_belt  : chr  "" "" "" "" ...
##  $ skewness_roll_belt.1: chr  "" "" "" "" ...
##  $ classe              : chr  "A" "A" "A" "A" ...

The training data has 19,622 observations and 160 variables (a sample of the str() output is above). The variable we are trying to predict is the classe variable, and it can have one of 5 values - “A” thru “E” (classe type “A” indicates a lift that was done correctly). Note there are several columns with a majority of NA’s and blanks. Also, columns 1 - 7 have identifying and timing data that is irrelevant to the classe, so they will be removed before fitting the models. In addition, only sensors that have a complete set of observations will be used to predict classe, so columns with all NA’s and blanks will be dropped as well. Only the training data will be changed.

The test data set has 20 observations and 160 variables. This dataset does not have the classe variable. Instead, it has a variable named “problem_id” that corresponds to a question number on the Quiz.

Creating and Comparing Models

I will split the training data into three partitions: training, testing, and validation. The training data will be used to train the models, and the testing data will be used to check the accuracy of those models. The validation data will be set aside and used to evaluate the final model’s performance when given new data.

The first model I will try is a random forest model using 10-fold cross validation. All 52 variables will be used as predictors for classe.

##   mtry  Accuracy     Kappa  AccuracySD     KappaSD
## 1    2 0.9854436 0.9815837 0.003836823 0.004855200
## 2   27 0.9830187 0.9785167 0.003702356 0.004684092
## 3   52 0.9728314 0.9656308 0.008626003 0.010912480

The model is a little more than 98% accurate with an mtry of 27. Random Forest algorithms with cross-validation are not prone to overfitting, but I still want to reduce the number of predictors due to complexity and processing time. The plot above shows 30 variables in order of importance. Since Accuracy is best when mtry is 27, I will use the top 27 to create a new model.

##   mtry  Accuracy     Kappa  AccuracySD     KappaSD
## 1    2 0.9816845 0.9768260 0.005978643 0.007568638
## 2   14 0.9842317 0.9800531 0.004780459 0.006048510
## 3   27 0.9710112 0.9633283 0.006020296 0.007610006

The model with fewer predictors actually performed slightly better, but not by much (98.3% vs 98.4%). Since simpler models are preferred, I will use the new RF model for predictions.

For comparison, I will fit a GBM classifier using 27 variables and 10-fold cross validation. Let’s see how its in-sample accuracy compares to the random forest model.

## Stochastic Gradient Boosting 
## 
## 8244 samples
##   27 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 7417, 7422, 7420, 7420, 7421, 7419, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.7371441  0.6669407
##   1                  100      0.8015528  0.7490614
##   1                  150      0.8385470  0.7957775
##   2                   50      0.8427988  0.8010216
##   2                  100      0.8961657  0.8686049
##   2                  150      0.9175120  0.8956247
##   3                   50      0.8867007  0.8565814
##   3                  100      0.9290384  0.9101894
##   3                  150      0.9498985  0.9365966
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.

The GBM model is not as Accurate as the new RF model, but it is close at 95%. Let’s see how each performs with new data.

Predictions Using Testing Data

The testing data that was split from the original training data set will be used to compare the models’ out-of-sample accuracy

RF Results using Test Data

##  Accuracy     Kappa 
## 0.9865283 0.9829561

GBM Results Using Test Data

##  Accuracy     Kappa 
## 0.9510286 0.9380312

Once again, there is very little difference in the two models; and both actually did slightly better with the new data. There is not much gain to be made by further tweaking the models, so I will use the validation data set to do a final comparison.

RF Results using Validation Data

##  Accuracy     Kappa 
## 0.9882753 0.9851667

GBM Results using Validation Data

##  Accuracy     Kappa 
## 0.9573492 0.9460308

Once again, both models improved slightly: Random Forest improved to almost 99%, and GBM increased to almost 96%.

Predicting classe of Original Test Data

Out of curiosity, I will use both models to predict the classe of the test cases. The accuracy for both is extremely high so I suspect there will be very little difference in the predictions, if any. However, only the predictions from the random forest model will be used for the Quiz.
Let’s see the results.

Final Predictions
RF GBM Qnum
B B 1
A A 2
B B 3
A A 4
A A 5
E E 6
D D 7
B B 8
A A 9
A A 10
B B 11
C C 12
B B 13
A A 14
E E 15
E E 16
A A 17
B B 18
B B 19
B B 20

Conclusion

Although the Random Forest model using 27 predictors has higher in-sample accuracy than the GBM model, they agree on all 20 test case predictions. This leads me to believe that they will perform equally well when given new data. The models improved with every new set of data, so I expect that the predictions on the test data will be 99% accurate.

Appendix Code Used

knitr::opts_chunk$set(echo = FALSE)
## LIBRARIES
library(caret)
library(gbm)
library(randomForest)
library(tidyverse)
## DATA
HARtraindata <- read.csv("pml-training.csv")
HARtestdata <-read.csv("pml-testing.csv")
str(HARtraindata[,c(1:10,12:16,160)])
## REMOVE UNNECESSARY COLUMNS
colnas <- unique(which(is.na(HARtraindata),arr.ind=TRUE)[,2])##Returns column indexes
traindata <- subset(HARtraindata,select=-colnas) ##Remove columns with NAs
cblk <- which(colSums(traindata=="")==0) ##Returns blank column indexes
traindata <- subset(traindata,select=cblk) ##Remove columns with blanks
traindata <- traindata[,c(8:60)] ##Remove first 7 columns
## SPLIT INTO SETS
set.seed(1234)
inBuild <- createDataPartition(y=traindata$classe,p=0.7,list=FALSE)
validation<-traindata[-inBuild,]
buildData <- traindata[inBuild,]
inTrain <- createDataPartition(y=buildData$classe, p=.6,list=FALSE)
training <- buildData[inTrain,]
testing <- buildData[-inTrain,]
## RANDOM FOREST
set.seed(1631)
fitrf <- train(classe~.,data=training,method="rf",ntrees=200,trControl=trainControl(method="cv",number=10))
## RF RESULTS
fitrf$results
## IMPORTANT VARIABLE PLOT
varImpPlot(fitrf$finalModel)
## CHOOSE ONLY TOP 27 VARIABLES
vimp <- varImp(fitrf$finalModel)
vimp <- arrange(vimp,desc(Overall)) 
vnames <- row.names(vimp)
vnames <- vnames[c(1:27)]##put top 27 rownames into char. vector
newtraindata <- subset(training,select=vnames)
newtraindata <- cbind(newtraindata,classe=training$classe) ##add classe to new data set
## RANDOM FOREST WITH 27 PREDICTORS
newrf <- train(classe~.,method="rf",data=newtraindata,ntrees=200,trControl=trainControl(method="cv",number=10))
## NEW RF RESULTS
newrf$results
## GBM WITH 27 PREDICTORS
fitgbm <- train(classe~.,method="gbm", data=newtraindata, trControl=trainControl(method="cv",number=10))
## GBM RESULTS
fitgbm
## PREDICTIONS ON TEST SET
predrf <- predict(newrf,testing)
predgbm<- predict(fitgbm,testing)

## EVALUATE PREDICTIONS
predrfdf <- data.frame(obs=factor(testing$classe),pred=predrf)
defaultSummary(predrfdf)
predgbmdf <- data.frame(obs=factor(testing$classe), pred=predgbm)
defaultSummary(predgbmdf)
## PREDICT USING VALIDATION SET
predrfval <- predict(newrf,validation)
predgbmval <- predict(fitgbm,validation)
## EVALUATE VALIDATION PREDICTIONS
defaultSummary(data.frame(obs=factor(validation$classe),pred=predrfval))
defaultSummary(data.frame(obs=factor(validation$classe),pred=predgbmval))
## PREDICTING TEST CASES
rftest <- predict(newrf,HARtestdata)
gbmtest <- predict(fitgbm,HARtestdata)
finalpreds <- data.frame(RF=rftest,GBM=gbmtest)

finalpreds <- cbind(finalpreds,Qnum=HARtestdata$problem_id)
knitr::kable(finalpreds, caption="Final Predictions")