Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

The aim of this paper is to build a model that is able to accurately predict whether an exercise can be classed from A to E, i.e whether it’s a good move or not, using the provided 20 test cases.

Loading Data

First we’ll load the Training data from csv file, and we’ll load the 20 test cases as well.

#Load train and test data
train<-read.csv("pml-training.csv")
test<-read.csv("pml-testing.csv")

Cleaning Data

After loading the training and test sets, we can get a glimpse of the data:

str(train)
## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : Factor w/ 397 levels "","-0.016850",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_belt     : Factor w/ 317 levels "","-0.021887",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_belt       : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_belt      : Factor w/ 395 levels "","-0.003095",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_belt.1    : Factor w/ 338 levels "","-0.005928",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_belt       : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : Factor w/ 4 levels "","#DIV/0!","0.00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y            : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y            : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z            : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x           : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y           : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z           : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x             : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y             : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z             : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x            : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y            : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z            : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ kurtosis_roll_arm       : Factor w/ 330 levels "","-0.02438",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_arm      : Factor w/ 328 levels "","-0.00484",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_arm        : Factor w/ 395 levels "","-0.01548",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_arm       : Factor w/ 331 levels "","-0.00051",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_pitch_arm      : Factor w/ 328 levels "","-0.00184",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_arm        : Factor w/ 395 levels "","-0.00311",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell          : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell            : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ kurtosis_roll_dumbbell  : Factor w/ 398 levels "","-0.0035","-0.0073",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_dumbbell : Factor w/ 401 levels "","-0.0163","-0.0233",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_dumbbell   : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_dumbbell  : Factor w/ 401 levels "","-0.0082","-0.0096",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_pitch_dumbbell : Factor w/ 402 levels "","-0.0053","-0.0084",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_dumbbell   : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]

It can be noted that there are a lot of NA values, empty values " “, therefore we’ll endeavour to cleanup the data.

Firstly, we’re going to remove the first 7 variables/columns as they’re just information about the participants and time stamps, so they don’t add any value:

#remove first seven variables/coumns since they have names/timestaps etc...
train<-train[,-(1:7)]
test<-test[,-(1:7)]

Now we’re going to remove any variables that have more than 90% of observation as NA or " “:

#remove variables/columns where more than 90% of observations are missing OR NA
train<-train[,-(which(colSums(is.na(train) | train=="") > 0.9*dim(train)[1]))]
test<-test[,-(which(colSums(is.na(test) | test=="") > 0.9*dim(test)[1]))]

Furthermore we’re going to check if there are any variables with zero or near zero variance, as those will not help in building our models:

#Check (and remove) variables with zero variance (or near zero)
nearZeroVar(train)
## integer(0)
nearZeroVar(test)
## integer(0)

Both returned zero variables with zero variance (or near it), as such we’re not going to remove any variables and this will be the data set we’re going to be working with.

One final check we’re going to do before moving on, after mangling the data as per above, we want to make sure that the training set has the same variables as the 20 test cases. Note we’re going to remove the last columns (classe from the training set and problem_id from the test set), as we know these are the only differences:

#make sure the variables/columns are the same across testing and training sets after mangling the data
identical(names(test[,-53]),names(train[,-53]))
## [1] TRUE

It has returned TRUE, therefore we can be assured that the sets have identical variables after data cleaning.

Partioning Data

We’re going to split our training set, into a training data and test data. Our split will be 75% and 25%, respectively.

#Load required libraries
suppressPackageStartupMessages(require(caret))
suppressPackageStartupMessages(require(randomForest))
suppressPackageStartupMessages(require(gbm))
suppressPackageStartupMessages(require(rattle))

#Set seed
set.seed(456)

#Split data into training and testing sets
inTrain<-createDataPartition(y=train$classe,p=0.75,list=0)
Training<-train[inTrain,]
Testing<-train[-inTrain,]

Cross Validation

We’re going to use K-Fold Cross Validation when building/training our models. We’re going to set it to 5 folds to avoid over fitting.

trcontrol=trainControl(method="cv", number=5)

Model Building

To build our models we’re going to use the train function from the caret package. We’re going to train our models using the following methods:

Each model will be judged against its accuracy; the model with best accuracy will be selected and used to predict the 20 test cases provided.

Classification Tree

We’ll use the classification tree method using K-Fold cross validation:

fit_CT<-train(classe ~ . , data=Training, method="rpart", trControl=trcontrol)
fancyRpartPlot(fit_CT$finalModel)

Following our training, we’ll use the test data created to measure the error/accuracy of our model:

pred_CT<-predict(fit_CT, newdata = Testing)

Finally we’re going to produce a confusion matrix to measure the accuracy of the Classification Tree:

confu_CT<-confusionMatrix(Testing$classe, pred_CT)
confu_CT
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1260   20  110    0    5
##          B  397  328  224    0    0
##          C  413   16  426    0    0
##          D  376  137  291    0    0
##          E  136  108  253    0  404
## 
## Overall Statistics
##                                          
##                Accuracy : 0.4931         
##                  95% CI : (0.479, 0.5072)
##     No Information Rate : 0.5265         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3369         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.4880  0.53859  0.32669       NA  0.98778
## Specificity            0.9419  0.85541  0.88083   0.8361  0.88943
## Pos Pred Value         0.9032  0.34563  0.49825       NA  0.44839
## Neg Pred Value         0.6233  0.92895  0.78316       NA  0.99875
## Prevalence             0.5265  0.12418  0.26591   0.0000  0.08340
## Detection Rate         0.2569  0.06688  0.08687   0.0000  0.08238
## Detection Prevalence   0.2845  0.19352  0.17435   0.1639  0.18373
## Balanced Accuracy      0.7149  0.69700  0.60376       NA  0.93860

We can see that the accuracy stands at 49.3% (and out-of-sample error of 50.7%) which is pretty low. As such we’ll move to our next model.

Gradient Boosting

In similar fashion as above we’re going to train the new model (with K-fold cross validation), test it against the test data and create a confusion matrix to inspect its accuracy:

fit_GBM<-train(classe ~ . , data=Training, method="gbm",trControl=trcontrol, verbose=0)
pred_GBM<-predict(fit_GBM,newdata = Testing)
confu_GBM<-confusionMatrix(Testing$classe,pred_GBM)
confu_GBM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1379   11    4    0    1
##          B   38  881   28    1    1
##          C    0   19  825   11    0
##          D    0    2   18  776    8
##          E    2   13    9    6  871
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9649          
##                  95% CI : (0.9594, 0.9699)
##     No Information Rate : 0.2894          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9556          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9718   0.9514   0.9333   0.9773   0.9886
## Specificity            0.9954   0.9829   0.9925   0.9932   0.9925
## Pos Pred Value         0.9885   0.9283   0.9649   0.9652   0.9667
## Neg Pred Value         0.9886   0.9886   0.9854   0.9956   0.9975
## Prevalence             0.2894   0.1888   0.1803   0.1619   0.1796
## Detection Rate         0.2812   0.1796   0.1682   0.1582   0.1776
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9836   0.9672   0.9629   0.9853   0.9906

We can see that there’s a huge improvement using the Gradient Boosting method, pulling up accuracy to 96.5% (conversely the out-of-sample error is 3.5%). Although we want to see if we can improve our accuracy even further, which brings us to our next model.

Random Forest

Also, we’re going to follow the same steps above though using Random Forest:

fit_RT<-train(classe ~ . , data=Training, method="rf",trControl=trcontrol)
pred_RT<-predict(fit_RT,newdata = Testing)
confu_RT<-confusionMatrix(Testing$classe,pred_RT)
confu_RT
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    0    0    0    0
##          B   11  933    5    0    0
##          C    0    3  850    2    0
##          D    0    0    6  798    0
##          E    0    0    3    1  897
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9937         
##                  95% CI : (0.991, 0.9957)
##     No Information Rate : 0.2867         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.992          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9922   0.9968   0.9838   0.9963   1.0000
## Specificity            1.0000   0.9960   0.9988   0.9985   0.9990
## Pos Pred Value         1.0000   0.9831   0.9942   0.9925   0.9956
## Neg Pred Value         0.9969   0.9992   0.9965   0.9993   1.0000
## Prevalence             0.2867   0.1909   0.1762   0.1633   0.1829
## Detection Rate         0.2845   0.1903   0.1733   0.1627   0.1829
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9961   0.9964   0.9913   0.9974   0.9995

It can be noted that the accuracy improved even further to 99.4% (and conversely our out-of-sample error dropped to 0.6%)

Conclusion

Following our testing of various models and considering the accuracy of each, we’re going to use the model produced using the Random Forest method to predict the 20 test cases, which will be used as answers to questions for the quiz. Although it should be noted that the Random Forest took significantly longer than Classification Trees and Gradient Boosting methods. As such, this will be the trade off to consider:

pred_final<-predict(fit_RT,newdata = test)
pred_final
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Data source

The data was sourced from from this website: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har