Introduction

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants are used to predict the manner in which they perform unilateral dumbbell biceps curls. The 5 possible methods are:

A: exactly according to the specification B: throwing the elbows to the front C: lifting the dumbbell only halfway D: lowering the dumbbell only halfway E: throwing the hips to the front

Setup, load packages and data

The library packages required and datasets were first loaded as below:

knitr::opts_chunk$set(warning=F)
set.seed(77)
library(corrplot)
library(rattle)
library(randomForest)
library(caret)
library(rpart.plot)
train_val <- read.csv("pml-training.csv")
test <- read.csv("pml-testing.csv")

Examine Data

The data structure were then examined:

names(train_val)
names(test)
# str(train_val)
# str(test)

Handle Missing Data

It was found that the variables have either none missing entries or are mostly unfilled.

apply(X=train_val, MARGIN=2, FUN=function(x) {sum(is.na(x))})
##                        X                user_name     raw_timestamp_part_1 
##                        0                        0                        0 
##     raw_timestamp_part_2           cvtd_timestamp               new_window 
##                        0                        0                        0 
##               num_window                roll_belt               pitch_belt 
##                        0                        0                        0 
##                 yaw_belt         total_accel_belt       kurtosis_roll_belt 
##                        0                        0                        0 
##      kurtosis_picth_belt        kurtosis_yaw_belt       skewness_roll_belt 
##                        0                        0                        0 
##     skewness_roll_belt.1        skewness_yaw_belt            max_roll_belt 
##                        0                        0                    19216 
##           max_picth_belt             max_yaw_belt            min_roll_belt 
##                    19216                        0                    19216 
##           min_pitch_belt             min_yaw_belt      amplitude_roll_belt 
##                    19216                        0                    19216 
##     amplitude_pitch_belt       amplitude_yaw_belt     var_total_accel_belt 
##                    19216                        0                    19216 
##            avg_roll_belt         stddev_roll_belt            var_roll_belt 
##                    19216                    19216                    19216 
##           avg_pitch_belt        stddev_pitch_belt           var_pitch_belt 
##                    19216                    19216                    19216 
##             avg_yaw_belt          stddev_yaw_belt             var_yaw_belt 
##                    19216                    19216                    19216 
##             gyros_belt_x             gyros_belt_y             gyros_belt_z 
##                        0                        0                        0 
##             accel_belt_x             accel_belt_y             accel_belt_z 
##                        0                        0                        0 
##            magnet_belt_x            magnet_belt_y            magnet_belt_z 
##                        0                        0                        0 
##                 roll_arm                pitch_arm                  yaw_arm 
##                        0                        0                        0 
##          total_accel_arm            var_accel_arm             avg_roll_arm 
##                        0                    19216                    19216 
##          stddev_roll_arm             var_roll_arm            avg_pitch_arm 
##                    19216                    19216                    19216 
##         stddev_pitch_arm            var_pitch_arm              avg_yaw_arm 
##                    19216                    19216                    19216 
##           stddev_yaw_arm              var_yaw_arm              gyros_arm_x 
##                    19216                    19216                        0 
##              gyros_arm_y              gyros_arm_z              accel_arm_x 
##                        0                        0                        0 
##              accel_arm_y              accel_arm_z             magnet_arm_x 
##                        0                        0                        0 
##             magnet_arm_y             magnet_arm_z        kurtosis_roll_arm 
##                        0                        0                        0 
##       kurtosis_picth_arm         kurtosis_yaw_arm        skewness_roll_arm 
##                        0                        0                        0 
##       skewness_pitch_arm         skewness_yaw_arm             max_roll_arm 
##                        0                        0                    19216 
##            max_picth_arm              max_yaw_arm             min_roll_arm 
##                    19216                    19216                    19216 
##            min_pitch_arm              min_yaw_arm       amplitude_roll_arm 
##                    19216                    19216                    19216 
##      amplitude_pitch_arm        amplitude_yaw_arm            roll_dumbbell 
##                    19216                    19216                        0 
##           pitch_dumbbell             yaw_dumbbell   kurtosis_roll_dumbbell 
##                        0                        0                        0 
##  kurtosis_picth_dumbbell    kurtosis_yaw_dumbbell   skewness_roll_dumbbell 
##                        0                        0                        0 
##  skewness_pitch_dumbbell    skewness_yaw_dumbbell        max_roll_dumbbell 
##                        0                        0                    19216 
##       max_picth_dumbbell         max_yaw_dumbbell        min_roll_dumbbell 
##                    19216                        0                    19216 
##       min_pitch_dumbbell         min_yaw_dumbbell  amplitude_roll_dumbbell 
##                    19216                        0                    19216 
## amplitude_pitch_dumbbell   amplitude_yaw_dumbbell     total_accel_dumbbell 
##                    19216                        0                        0 
##       var_accel_dumbbell        avg_roll_dumbbell     stddev_roll_dumbbell 
##                    19216                    19216                    19216 
##        var_roll_dumbbell       avg_pitch_dumbbell    stddev_pitch_dumbbell 
##                    19216                    19216                    19216 
##       var_pitch_dumbbell         avg_yaw_dumbbell      stddev_yaw_dumbbell 
##                    19216                    19216                    19216 
##         var_yaw_dumbbell         gyros_dumbbell_x         gyros_dumbbell_y 
##                    19216                        0                        0 
##         gyros_dumbbell_z         accel_dumbbell_x         accel_dumbbell_y 
##                        0                        0                        0 
##         accel_dumbbell_z        magnet_dumbbell_x        magnet_dumbbell_y 
##                        0                        0                        0 
##        magnet_dumbbell_z             roll_forearm            pitch_forearm 
##                        0                        0                        0 
##              yaw_forearm    kurtosis_roll_forearm   kurtosis_picth_forearm 
##                        0                        0                        0 
##     kurtosis_yaw_forearm    skewness_roll_forearm   skewness_pitch_forearm 
##                        0                        0                        0 
##     skewness_yaw_forearm         max_roll_forearm        max_picth_forearm 
##                        0                    19216                    19216 
##          max_yaw_forearm         min_roll_forearm        min_pitch_forearm 
##                        0                    19216                    19216 
##          min_yaw_forearm   amplitude_roll_forearm  amplitude_pitch_forearm 
##                        0                    19216                    19216 
##    amplitude_yaw_forearm      total_accel_forearm        var_accel_forearm 
##                        0                        0                    19216 
##         avg_roll_forearm      stddev_roll_forearm         var_roll_forearm 
##                    19216                    19216                    19216 
##        avg_pitch_forearm     stddev_pitch_forearm        var_pitch_forearm 
##                    19216                    19216                    19216 
##          avg_yaw_forearm       stddev_yaw_forearm          var_yaw_forearm 
##                    19216                    19216                    19216 
##          gyros_forearm_x          gyros_forearm_y          gyros_forearm_z 
##                        0                        0                        0 
##          accel_forearm_x          accel_forearm_y          accel_forearm_z 
##                        0                        0                        0 
##         magnet_forearm_x         magnet_forearm_y         magnet_forearm_z 
##                        0                        0                        0 
##                   classe 
##                        0

The variables with mostly missing values in the test set were first dropped:

train_val <- train_val[, colSums(is.na(test)) == 0]
test <- test[, colSums(is.na(test)) == 0] 

Preprocess Data

The cleaning was followed by variables that have very few unique values relative to the number of sample:

nzv_vars <- nearZeroVar(train_val)
train_val <- train_val[, -nzv_vars]
test <- test[, -nzv_vars]

And also variables that only serve for identification purposes:

train_val <- train_val[, -(1:5)]
test <- test[, -(1:5)]

The data type of the variable to be predicted was converted into a categorical variable:

class(train_val$classe)
## [1] "character"
train_val$classe <- factor(train_val$classe)

Create Training and Validation Datasets

The cleaned train dataset was split into datasets for training (70%) and validation (30%).

train_id <- createDataPartition(y=train_val$classe, p=0.70, list=F)
train <- train_val[train_id, ]
validation <- train_val[-train_id, ]

Correlation Analysis

The correlations between all variables except response variable in the train set were then visualized:

corr_mat <- cor(train[, -ncol(train)])
corrplot(corr=corr_mat, method="color", type="lower", order="FPC", tl.cex=0.5, tl.col=rgb(0, 0, 0))

And the variables with correlations higher than 0.8 were listed as below:

findCorrelation(x=corr_mat, cutoff=0.8, names=T)
##  [1] "accel_belt_z"     "roll_belt"        "accel_belt_y"     "accel_dumbbell_z"
##  [5] "accel_belt_x"     "pitch_belt"       "accel_arm_x"      "accel_dumbbell_x"
##  [9] "magnet_arm_y"     "gyros_forearm_y"  "gyros_dumbbell_x" "gyros_dumbbell_z"
## [13] "gyros_arm_x"

Models Training

A Random Forest and a Generalized Boosted Model were chosen for the activity recognition prediction due to their robustness to correlated covariates and outliers. Besides, the feature selection could also be taken care of automatically. The model that achieves better accuracy using 5-fold cross validation will be used to predict the test set classe. The results were visualized using a confusion matrix.

Random Forest

The random forest model was trained as below:

controlRF <- trainControl(method="cv", number=5)
modelRF <- train(classe~., data=train, method="rf", trControl=controlRF)
modelRF$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.27%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3904    1    0    0    1 0.0005120328
## B    8 2648    2    0    0 0.0037622272
## C    0    6 2386    4    0 0.0041736227
## D    0    0    7 2244    1 0.0035523979
## E    0    1    0    6 2518 0.0027722772

The model was then evaluated using the validation dataset:

predictRF <- predict(object=modelRF, newdata=validation)
confmatRF <- confusionMatrix(predictRF, validation$classe)
confmatRF
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    3    0    0    0
##          B    0 1133    0    0    0
##          C    0    2 1026    3    0
##          D    0    1    0  961    1
##          E    0    0    0    0 1081
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9983          
##                  95% CI : (0.9969, 0.9992)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9979          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9947   1.0000   0.9969   0.9991
## Specificity            0.9993   1.0000   0.9990   0.9996   1.0000
## Pos Pred Value         0.9982   1.0000   0.9952   0.9979   1.0000
## Neg Pred Value         1.0000   0.9987   1.0000   0.9994   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1925   0.1743   0.1633   0.1837
## Detection Prevalence   0.2850   0.1925   0.1752   0.1636   0.1837
## Balanced Accuracy      0.9996   0.9974   0.9995   0.9982   0.9995

The estimated out-of-sample error was:

round(1 - as.numeric(confmatRF$overall['Accuracy']), 4)
## [1] 0.0017

The confusion matrix was visualized below:

plot(confmatRF$table, col=confmatRF$byClass, main=paste("RF Accuracy:", round(confmatRF$overall['Accuracy'], 4)))

Generalized Boosted Model

The GBM model was trained as below:

controlGBM <- trainControl(method="cv", number=5)
modelGBM <- train(classe~., data=train, method="gbm", trControl=controlGBM, verbose=F)
modelGBM$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 53 had non-zero influence.

And similarly evaluated using the validation dataset:

predictGBM <- predict(object=modelGBM, newdata=validation)
confmatGBM <- confusionMatrix(predictGBM, validation$classe)
confmatGBM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1670   11    0    1    0
##          B    3 1117    6    5    4
##          C    0   10 1017    7    2
##          D    1    1    3  948    6
##          E    0    0    0    3 1070
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9893          
##                  95% CI : (0.9863, 0.9918)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9865          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9976   0.9807   0.9912   0.9834   0.9889
## Specificity            0.9972   0.9962   0.9961   0.9978   0.9994
## Pos Pred Value         0.9929   0.9841   0.9817   0.9885   0.9972
## Neg Pred Value         0.9990   0.9954   0.9981   0.9968   0.9975
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2838   0.1898   0.1728   0.1611   0.1818
## Detection Prevalence   0.2858   0.1929   0.1760   0.1630   0.1823
## Balanced Accuracy      0.9974   0.9884   0.9937   0.9906   0.9941

The estimated out-of-sample error was:

round(1 - as.numeric(confmatGBM$overall['Accuracy']), 4)
## [1] 0.0107

The confusion matrix was visualized below:

plot(confmatGBM$table, col=confmatGBM$byClass, main=paste("GBM Accuracy:", round(confmatGBM$overall['Accuracy'], 4)))

Test Set Prediction

Since the Random Forest model yields higher accuracy, it was used to predict the classe of the test set:

predict(object=modelRF, newdata=test)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E