Executive Summary

As part of Course 7, Practical Machine Learning in the Data Science Specialization on Coursera, the goal of this project is to predict the manner in which participants did an exercise. This is the ‘classe’ variable in the below dataset. We will clean and explore the data to choose the variables to predict with. This report will outline how the models are built, estimate our out-of-sample error utilizing validation tests, and explain how we choose the best model to utilize. Lastly, we will use this model to predict 20 test cases.

Background

This human activity recognition research has traditionally focused on discriminating between different activities, i.e. to predict “which” activity was performed at a specific point in time (like with the Daily Living Activities dataset above). The approach we propose for the Weight Lifting Exercises dataset is to investigate “how (well)” an activity was performed by the wearer. The “how (well)” investigation has only received little attention so far, even though it potentially provides useful information for a large variety of applications,such as sports training.

In this work (see the paper referenced) we first define quality of execution and investigate three aspects that pertain to qualitative activity recognition: the problem of specifying correct execution, the automatic and robust detection of execution mistakes, and how to provide feedback on the quality of execution to the user. We tried out an on-body sensing approach (dataset here), but also an “ambient sensing approach” (by using Microsoft Kinect - dataset still unavailable)

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

Data Loading and Cleaning

Setup and Libraries

library(caret)
library(dplyr)

Loading the Data

url_train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url_test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

train_raw <- read.csv(url_train)
test_raw <- read.csv(url_test)

dim(train_raw); dim(test_raw)
## [1] 19622   160
## [1]  20 160

Non-parametric Variables

View(train_raw)

After viewing the data it appears the first 6 features are not applicable to the analysis, so we can remove them.

train_data <- train_raw[,-c(1:6)]
test_data <- test_raw[, -c(1:6)]

dim(train_data); dim(test_data)
## [1] 19622   154
## [1]  20 154

Missing Values

We will need to address any NAs in the data.

table(sapply(train_data, function(x) mean(is.na(x))))
## 
##                 0 0.979308938946081 
##                87                67

This is an interesting result, 87 features have no NA’s, while 67 have 97%+ NA’s. Let’s exclude these features for now.

remove_NA <- sapply(train_data, function(x) mean(is.na(x))) == 0
table(remove_NA)
## remove_NA
## FALSE  TRUE 
##    67    87
train_data <- train_data[, remove_NA == TRUE]
test_data <- test_data[, remove_NA == TRUE]
dim(train_data); dim(test_data)
## [1] 19622    87
## [1] 20 87

Zero-Variance Variables

Let’s also remove the zero-variance features from the modeling.

near_zero <- nearZeroVar(train_data)
train_data <- train_data[, -near_zero]
test_data <- test_data[, -near_zero]

dim(train_data); dim(test_data)
## [1] 19622    54
## [1] 20 54

Lastly, let’s convert Classe to a factor variable

train_data$classe <- as.factor(train_data$classe)

Partitioning the Datasets to Train and Validation

Now, with the data cleaned, we can create split our train_data set into training and validation sets to build our models against.

inTrain <- createDataPartition(train_data$classe, p = 3/4, list = FALSE)
train_set <- train_data[inTrain, ]
valid_set <- train_data[-inTrain, ]

dim(train_set); dim(valid_set)
## [1] 14718    54
## [1] 4904   54

Feature Engineering

This is a large dataset with many features, for speed optimization, let’s perform a Principal Components Analysis to see how many features will be needed to give us the majority of the accuracy to improve model performance.

PCA Analysis

train_set_num <- train_set %>%
        select(-classe) %>%
        mutate_all(as.numeric)

train_set_num[is.na(train_set_num)] = 0

pca <- prcomp(train_set_num)

qplot(1:length(pca$sdev), pca$sdev / sum(pca$sdev), ylab = "% Explained", 
      xlab = "# of Features")

From this chart it appears that we seem to get to 99% explaination in between 30 and 40 features.

cumsum(pca$sdev / sum(pca$sdev))[30:40]
##  [1] 0.9808430 0.9842486 0.9874429 0.9896621 0.9913573 0.9929876 0.9944913
##  [8] 0.9958561 0.9967019 0.9974892 0.9982421

Looks like 35 features will get us to 99% of the variance explained, so if we run into performance issues, we can use this as a constraint on the models.

Model Creation

Now it’s time to build and train our various models. Due to having limited computing power on a single core, we will not utilze a Random Forest model. We will instead train 2 models then tune the better performing model to improve results.

  • Recursive Tree
  • Gradient Boosted Machine using Cross Validation
  • Gradient Boosted Machine (user-tuned)

We will utilize a Confusion Matrix to compare the result on the Training Set.

Recursive Tree (rpart)

set.seed(4242)

model_rpart <- train(classe ~ ., data = train_set, method = "rpart")

Gradient Boosted Machine (gbm)

set.seed(4242)

train_control <- trainControl(method = "cv", number = 3)


model_gbm <- train(classe ~ ., data = train_set, method = "gbm", 
                   trControl = train_control, verbose = FALSE)


model_gbm$bestTune
##   n.trees interaction.depth shrinkage n.minobsinnode
## 9     150                 3       0.1             10

Gradient Boosted Machine User-Defined(gbm)

With more compute power and time, we can set the tuning grid to a range of values i.e.

  • interaction.depth = c(3, 5, 7)
  • n.trees = c(150, 175, 200, 225)
  • shrinkage = c(0.075, 0.10, 0.125)
  • n.minosinnode = c(7, 10, 12, 15)

Here, I have illustrated a single user-defined output.

set.seed(4242)

train_control <- trainControl(method = "cv", number = 3)
gbm_grid <- expand.grid(interaction.depth = 5,
                        n.trees = 175,
                        shrinkage = 0.075,
                        n.minobsinnode = 10)


model_gbm_user <- train(classe ~ ., data = train_set, method = "gbm", 
                   trControl = train_control, tuneGrid = gbm_grid, verbose = FALSE)

Model Selection

Let’s validate the model against the validation data

predict_rpart <- predict(model_rpart, newdata = valid_set)
predict_gbm <- predict(model_gbm, newdata = valid_set)
predict_gbm_user <- predict(model_gbm_user, newdata = valid_set)

CM_rpart <- confusionMatrix(predict_rpart, valid_set$classe)
CM_gbm <- confusionMatrix(predict_gbm, valid_set$classe)
CM_gbm_user <- confusionMatrix(predict_gbm_user, valid_set$classe)

CM_rpart$table
##           Reference
## Prediction    A    B    C    D    E
##          A 1273  392  401  342   96
##          B   15  326   34  148   60
##          C  105  231  420  272  197
##          D    0    0    0    0    0
##          E    2    0    0   42  548
CM_gbm$table
##           Reference
## Prediction    A    B    C    D    E
##          A 1388    8    0    0    1
##          B    7  926   12    2    4
##          C    0   12  842    8    3
##          D    0    3    1  794    7
##          E    0    0    0    0  886
CM_gbm_user$table
##           Reference
## Prediction    A    B    C    D    E
##          A 1393    3    0    0    0
##          B    2  942    2    0    1
##          C    0    4  853    8    2
##          D    0    0    0  796    2
##          E    0    0    0    0  896
CM_rpart$overall[1]; CM_gbm$overall[1]; CM_gbm_user$overall[1]
##  Accuracy 
## 0.5234502
##  Accuracy 
## 0.9861338
## Accuracy 
## 0.995106

Prediction

Now it’s time to use our best model [model_gbm_user] to predict values for our Test set. Given our success with the validation data, we would expect our out-of-sample error to be small.

predict_rf <- predict(model_gbm_user, newdata = test_data)

predict_rf
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Acknowledgements

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.