Introduction

This human activity recognition research has traditionally focused on discriminating between different activities, i.e. to predict “which” activity was performed at a specific point in time (like with the Daily Living Activities dataset).
The approach we propose for the Weight Lifting Exercises dataset is to investigate “how (well)” an activity was performed by the wearer. The “how (well)” investigation has only received little attention so far, even though it potentially provides useful information for a large variety of applications,such as sports training.
In this work (see the paper) we first define quality of execution and investigate three aspects that pertain to qualitative activity recognition: the problem of specifying correct execution, the automatic and robust detection of execution mistakes, and how to provide feedback on the quality of execution to the user. We tried out an on-body sensing approach (dataset here), but also an “ambient sensing approach” (by using Microsoft Kinect - dataset still unavailable).
Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

Purpose

The purpose of this exercise is to construct a machine learning model that can effectively and accurately predict which class of workout (workout mistake) is performed by applying the metrics from gyroscopes, accelerometers, magnetic gauges, and belt/pully systems.

Dataset

Publication cite

Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science., pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.

Library Information

R version

R version 3.3.3 (2017-03-06) - Another Canoe - svn 72310

Packages used

Package Version
caret 6.0-78
doSNOW 1.0.16
dplyr 0.7.4
foreach 1.4.4
gplots 3.0.1
iterators 1.0.9
snow 0.4-2

Building the model

General and Model Selection

There are many variables with NA values. They serve no purpose and will not be selected.
There are many observations, and thus a 50/50 training and evaluation dataset can be created.
The model will be for classification, thus a random forest is used. Because there are more than 2 classes, logistic regression cannot be used.
There is also a lot of multicollinearity, and that will make any linear prediction difficult and time consuming, and interpretability will be lost if Principal Components are generated and used.
Cross Validation (5-fold) was used to train the model. The model was then tested against the evaluation data.
Due to the intense processor utilisation, and the time it takes to train the model, parallel processing will be used.

Library Load

library(dplyr)
library(caret)
library(doSNOW)
library(gplots)

Data Import

train.import <- read.csv("C:/R/Datasets/pml-training.csv", stringsAsFactors = T)

Data Clean-up

Only select non-NA variables.

train2 <- train.import %>%
    select(roll_belt, pitch_belt, yaw_belt,
           gyros_belt_x, gyros_belt_y, gyros_belt_z,
           accel_belt_x, accel_belt_y, accel_belt_z,
           magnet_belt_x, magnet_belt_y, magnet_belt_z,
           roll_arm, pitch_arm, yaw_arm,
           gyros_arm_x, gyros_arm_y, gyros_arm_z,
           accel_arm_x, accel_arm_y, accel_arm_z,
           magnet_arm_x, magnet_arm_y, magnet_arm_z,
           roll_dumbbell, pitch_dumbbell, yaw_dumbbell,
           gyros_dumbbell_x, gyros_dumbbell_y, gyros_dumbbell_z,
           accel_dumbbell_x, accel_dumbbell_y, accel_dumbbell_z,
           magnet_dumbbell_x, magnet_dumbbell_y, magnet_dumbbell_z,
           roll_forearm, pitch_forearm, yaw_forearm,
           gyros_forearm_x, gyros_forearm_y, gyros_forearm_z,
           accel_forearm_x, accel_forearm_y, accel_forearm_z,
           magnet_forearm_x, magnet_forearm_y, magnet_forearm_z,
           classe)

Remove extreme outlier in gyros_dumbbell_z

train2 <- train2[-5373,]

Create data partitions and new training and testing sets

set.seed(12345)
train.set <- createDataPartition(y = train2$classe, p = 0.5, list = FALSE)
training <- train2[train.set, ]
evaluation <- train2[-train.set, ]

Investigate multicollinearity

Find correlated variables

my.cor.var <- abs(cor(training[,-49]))

Eliminate diagonal 1 variables

diag(my.cor.var) <- 0

Identify highly correlated variables

Highly correlated variables:

which(my.cor.var > 0.85, arr.ind = TRUE)
##                  row col
## accel_belt_y       8   1
## accel_belt_z       9   1
## accel_belt_x       7   2
## magnet_belt_x     10   2
## pitch_belt         2   7
## magnet_belt_x     10   7
## roll_belt          1   8
## accel_belt_z       9   8
## roll_belt          1   9
## accel_belt_y       8   9
## pitch_belt         2  10
## accel_belt_x       7  10
## gyros_arm_y       17  16
## gyros_arm_x       16  17
## accel_dumbbell_z  33  27
## yaw_dumbbell      27  33
cor.col <- as.vector(which(my.cor.var > 0.85, arr.ind = TRUE)[,2])
cor.columns <- names(train2[,cor.col])
cor.rows <- rownames(which(my.cor.var > 0.85, arr.ind = TRUE))
cor.df <- data.frame(cor.rows, cor.columns)
cor.df$cor.columns <- gsub(".1", "", cor.df$cor.columns)
cor.df <- arrange(cor.df, cor.rows, cor.columns)
cor.df
##            cor.rows      cor.columns
## 1      accel_belt_x    magnet_belt_x
## 2      accel_belt_x       pitch_belt
## 3      accel_belt_y     accel_belt_z
## 4      accel_belt_y        roll_belt
## 5      accel_belt_z     accel_belt_y
## 6      accel_belt_z        roll_belt
## 7  accel_dumbbell_z     yaw_dumbbell
## 8       gyros_arm_x      gyros_arm_y
## 9       gyros_arm_y      gyros_arm_x
## 10    magnet_belt_x     accel_belt_x
## 11    magnet_belt_x       pitch_belt
## 12       pitch_belt     accel_belt_x
## 13       pitch_belt    magnet_belt_x
## 14        roll_belt     accel_belt_y
## 15        roll_belt     accel_belt_z
## 16     yaw_dumbbell accel_dumbbell_z

Training the model

Configure and start parallel processing cluster

my.cpu.cl <- makeCluster(2, type="SOCK")
registerDoSNOW(my.cpu.cl)

Set train control options and train the random forest

set.seed(12345)
tr.control <- trainControl(method = "cv", number = 5, verboseIter = T)

rf.mod <- train(classe ~., data = training, method = "rf",
                trControl = tr.control,
                tuneLength = 5)
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 13 on full training set

Stop and de-register the parallel processing cluster

The cluster has to be de-registered as well to clean things up nicely.

stopCluster(my.cpu.cl)
registerDoSEQ()

Validate the model’s accuracy

  1. Predict new classes using evaluation data set and random forest model
  2. Create separate error data frame
  3. Compare predicted values with values in evaluation dataset
  4. Generate frequence table showing errors and calculate percentage from it

Model diagnostics

View the model diagnostics - look at Kappa and Accuracy values.
Kappa and Accuracy values are high.

rf.mod
## Random Forest 
## 
## 9812 samples
##   48 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 7849, 7850, 7850, 7850, 7849 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9868522  0.9833668
##   13    0.9900121  0.9873646
##   25    0.9874643  0.9841409
##   36    0.9861393  0.9824646
##   48    0.9860377  0.9823359
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 13.

Predict on evaluation dataset:

rf.pred <- predict(rf.mod, newdata = evaluation)
rf.error <- data.frame(Evaluation = evaluation$classe, Predicted = rf.pred)
rf.error$test <- rf.error$Evaluation==rf.error$Predicted
table(rf.error$test)[2]/nrow(evaluation)
##      TRUE 
## 0.9916403

The model appears to be 99.16 % accurate in predicting values in the evaluation dataset.

Out of Sample Error Rate

rf.cmat <- confusionMatrix(evaluation$classe, rf.pred)
rf.ooser <- (1 - rf.cmat$overall[["Accuracy"]]) * 100

The out of sample error rate is 0.84%

Confusion Matrix

The following confusion matrix shows the accuracy in prediction vs. the evaluation dataset:

conf.matrix <- as.matrix(table(rf.error$Predicted, rf.error$Evaluation))
conf.matrix
##    
##        A    B    C    D    E
##   A 2784   14    0    0    0
##   B    5 1883    7    0    0
##   C    0    1 1695   28    7
##   D    0    0    9 1578    9
##   E    0    0    0    2 1787

Confusion Matrix Heatmap

hm.cols <- colorRampPalette(c("lightyellow","maroon"))(256)
heatmap.2(conf.matrix, Colv = NA, Rowv = NA, 
          xlab = "Predicted",
          ylab = "Evaluation",
          main = "Confusion Matrix Heatmap",
          trace = "none",
          density.info = "histogram",
          cellnote = conf.matrix,
          notecex = 1.5,
          notecol = "black",
          col = hm.cols)