Human activity recognition research has traditionally focused on discriminating between different activities, i.e. to predict “which” activity was performed at a specific point in time. This project investigates a dataset attempting to measure “how (well)” an activity was performed as measured by various sensor attached to different parts of the subject’s body. The “how (well)” investigation has only received little attention so far, even though it potentially provides useful information for a large variety of applications,such as sports training.
Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
Tacking devices were mounted in the users’ glove, armband, lumbar belt and dumbbell, each containing accerlerometer, gyroscope and magentormeter sensors.
To see more of the detail, please refer to the original study “Qualitative Activity Recognition of Weight Lifting Exercises”“
We begin by loading necessary packages and loading the data.
library(caret); library(ISLR); library(ggplot2); library(rgl); library(rpart);library(randomForest);library(corrplot)
## Loading required package: lattice
## Loading required package: ggplot2
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
pml_training <- read.csv("./data/pml-training.csv")
pml_testing <- read.csv("./data/pml-testing.csv")
The data has already been split up in to a training set and a testing set. Our aim here is to use cross-fold validation to train a model and test the out-of-sample accuracy on the training set, before finally applying it to the testing set to generate our predictions.
dim(pml_training)
## [1] 19622 160
The training set consists of 19622 observations of 160 variables. There were a total of 6 subjects participating in the study, performing 5 different excercises, measured by four different wearable devices.
with(training, plot3d(accel_arm_x, accel_arm_y, accel_arm_z, col = as.numeric(training$classe),
size = 1, xlab = "X-axis", ylab = "Y-axis", zlab = "Z-axis"))
with(training, plot3d(accel_forearm_x, accel_forearm_y, accel_forearm_z, col = as.numeric(training$classe),
size = 1, xlab = "X-axis", ylab = "Y-axis", zlab = "Z-axis"))
with(training, plot3d(accel_dumbbell_x, accel_dumbbell_y, accel_dumbbell_z, col = as.numeric(training$classe),
size = 1, xlab = "X-axis", ylab = "Y-axis", zlab = "Z-axis"))
with(training, plot3d(accel_belt_x, accel_belt_y, accel_belt_z, col = as.numeric(training$classe),
size = 1.5, xlab = "X-axis", ylab = "Y-axis", zlab = "Z-axis"))
Plotting the acceleration from the sensors in three dimensions make it easier to visualize the captured data.
Here the colors correspond as follows:
That is, all colors other than black represent some kind of error in execution.
The picture was generated locally and included in the output file.
When looking at acceleration in three dimensions in Fig 1 it is quite easy to see that black, the correct execution of movement, has significantly lower variance than the other colors (representing incorrect motions). For the most part, the black points are clustered tightly together with other colors spreading out widely. Notably, in the belt acceleration plot it’s clear that light blue - representing “Thowing the hips to the front” - shows signifant comparative spread, as would be expected.
We start off with removing features with near zero variance
nsv <- nearZeroVar(pml_training, saveMetrics = T)
reduced.pml_training <- pml_training[, names(pml_training)[!nsv$nzv]]
dim(reduced.pml_training)
## [1] 19622 100
The new dataset has 100 columns. Of those we remove those with a high rate ( > 50%) of NA values.
highNas <- colSums(is.na(reduced.pml_training)) > nrow(reduced.pml_training)/2
lowNAs <- names(reduced.pml_training)[!highNas]
lowNAs
## [1] "X" "user_name" "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp" "num_window"
## [7] "roll_belt" "pitch_belt" "yaw_belt"
## [10] "total_accel_belt" "gyros_belt_x" "gyros_belt_y"
## [13] "gyros_belt_z" "accel_belt_x" "accel_belt_y"
## [16] "accel_belt_z" "magnet_belt_x" "magnet_belt_y"
## [19] "magnet_belt_z" "roll_arm" "pitch_arm"
## [22] "yaw_arm" "total_accel_arm" "gyros_arm_x"
## [25] "gyros_arm_y" "gyros_arm_z" "accel_arm_x"
## [28] "accel_arm_y" "accel_arm_z" "magnet_arm_x"
## [31] "magnet_arm_y" "magnet_arm_z" "roll_dumbbell"
## [34] "pitch_dumbbell" "yaw_dumbbell" "total_accel_dumbbell"
## [37] "gyros_dumbbell_x" "gyros_dumbbell_y" "gyros_dumbbell_z"
## [40] "accel_dumbbell_x" "accel_dumbbell_y" "accel_dumbbell_z"
## [43] "magnet_dumbbell_x" "magnet_dumbbell_y" "magnet_dumbbell_z"
## [46] "roll_forearm" "pitch_forearm" "yaw_forearm"
## [49] "total_accel_forearm" "gyros_forearm_x" "gyros_forearm_y"
## [52] "gyros_forearm_z" "accel_forearm_x" "accel_forearm_y"
## [55] "accel_forearm_z" "magnet_forearm_x" "magnet_forearm_y"
## [58] "magnet_forearm_z" "classe"
From this list we disregard the first 6 columns as they are not strictly measurements, and end up with a final vector of features that can be applied to the training set.
features <- lowNAs[7:59]
reduced.training <- pml_training[, features]
We split the data 70/30 between training and testing.
inTrain <- createDataPartition(y = reduced.training$classe, p = 0.70, list = F)
training <- reduced.training[inTrain,]
testing <- reduced.training[-inTrain,]
corrMat <- cor(training[, -53])
corrplot(corrMat, method = "square", order = "FPC", type = "lower", tl.cex = 0.6, tl.col = rgb(0, 0, 0), outline = F)
set.seed(123)
rfModel <- randomForest(classe ~ ., data = training, proximity = T)
predictionRF <- predict(rfModel, newdata = testing)
confusionMatrix(predictionRF, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1672 3 0 0 0
## B 1 1135 0 0 0
## C 0 1 1026 8 1
## D 0 0 0 956 1
## E 1 0 0 0 1080
##
## Overall Statistics
##
## Accuracy : 0.997
## 95% CI : (0.996, 0.998)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.997
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.999 0.996 1.000 0.992 0.998
## Specificity 0.999 1.000 0.998 1.000 1.000
## Pos Pred Value 0.998 0.999 0.990 0.999 0.999
## Neg Pred Value 1.000 0.999 1.000 0.998 1.000
## Prevalence 0.284 0.194 0.174 0.164 0.184
## Detection Rate 0.284 0.193 0.174 0.162 0.184
## Detection Prevalence 0.285 0.193 0.176 0.163 0.184
## Balanced Accuracy 0.999 0.998 0.999 0.996 0.999
A 99.41% is promising indeed. Given that it was fairly easy to see a difference between correctly executed movement and wrong ones we can be fairly confident that this model will do well on the test set.
Now let’s take a look at the variable importance as measured by our Random Forest model.
varImpPlot(rfModel, pch = 20, cex = 0.8, main = "Variable Importance" )
We are now finally ready to apply our predictive model to the test data.
predict(rfModel, newdata = pml_testing)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
These results have been verified as correct.
These initial results show strong promise for the possibility of giving accurate feedback on exercise form based on data from wearable sensors. An interesting followup study would be to randomly select participants who do not know the purpose of the study, and then have exercise professionals score their form and use this data to build a more sopohisticated model that has the potential to detect errors in form under normal excercise conditions.