Human Activity Recognition - How Well Did They Do That Exercise?

Summary

With the advent of accelerometer-based smartphone apps and devices like Xbox Kinect, human activity recognition is in the spotlight. It’s possible to track exercise using a wearable device connected to a smartphone, and to swing a controller to emulate tennis on the Nintendo Wii. Many of these devices track which motion is being performed, but is it possible to detect how well it is done? Using a motion dataset, we investigate whether it’s feasible to use machine learning to accurately distinguish between variations of one exercise from wearable sensor measurements. We build a random forest model and show that it is possible to predict how well an exercise is performed with an accuracy higher than 97%.

Introduction

The dataset was collected from 6 individuals performing sets of 10 repetitions of a one-arm dumbell biceps curl, with various degrees of mistakes. Measurements were taken using 4 devices, worn on the forearm, arm, dumbell, and belt. Each observation is labeled as one of 5 classes, with only one (“A”) being the correct execution. More information is available at the research website. The training data were obtained from here.

Data Processing

Starting from the dowloaded dataset, we remove columns with missing values and unused features (like subject names). We then partition the “training” data into train and cross-validation sets. The cross-validation set will be used to check model performance and estimate out-of-sample error.

Load necessary libraries:

library(caret)
library(doParallel)

Load data, check dimensions, and find the number of columns with missing values. Missing values are coded as NA or a blank:

train_full <- read.csv("data/pml-training.csv",na.strings = c("NA", ""))
test <- read.csv("data/pml-testing.csv",na.strings = c("NA", ""))

rbind(dim(train_full),dim(test))
##       [,1] [,2]
## [1,] 19622  160
## [2,]    20  160

Find the number of columns with any missing values:

c(sum(colSums(is.na(train_full))!=0), sum(colSums(is.na(test))!=0))
## [1] 100 100

Using unique, we also see that the number of missing values per column is always one of two numbers, which is indicative of a systemic issue (we won’t investigate this further here):

unique(colSums(is.na(train_full)))
## [1]     0 19216

Remove the columns with NA entries:

train_full <- train_full[, colSums(is.na(train_full))==0]
test <- test[, colSums(is.na(test))==0]

Also remove other columns not used in the analysis and check dimensions again:

train_full <- subset(train_full, select=-c(X,user_name,raw_timestamp_part_1,raw_timestamp_part_2,
                                           cvtd_timestamp,new_window,num_window))
test <- subset(test, select=-c(X,user_name,raw_timestamp_part_1,raw_timestamp_part_2,
                                     cvtd_timestamp,new_window,num_window))
rbind(dim(train_full),dim(test))
##       [,1] [,2]
## [1,] 19622   53
## [2,]    20   53

The number of columns has been reduced to 53, with 52 features and one class column.

Partition the data into training and cross-validation sets:

set.seed(1234)
trainIndex = createDataPartition(train_full$classe, p = 0.70,list=FALSE)
train = train_full[trainIndex,]
xval = train_full[-trainIndex,]

The data is now ready for analysis.

Exploratory Analysis

In this section, we explore the data a little, and check variability and correlation.

Display some values from the data:

str(train)
## 'data.frame':    13737 obs. of  53 variables:
##  $ roll_belt           : num  1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 1.45 ...
##  $ pitch_belt          : num  8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 8.18 ...
##  $ yaw_belt            : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ gyros_belt_x        : num  0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 0.03 ...
##  $ gyros_belt_y        : num  0 0 0 0.02 0 0 0 0 0 0 ...
##  $ gyros_belt_z        : num  -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 -0.02 ...
##  $ accel_belt_x        : int  -22 -20 -22 -21 -21 -22 -22 -20 -21 -21 ...
##  $ accel_belt_y        : int  4 5 3 2 4 3 4 2 4 2 ...
##  $ accel_belt_z        : int  22 23 21 24 21 21 21 24 22 23 ...
##  $ magnet_belt_x       : int  -7 -2 -6 -6 0 -4 -2 1 -3 -5 ...
##  $ magnet_belt_y       : int  608 600 604 600 603 599 603 602 609 596 ...
##  $ magnet_belt_z       : int  -311 -305 -310 -302 -312 -311 -313 -312 -308 -317 ...
##  $ roll_arm            : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm           : num  22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 21.5 ...
##  $ yaw_arm             : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm     : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ gyros_arm_x         : num  0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 0.02 ...
##  $ gyros_arm_y         : num  -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 -0.03 ...
##  $ gyros_arm_z         : num  -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 0 ...
##  $ accel_arm_x         : int  -290 -289 -289 -289 -289 -289 -289 -288 -288 -290 ...
##  $ accel_arm_y         : int  110 110 111 111 111 111 111 109 110 110 ...
##  $ accel_arm_z         : int  -125 -126 -123 -123 -122 -125 -124 -122 -124 -123 ...
##  $ magnet_arm_x        : int  -369 -368 -372 -374 -369 -373 -372 -369 -376 -366 ...
##  $ magnet_arm_y        : int  337 344 344 337 342 336 338 341 334 339 ...
##  $ magnet_arm_z        : int  513 513 512 506 513 509 510 518 516 509 ...
##  $ roll_dumbbell       : num  13.1 12.9 13.4 13.4 13.4 ...
##  $ pitch_dumbbell      : num  -70.6 -70.3 -70.4 -70.4 -70.8 ...
##  $ yaw_dumbbell        : num  -84.7 -85.1 -84.9 -84.9 -84.5 ...
##  $ total_accel_dumbbell: int  37 37 37 37 37 37 37 37 37 37 ...
##  $ gyros_dumbbell_x    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ gyros_dumbbell_y    : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
##  $ gyros_dumbbell_z    : num  0 0 -0.02 0 0 0 0 0 0 0 ...
##  $ accel_dumbbell_x    : int  -233 -232 -232 -233 -234 -232 -234 -232 -235 -233 ...
##  $ accel_dumbbell_y    : int  47 46 48 48 48 47 46 47 48 47 ...
##  $ accel_dumbbell_z    : int  -269 -270 -269 -270 -269 -270 -272 -269 -270 -269 ...
##  $ magnet_dumbbell_x   : int  -555 -561 -552 -554 -558 -551 -555 -549 -558 -564 ...
##  $ magnet_dumbbell_y   : int  296 298 303 292 294 295 300 292 291 299 ...
##  $ magnet_dumbbell_z   : num  -64 -63 -60 -68 -66 -70 -74 -65 -69 -64 ...
##  $ roll_forearm        : num  28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 27.6 ...
##  $ pitch_forearm       : num  -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 -63.8 ...
##  $ yaw_forearm         : num  -153 -152 -152 -152 -152 -152 -152 -152 -152 -152 ...
##  $ total_accel_forearm : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ gyros_forearm_x     : num  0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 0.02 ...
##  $ gyros_forearm_y     : num  0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 -0.02 ...
##  $ gyros_forearm_z     : num  -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 -0.02 ...
##  $ accel_forearm_x     : int  192 196 189 189 193 195 193 193 190 193 ...
##  $ accel_forearm_y     : int  203 204 206 206 203 205 205 204 205 205 ...
##  $ accel_forearm_z     : int  -216 -213 -214 -214 -215 -215 -213 -214 -215 -214 ...
##  $ magnet_forearm_x    : int  -18 -18 -16 -17 -9 -18 -9 -16 -22 -17 ...
##  $ magnet_forearm_y    : num  661 658 658 655 660 659 660 653 656 657 ...
##  $ magnet_forearm_z    : num  473 469 469 473 478 470 474 476 473 465 ...
##  $ classe              : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

All features are numeric. Many features are on different scales, so it a good idea to center and scale (subtract mean and divide by the standard deviation) before building a model on the data. This will be done at a later step.

Check for features with low variance or only one unique value:

nearZeroVar(train[,-53], saveMetrics=F)
## integer(0)

No features fit the low variance criteria. However, since there are 50+ features in the dataset, and movement is recorded along 3 axes, it is likely that some attributes are highly correlated. Create a correlation matrix and count number of features with correlation higher than 0.8:

M <- abs(cor(train[,-53]))
diag(M) <- 0 #for each variable, correlation with itself is 1. Set this to 0.
dim(which(M > 0.8, arr.ind=T))[1]
## [1] 38

There is a high number of highly-correlated features in the dataset. Principal component analysis can be used to project the features onto a lower-dimensional space.

First, find all principal components. The preProcess function in caret centers and scales the data automatically when method is set to pca:

preProc <- preProcess(train[,-53],method=("pca"),pcaComp=52)
train_PC <- predict(preProc,train[,-53])

Calculate and plot the variance explained by each component. Also plot the cutoff at which 95% and 99% of the variance are explained:

train_PC_sd <- apply(train_PC,2,sd)
var_explained <- train_PC_sd^2 / sum(train_PC_sd^2)

plot(var_explained,type="b",main="Variance explained by Principal Components",
     xlab="Components (sorted by decreasing variance explained)",
     ylab="Variance Explained")
abline(v = which(cumsum(var_explained)>=.95)[1],col="red")
abline(v = which(cumsum(var_explained)>=.99)[1],col="blue")
text(x=30,y=0.15,labels="95% explained",cex=0.7)
text(x=41,y=0.10,labels="99% explained",cex=0.7)

plot of chunk pcaPlot

The plot shows that the top 25 components explain 95%, and the top 36 explain 99% of the variance.

Rerun the PCA preprocessing, this time keeping enough components to explain 99% of the variance, and apply the PCA transformation to the crossvalidation and test sets:

preProc <- preProcess(train[,-53],method=("pca"),thresh=.99) 
train_PC <- predict(preProc,train[,-53])

xval_PC <- predict(preProc,xval[,-53])
test_PC <- predict(preProc,test[,-53])
dim(train_PC)
## [1] 13737    36

These features are used to build the model.

Model Building

We choose to build a random forest model. The method="oob" option in trainControl indicates using out-of-bag estimates for cross-validation.

To allow for multicore processing, we use makeCluster() and registerDoParallel() commands. Otherwise the train() function will use one core, and training will take much longer.

# cl <- makeCluster(detectCores()/2) #use half of available cores
cl <- makeCluster(6) #use 6 available cores
registerDoParallel(cl)

modFit <- train(train$classe ~ ., method = "rf", data = train_PC, 
                trControl = trainControl(method = "oob"))

Results

Find the accuracy on training and crossvalidation sets:

CM_train <- confusionMatrix(train$classe,predict(modFit,train_PC)) #train results
CM_xval <- confusionMatrix(xval$classe,predict(modFit,xval_PC)) #xval results
train_error = 1 - CM_train$overall[1]
xval_error = 1 - CM_xval$overall[1]

While the model perfectly fits the training set (in-sample error is 0), the cross-validation error is 0.017. We thus estimate that the out-of-sample error will be about 0.017. The cross-validation confusion matrix summarizes how classes were misclassified:

CM_xval$table
##           Reference
## Prediction    A    B    C    D    E
##          A 1671    2    0    1    0
##          B   23 1107    8    0    1
##          C    0   10 1011    4    1
##          D    2    0   28  933    1
##          E    0    6    9    4 1063

It looks like class C was misclassified the most.

Conclusion

Using a random forest model on data preprocessed with principal component analysis, we showed that it’s possible to fairly accurately distinguish between variations of the same motion pattern.

Additional information

This analysis was written as the course project for Practical Machine Learning on Coursera.