With the advent of accelerometer-based smartphone apps and devices like Xbox Kinect, human activity recognition is in the spotlight. It’s possible to track exercise using a wearable device connected to a smartphone, and to swing a controller to emulate tennis on the Nintendo Wii. Many of these devices track which motion is being performed, but is it possible to detect how well it is done? Using a motion dataset, we investigate whether it’s feasible to use machine learning to accurately distinguish between variations of one exercise from wearable sensor measurements. We build a random forest model and show that it is possible to predict how well an exercise is performed with an accuracy higher than 97%.
The dataset was collected from 6 individuals performing sets of 10 repetitions of a one-arm dumbell biceps curl, with various degrees of mistakes. Measurements were taken using 4 devices, worn on the forearm, arm, dumbell, and belt. Each observation is labeled as one of 5 classes, with only one (“A”) being the correct execution. More information is available at the research website. The training data were obtained from here.
Starting from the dowloaded dataset, we remove columns with missing values and unused features (like subject names). We then partition the “training” data into train and cross-validation sets. The cross-validation set will be used to check model performance and estimate out-of-sample error.
Load necessary libraries:
library(caret)
library(doParallel)
Load data, check dimensions, and find the number of columns with missing values. Missing values are coded as NA or a blank:
train_full <- read.csv("data/pml-training.csv",na.strings = c("NA", ""))
test <- read.csv("data/pml-testing.csv",na.strings = c("NA", ""))
rbind(dim(train_full),dim(test))
## [,1] [,2]
## [1,] 19622 160
## [2,] 20 160
Find the number of columns with any missing values:
c(sum(colSums(is.na(train_full))!=0), sum(colSums(is.na(test))!=0))
## [1] 100 100
Using unique, we also see that the number of missing values per column is always one of two numbers, which is indicative of a systemic issue (we won’t investigate this further here):
unique(colSums(is.na(train_full)))
## [1] 0 19216
Remove the columns with NA entries:
train_full <- train_full[, colSums(is.na(train_full))==0]
test <- test[, colSums(is.na(test))==0]
Also remove other columns not used in the analysis and check dimensions again:
train_full <- subset(train_full, select=-c(X,user_name,raw_timestamp_part_1,raw_timestamp_part_2,
cvtd_timestamp,new_window,num_window))
test <- subset(test, select=-c(X,user_name,raw_timestamp_part_1,raw_timestamp_part_2,
cvtd_timestamp,new_window,num_window))
rbind(dim(train_full),dim(test))
## [,1] [,2]
## [1,] 19622 53
## [2,] 20 53
The number of columns has been reduced to 53, with 52 features and one class column.
Partition the data into training and cross-validation sets:
set.seed(1234)
trainIndex = createDataPartition(train_full$classe, p = 0.70,list=FALSE)
train = train_full[trainIndex,]
xval = train_full[-trainIndex,]
The data is now ready for analysis.
In this section, we explore the data a little, and check variability and correlation.
Display some values from the data:
str(train)
## 'data.frame': 13737 obs. of 53 variables:
## $ roll_belt : num 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 8.18 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ gyros_belt_x : num 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 0.03 ...
## $ gyros_belt_y : num 0 0 0 0.02 0 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 -0.02 ...
## $ accel_belt_x : int -22 -20 -22 -21 -21 -22 -22 -20 -21 -21 ...
## $ accel_belt_y : int 4 5 3 2 4 3 4 2 4 2 ...
## $ accel_belt_z : int 22 23 21 24 21 21 21 24 22 23 ...
## $ magnet_belt_x : int -7 -2 -6 -6 0 -4 -2 1 -3 -5 ...
## $ magnet_belt_y : int 608 600 604 600 603 599 603 602 609 596 ...
## $ magnet_belt_z : int -311 -305 -310 -302 -312 -311 -313 -312 -308 -317 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 21.5 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ gyros_arm_x : num 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 0.02 ...
## $ gyros_arm_y : num -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 0 ...
## $ accel_arm_x : int -290 -289 -289 -289 -289 -289 -289 -288 -288 -290 ...
## $ accel_arm_y : int 110 110 111 111 111 111 111 109 110 110 ...
## $ accel_arm_z : int -125 -126 -123 -123 -122 -125 -124 -122 -124 -123 ...
## $ magnet_arm_x : int -369 -368 -372 -374 -369 -373 -372 -369 -376 -366 ...
## $ magnet_arm_y : int 337 344 344 337 342 336 338 341 334 339 ...
## $ magnet_arm_z : int 513 513 512 506 513 509 510 518 516 509 ...
## $ roll_dumbbell : num 13.1 12.9 13.4 13.4 13.4 ...
## $ pitch_dumbbell : num -70.6 -70.3 -70.4 -70.4 -70.8 ...
## $ yaw_dumbbell : num -84.7 -85.1 -84.9 -84.9 -84.5 ...
## $ total_accel_dumbbell: int 37 37 37 37 37 37 37 37 37 37 ...
## $ gyros_dumbbell_x : num 0 0 0 0 0 0 0 0 0 0 ...
## $ gyros_dumbbell_y : num -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
## $ gyros_dumbbell_z : num 0 0 -0.02 0 0 0 0 0 0 0 ...
## $ accel_dumbbell_x : int -233 -232 -232 -233 -234 -232 -234 -232 -235 -233 ...
## $ accel_dumbbell_y : int 47 46 48 48 48 47 46 47 48 47 ...
## $ accel_dumbbell_z : int -269 -270 -269 -270 -269 -270 -272 -269 -270 -269 ...
## $ magnet_dumbbell_x : int -555 -561 -552 -554 -558 -551 -555 -549 -558 -564 ...
## $ magnet_dumbbell_y : int 296 298 303 292 294 295 300 292 291 299 ...
## $ magnet_dumbbell_z : num -64 -63 -60 -68 -66 -70 -74 -65 -69 -64 ...
## $ roll_forearm : num 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 27.6 ...
## $ pitch_forearm : num -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 -63.8 ...
## $ yaw_forearm : num -153 -152 -152 -152 -152 -152 -152 -152 -152 -152 ...
## $ total_accel_forearm : int 36 36 36 36 36 36 36 36 36 36 ...
## $ gyros_forearm_x : num 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 0.02 ...
## $ gyros_forearm_y : num 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 -0.02 ...
## $ gyros_forearm_z : num -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 -0.02 ...
## $ accel_forearm_x : int 192 196 189 189 193 195 193 193 190 193 ...
## $ accel_forearm_y : int 203 204 206 206 203 205 205 204 205 205 ...
## $ accel_forearm_z : int -216 -213 -214 -214 -215 -215 -213 -214 -215 -214 ...
## $ magnet_forearm_x : int -18 -18 -16 -17 -9 -18 -9 -16 -22 -17 ...
## $ magnet_forearm_y : num 661 658 658 655 660 659 660 653 656 657 ...
## $ magnet_forearm_z : num 473 469 469 473 478 470 474 476 473 465 ...
## $ classe : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
All features are numeric. Many features are on different scales, so it a good idea to center and scale (subtract mean and divide by the standard deviation) before building a model on the data. This will be done at a later step.
Check for features with low variance or only one unique value:
nearZeroVar(train[,-53], saveMetrics=F)
## integer(0)
No features fit the low variance criteria. However, since there are 50+ features in the dataset, and movement is recorded along 3 axes, it is likely that some attributes are highly correlated. Create a correlation matrix and count number of features with correlation higher than 0.8:
M <- abs(cor(train[,-53]))
diag(M) <- 0 #for each variable, correlation with itself is 1. Set this to 0.
dim(which(M > 0.8, arr.ind=T))[1]
## [1] 38
There is a high number of highly-correlated features in the dataset. Principal component analysis can be used to project the features onto a lower-dimensional space.
First, find all principal components. The preProcess function in caret centers and scales the data automatically when method is set to pca:
preProc <- preProcess(train[,-53],method=("pca"),pcaComp=52)
train_PC <- predict(preProc,train[,-53])
Calculate and plot the variance explained by each component. Also plot the cutoff at which 95% and 99% of the variance are explained:
train_PC_sd <- apply(train_PC,2,sd)
var_explained <- train_PC_sd^2 / sum(train_PC_sd^2)
plot(var_explained,type="b",main="Variance explained by Principal Components",
xlab="Components (sorted by decreasing variance explained)",
ylab="Variance Explained")
abline(v = which(cumsum(var_explained)>=.95)[1],col="red")
abline(v = which(cumsum(var_explained)>=.99)[1],col="blue")
text(x=30,y=0.15,labels="95% explained",cex=0.7)
text(x=41,y=0.10,labels="99% explained",cex=0.7)
The plot shows that the top 25 components explain 95%, and the top 36 explain 99% of the variance.
Rerun the PCA preprocessing, this time keeping enough components to explain 99% of the variance, and apply the PCA transformation to the crossvalidation and test sets:
preProc <- preProcess(train[,-53],method=("pca"),thresh=.99)
train_PC <- predict(preProc,train[,-53])
xval_PC <- predict(preProc,xval[,-53])
test_PC <- predict(preProc,test[,-53])
dim(train_PC)
## [1] 13737 36
These features are used to build the model.
We choose to build a random forest model. The method="oob" option in trainControl indicates using out-of-bag estimates for cross-validation.
To allow for multicore processing, we use makeCluster() and registerDoParallel() commands. Otherwise the train() function will use one core, and training will take much longer.
# cl <- makeCluster(detectCores()/2) #use half of available cores
cl <- makeCluster(6) #use 6 available cores
registerDoParallel(cl)
modFit <- train(train$classe ~ ., method = "rf", data = train_PC,
trControl = trainControl(method = "oob"))
Find the accuracy on training and crossvalidation sets:
CM_train <- confusionMatrix(train$classe,predict(modFit,train_PC)) #train results
CM_xval <- confusionMatrix(xval$classe,predict(modFit,xval_PC)) #xval results
train_error = 1 - CM_train$overall[1]
xval_error = 1 - CM_xval$overall[1]
While the model perfectly fits the training set (in-sample error is 0), the cross-validation error is 0.017. We thus estimate that the out-of-sample error will be about 0.017. The cross-validation confusion matrix summarizes how classes were misclassified:
CM_xval$table
## Reference
## Prediction A B C D E
## A 1671 2 0 1 0
## B 23 1107 8 0 1
## C 0 10 1011 4 1
## D 2 0 28 933 1
## E 0 6 9 4 1063
It looks like class C was misclassified the most.
Using a random forest model on data preprocessed with principal component analysis, we showed that it’s possible to fairly accurately distinguish between variations of the same motion pattern.
This analysis was written as the course project for Practical Machine Learning on Coursera.