Background

Human Activity Recognition (HAR) has received increasing attention in the computing research community in the recent years. Wearable HAR devices and fitness trackers such as FitBit make it easy to collect a large amount of data about personal activities relatively inexpensively. The potential applications for such data include elderly monitoring, energy expenditure monitoring, and weight-loss assistance.

HAR research has been focused on discriminating between different activities rather than how well an activity is performed. Similarly, users of wearable HAR devices can easily quantify how much a particular activity they do, but cannot quantify how well they do it. This project uses data from accelerometers on the belt, forearm, arm, and dumbbell to identify correct and incorrect weight lifting methods. By developing a machine learning algorithm to distinguish correct and incorrect methods for exercising, the project can provide useful information for a large variety of applications, such as avoiding injuries, providing self-guided physical therapy, and increasing the efficiency of sports training.

The dataset for this project comes from http://groupware.les.inf.puc-rio.br/har Reference: Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

Setup

library(ggplot2)
library(lattice)
library(Hmisc)
library(caret)
library(rpart)
library(randomForest)
library(foreach)
library(doParallel)
library(rpart.plot)
library(corrplot)
library(cluster)
library(fpc)
set.seed(1234)

Data Preparation

This dataset includes about 2,600,000 observations from 6 healthy adults. The participants each performed 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different ways while their movements were recorded by four sensors located on their arms, forearms, belts, and dumbbells. The five methods include: correctly (A), throwing the elbows to the front (B), lifting the dumbbell only halfway (C), lowering the dumbbell only halfway (D), and throwing the hips to the front (E).

  1. Import and clean data
  2. Remove missing data
  3. Divide the dataset into a training set and a testing set
train <- read.csv("pml-training.csv", 
                  header = TRUE, 
                  na.strings = c("#DIV/0!"))

test <- read.csv("pml-testing.csv", 
                  header = TRUE, 
                  na.strings = c("#DIV/0!"))

# convert to numeric variables
for (i in c(8:ncol(train) - 1)) {train[, i] <- as.numeric(train[, i])}
for (i in c(8:ncol(test) - 1)) {test[, i] <- as.numeric(test[, i])}

# remove missing values
datCols <- colnames(train[colSums(is.na(train)) == 0])
traindata <- train[datCols]
datCols <- colnames(test[colSums(is.na(test)) == 0])
testdata <- test[datCols]

# remove irrelevant columns
cleaned_train <- traindata[-(1:7)]
cleaned_test <- testdata[-(1:7)]

# data slicing
idx <- createDataPartition(y = cleaned_train$classe, p = 0.75, list = FALSE)
training <- cleaned_train[idx, ]
testing <- cleaned_train[-idx, ]

Exploratory Visualizations

plotcluster(training[, -length(names(training))], training$classe)

Modeling

Here we fit a predictive model on the HAR data using the random forests method, which is an ensemble learning method for classification. A number of decision trees are automatically constructed during training.

controlRF <- trainControl(method = "cv", 5)

modelRF <- train(classe ~ ., data = training, 
                 method = "rf", trControl = controlRF, ntree = 250)

modelRF
## Random Forest 
## 
## 14718 samples
##   119 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 11774, 11775, 11775, 11773, 11775 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##     2   0.8960462  0.8676885
##    60   0.9912356  0.9889129
##   119   0.9869550  0.9834955
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 60.

Out-of-sample Error

After fitting the model, we will estimate the performance of the model using the testing data set.

predictRF <- predict(modelRF, testing)
confusionMatrix(testing$classe, predictRF)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    0    0    0    0
##          B    7  939    3    0    0
##          C    0    7  846    2    0
##          D    0    1    8  794    1
##          E    0    0    2    0  899
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9937         
##                  95% CI : (0.991, 0.9957)
##     No Information Rate : 0.2859         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.992          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9950   0.9916   0.9849   0.9975   0.9989
## Specificity            1.0000   0.9975   0.9978   0.9976   0.9995
## Pos Pred Value         1.0000   0.9895   0.9895   0.9876   0.9978
## Neg Pred Value         0.9980   0.9980   0.9968   0.9995   0.9998
## Prevalence             0.2859   0.1931   0.1752   0.1623   0.1835
## Detection Rate         0.2845   0.1915   0.1725   0.1619   0.1833
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9975   0.9945   0.9913   0.9975   0.9992
(accuracy <- postResample(predictRF, testing$classe))
##  Accuracy     Kappa 
## 0.9936786 0.9920027
(error <- 1 - as.numeric(confusionMatrix(testing$classe, predictRF)$overall[1]))
## [1] 0.00632137

The estimated accuracy of the model is about 99.4% and the estimated out-of-sample error is about .6%.

Visualizations

Decision Tree Visualization

tree <- rpart(classe ~ ., data = training, method = "class")
prp(tree)

Clustering based on predicted category

predicts <- data.frame(matrix(unlist(predictRF), nrow = 4904, byrow = TRUE))
names(predicts) <- "prediction"
predicted_testing <- cbind(testing, predicts)
plotcluster(predicted_testing[, -(120:121)], predicted_testing$prediction)

Clustering based on actual category

plotcluster(testing[, -length(names(testing))], testing$classe)

Predict for the Test Data Set

predictions <- predict(modelRF, cleaned_test[, -length(names(cleaned_test))])
predictions
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Conclusions and Next Steps

As shown from the analysis above, the model’s predictions are very accurate (99.39%). The analysis shows that, in addition to effectively distinguishing between different types of activities, HAR data can also be used to distinguish between correct and incorrect methods of performing a certain activity. Nevertheless, this project was conducted on a relatively small dataset (i.e., about 2,600,000 observations from 6 participants), so the immediate next step is replicate the current analysis on large data sets. Furthermore, other types of data (e.g., diet, environmental pollution, income, census, education, etc.) can be used along with the HAR data to study how people can use wearable HAR devices to improve their health.

The future directions of this project include:

  1. Improve healthcare monitoring and fall detection for senior citizens and patients

  2. Provide self-guided physical therapy and weight-loss training programs

  3. Increase the efficiency and precision of sports training and avoid injuries