Quality Prediction of Quantified Self Movement

Motivation

Human Activity Recognition - HAR - has emerged as a key research area in the last years and is gaining increasing attention by the pervasive computing research community, especially for the development of context-aware systems. There are many potential applications for HAR, like: elderly monitoring, life log systems for monitoring energy expenditure and for supporting weight-loss programs, and digital assistants for weight lifting excercises.

Project Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self-movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt,forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

Data

The training data for this project is available here The test data is available here The data for this project comes from this source

Goal

The goal of the project is to predict the manner in which they did the exercise, since each individual would also want to know whether he has performed the exercise correctly or not. The entire analysis aims to predict whether the exercise was done according to the specification or else predict what went wrong according to the following :

Class A: Exactly according to the specification
Class B: Throwing elbows to the front
Class C: Lifting the dumbbell only halfway
Class D: Lowering the dumbbell only halfway
Class E: Throwing the hips to the front.

Data Processing

Libraries Required :

library(RCurl)

## Loading required package: bitops

library(randomForest)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

Obtaining the urls :

url_1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url_2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

if(!file.exists("pml-training.csv")){
  download.file(url_1, destfile = "pml-training.csv")
  dateDownloaded <- date()
}

if(!file.exists("pml-testing.csv")){
  download.file(url_1, destfile = "pml-training.csv")
  dateDownloaded <- date()
}

train_data <- read.csv("pml-training.csv", na.strings=c("", "NA"))

test_data <- read.csv("pml-testing.csv", na.strings=c("", "NA"))

Removing the index,user and time information from the training data frame since they are neutral with respect to determination of whether barbell lifts are performed rightly or not.

train_data$X <- NULL
cols_to_remove <- c("user_name", "raw_timestamp_part_1",
                    "raw_timestamp_part_2", "cvtd_timestamp")
for (col in cols_to_remove) {
    train_data[, col] <- NULL
}

Many columns in the dataset have mostly missing values. We remove features from the training and testing data that have too many missing values, where imputing is not an option.

NAs <- apply(train_data,2,function(x) {sum(is.na(x))})
train_data <- train_data[,which(NAs == 0)]

We also remove features that don’t have many missing values but have one unique value (i.e. zero variance predictors) or have few unique values relative to the number of samples and the ratio of frequency of the most common value to the frequency of second most common value is large.

library(caret)
nsv <- nearZeroVar(train_data)
train_data <- train_data[-nsv]
test_data <- test_data[-nsv]

The final set of predictors used for classification are as follows.

names(train_data)

##  [1] "num_window"           "roll_belt"            "pitch_belt"          
##  [4] "yaw_belt"             "total_accel_belt"     "gyros_belt_x"        
##  [7] "gyros_belt_y"         "gyros_belt_z"         "accel_belt_x"        
## [10] "accel_belt_y"         "accel_belt_z"         "magnet_belt_x"       
## [13] "magnet_belt_y"        "magnet_belt_z"        "roll_arm"            
## [16] "pitch_arm"            "yaw_arm"              "total_accel_arm"     
## [19] "gyros_arm_x"          "gyros_arm_y"          "gyros_arm_z"         
## [22] "accel_arm_x"          "accel_arm_y"          "accel_arm_z"         
## [25] "magnet_arm_x"         "magnet_arm_y"         "magnet_arm_z"        
## [28] "roll_dumbbell"        "pitch_dumbbell"       "yaw_dumbbell"        
## [31] "total_accel_dumbbell" "gyros_dumbbell_x"     "gyros_dumbbell_y"    
## [34] "gyros_dumbbell_z"     "accel_dumbbell_x"     "accel_dumbbell_y"    
## [37] "accel_dumbbell_z"     "magnet_dumbbell_x"    "magnet_dumbbell_y"   
## [40] "magnet_dumbbell_z"    "roll_forearm"         "pitch_forearm"       
## [43] "yaw_forearm"          "total_accel_forearm"  "gyros_forearm_x"     
## [46] "gyros_forearm_y"      "gyros_forearm_z"      "accel_forearm_x"     
## [49] "accel_forearm_y"      "accel_forearm_z"      "magnet_forearm_x"    
## [52] "magnet_forearm_y"     "magnet_forearm_z"     "classe"

Modelling :

We build a random forest classifier to predict the action class. To measure the accuracy of the model, we do 10-fold cross validation with 80:20 split, on each fold, 80% of the data is used for training the random forest and remaining 20% is used for testing.

library(randomForest)
set.seed(1)
obs <- c()
preds <- c()
for(i in 1:10) {
  intrain = sample(1:dim(train_data)[1], size=dim(train_data)[1] * 0.8, replace=F)
  train_cross = train_data[intrain,]
  test_cross = train_data[-intrain,]
  rf <- randomForest(classe ~ ., data=train_cross)
  obs <- c(obs, test_cross$classe)
  preds <- c(preds, predict(rf, test_cross))
}

The confusion matrix for predictions on cross validation folds is given below.

conf_mat <- confusionMatrix(table(preds, obs))
conf_mat$table

##      obs
## preds     1     2     3     4     5
##     1 11099     7     0     0     0
##     2     1  7456    10     0     0
##     3     0     3  6836    32     0
##     4     0     0     3  6470     7
##     5     2     0     0     2  7322

The proposed model seems classifying well enough. The accuracy is 99.8293% and it misclassifies only few instances. Finally, we train the random forest with whole dataset so that the classifier can be used to predict the class of an action, given the set of activity measurements.

model <- randomForest(classe ~ ., data=train_data)

References :

Velloso, Eduardo, Andreas Bulling, Hans Gellersen, Wallace Ugulino, and Hugo Fuks. 2013. “Qualitative Activity Recognition of Weight Lifting Exercises.” In Proceedings of the 4th Augmented Human International Conference, 116-123. AH ’13. New York, NY, USA: ACM.