Human Activity Recognition - HAR - has emerged as a key research area in the last years and is gaining increasing attention by the pervasive computing research community, especially for the development of context-aware systems. There are many potential applications for HAR, like: elderly monitoring, life log systems for monitoring energy expenditure and for supporting weight-loss programs, and digital assistants for weight lifting excercises.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self-movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt,forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
The training data for this project is available here The test data is available here The data for this project comes from this source
The goal of the project is to predict the manner in which they did the exercise, since each individual would also want to know whether he has performed the exercise correctly or not. The entire analysis aims to predict whether the exercise was done according to the specification or else predict what went wrong according to the following :
Libraries Required :
library(RCurl)
## Loading required package: bitops
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
Obtaining the urls :
url_1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url_2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
if(!file.exists("pml-training.csv")){
download.file(url_1, destfile = "pml-training.csv")
dateDownloaded <- date()
}
if(!file.exists("pml-testing.csv")){
download.file(url_1, destfile = "pml-training.csv")
dateDownloaded <- date()
}
train_data <- read.csv("pml-training.csv", na.strings=c("", "NA"))
test_data <- read.csv("pml-testing.csv", na.strings=c("", "NA"))
Removing the index,user and time information from the training data frame since they are neutral with respect to determination of whether barbell lifts are performed rightly or not.
train_data$X <- NULL
cols_to_remove <- c("user_name", "raw_timestamp_part_1",
"raw_timestamp_part_2", "cvtd_timestamp")
for (col in cols_to_remove) {
train_data[, col] <- NULL
}
Many columns in the dataset have mostly missing values. We remove features from the training and testing data that have too many missing values, where imputing is not an option.
NAs <- apply(train_data,2,function(x) {sum(is.na(x))})
train_data <- train_data[,which(NAs == 0)]
We also remove features that don’t have many missing values but have one unique value (i.e. zero variance predictors) or have few unique values relative to the number of samples and the ratio of frequency of the most common value to the frequency of second most common value is large.
library(caret)
nsv <- nearZeroVar(train_data)
train_data <- train_data[-nsv]
test_data <- test_data[-nsv]
The final set of predictors used for classification are as follows.
names(train_data)
## [1] "num_window" "roll_belt" "pitch_belt"
## [4] "yaw_belt" "total_accel_belt" "gyros_belt_x"
## [7] "gyros_belt_y" "gyros_belt_z" "accel_belt_x"
## [10] "accel_belt_y" "accel_belt_z" "magnet_belt_x"
## [13] "magnet_belt_y" "magnet_belt_z" "roll_arm"
## [16] "pitch_arm" "yaw_arm" "total_accel_arm"
## [19] "gyros_arm_x" "gyros_arm_y" "gyros_arm_z"
## [22] "accel_arm_x" "accel_arm_y" "accel_arm_z"
## [25] "magnet_arm_x" "magnet_arm_y" "magnet_arm_z"
## [28] "roll_dumbbell" "pitch_dumbbell" "yaw_dumbbell"
## [31] "total_accel_dumbbell" "gyros_dumbbell_x" "gyros_dumbbell_y"
## [34] "gyros_dumbbell_z" "accel_dumbbell_x" "accel_dumbbell_y"
## [37] "accel_dumbbell_z" "magnet_dumbbell_x" "magnet_dumbbell_y"
## [40] "magnet_dumbbell_z" "roll_forearm" "pitch_forearm"
## [43] "yaw_forearm" "total_accel_forearm" "gyros_forearm_x"
## [46] "gyros_forearm_y" "gyros_forearm_z" "accel_forearm_x"
## [49] "accel_forearm_y" "accel_forearm_z" "magnet_forearm_x"
## [52] "magnet_forearm_y" "magnet_forearm_z" "classe"
We build a random forest classifier to predict the action class. To measure the accuracy of the model, we do 10-fold cross validation with 80:20 split, on each fold, 80% of the data is used for training the random forest and remaining 20% is used for testing.
library(randomForest)
set.seed(1)
obs <- c()
preds <- c()
for(i in 1:10) {
intrain = sample(1:dim(train_data)[1], size=dim(train_data)[1] * 0.8, replace=F)
train_cross = train_data[intrain,]
test_cross = train_data[-intrain,]
rf <- randomForest(classe ~ ., data=train_cross)
obs <- c(obs, test_cross$classe)
preds <- c(preds, predict(rf, test_cross))
}
The confusion matrix for predictions on cross validation folds is given below.
conf_mat <- confusionMatrix(table(preds, obs))
conf_mat$table
## obs
## preds 1 2 3 4 5
## 1 11099 7 0 0 0
## 2 1 7456 10 0 0
## 3 0 3 6836 32 0
## 4 0 0 3 6470 7
## 5 2 0 0 2 7322
The proposed model seems classifying well enough. The accuracy is 99.8293% and it misclassifies only few instances. Finally, we train the random forest with whole dataset so that the classifier can be used to predict the class of an action, given the set of activity measurements.
model <- randomForest(classe ~ ., data=train_data)
Velloso, Eduardo, Andreas Bulling, Hans Gellersen, Wallace Ugulino, and Hugo Fuks. 2013. “Qualitative Activity Recognition of Weight Lifting Exercises.” In Proceedings of the 4th Augmented Human International Conference, 116-123. AH ’13. New York, NY, USA: ACM.