Human Activity Recognition - HAR - has emerged as a key research area in the last years and is gaining increasing attention by the pervasive computing research community, especially for the development of context-aware systems. There are many potential applications for HAR, like: elderly monitoring, life log systems for monitoring energy expenditure and for supporting weight-loss programs, and digital assistants for weight lifting exercises.
Using devices like Fitbit makes it possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, I will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict the manner in which they did the exercise.
library(caret)
library(rpart)
library(randomForest)
library(corrplot)
library(rpart.plot)
trainU <-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testU <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
if (!file.exists("./data")) {
dir.create("./data")
}
trFile <- "./data/pml-training.csv"
teFile <- "./data/pml-testing.csv"
download.file(trainU, destfile=trFile, method="curl")
download.file(testU, destfile=teFile, method="curl")
After downloading the data from the data source, we can read the two csv files into two data frames.
trdf <- read.csv("./data/pml-training.csv")
tedf <- read.csv("./data/pml-testing.csv")
dim(trdf)
## [1] 19622 160
dim(tedf)
## [1] 20 160
The training data set contains 19622 observations and 160 variables, while the testing data set contains 20 observations and 160 variables. The “classe” variable in the training set is the outcome to predict.
In this step, we will clean the data and get rid of observations with missing values .
sum(complete.cases(trdf))
## [1] 406
trdf <- trdf[, colSums(is.na(trdf)) == 0]
tedf <- tedf[, colSums(is.na(tedf)) == 0]
classe <- trdf$classe
trdel <- grepl("^X|timestamp|window", names(trdf))
trdf <- trdf[, !trdel]
trnew <- trdf[, sapply(trdf, is.numeric)]
trnew$classe <- classe
tedel <- grepl("^X|timestamp|window", names(tedf))
tedf <- tedf[, !tedel]
tenew <- tedf[, sapply(tedf, is.numeric)]
The cleaned training data set contains 19622 observations and 53 variables, while the testing data set contains 20 observations and 53 variables. ### Slice the data The cleaned training set is split into a pure training data set (70%) and a validation data set (30%). Validation data set will be used to conduct cross validation in later steps.
set.seed(22600) # For reproducibile purpose
inTrain <- createDataPartition(trnew$classe, p=0.70, list=F)
trdata <- trnew[inTrain, ]
tedata <- trnew[-inTrain, ]
A predictive model for activity recognition is fit to the data using Random Forest algorithm because it automatically selects important variables and is robust to correlated covariates & outliers .5-fold cross validation is used.
controlRf <- trainControl(method="cv", 5)
modelRf <- train(classe ~ ., data=trdata, method="rf", trainControl=controlRf, ntree=250)
modelRf
## Random Forest
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9873974 0.9840508
## 27 0.9883630 0.9852742
## 52 0.9811783 0.9761821
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
predictRf <- predict(modelRf, tedata)
confusionMatrix(tedata$classe, predictRf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1667 5 2 0 0
## B 8 1127 4 0 0
## C 0 7 1015 4 0
## D 0 0 13 951 0
## E 0 0 1 1 1080
##
## Overall Statistics
##
## Accuracy : 0.9924
## 95% CI : (0.9898, 0.9944)
## No Information Rate : 0.2846
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9903
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9952 0.9895 0.9807 0.9948 1.0000
## Specificity 0.9983 0.9975 0.9977 0.9974 0.9996
## Pos Pred Value 0.9958 0.9895 0.9893 0.9865 0.9982
## Neg Pred Value 0.9981 0.9975 0.9959 0.9990 1.0000
## Prevalence 0.2846 0.1935 0.1759 0.1624 0.1835
## Detection Rate 0.2833 0.1915 0.1725 0.1616 0.1835
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9968 0.9935 0.9892 0.9961 0.9998
acc <- postResample(predictRf, tedata$classe)
acc
## Accuracy Kappa
## 0.9923534 0.9903278
s <- 1 - as.numeric(confusionMatrix(tedata$classe, predictRf)$overall[1])
s
## [1] 0.007646559
So, the estimated accuracy of the model is 99.42% and the estimated out-of-sample error is 0.58%.
Now, we apply the model to the original testing data set downloaded from the data source. We remove the problem_id column first.
final <- predict(modelRf, tenew[, -length(names(tenew))])
final
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
corrPlot <- cor(trdata[, -length(names(trdata))])
corrplot(corrPlot, method="color")
2. Decision Tree Visualization
tree <- rpart(classe ~ ., data=trdata, method="class")
prp(tree) # fast plot