Human Activity Recognition Author: Nathan Smith ———————————————–
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, my goal was to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The goal was to train a model that could accurately identify the 5 different movements. I used a method called Random Forests to classify the movements.
Human Activity Recognition has emerged as a key research area and generates a massive amount of data. More information about our data set can be found here http://groupware.les.inf.puc-rio.br/har. We will have to get the data from the web for this exercise and below is my code to do this. If we have already downloaded it, we don’t want to duplicate our efforts, so there is a check to ensure upon re-running the code we don’t.
trainURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
train <- "./data/pml-training.csv"
test <- "./data/pml-testing.csv"
if (!file.exists("./data")) {
dir.create("./data")
}
if (!file.exists(train)) {
download.file(trainURL, destfile=train, method="curl")
}
if (!file.exists(test)) {
download.file(testURL, destfile=test, method="curl")
}
Now we will load the data into our workspace.
trainData <- read.csv("./data/pml-training.csv", header=TRUE)
testData <- read.csv("./data/pml-testing.csv", header=TRUE)
And fire up all the libraries we could possibly need during this analysis.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(RANN)
library(rpart)
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
library(corrplot)
library(rpart.plot)
As with any dataset there are things we don’t need. There are lots of variables we don’t need out of this one so here is the code I wrote to trim all the rubbish out.
# removing anything with timestamp, X, and window
rm <- grep('^X|timestamp|window', names(trainData))
rmtest <- grep('^X|timestamp|window', names(testData))
trainSlim <- trainData[,-c(rm)]
testSlim <- testData[,-c(rmtest)]
# removing NA values
trainSlim <- trainSlim[, colSums(is.na(trainSlim)) == 0]
testSlim <- testSlim[, colSums(is.na(testSlim)) == 0]
# removing factor variables while preserving "classe" variable
trainClasse <- trainSlim$classe
trainNum <- trainSlim[,sapply(trainSlim, is.numeric)] # run down the columns and check if numeric
trainNum$classe <- trainClasse # append classe back to the set
testNum <- testSlim[,sapply(testSlim, is.numeric)] # run down the columns and check if numeric
dim(trainNum)
## [1] 19622 53
dim(testNum)
## [1] 20 53
## check to make sure both sets have the same columns (with exception of last column in test set)
sum(names(trainNum) == names(testNum))
## [1] 52
Before we test our model on the real test data, we should split our training set into subsets with an additional training and test set (so we can test out of sample error). The output will also produce a neat visualization of the correlation between all the variables in the dataset. The diagonals of the matrix are all 1, as each variable is 100% correlated with itself, but notice the interesting patterns elsewhere, you can see that clearly there are bodily movements that happen in tandem as we would expect.
###### Cross validation set up ############
# we need to separate the training set into parts so we can perform cross validation
set.seed(300)
inTrain <- createDataPartition(trainNum$classe, p=0.7, list=FALSE)
trainDP <- trainNum[inTrain,]
testDP <- trainNum[-inTrain,]
# checking out correlation between variables
tempDF <- subset(trainDP, select=-c(classe))
corrVis <- cor(tempDF)
corrplot(corrVis, method="circle", title="Correlation Plot of Explanatory Variables")
I chose to use the Random Forest method because it can produce some of the most robust models available in a relative short amount of code. I set k-folds equal to 5 and even at a low k such as that it takes a decent amount of time to compute the trees. k=10 would possibly be better. The output produces a graphic of our final model.
##### Random Forest #########
# we will try random forests on this data to see if we can't get a robust prediction
# we will set k-fold equal to 5 because this dataset is relatively small
k <- 5
ctrl <- trainControl(method="cv", repeats=k) # setting training parameters
model <- train(classe~., data=trainDP, method="rf", trControl=ctrl)
# looking at the decision tree
trees <- rpart(classe~., data=trainDP, method="class", cp=.02)
prp(trees, extra=8, box.col=c("pink","palegreen3", "chocolate2", "coral3", "aquamarine")[trees$frame$yval])
Now let’s do some cross-validation and check out the model performance.
#### checking model performance
model
## Random Forest
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 12364, 12364, 12363, 12364, 12363, 12362, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9924289 0.9904220 0.002065486 0.002613740
## 27 0.9922106 0.9901464 0.002004346 0.002536263
## 52 0.9849298 0.9809349 0.005199853 0.006580082
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
predictions <- predict(model, testDP)
confmat <- confusionMatrix(testDP$classe, predictions) # to see where we missed
confmat$table
## Reference
## Prediction A B C D E
## A 1673 1 0 0 0
## B 9 1126 4 0 0
## C 0 9 1015 2 0
## D 0 0 18 943 3
## E 0 0 0 3 1079
And as for accuracy measures on the secondary test set?
accuracy <- postResample(predictions, testDP$classe)
# the estimated out of sample error is
outsample <- 1 - accuracy[1]
# the estimated accuracy is
acc <- accuracy[1]
The estimated out of sample error is 0.0083263 and the accuracy is 0.9916737. This is a model with very strong predictive ability.
The ultimate test of our model accuracy, however, comes from testing on the real test data set. This model accurately predicted all 20 of the test cases.
######## PREDICTING ON TEST SET ############
testNum <- subset(testNum, select=-c(problem_id))
Finalpred <- predict(model, testNum)
Finalpred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E