There are a host of new wearable electronic devices which can track the quantity of activity in which a person engages. An unresolved problem, however, is how to determine whether such activity is being done properly. In other words, we would like to know about the quality of activity/exercise, not just the quantity.
The goal of this project is to use data from accelerometer measurements of participants engaged in different classes of physical activity - bicep curls in this case - to predict whether the exercise is being done properly or improperly based solely on accelerometer measurements. The participants were each instructed to perform the exercise either properly (Class A) or in a way which replicated 4 common weightlifting mistakes (Classes B, C, D, and E).
library(caret)
library(ggplot2)
library(reshape2)
library(gridExtra)
library(dplyr)
set.seed(3399)
urlTraining <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
urlValidation <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
#Read the Training data
fullData <- read.csv(urlTraining, header = T, na.strings = c("", "NA"))
#Read the Validation data
validation <- read.csv(urlValidation, header = T, na.strings = c("", "NA"))
#Near Zero Value analysis
nzv <- nearZeroVar(fullData,saveMetrics=TRUE)
One of the first things to note is that there are many dummy variables in these data. Several columns do not have measurements for each observation, but are rather summary statistics for one sliding window. The paper suggests that a “sliding window” of time is anywhere between 0.5 and 2.5 seconds of activity.
In the validation data which we would like to predict are just random draws of one observation at a particular time point. That being the case, we can easily omit many of the summary variables in this dataset as they are not useful for this particular prediction problem, and also seem to be mislabeled in some cases (e.g. the summary statistics are in the wrong columns). The inappropriate structure of these variables is borne out by the near zero variances of these variables (not shown).
While there are a lot of continuous and integer variables in these data, there is an interesting pattern that we can glean from the factor variables: most notably from the user_name,classe and cvtd_timestamp variables.
In the following exploratory plots, for example, we can see that the participants in this study all performed these trials in temporal order. They all started doing biceps curls the proper way (Class A), then proceeded with Class B, then C, etc. They are also all easily idenifiable by the num_window variable
qplota <- qplot(classe, cvtd_timestamp, data=fullData, color=user_name, size=I(3))
qplotb <- qplot(classe, num_window, data=fullData, color=user_name, size=I(3))
grid.arrange(qplota, qplotb, ncol=2)
This relationship is an artifact of the study design which would allow us to predict the validation data with great accuracy, but will fail if we are trying to accurately predict new data given only accelerometer measurements. Therefore I’ll further exclude all of the timestamp and username data if we want to accurately predict the exercise class based solely on accelerometer measurements.
fullData <- filter(fullData, new_window == "no")
fullData <- Filter(function(x)!all(is.na(x)), fullData)
fullData <- select(fullData, -X, -new_window, -user_name, -cvtd_timestamp,
-raw_timestamp_part_1, -raw_timestamp_part_2, -num_window)
# Also remove all the same columns from the test data set:
validation <- Filter(function(x)!all(is.na(x)), validation)
validation <- select(validation, -X, -new_window, -user_name, -cvtd_timestamp,
-raw_timestamp_part_1, -raw_timestamp_part_2, -num_window)
#Do the Near Zero Value analysis again
nzv <- nearZeroVar(fullData, saveMetrics=TRUE)
Examine these data for multicollinearity. I want to trim variables that are highly correlated
cor.matrix <- cor(fullData[sapply(fullData, is.numeric)])
c <- melt(cor.matrix)
qplot(x=Var1, y=Var2, data=c, fill=value, geom="tile") +
scale_fill_gradient2(limits=c(-1, 1))
While the data is split online into a “training” and “testing” set already, the “testing” set is really for the purposes of validation for the assignment. We will split the full data (what is labeled as the “training” data online) into training and testing sets in order to continue these analyses.
inTrain <- createDataPartition(y=fullData$classe, p=0.7, list=FALSE)
training <- fullData[inTrain, ]; testing <- fullData[-inTrain, ]
To begin with, I will try to build a model using classification trees. It’s somewhat unclear to me based on what I’ve learned so far in the class which model building method is most appropriate in certain situations. I need to learn a lot more about that on my own, however it seems like it might be best to build models without the assumptions of normality since that assumption is violated in some of these variables.
Here is a classification tree:
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
modfit_tree <- train(classe ~ ., method="rpart", data=training)
fancyRpartPlot(modfit_tree$finalModel)
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
mtry <- tuneRF(training[,-46], training$classe, ntreeTry=500, stepFactor=1.5,improve=0.01,
plot=FALSE, trace=TRUE, dobest=FALSE)
## mtry = 6 OOB error = 0.61%
## Searching left ...
## mtry = 4 OOB error = 0.63%
## -0.03658537 0.01
## Searching right ...
## mtry = 9 OOB error = 0.54%
## 0.1097561 0.01
## mtry = 13 OOB error = 0.56%
## -0.04109589 0.01
modfit_rf <- randomForest(classe~.,data=training, mtry=9, ntree=500)
pred <- predict(modfit_rf, testing)
confusionMatrix(pred, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1641 8 0 0 0
## B 0 1102 2 0 0
## C 0 5 1001 7 1
## D 0 0 2 937 0
## E 0 0 0 0 1057
##
## Overall Statistics
##
## Accuracy : 0.9957
## 95% CI : (0.9936, 0.9972)
## No Information Rate : 0.2847
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9945
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9883 0.9960 0.9926 0.9991
## Specificity 0.9981 0.9996 0.9973 0.9996 1.0000
## Pos Pred Value 0.9951 0.9982 0.9872 0.9979 1.0000
## Neg Pred Value 1.0000 0.9972 0.9992 0.9985 0.9998
## Prevalence 0.2847 0.1935 0.1744 0.1638 0.1836
## Detection Rate 0.2847 0.1912 0.1737 0.1626 0.1834
## Detection Prevalence 0.2861 0.1916 0.1760 0.1629 0.1834
## Balanced Accuracy 0.9990 0.9940 0.9966 0.9961 0.9995
The accuracy of this model on the testing dataset was 0.9957, which is a good estimate of the out-of-sample accuracy.
All that is left is to use this model to predict the classes of the validation data:
pred_val <- predict(modfit_rf, validation)
pred_val
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E