From Jeff Leek’s assignment: “Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset)”.
The analysis assumes that the files pml-testing.csv and pml-training.csv are downloaded in the working directory. If they are not, the following code can be used to do that:
The first step is to read the data files into R. Given the amount of NAs, it’s a good idea to label those from the beginning.
testing <- read.csv("pml-testing.csv", header = TRUE, na.strings = c("NA",""))
training <- read.csv("pml-training.csv", header = TRUE, na.strings = c("NA",""))
dim(training); dim(testing)
## [1] 19622 160
## [1] 20 160
After inspection, there are columns that contain only NAs and variables that are not relevant for the purposes of this analysis. In addition, the very last column in the testing set contains the variable “problem id”, which is another variable that can be removed.
trainingNAfilter <- training[,(colSums(is.na(training)) == 0)]
testingNAfilter <- testing[,(colSums(is.na(testing)) == 0)]
trainingTidy <- trainingNAfilter[,-(1:6)]
testingTidy <- testingNAfilter[, -(1:6)]
testingTidy <- testingTidy[,-54]
dim(trainingTidy); dim(testingTidy)
## [1] 19622 54
## [1] 20 53
Finally, there are variables that are highly correlated with one another that can be removed in order to reduce the variance.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(corrplot)
corMatrix <- abs(cor(trainingTidy[, -54]))
diag(corMatrix) <- 0
corrplot(corMatrix, tl.cex = 0.6)
highCorrs <- findCorrelation(corMatrix, cutoff = .9)
train <- trainingTidy[, -c(highCorrs)]
test <- testingTidy[, -c(highCorrs)]
dim(train); dim(test)
## [1] 19622 47
## [1] 20 46
After tidying up, “train” and “test” can be used with no empty, irrelevant, or redundant columns.
Next, the “train” data are split into a training set and a validation set. The training set is used to build the model and the validation set will be used to check the accuracy of the model.
inTrain <- createDataPartition(y = train$classe, p = 0.6, list = FALSE)
trainSubset <- train[inTrain,]
validSubset <- train[-inTrain,]
dim(trainSubset); dim(validSubset)
## [1] 11776 47
## [1] 7846 47
An initial tree is built to test the predictive quality of all relevant variables.
library(tree)
set.seed(79)
model1 <- tree(classe~.,data=trainSubset)
summary(model1)
##
## Classification tree:
## tree(formula = classe ~ ., data = trainSubset)
## Variables actually used in tree construction:
## [1] "pitch_forearm" "magnet_belt_y" "accel_forearm_z"
## [4] "magnet_dumbbell_y" "roll_forearm" "magnet_dumbbell_z"
## [7] "accel_dumbbell_y" "num_window" "pitch_belt"
## [10] "yaw_belt" "total_accel_belt" "accel_forearm_x"
## [13] "magnet_arm_x" "accel_dumbbell_z"
## Number of terminal nodes: 27
## Residual mean deviance: 1.271 = 14930 / 11750
## Misclassification error rate: 0.241 = 2838 / 11776
plot(model1)
text(model1,cex =.6)
The tree is messy and may need some pruning but the accuracy of the model can be checked before that.
mod1Predict <- predict(model1,validSubset,type="class")
predMatrix <- with(validSubset,table(mod1Predict,classe))
sum(diag(predMatrix))/sum(as.vector(predMatrix))
## [1] 0.7481519
The model makes accurate predictions in 71% of the cases. A model with less variables that has predictive power over 50% is still better than a coin flip. The next step is to see if the number of variables can be reduced while maintaining decent predictive power.
cvTrain <- cv.tree(model1,FUN=prune.misclass)
plot(cvTrain)
It looks like there are noticeable gains in terms of misclassification around the 18 and the 13 variables. Pruning the tree at those variables will result in relatively small accuracy losses with significant gains in terms of variance. Taking the best 13 variables results in a small loss in accuracy but an uncluttered tree (see below) maintaining at more than 60% of predictive accuracy.
pruned <- prune.misclass(model1,best=13)
predict2 <- predict(pruned,validSubset,type="class")
predMatrix2 <- with(validSubset,table(predict2,classe))
sum(diag(predMatrix2))/sum(as.vector(predMatrix2))
## [1] 0.6348458
plot(pruned)
text(pruned,cex =.6)
The final step is to apply the model to the test set and predict the level of exercise for the twenty subjects from that set. Given the excessive computational resources that random forests require in order to use all potential predictors, the model presented here is efficient, but not sufficiently accurate.
predictFinal <- predict(pruned,test,type="class")
predictFinal
## [1] D A C A A C D C A A C C B A C C A A A B
## Levels: A B C D E
As a potentially more accurate alternative, a random forest model may be explored. The model relies on the steps laid out previously but takes advantage of the randomForest package. The confusion matrix below shows a lot more accurate predictions and less misclassifications in the validation set. Therefore, the test set predictions generated below are ultimately used.
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
model2 <- randomForest(classe ~. , data=trainSubset, method="class")
mod2predict <- predict(model2, validSubset, type="class")
confusionMatrix(mod2predict, validSubset$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2231 3 0 0 0
## B 1 1513 4 0 0
## C 0 2 1364 18 0
## D 0 0 0 1268 3
## E 0 0 0 0 1439
##
## Overall Statistics
##
## Accuracy : 0.996
## 95% CI : (0.9944, 0.9973)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.995
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9996 0.9967 0.9971 0.9860 0.9979
## Specificity 0.9995 0.9992 0.9969 0.9995 1.0000
## Pos Pred Value 0.9987 0.9967 0.9855 0.9976 1.0000
## Neg Pred Value 0.9998 0.9992 0.9994 0.9973 0.9995
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1928 0.1738 0.1616 0.1834
## Detection Prevalence 0.2847 0.1935 0.1764 0.1620 0.1834
## Balanced Accuracy 0.9995 0.9980 0.9970 0.9928 0.9990
predict(model2, test)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E