Machine Learning: Activity Tracking

Summary

From Jeff Leek’s assignment: “Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset)”.

Preprocessing

The analysis assumes that the files pml-testing.csv and pml-training.csv are downloaded in the working directory. If they are not, the following code can be used to do that:

The first step is to read the data files into R. Given the amount of NAs, it’s a good idea to label those from the beginning.

testing <- read.csv("pml-testing.csv", header = TRUE, na.strings = c("NA",""))
training <- read.csv("pml-training.csv", header = TRUE, na.strings = c("NA",""))
dim(training); dim(testing)

## [1] 19622   160

## [1]  20 160

After inspection, there are columns that contain only NAs and variables that are not relevant for the purposes of this analysis. In addition, the very last column in the testing set contains the variable “problem id”, which is another variable that can be removed.

trainingNAfilter <- training[,(colSums(is.na(training)) == 0)]
testingNAfilter <- testing[,(colSums(is.na(testing)) == 0)]
trainingTidy <- trainingNAfilter[,-(1:6)]
testingTidy <- testingNAfilter[, -(1:6)]
testingTidy <- testingTidy[,-54]
dim(trainingTidy); dim(testingTidy)

## [1] 19622    54

## [1] 20 53

Finally, there are variables that are highly correlated with one another that can be removed in order to reduce the variance.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(corrplot)
corMatrix <- abs(cor(trainingTidy[, -54]))
diag(corMatrix) <- 0
corrplot(corMatrix, tl.cex = 0.6)

highCorrs <- findCorrelation(corMatrix, cutoff = .9)
train <- trainingTidy[, -c(highCorrs)]
test <- testingTidy[, -c(highCorrs)]
dim(train); dim(test)

## [1] 19622    47

## [1] 20 46

After tidying up, “train” and “test” can be used with no empty, irrelevant, or redundant columns.

Building the Model

Next, the “train” data are split into a training set and a validation set. The training set is used to build the model and the validation set will be used to check the accuracy of the model.

inTrain <- createDataPartition(y = train$classe, p = 0.6, list = FALSE)
trainSubset <- train[inTrain,]
validSubset <- train[-inTrain,]
dim(trainSubset); dim(validSubset)

## [1] 11776    47

## [1] 7846   47

An initial tree is built to test the predictive quality of all relevant variables.

library(tree)
set.seed(79)
model1 <- tree(classe~.,data=trainSubset)
summary(model1)

## 
## Classification tree:
## tree(formula = classe ~ ., data = trainSubset)
## Variables actually used in tree construction:
##  [1] "pitch_forearm"     "magnet_belt_y"     "accel_forearm_z"  
##  [4] "magnet_dumbbell_y" "roll_forearm"      "magnet_dumbbell_z"
##  [7] "accel_dumbbell_y"  "num_window"        "pitch_belt"       
## [10] "yaw_belt"          "total_accel_belt"  "accel_forearm_x"  
## [13] "magnet_arm_x"      "accel_dumbbell_z" 
## Number of terminal nodes:  27 
## Residual mean deviance:  1.271 = 14930 / 11750 
## Misclassification error rate: 0.241 = 2838 / 11776

plot(model1)
text(model1,cex =.6)

The tree is messy and may need some pruning but the accuracy of the model can be checked before that.

mod1Predict <- predict(model1,validSubset,type="class")
predMatrix <- with(validSubset,table(mod1Predict,classe))
sum(diag(predMatrix))/sum(as.vector(predMatrix))

## [1] 0.7481519

The model makes accurate predictions in 71% of the cases. A model with less variables that has predictive power over 50% is still better than a coin flip. The next step is to see if the number of variables can be reduced while maintaining decent predictive power.

cvTrain <- cv.tree(model1,FUN=prune.misclass)
plot(cvTrain)

It looks like there are noticeable gains in terms of misclassification around the 18 and the 13 variables. Pruning the tree at those variables will result in relatively small accuracy losses with significant gains in terms of variance. Taking the best 13 variables results in a small loss in accuracy but an uncluttered tree (see below) maintaining at more than 60% of predictive accuracy.

pruned <- prune.misclass(model1,best=13)
predict2 <- predict(pruned,validSubset,type="class")
predMatrix2 <- with(validSubset,table(predict2,classe))
sum(diag(predMatrix2))/sum(as.vector(predMatrix2))

## [1] 0.6348458

plot(pruned)
text(pruned,cex =.6)

Applying the Model

The final step is to apply the model to the test set and predict the level of exercise for the twenty subjects from that set. Given the excessive computational resources that random forests require in order to use all potential predictors, the model presented here is efficient, but not sufficiently accurate.

predictFinal <- predict(pruned,test,type="class")
predictFinal

##  [1] D A C A A C D C A A C C B A C C A A A B
## Levels: A B C D E

Building and Applying a More Accurate Model

As a potentially more accurate alternative, a random forest model may be explored. The model relies on the steps laid out previously but takes advantage of the randomForest package. The confusion matrix below shows a lot more accurate predictions and less misclassifications in the validation set. Therefore, the test set predictions generated below are ultimately used.

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

model2 <- randomForest(classe ~. , data=trainSubset, method="class")
mod2predict <- predict(model2, validSubset, type="class")
confusionMatrix(mod2predict, validSubset$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2231    3    0    0    0
##          B    1 1513    4    0    0
##          C    0    2 1364   18    0
##          D    0    0    0 1268    3
##          E    0    0    0    0 1439
## 
## Overall Statistics
##                                           
##                Accuracy : 0.996           
##                  95% CI : (0.9944, 0.9973)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.995           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9996   0.9967   0.9971   0.9860   0.9979
## Specificity            0.9995   0.9992   0.9969   0.9995   1.0000
## Pos Pred Value         0.9987   0.9967   0.9855   0.9976   1.0000
## Neg Pred Value         0.9998   0.9992   0.9994   0.9973   0.9995
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1928   0.1738   0.1616   0.1834
## Detection Prevalence   0.2847   0.1935   0.1764   0.1620   0.1834
## Balanced Accuracy      0.9995   0.9980   0.9970   0.9928   0.9990

predict(model2, test)

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E