Predicting Motion Type

Executive Summary

This report is bourne out of a data science course on Coursera.com through Johns Hopkins Univerity called Practical Machine Learning. The goal of the analysis is to create a prediction algorithm for the type of activity an individual is performing utilizing a set of human activity recognition data from the good folks at Groupware. The five distinct activities are sitting, sitting down, standing, standing up, and walking. The data, with deeper explanation, can be found here. The report can also be found on rPub here.

Creating this prediction algorithm boiled down to 2 main challenges. First, cleaning the training dataset and selecting which features to use as part of the algorithm (the original set contained >19,000 obs and >190 features). Second, understanding and tuning the dozens of options available for the best machine learning algorithm.

This report details the method of data cleaning and feature selection and ultimitely utilizes a random forest algorithm. The final model has 500 trees, an out of bag error rate of 0.45% (awesome), and the accuracy on a partitioned test set is 99.52% (again, awesome). Since this is not a perfect model for prediction, the end of this report will discuss possibilities for improvement.

Reading and Exploring the Data

Exploring unfamiliar data is a journey. This journey begins with basic 'what does this data look like' questions.

# download and read the data
trainURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(trainURL, destfile = "trainingData.csv", method = "curl")
download.file(testURL, destfile = "testingData.csv", method = "curl")
trainingData <- read.csv("trainingData.csv", header=TRUE, na.string = c("#DIV/0!"))
testingData <- read.csv("testingData.csv", header=TRUE, na.strings = c("#DIV/0!"))

# explore the data structure and prescense of missing values
str(trainingData); str(testingData)
summary(trainingData); summary(testingData)

Pretty inconsistent data. There are many columns with nothing but NA's. In order to clean that up, we can turns all values into numeric (thereby coercing NA's for the emptry columns) in order to select only the columns where the sum of NA values is equal to zero. Note that this took this author hours to figure out since we read in the “#DIV/0!” values as strings that read “NA” rather than NA values. Nonetheless. We also remove the metadata columns (1-7) since they will not help our machine learn.

# convert all values to numeric
for (i in c(8:ncol(trainingData)-1)) {trainingData[,i] = as.numeric(as.character(trainingData[,i]))}
for (i in c(8:ncol(testingData)-1)) {testingData[,i] = as.numeric(as.character(testingData[,i]))}

# identify the features to use for the model and subset the training data into a model data train and test set for cross validation
modelFeatures <- colnames(trainingData[colSums(is.na(trainingData)) == 0])[-(1:7)]
modelDataset <- trainingData[modelFeatures]

Building the Model

The data should be good enough to build a prediction model. There are 52 features, 19,622 observations, and the prediction variable is 'classe'. So, which model? We tried a variety — started with random forest, then a Naive Bayes, and a gradient boosting model. A few key drivers influenced my decision to use random forest as the final model.

First, it is my understanding that the Naive Bayes works best with large datasets. Even though the modelDataset has 19,622 observations it was only from four people so it might not be representative of the entire population. Second, while both ensemble methods — gbm and random forest worked nicely (>95% accuracy) — random forest seemed to perform better consistently through through various trials. Hence, we use a random forest model. The process looked like this:

# read in caret library
library(caret)
library(randomForest)

# partition the training data between training and test set
set.seed(2222)
inTrain <- createDataPartition(y=modelDataset$classe, p=0.7, list=FALSE)
train <- modelDataset[inTrain,]
test <- modelDataset[-inTrain,]

# generate the model
model <- randomForest(classe ~., data = train)

We partitioned the training set into a sub training set and testing set to understand the in sample error rate. Across the board, the default settings of the random forest package have been used. There simply was not the need to implement any pre processing or tweaks. Using this simple cross validation method (70% of training data in train subset and 30% in test subset) shows that the final model is over 99% accurate on the testing set. Of course, the random forest method has it's own internal cross validation method so using the word 'simple' might have been misleading… Anyway, we present the results of the model..

model

## 
## Call:
##  randomForest(formula = classe ~ ., data = train, proximity = TRUE,      keep.forest = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.45%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3905    1    0    0    0    0.000256
## B   15 2639    4    0    0    0.007148
## C    0   13 2379    4    0    0.007095
## D    0    0   19 2232    1    0.008881
## E    0    0    0    5 2520    0.001980

confusionMatrix(test$classe, predict(model, newdata = test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    0    0    0    0
##          B    2 1134    3    0    0
##          C    0    6 1020    0    0
##          D    0    0   12  950    2
##          E    0    0    0    1 1081
## 
## Overall Statistics
##                                         
##                Accuracy : 0.996         
##                  95% CI : (0.994, 0.997)
##     No Information Rate : 0.285         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.994         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.999    0.995    0.986    0.999    0.998
## Specificity             1.000    0.999    0.999    0.997    1.000
## Pos Pred Value          1.000    0.996    0.994    0.985    0.999
## Neg Pred Value          1.000    0.999    0.997    1.000    1.000
## Prevalence              0.285    0.194    0.176    0.162    0.184
## Detection Rate          0.284    0.193    0.173    0.161    0.184
## Detection Prevalence    0.284    0.194    0.174    0.164    0.184
## Balanced Accuracy       0.999    0.997    0.992    0.998    0.999

Visualizing Model Results

In order to see the model in action, we use base plot in r. In the base plot, we see that as the number of trees increases the error rate in the model falls to nearly zero.

# simple r plot
plot(model, main = "Visualizing Model Construction")

plot of chunk seeModel

Submitting Model Predictions

Lastly, the model needs to be used to predict 'classe' on our original testingDataset and prepared for submission to the Coursere website.

answers = predict(model, newdata = testingData)

pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}

pml_write_files(answers)

Conclusion

This report analyzed a set of data with 5 classes (sitting-down, standing-up, standing, walking, and sitting) collected on 8 hours of activities of 4 healthy subjects. After cleaning the data, a random forest prediction model was built in order to predict which class the individual was performing. This model achieved an in sample accuracy rate of 99%% and on the sumbmission to the course testingData it was 100% accurate (20 correct answers).

Further work on this model should include tweaking model parameters to increase the out of sample accuracy rate (if we assume it won't be 100% accurate as we test a greater number of subjects). It would also be wise to use a better processor for building and testing model options as this was a huge headache and timesuck for the author of this report. Nonetheless, thanks for reading.