Coursera Practical Machine Learning

Introduction

The following information was provided as a guide for this project:

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Data

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Objective

The goal of the project is to predict the manner in which an excercise was performed. To do this, Machine Learning techniques were used on a training set, and used to predict information about a testing set. We will be predicting on the variable “classe” in the data.

Data Processing

First the data is downloaded via the URL given above. During the analysis it was noted that several values in the data can be considered “NA”, so these are now passed to the function reading in the data to help with clean up and speed the project up.

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

library(randomForest)

## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.

library(doParallel)

## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel

trainlink <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testlink <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training <- read.csv(url(trainlink), na.strings=c("NA","#DIV/0!",""))
test_grade<- read.csv(url(testlink), na.strings=c("NA","#DIV/0!",""))

A quick look at the data to see what we are dealing with, the output is hidden here due to the amount of space it takes up.

dim(training)
summary(training)
names(training)

It is easy to see that the first few columns have nothing to do with the movement we are predicting, so they are removed.

training <- training[,-c(1:7)]
test_grade <- test_grade[,-c(1:7)]
training$classe <- factor(training$classe)

Next we check for NA values to help pare down the data set we are using. To see if this is even possible I want to know how many NA values we have.

sum(is.na(training))

## [1] 1925102

To help decide which columns to remove, I first looked in to which columns had no NA observations. This eliminated many of the columns, leaving a little over 50 predictor values left to use during modeling.

notna <- sapply(training, function(x)all(!is.na(x)))
training <- training[,notna]
test_grade <- test_grade[,notna]

I also investigated removing additional variables based on the near zero variance function. To compare the two models, I created another training set and test grading set.

removecol <- nearZeroVar(training, saveMetrics=TRUE)
training2 <- training[,!removecol$nzv== TRUE]

test_grade2 <- test_grade[,!removecol$nzv== TRUE]

We then split our training data in to seperate training and testing sets. The testing set that was downloaded above is used to grade the project, but we need a testing set to test our model on before grading.

inTrain = createDataPartition(y=training$classe, p=0.6, list=FALSE)
trainset = training[inTrain,]
testset= training[-inTrain,]

Model Building

Now that the data has been cleaned, we will build a random forest model to predict the classe of each observation. The first model was built using the data set that contained the 50+ completed columns. By creating the control variable, we are able to use cross validation in the model, in this case 6-fold. I also experimented with the repeats valiable within the cross validation controls. I found that adding aditional iterations led to little increase in accuracy while greatly increasing computation time.

cl <- makeCluster(detectCores())
registerDoParallel(cl)
controls <- trainControl(method="cv", number = 6)
fitrf = train(classe ~ ., method = "rf", data=trainset, trControl = controls, allowParallel = TRUE)
pred <- predict(fitrf, testset)
stopCluster(cl)

The model was then used to predict the classe of the test set. It is easily seen that the model has an accuracy near 99%

confusionMatrix(pred, testset$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2228    9    0    0    1
##          B    3 1508   15    1    1
##          C    0    1 1350   12    0
##          D    0    0    3 1270    6
##          E    1    0    0    3 1434
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9929          
##                  95% CI : (0.9907, 0.9946)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.991           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9934   0.9868   0.9876   0.9945
## Specificity            0.9982   0.9968   0.9980   0.9986   0.9994
## Pos Pred Value         0.9955   0.9869   0.9905   0.9930   0.9972
## Neg Pred Value         0.9993   0.9984   0.9972   0.9976   0.9988
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2840   0.1922   0.1721   0.1619   0.1828
## Detection Prevalence   0.2852   0.1947   0.1737   0.1630   0.1833
## Balanced Accuracy      0.9982   0.9951   0.9924   0.9931   0.9969

The second model is also a random forest, but uses fewer predictors by eliminating those with near zero variance. Note: this model took much longer to run and offered lower or equal accuracy, so while the code is shown below to build and evaluate the model, the code was not ran as part of the report.

fitrf = train(classe ~ ., method = "rf", data=trainset, trControl = controls)
pred <- predict(fitrf, testset)
fitrf

confusionMatrix(pred, testset$classe)

The out of sample error for the best model is shown below

outsamerr <- (1 - as.numeric(confusionMatrix(testset$classe, pred)$overall[1]))
outsamerr

## [1] 0.007137395

Graded Predictions

This model was then used to predict against the 20 test cases that would be graded, caled test_grade here. The model was able to predict all 20 observations correctly. In case this project is used again in the future, the actual predicitions for the graded test set are not shown here.

Reference

It was interesting to note that the choosen model was able to obtain the highest accuracy without using all of the predictors. Based on this quick plot it appears roughly half of the predictors were used.

The data for this project come from : http://groupware.les.inf.puc-rio.br/har.

Coursera Practical Machine Learning - Excercise Prediction

Chris Dolan

December 16, 2015