Prediction Assignment Writeup

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Goal

The goal of this project is to predict the manner in which the participants did the exercise (the “classe” variable) and use a prediction model to predict 20 different test cases.

WLE Dataset

Data Observation

Get data from source

pmlTraining <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"), na.strings=c("NA","","#DIV/0!"))
pmlTesting <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"), na.strings=c("NA","","#DIV/0!"))

The training data set is rather large with 19622 observations. The testing data set has 20 observations.

library(ggplot2)
library(caret)

## Loading required package: lattice

dim(pmlTraining)

## [1] 19622   160

dim(pmlTesting)

## [1]  20 160

While the data is large, including all the data will not improve the accuracy of models. Removing some of the data will also optimise model generation. After exploration on the data set, the following observations are made:

The first seven predictors, which include the counter, name of participants and timestamp, are not useful since they do not contribute to the prediction of classe.
There are a lot NAs which do not contribute to the prediction
There are a lot of predictors which will increase time required to create the models. The data may be optimised by removing predictors with low variance, which will optimise model generation without reducing accuracy.

There are four sensors attached to each subject, at the arm, forearm, belt and dumbell. The plot of data from these four sensors do not immediately show anything of interest.

featurePlot(x=pmlTraining[ ,c("roll_belt","pitch_belt","yaw_belt", "roll_arm","pitch_arm", "yaw_arm", "roll_forearm", "pitch_forearm", "yaw_forearm", "roll_dumbbell", "pitch_dumbbell", "yaw_dumbbell") ], y = pmlTraining$classe, plot="pairs" )

Data Cleaning

Based on the observation above, the following data cleaning will be performed.

The first seven predictors will be removed.

pmlTraining = pmlTraining[,-c(1:7)]

Low variance predictors will be removed.

nsv = nearZeroVar( pmlTraining, saveMetrics=TRUE )
pmlTraining = pmlTraining[ , !nsv$nzv ]

Remove predictors with NAs

pmlTraining = pmlTraining [ , colSums(is.na(pmlTraining )) == 0]

Models and Validation

Random Forest, Gradient Boost and Support Vector Machine with Radial are selected for comparison.

For modeling and validation, the data will be splilt into a training and validation set. The training set will be used to determine which model has the highest accuracy. The model with the highest accuracy will be validated using the validation set.

inTrain = createDataPartition( y=pmlTraining$classe, p=0.7, list=FALSE)
myTrain = pmlTraining[ inTrain, ]
myTest  = pmlTraining[ -inTrain, ]

Modelling will be based on 8 fold cross validation to avoid overfitting and reduce out of sample error.

control <- trainControl(method="cv", number=8)
set.seed(1927)
modelRF <- train(classe~., data=myTrain, method="rf", trControl=control)

## Loading required package: randomForest

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

modelSVM <- train(classe~., data=myTrain, method="svmRadial", trControl=control)

## Loading required package: kernlab

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

modelGBM <- train(classe~., data=myTrain, method="gbm", trControl=control, verbose=FALSE)

## Loading required package: gbm

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: splines

## Loading required package: parallel

## Loaded gbm 2.1.1

## Loading required package: plyr

max(modelRF$results$Accuracy)

## [1] 0.9933752

max(modelSVM$results$Accuracy)

## [1] 0.9214542

max(modelGBM$results$Accuracy)

## [1] 0.9614905

Since Random Forest is the model with the highest accuracy, the model will be used to evaluate the accuracy on the test data. With Random Forest model, the out of sample error is expected to be less than 1%.

predictRF <- predict( modelRF, myTest )
confusionMatrix( predictRF, myTest$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1672    9    0    0    0
##          B    2 1129    3    1    0
##          C    0    1 1016    6    5
##          D    0    0    7  954    4
##          E    0    0    0    3 1073
## 
## Overall Statistics
##                                          
##                Accuracy : 0.993          
##                  95% CI : (0.9906, 0.995)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9912         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9988   0.9912   0.9903   0.9896   0.9917
## Specificity            0.9979   0.9987   0.9975   0.9978   0.9994
## Pos Pred Value         0.9946   0.9947   0.9883   0.9886   0.9972
## Neg Pred Value         0.9995   0.9979   0.9979   0.9980   0.9981
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2841   0.1918   0.1726   0.1621   0.1823
## Detection Prevalence   0.2856   0.1929   0.1747   0.1640   0.1828
## Balanced Accuracy      0.9983   0.9950   0.9939   0.9937   0.9955

From the result above, Random Forest Model accurracy is about the same with the training model result, which means there is no overfitting on the model and can be used to run on the Test set to get the 20 predictions. The cross validation error matches well with the expected error.

Running the model with the test data, the results are given below.

predict(modelRF, newdata=pmlTesting)

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E