MachineLearningProject

Question

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. Six participants were asked to perform barbell lifts 5 different ways: correctly (classe = A) and incorrectly (classe = B, C, D, E).

Using data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants, I will build a machine learning algorithm to predict the activity quality from the activity monitors.

Load the required packages needed for this analysis

library(caret)
library(randomForest)

Data

Download training data

file <-'http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv'  #this is training file
download.file(file, destfile="trainingData.csv")

totalData <- read.csv("./trainingData.csv")

Features

What features from the raw data should be used for predicting the variable classe ? We want features that capture/explain the data but we also need to balance information loss against summarization. When in doubt, err on the side of more information and create more features.

By looking at summary(totalData) I removed the identifier variables (user name, timestamp) which will not be used as predictors and also removed all variables that have 19,216 rows of blank data or NAs. Below is the limited data set which will be used to fit a model to predict classe.

Eliminate identifiers and variables that are largely NA or blank

completeData <- totalData[,c(8:11, 37:49, 60:68, 84:86, 102, 113:124, 140, 151:160)]

I separated the training data into 75% training and 25% validation so that I can test the algorithm against a preliminary test set before running it against the final test set, which can only be used once.

Partition the training data into training and validation sets

inTrain <- createDataPartition(y=completeData$classe,
                              p=0.75, list=FALSE)
training <- completeData[inTrain,]
validation <- completeData[-inTrain,]
dim(training)

## [1] 14718    53

Algorithm

I tried several methods to model the training set (all with cross validation) but found that the rpart method had very low accuracy and the treebag (bagging) method and gbm (boosting with trees) method had very good accurarcy but still lower than random forest. Since it is the model with the best accuracy (although with a much longer processing time), I will use a random forest model with a 5-fold cross validation.

random forest - excellent accuracy

set.seed(556)

modelFit <- train(training$classe ~ ., data=training, method="rf", trControl = trainControl(method = "cv", number = 5))
modelFit

## Random Forest 
## 
## 14718 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## 
## Summary of sample sizes: 11774, 11774, 11773, 11776, 11775 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
##    2    1         1      0.003        0.004   
##   27    1         1      0.003        0.003   
##   52    1         1      0.004        0.005   
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

finMod <- modelFit$finalModel
finMod

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.56%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 4183    1    1    0    0   0.0004779
## B   21 2823    4    0    0   0.0087781
## C    0   11 2549    7    0   0.0070121
## D    0    0   23 2387    2   0.0103648
## E    0    1    5    7 2693   0.0048041

rpart - runs very quickly but has poor accuracy

set.seed(557)

modelFit2 <- train(training$classe ~ ., data=training, method="rpart", trControl = trainControl(method = "cv", number = 5))

## Loading required package: rpart

print(modelFit2$finalModel)

## n= 14718 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 14718 10530 A (0.28 0.19 0.17 0.16 0.18)  
##    2) roll_belt< 130.5 13495  9323 A (0.31 0.21 0.19 0.18 0.11)  
##      4) pitch_forearm< -34.15 1184     8 A (0.99 0.0068 0 0 0) *
##      5) pitch_forearm>=-34.15 12311  9315 A (0.24 0.23 0.21 0.2 0.12)  
##       10) magnet_dumbbell_y< 436.5 10361  7436 A (0.28 0.18 0.24 0.19 0.11)  
##         20) roll_forearm< 122.5 6430  3810 A (0.41 0.18 0.19 0.17 0.061) *
##         21) roll_forearm>=122.5 3931  2645 C (0.078 0.18 0.33 0.23 0.19) *
##       11) magnet_dumbbell_y>=436.5 1950   966 B (0.036 0.5 0.041 0.23 0.19) *
##    3) roll_belt>=130.5 1223    13 E (0.011 0 0 0 0.99) *

modelFit2

## CART 
## 
## 14718 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## 
## Summary of sample sizes: 11775, 11775, 11773, 11774, 11775 
## 
## Resampling results across tuning parameters:
## 
##   cp    Accuracy  Kappa  Accuracy SD  Kappa SD
##   0.03  0.5       0.36   0.01         0.02    
##   0.06  0.5       0.29   0.06         0.10    
##   0.11  0.3       0.02   0.03         0.05    
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.03484.

treebag (bagging) - great accuracy but lower than random forest

set.seed(558)

library(ipred)
library(plyr)

modelFit3 <- train(training$classe ~ ., data=training, method="treebag", trControl = trainControl(method = "cv", number = 5))
print(modelFit3$finalModel)

## 
## Bagging classification trees with 25 bootstrap replications

modelFit3

## Bagged CART 
## 
## 14718 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## 
## Summary of sample sizes: 11774, 11775, 11774, 11774, 11775 
## 
## Resampling results
## 
##   Accuracy  Kappa  Accuracy SD  Kappa SD
##   1         1      0.003        0.004   
## 
##

gbm (boosting with trees) - good accuracy

set.seed(559)

library(gbm)

## Loading required package: survival
## Loading required package: splines
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Loading required package: parallel
## Loaded gbm 2.1

modelFit4 <- train(training$classe ~ ., data=training, method="gbm", verbose=FALSE, trControl = trainControl(method = "cv", number = 5))
print(modelFit4$finalModel)

## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 40 had non-zero influence.

modelFit4

## Stochastic Gradient Boosting 
## 
## 14718 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## 
## Summary of sample sizes: 11775, 11773, 11775, 11775, 11774 
## 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy  Kappa  Accuracy SD  Kappa SD
##   1                   50      0.8       0.7    0.008        0.010   
##   1                  100      0.8       0.8    0.008        0.010   
##   1                  150      0.9       0.8    0.006        0.007   
##   2                   50      0.9       0.8    0.008        0.010   
##   2                  100      0.9       0.9    0.004        0.006   
##   2                  150      0.9       0.9    0.008        0.010   
##   3                   50      0.9       0.9    0.005        0.007   
##   3                  100      0.9       0.9    0.010        0.013   
##   3                  150      1.0       0.9    0.009        0.011   
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3 and shrinkage = 0.1.

List of top 20 most important variables

varImp(modelFit, scale = FALSE)

## rf variable importance
## 
##   only 20 most important variables shown (out of 52)
## 
##                      Overall
## roll_belt               1493
## pitch_forearm            920
## yaw_belt                 869
## pitch_belt               726
## magnet_dumbbell_z        697
## magnet_dumbbell_y        670
## roll_forearm             634
## accel_dumbbell_y         356
## accel_forearm_x          276
## magnet_dumbbell_x        276
## roll_dumbbell            273
## magnet_belt_z            251
## accel_dumbbell_z         239
## accel_belt_z             232
## magnet_forearm_z         229
## total_accel_dumbbell     222
## magnet_belt_y            197
## gyros_belt_z             192
## yaw_arm                  177
## magnet_belt_x            173

Out of Sample Error

The out of sample error is the error rate you get when you run your predicted model on a new data set. Since I created a validation set, I can run our model against the validation set to check accuracy. I can then use these result to further refine the model prior to running it one time only on the test dataset.

Run predcition on validation data and test accuracy

pred2 <- predict(modelFit, validation)
confusionMatrix(pred2, validation$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1393    6    0    0    0
##          B    1  939    8    0    0
##          C    0    4  841    7    1
##          D    0    0    5  796    1
##          E    1    0    1    1  899
## 
## Overall Statistics
##                                        
##                Accuracy : 0.993        
##                  95% CI : (0.99, 0.995)
##     No Information Rate : 0.284        
##     P-Value [Acc > NIR] : <2e-16       
##                                        
##                   Kappa : 0.991        
##  Mcnemar's Test P-Value : NA           
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.999    0.989    0.984    0.990    0.998
## Specificity             0.998    0.998    0.997    0.999    0.999
## Pos Pred Value          0.996    0.991    0.986    0.993    0.997
## Neg Pred Value          0.999    0.997    0.997    0.998    1.000
## Prevalence              0.284    0.194    0.174    0.164    0.184
## Detection Rate          0.284    0.191    0.171    0.162    0.183
## Detection Prevalence    0.285    0.193    0.174    0.164    0.184
## Balanced Accuracy       0.998    0.994    0.990    0.994    0.999

The accuracy of the model on the validation data set is 99.3%, with a 95% confidence interval of (0.991, 0.995). Therefore the out of sample error rate on the validation data set is about 0.7%. With this error rate, no further refinements to the model are required.

Evalutation

Run the final model, which has been developed against the training data set and tested against the validation data set. The final model will be run once against the test data set.

Run prediction on test data and test accuracy

file2 <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"  #this is testing file
download.file(file2,destfile="testData.csv")

testData <- read.csv("./testData.csv")

testData <- testData[,c(8:11, 37:49, 60:68, 84:86, 102, 113:124, 140, 151:160)]
#names(testData)

pred3 <- predict(modelFit, testData)
pred3

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

100% accuracy according to the submission portion of this project.