Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. Six participants were asked to perform barbell lifts 5 different ways: correctly (classe = A) and incorrectly (classe = B, C, D, E).
Using data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants, I will build a machine learning algorithm to predict the activity quality from the activity monitors.
Load the required packages needed for this analysis
library(caret)
library(randomForest)
Download training data
file <-'http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv' #this is training file
download.file(file, destfile="trainingData.csv")
totalData <- read.csv("./trainingData.csv")
What features from the raw data should be used for predicting the variable classe ? We want features that capture/explain the data but we also need to balance information loss against summarization. When in doubt, err on the side of more information and create more features.
By looking at summary(totalData) I removed the identifier variables (user name, timestamp) which will not be used as predictors and also removed all variables that have 19,216 rows of blank data or NAs. Below is the limited data set which will be used to fit a model to predict classe.
Eliminate identifiers and variables that are largely NA or blank
completeData <- totalData[,c(8:11, 37:49, 60:68, 84:86, 102, 113:124, 140, 151:160)]
I separated the training data into 75% training and 25% validation so that I can test the algorithm against a preliminary test set before running it against the final test set, which can only be used once.
Partition the training data into training and validation sets
inTrain <- createDataPartition(y=completeData$classe,
p=0.75, list=FALSE)
training <- completeData[inTrain,]
validation <- completeData[-inTrain,]
dim(training)
## [1] 14718 53
I tried several methods to model the training set (all with cross validation) but found that the rpart method had very low accuracy and the treebag (bagging) method and gbm (boosting with trees) method had very good accurarcy but still lower than random forest. Since it is the model with the best accuracy (although with a much longer processing time), I will use a random forest model with a 5-fold cross validation.
set.seed(556)
modelFit <- train(training$classe ~ ., data=training, method="rf", trControl = trainControl(method = "cv", number = 5))
modelFit
## Random Forest
##
## 14718 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
##
## Summary of sample sizes: 11774, 11774, 11773, 11776, 11775
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 1 1 0.003 0.004
## 27 1 1 0.003 0.003
## 52 1 1 0.004 0.005
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
finMod <- modelFit$finalModel
finMod
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.56%
## Confusion matrix:
## A B C D E class.error
## A 4183 1 1 0 0 0.0004779
## B 21 2823 4 0 0 0.0087781
## C 0 11 2549 7 0 0.0070121
## D 0 0 23 2387 2 0.0103648
## E 0 1 5 7 2693 0.0048041
set.seed(557)
modelFit2 <- train(training$classe ~ ., data=training, method="rpart", trControl = trainControl(method = "cv", number = 5))
## Loading required package: rpart
print(modelFit2$finalModel)
## n= 14718
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 14718 10530 A (0.28 0.19 0.17 0.16 0.18)
## 2) roll_belt< 130.5 13495 9323 A (0.31 0.21 0.19 0.18 0.11)
## 4) pitch_forearm< -34.15 1184 8 A (0.99 0.0068 0 0 0) *
## 5) pitch_forearm>=-34.15 12311 9315 A (0.24 0.23 0.21 0.2 0.12)
## 10) magnet_dumbbell_y< 436.5 10361 7436 A (0.28 0.18 0.24 0.19 0.11)
## 20) roll_forearm< 122.5 6430 3810 A (0.41 0.18 0.19 0.17 0.061) *
## 21) roll_forearm>=122.5 3931 2645 C (0.078 0.18 0.33 0.23 0.19) *
## 11) magnet_dumbbell_y>=436.5 1950 966 B (0.036 0.5 0.041 0.23 0.19) *
## 3) roll_belt>=130.5 1223 13 E (0.011 0 0 0 0.99) *
modelFit2
## CART
##
## 14718 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
##
## Summary of sample sizes: 11775, 11775, 11773, 11774, 11775
##
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa Accuracy SD Kappa SD
## 0.03 0.5 0.36 0.01 0.02
## 0.06 0.5 0.29 0.06 0.10
## 0.11 0.3 0.02 0.03 0.05
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03484.
set.seed(558)
library(ipred)
library(plyr)
modelFit3 <- train(training$classe ~ ., data=training, method="treebag", trControl = trainControl(method = "cv", number = 5))
print(modelFit3$finalModel)
##
## Bagging classification trees with 25 bootstrap replications
modelFit3
## Bagged CART
##
## 14718 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
##
## Summary of sample sizes: 11774, 11775, 11774, 11774, 11775
##
## Resampling results
##
## Accuracy Kappa Accuracy SD Kappa SD
## 1 1 0.003 0.004
##
##
set.seed(559)
library(gbm)
## Loading required package: survival
## Loading required package: splines
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: parallel
## Loaded gbm 2.1
modelFit4 <- train(training$classe ~ ., data=training, method="gbm", verbose=FALSE, trControl = trainControl(method = "cv", number = 5))
print(modelFit4$finalModel)
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 40 had non-zero influence.
modelFit4
## Stochastic Gradient Boosting
##
## 14718 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
##
## Summary of sample sizes: 11775, 11773, 11775, 11775, 11774
##
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa Accuracy SD Kappa SD
## 1 50 0.8 0.7 0.008 0.010
## 1 100 0.8 0.8 0.008 0.010
## 1 150 0.9 0.8 0.006 0.007
## 2 50 0.9 0.8 0.008 0.010
## 2 100 0.9 0.9 0.004 0.006
## 2 150 0.9 0.9 0.008 0.010
## 3 50 0.9 0.9 0.005 0.007
## 3 100 0.9 0.9 0.010 0.013
## 3 150 1.0 0.9 0.009 0.011
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3 and shrinkage = 0.1.
List of top 20 most important variables
varImp(modelFit, scale = FALSE)
## rf variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 1493
## pitch_forearm 920
## yaw_belt 869
## pitch_belt 726
## magnet_dumbbell_z 697
## magnet_dumbbell_y 670
## roll_forearm 634
## accel_dumbbell_y 356
## accel_forearm_x 276
## magnet_dumbbell_x 276
## roll_dumbbell 273
## magnet_belt_z 251
## accel_dumbbell_z 239
## accel_belt_z 232
## magnet_forearm_z 229
## total_accel_dumbbell 222
## magnet_belt_y 197
## gyros_belt_z 192
## yaw_arm 177
## magnet_belt_x 173
The out of sample error is the error rate you get when you run your predicted model on a new data set. Since I created a validation set, I can run our model against the validation set to check accuracy. I can then use these result to further refine the model prior to running it one time only on the test dataset.
Run predcition on validation data and test accuracy
pred2 <- predict(modelFit, validation)
confusionMatrix(pred2, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1393 6 0 0 0
## B 1 939 8 0 0
## C 0 4 841 7 1
## D 0 0 5 796 1
## E 1 0 1 1 899
##
## Overall Statistics
##
## Accuracy : 0.993
## 95% CI : (0.99, 0.995)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.991
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.999 0.989 0.984 0.990 0.998
## Specificity 0.998 0.998 0.997 0.999 0.999
## Pos Pred Value 0.996 0.991 0.986 0.993 0.997
## Neg Pred Value 0.999 0.997 0.997 0.998 1.000
## Prevalence 0.284 0.194 0.174 0.164 0.184
## Detection Rate 0.284 0.191 0.171 0.162 0.183
## Detection Prevalence 0.285 0.193 0.174 0.164 0.184
## Balanced Accuracy 0.998 0.994 0.990 0.994 0.999
The accuracy of the model on the validation data set is 99.3%, with a 95% confidence interval of (0.991, 0.995). Therefore the out of sample error rate on the validation data set is about 0.7%. With this error rate, no further refinements to the model are required.
Run the final model, which has been developed against the training data set and tested against the validation data set. The final model will be run once against the test data set.
Run prediction on test data and test accuracy
file2 <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv" #this is testing file
download.file(file2,destfile="testData.csv")
testData <- read.csv("./testData.csv")
testData <- testData[,c(8:11, 37:49, 60:68, 84:86, 102, 113:124, 140, 151:160)]
#names(testData)
pred3 <- predict(modelFit, testData)
pred3
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
100% accuracy according to the submission portion of this project.