Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The goal of this project is to predict the manner in which they did the exercise.
training<-read.table("pml-training.csv", header=TRUE, sep=",")
testing<-read.table("pml-testing.csv",header=TRUE, sep=",")
library('caret')
## Warning: package 'caret' was built under R version 3.1.3
## Loading required package: lattice
## Loading required package: ggplot2
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 3.4.1 Copyright (c) 2006-2014 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(gridExtra)
## Loading required package: grid
trainingaccel<-grepl("^accel",names(training))
trainingtotal<-grepl("^total",names(training))
roll<-grepl("^roll",names(training))
pitch<-grepl("^pitch",names(training))
yaw<-grepl("^yaw",names(training))
magnet<-grepl("^magnet",names(training))
gyro<-grepl("^gyro",names(training))
acceldata<-training[ ,trainingaccel]
rolldata<-training[ ,roll]
pitchdata<-training[ ,pitch]
yawdata<-training[,yaw]
magnetdata<-training[,magnet]
gyrodata<-training[,gyro]
totaldata<-training[,trainingtotal]
trainClasse<-cbind(acceldata,rolldata,pitchdata,yawdata,magnetdata,gyrodata,totaldata,training[ ,160])
colnames(trainClasse)[53]<-'Classe'
testingaccel<-grepl("^accel",names(testing))
testingtotal<-grepl("^total",names(testing))
troll<-grepl("^roll",names(testing))
tpitch<-grepl("^pitch",names(testing))
tyaw<-grepl("^yaw",names(testing))
tmagnet<-grepl("^magnet",names(testing))
tgyro<-grepl("^gyro",names(testing))
tacceldata<-testing[ ,testingaccel]
trolldata<-testing[ ,troll]
tpitchdata<-testing[,tpitch]
tyawdata<-testing[,tyaw]
tmagnetdata<-testing[,tmagnet]
tgyrodata<-testing[,tgyro]
ttotaldata<-testing[,testingtotal]
testClasse<-cbind(tacceldata,trolldata,tpitchdata,tyawdata,tmagnetdata,tgyrodata,ttotaldata,testing[ ,160])
colnames(testClasse)[53]<-'problem.id'
set.seed(400)
inTrain = createDataPartition(trainClasse$Classe, p = .60)[[1]]
trainingsubset = trainClasse[ inTrain,]
testingsubset = trainClasse[-inTrain,]
set.seed(400)
modFit<-train(Classe~.,method="rpart", data=trainingsubset)
## Loading required package: rpart
fancyRpartPlot(modFit$finalModel,cex=.5,under.cex=1,shadow.offset=0)
classepredict=predict(modFit,testingsubset)
confusionMatrix(testingsubset$Classe,classepredict)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1349 233 479 166 5
## B 247 864 337 70 0
## C 41 55 1078 194 0
## D 72 183 679 352 0
## E 19 355 360 68 640
##
## Overall Statistics
##
## Accuracy : 0.5459
## 95% CI : (0.5348, 0.5569)
## No Information Rate : 0.3738
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4307
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.7807 0.5112 0.3675 0.41412 0.99225
## Specificity 0.8557 0.8938 0.9410 0.86650 0.88863
## Pos Pred Value 0.6044 0.5692 0.7880 0.27372 0.44383
## Neg Pred Value 0.9325 0.8695 0.7136 0.92409 0.99922
## Prevalence 0.2202 0.2154 0.3738 0.10834 0.08221
## Detection Rate 0.1719 0.1101 0.1374 0.04486 0.08157
## Detection Prevalence 0.2845 0.1935 0.1744 0.16391 0.18379
## Balanced Accuracy 0.8182 0.7025 0.6543 0.64031 0.94044
In testing this model on the testing subset, it is revealed to have a 54.6% accuracy (only slightly better than chance). The variables used in the algorithm include roll_belt, pitch_forearm, yaw_belt,magnet_dumbbell_Z,pitch_belt, and magnet_dumbell_x.
We see that the rpart model was not as accurate as we hoped. We now do a random forest model to see if that method will better fit the data.
set.seed(400)
modFit2 <- train(Classe ~ ., method="rf",trControl=trainControl(method = "cv", number = 4), data=trainingsubset)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
print(modFit2)
## Random Forest
##
## 11776 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold)
##
## Summary of sample sizes: 8832, 8830, 8833, 8833
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9869228 0.9834553 0.003447160 0.004361512
## 27 0.9875177 0.9842103 0.005664688 0.007164399
## 52 0.9827623 0.9781937 0.005106623 0.006457005
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
varImp(modFit2)
## rf variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 100.00
## pitch_forearm 57.62
## yaw_belt 56.86
## pitch_belt 44.63
## magnet_dumbbell_z 43.16
## magnet_dumbbell_y 41.88
## roll_forearm 39.24
## accel_dumbbell_y 20.01
## accel_forearm_x 18.58
## magnet_dumbbell_x 18.37
## roll_dumbbell 17.88
## magnet_belt_z 16.15
## accel_belt_z 14.32
## magnet_forearm_z 13.41
## accel_dumbbell_z 13.32
## total_accel_dumbbell 13.00
## yaw_arm 11.60
## magnet_belt_y 11.01
## magnet_belt_x 10.85
## gyros_belt_z 10.44
classepredict2=predict(modFit2,testingsubset)
confusionMatrix(testingsubset$Classe,classepredict2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2231 0 0 0 1
## B 9 1501 8 0 0
## C 0 16 1349 3 0
## D 0 1 15 1270 0
## E 0 0 4 1 1437
##
## Overall Statistics
##
## Accuracy : 0.9926
## 95% CI : (0.9905, 0.9944)
## No Information Rate : 0.2855
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9906
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9960 0.9888 0.9804 0.9969 0.9993
## Specificity 0.9998 0.9973 0.9971 0.9976 0.9992
## Pos Pred Value 0.9996 0.9888 0.9861 0.9876 0.9965
## Neg Pred Value 0.9984 0.9973 0.9958 0.9994 0.9998
## Prevalence 0.2855 0.1935 0.1754 0.1624 0.1833
## Detection Rate 0.2843 0.1913 0.1719 0.1619 0.1832
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9979 0.9931 0.9887 0.9972 0.9993
The random forest model has a 99.2% accuracy, far superior to the rpart method. The specificity and sensitivity is in the high 90s for all variables. The top five variables of importance included the roll_belt, yaw_belt,magnet_dumbbell_z,magnet_dumbbell_y, and the pitch_forearm.
The in sample error is when the model is used to predict the training set it is based off of. This error is going to be much less than the model predicting another dataset (out of sample error).
insamplepredict=predict(modFit2,trainingsubset)
confusionMatrix(trainingsubset$Classe,insamplepredict)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3348 0 0 0 0
## B 0 2279 0 0 0
## C 0 0 2054 0 0
## D 0 0 0 1930 0
## E 0 0 0 0 2165
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9997, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
classepredict2=predict(modFit2,testingsubset)
confusionMatrix(testingsubset$Classe,classepredict2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2231 0 0 0 1
## B 9 1501 8 0 0
## C 0 16 1349 3 0
## D 0 1 15 1270 0
## E 0 0 4 1 1437
##
## Overall Statistics
##
## Accuracy : 0.9926
## 95% CI : (0.9905, 0.9944)
## No Information Rate : 0.2855
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9906
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9960 0.9888 0.9804 0.9969 0.9993
## Specificity 0.9998 0.9973 0.9971 0.9976 0.9992
## Pos Pred Value 0.9996 0.9888 0.9861 0.9876 0.9965
## Neg Pred Value 0.9984 0.9973 0.9958 0.9994 0.9998
## Prevalence 0.2855 0.1935 0.1754 0.1624 0.1833
## Detection Rate 0.2843 0.1913 0.1719 0.1619 0.1832
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9979 0.9931 0.9887 0.9972 0.9993
Based on what we see above, the testing on a new set of data shows accuracy.
testinganswers=predict(modFit2, newdata=testing)
print(testinganswers)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
As we can see from above, the Random Forest was a better model for prediction of exercise quality compared to rpart. The nominal categories were dependent on various variables and the interaction between them. The Random Forest model had over 99% accuracy and fitted well to other subsamples of the data.