Practical Machine Learning Course Project

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The goal of this project is to predict the manner in which they did the exercise.

Loading the data and neccesary packages

training<-read.table("pml-training.csv", header=TRUE, sep=",")
testing<-read.table("pml-testing.csv",header=TRUE, sep=",")
library('caret')
## Warning: package 'caret' was built under R version 3.1.3
## Loading required package: lattice
## Loading required package: ggplot2
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 3.4.1 Copyright (c) 2006-2014 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(gridExtra)
## Loading required package: grid

Cleaning the training data set

trainingaccel<-grepl("^accel",names(training))
trainingtotal<-grepl("^total",names(training))
roll<-grepl("^roll",names(training))
pitch<-grepl("^pitch",names(training))
yaw<-grepl("^yaw",names(training))
magnet<-grepl("^magnet",names(training))
gyro<-grepl("^gyro",names(training))
acceldata<-training[ ,trainingaccel]
rolldata<-training[ ,roll]
pitchdata<-training[ ,pitch]
yawdata<-training[,yaw]
magnetdata<-training[,magnet]
gyrodata<-training[,gyro]
totaldata<-training[,trainingtotal]
trainClasse<-cbind(acceldata,rolldata,pitchdata,yawdata,magnetdata,gyrodata,totaldata,training[ ,160])
colnames(trainClasse)[53]<-'Classe'

Cleaning the testing data set

testingaccel<-grepl("^accel",names(testing))
testingtotal<-grepl("^total",names(testing))
troll<-grepl("^roll",names(testing))
tpitch<-grepl("^pitch",names(testing))
tyaw<-grepl("^yaw",names(testing))
tmagnet<-grepl("^magnet",names(testing))
tgyro<-grepl("^gyro",names(testing))
tacceldata<-testing[ ,testingaccel]
trolldata<-testing[ ,troll]
tpitchdata<-testing[,tpitch]
tyawdata<-testing[,tyaw]
tmagnetdata<-testing[,tmagnet]
tgyrodata<-testing[,tgyro]
ttotaldata<-testing[,testingtotal]
testClasse<-cbind(tacceldata,trolldata,tpitchdata,tyawdata,tmagnetdata,tgyrodata,ttotaldata,testing[ ,160])
colnames(testClasse)[53]<-'problem.id'

Creating a training and testing subset

set.seed(400)
inTrain = createDataPartition(trainClasse$Classe, p = .60)[[1]]
trainingsubset = trainClasse[ inTrain,]
testingsubset = trainClasse[-inTrain,]

rPart Model

set.seed(400)
modFit<-train(Classe~.,method="rpart", data=trainingsubset)
## Loading required package: rpart
fancyRpartPlot(modFit$finalModel,cex=.5,under.cex=1,shadow.offset=0)

classepredict=predict(modFit,testingsubset)
confusionMatrix(testingsubset$Classe,classepredict)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1349  233  479  166    5
##          B  247  864  337   70    0
##          C   41   55 1078  194    0
##          D   72  183  679  352    0
##          E   19  355  360   68  640
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5459          
##                  95% CI : (0.5348, 0.5569)
##     No Information Rate : 0.3738          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4307          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7807   0.5112   0.3675  0.41412  0.99225
## Specificity            0.8557   0.8938   0.9410  0.86650  0.88863
## Pos Pred Value         0.6044   0.5692   0.7880  0.27372  0.44383
## Neg Pred Value         0.9325   0.8695   0.7136  0.92409  0.99922
## Prevalence             0.2202   0.2154   0.3738  0.10834  0.08221
## Detection Rate         0.1719   0.1101   0.1374  0.04486  0.08157
## Detection Prevalence   0.2845   0.1935   0.1744  0.16391  0.18379
## Balanced Accuracy      0.8182   0.7025   0.6543  0.64031  0.94044

In testing this model on the testing subset, it is revealed to have a 54.6% accuracy (only slightly better than chance). The variables used in the algorithm include roll_belt, pitch_forearm, yaw_belt,magnet_dumbbell_Z,pitch_belt, and magnet_dumbell_x.

Random Forest Model

We see that the rpart model was not as accurate as we hoped. We now do a random forest model to see if that method will better fit the data.

set.seed(400)
modFit2 <- train(Classe ~ ., method="rf",trControl=trainControl(method = "cv", number = 4), data=trainingsubset)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
print(modFit2)
## Random Forest 
## 
## 11776 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold) 
## 
## Summary of sample sizes: 8832, 8830, 8833, 8833 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9869228  0.9834553  0.003447160  0.004361512
##   27    0.9875177  0.9842103  0.005664688  0.007164399
##   52    0.9827623  0.9781937  0.005106623  0.006457005
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.
varImp(modFit2)
## rf variable importance
## 
##   only 20 most important variables shown (out of 52)
## 
##                      Overall
## roll_belt             100.00
## pitch_forearm          57.62
## yaw_belt               56.86
## pitch_belt             44.63
## magnet_dumbbell_z      43.16
## magnet_dumbbell_y      41.88
## roll_forearm           39.24
## accel_dumbbell_y       20.01
## accel_forearm_x        18.58
## magnet_dumbbell_x      18.37
## roll_dumbbell          17.88
## magnet_belt_z          16.15
## accel_belt_z           14.32
## magnet_forearm_z       13.41
## accel_dumbbell_z       13.32
## total_accel_dumbbell   13.00
## yaw_arm                11.60
## magnet_belt_y          11.01
## magnet_belt_x          10.85
## gyros_belt_z           10.44
classepredict2=predict(modFit2,testingsubset)
confusionMatrix(testingsubset$Classe,classepredict2)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2231    0    0    0    1
##          B    9 1501    8    0    0
##          C    0   16 1349    3    0
##          D    0    1   15 1270    0
##          E    0    0    4    1 1437
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9926          
##                  95% CI : (0.9905, 0.9944)
##     No Information Rate : 0.2855          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9906          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9960   0.9888   0.9804   0.9969   0.9993
## Specificity            0.9998   0.9973   0.9971   0.9976   0.9992
## Pos Pred Value         0.9996   0.9888   0.9861   0.9876   0.9965
## Neg Pred Value         0.9984   0.9973   0.9958   0.9994   0.9998
## Prevalence             0.2855   0.1935   0.1754   0.1624   0.1833
## Detection Rate         0.2843   0.1913   0.1719   0.1619   0.1832
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9979   0.9931   0.9887   0.9972   0.9993

The random forest model has a 99.2% accuracy, far superior to the rpart method. The specificity and sensitivity is in the high 90s for all variables. The top five variables of importance included the roll_belt, yaw_belt,magnet_dumbbell_z,magnet_dumbbell_y, and the pitch_forearm.

In Sample & Out of Sample Error

The in sample error is when the model is used to predict the training set it is based off of. This error is going to be much less than the model predicting another dataset (out of sample error).

insamplepredict=predict(modFit2,trainingsubset)
confusionMatrix(trainingsubset$Classe,insamplepredict)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3348    0    0    0    0
##          B    0 2279    0    0    0
##          C    0    0 2054    0    0
##          D    0    0    0 1930    0
##          E    0    0    0    0 2165
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000
classepredict2=predict(modFit2,testingsubset)
confusionMatrix(testingsubset$Classe,classepredict2)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2231    0    0    0    1
##          B    9 1501    8    0    0
##          C    0   16 1349    3    0
##          D    0    1   15 1270    0
##          E    0    0    4    1 1437
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9926          
##                  95% CI : (0.9905, 0.9944)
##     No Information Rate : 0.2855          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9906          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9960   0.9888   0.9804   0.9969   0.9993
## Specificity            0.9998   0.9973   0.9971   0.9976   0.9992
## Pos Pred Value         0.9996   0.9888   0.9861   0.9876   0.9965
## Neg Pred Value         0.9984   0.9973   0.9958   0.9994   0.9998
## Prevalence             0.2855   0.1935   0.1754   0.1624   0.1833
## Detection Rate         0.2843   0.1913   0.1719   0.1619   0.1832
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9979   0.9931   0.9887   0.9972   0.9993

Based on what we see above, the testing on a new set of data shows accuracy.

testinganswers=predict(modFit2, newdata=testing)
print(testinganswers)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Conclusion

As we can see from above, the Random Forest was a better model for prediction of exercise quality compared to rpart. The nominal categories were dependent on various variables and the interaction between them. The Random Forest model had over 99% accuracy and fitted well to other subsamples of the data.