The project is the final course project for the Practical Machine Learning Course from the Johns Hopkins University Data Specialization on Coursera.
This project explores the use of machine learning to predict correct and incorrect exercise form using measured exercise data collected from various test subjects. The data set is large, including 19,622 observations in the training dataset and 20 additional observations in the test set. Each set includes 160 variables (159 predictors and the classification variable.
Using the random forests methodology and a training class of 20% of the training data, we are able to generate a prediction algorithm whichwith an accuracy of 99.26% when tested on the other 80% of the training set (the quiz set) 58 of the 159 original variables. The prediction algorithm achieved a 100% success rate on the 20 test observations. Using the alternative rpart methodology results in a much lower accuracy rate.
When attempted subsequently on a data set with sequential variables (timestamp, number, etc) using random forests, the accuracy rate remains high although the prediction model fails to accurately predict all of the final test classifications.
The personal activity training data is located in a .csv file called “pml-training.csv”. In this code chunk we load pml-training.csv into a data.frame object called “trainingset”. The test data set, containing 20 observations, is called into a variable called “testingset”.
trainingset<-data.frame(fread(input = "pml-training.csv",header = T,na.strings = "NA"))
testingset<-data.frame(fread(input = "pml-testing.csv",header = T,na.strings = "NA"))
Many of the variables included in the training data are not included in the test data set. The majority of these variables contain no or close to no data in the test set as well. We will remove those variables as they will not be useful in predicting the test variables, leaving 58 remaining predictor variables and 1 classificatoin variable.
valuablecols<-c(2:11,37:49,60:68,84:86,102,113:124,140,151:160)
trainingset<-trainingset[,valuablecols]
testingset<-testingset[,valuablecols]
rm(valuablecols)
Using the random forests package, we train a predictive model (“modFit”) using the training set, which contains 3,927 observations. Due to the long processing time for this segment, the code below calls a model fit previously evaluated using the the commented code instead of evaluating the code in the markdown itself.
Using the createDataPartition function, we next subset the data into a training set and a quiz set from the available data. Due to the large size of the dataset, we limit the training data set to only 20% of the full provided training set and leave the rest of the set available as a quiz set.
inTrain<-createDataPartition(trainingset$classe,
p=.2,
list=FALSE)
train<-trainingset[inTrain,]
quiz<-trainingset[-inTrain,]
rm(inTrain)
The first attempt uses the Random Forests approach and provides a very accurate model. The training algorithm uses the default settings and predicts the classe (classification variable) using all other variables.
# RF20<-train(classe ~ .,
# method="rf",
# data=train)
#
# saveRDS(modFitRF50,file = "rf modfit 20.RDS")
RF20<-readRDS(file="rf modfit 20.RDS")
# plot(RF20)
# RF20
The second attempt uses the rpart method, which provides a lower degree of accuracy. The training algorithm uses the default settings and predicts the classe (classification variable) using all other variables.
# RP20<-train(classe ~ .,
# method="rpart",
# data=train)
# View(train)
# saveRDS(RP20,file = "rpart modfit 20.RDS")
#
RP20<-readRDS(file="rpart modfit 20.RDS")
Using the VarImp function we determine the most significant variables for each training algorithm. First we will consider the Random Forests model by variable imporance.
varImp(RF20)
## rf variable importance
##
## only 20 most important variables shown (out of 80)
##
## Overall
## raw_timestamp_part_1 100.000
## num_window 48.003
## roll_belt 46.787
## pitch_forearm 27.932
## magnet_dumbbell_z 22.079
## pitch_belt 15.137
## magnet_dumbbell_y 13.820
## roll_forearm 12.834
## cvtd_timestamp30/11/2011 17:12 12.704
## yaw_belt 11.905
## cvtd_timestamp02/12/2011 14:58 11.695
## magnet_dumbbell_x 11.300
## magnet_belt_y 7.840
## cvtd_timestamp28/11/2011 14:15 7.125
## accel_dumbbell_y 6.873
## roll_dumbbell 6.482
## cvtd_timestamp02/12/2011 13:33 6.478
## accel_forearm_x 5.766
## accel_belt_z 5.363
## accel_dumbbell_z 5.029
Then the RPart by Variable Importance.
varImp(RP20)
## rpart variable importance
##
## only 20 most important variables shown (out of 91)
##
## Overall
## raw_timestamp_part_1 100.00
## pitch_forearm 86.29
## roll_belt 84.20
## roll_dumbbell 68.70
## magnet_dumbbell_y 64.77
## accel_belt_z 62.88
## cvtd_timestamp28/11/2011 14:13 62.07
## num_window 61.20
## accel_forearm_x 58.66
## total_accel_belt 47.41
## yaw_belt 46.02
## cvtd_timestamp30/11/2011 17:12 41.66
## cvtd_timestamp30/11/2011 17:11 40.52
## accel_dumbbell_y 39.55
## magnet_belt_y 36.12
## pitch_belt 35.14
## magnet_dumbbell_z 32.53
## cvtd_timestamp05/12/2011 11:24 32.27
## cvtd_timestamp02/12/2011 14:57 31.68
## cvtd_timestamp28/11/2011 14:14 31.22
Both training algorithms rely heavily on the raw_timestamp_part_1 variable, a variable representing the time the observation was taken. Random Forests uses num_window, an ordered variable, as well as roll_belt, pitch_forearm, magnet_dumbbell_z, and other observed spatial orientation variables. Rpart uses pitch_forearm, roll_belt, roll_dumbbell, and other variables.
First we analyze the correlation between the y variable and the timestamp and numeric variables. Both of these show a cascading pattern compared against the classifier variable, consistent with the reality that the exercise classifications were clumped together and performed sequentially by the test subjects.
featurePlot(x=trainingset[,c("raw_timestamp_part_1", "num_window")],
y=trainingset$classe,
plot="pairs")
For a more realistic analysis, it is important to disregard these variables, as we will discuss further below. This chart represents the correlation between the classifier variable and two of the top predictor variables aside from the synthetic, time and order variables above. Here we plot roll_belt and pitch_forearm. Both show some significant patterns with the classifier variables.
featurePlot(x=trainingset[,c("roll_belt","pitch_forearm")],
y=trainingset$classe,
plot="pairs")
We then test the model on the remaining 15,695 observations in the quiz set, finding an accuracy of 99.26% using the random forests method but only 72% for the rpart classification tree method.
Confusion matrices display a wide range of valuable data on a predictive test, including prediction accuracy, sensitivity, and specificity by class. As seen below, the RF model is highly accurate across all classes when sampled on the quiz set.
confusionMatrix(predict(RF20, quiz),quiz$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4454 12 0 0 0
## B 10 3012 20 0 0
## C 0 12 2704 17 0
## D 0 1 13 2555 5
## E 0 0 0 0 2880
##
## Overall Statistics
##
## Accuracy : 0.9943
## 95% CI : (0.993, 0.9954)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9927
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9978 0.9918 0.9879 0.9934 0.9983
## Specificity 0.9989 0.9976 0.9978 0.9986 1.0000
## Pos Pred Value 0.9973 0.9901 0.9894 0.9926 1.0000
## Neg Pred Value 0.9991 0.9980 0.9975 0.9987 0.9996
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2838 0.1919 0.1723 0.1628 0.1835
## Detection Prevalence 0.2845 0.1938 0.1741 0.1640 0.1835
## Balanced Accuracy 0.9983 0.9947 0.9929 0.9960 0.9991
In contrast the rpart method has only 72% accuracy and has less than 50% sensitivity for Class C.
confusionMatrix(predict(RP20, quiz),quiz$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3780 632 555 361 78
## B 275 1893 92 14 16
## C 140 426 1348 73 165
## D 201 86 724 1992 341
## E 68 0 18 132 2285
##
## Overall Statistics
##
## Accuracy : 0.7198
## 95% CI : (0.7127, 0.7269)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6434
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8468 0.6233 0.49251 0.7745 0.7920
## Specificity 0.8552 0.9686 0.93795 0.8970 0.9830
## Pos Pred Value 0.6992 0.8266 0.62639 0.5957 0.9129
## Neg Pred Value 0.9335 0.9147 0.89744 0.9530 0.9545
## Prevalence 0.2844 0.1935 0.17439 0.1639 0.1838
## Detection Rate 0.2408 0.1206 0.08589 0.1269 0.1456
## Detection Prevalence 0.3444 0.1459 0.13711 0.2131 0.1595
## Balanced Accuracy 0.8510 0.7960 0.71523 0.8357 0.8875
These accuracies are visualized below.
accurateRF<-predict(RF20,trainingset)==trainingset$classe
accurateRP<-predict(RP20,trainingset)==trainingset$classe
RFplot<-
qplot(classe,
pitch_forearm,
data=trainingset,
colour=predict(RF20,trainingset),
main="RF Model Accuracy: pitch_forearm"
)+theme(legend.position = "none")
RPplot<-
qplot(classe,
pitch_forearm,
data=trainingset,
colour=predict(RP20,trainingset),
main="RP Model Accuracy: pitch_forearm")+
theme(legend.position = "none")
grid.arrange(RFplot,RPplot, ncol=2)
Several variables which progressed sequentially with the data were important to the classification tree, including the raw time stamp and the num_window variable. Because the exercise methods were also sequential, this may represent an unfair training and testing advantage for this dataset relative to efforts to classify these behaviors observed outside of an experimental setting. In order to rectify this, we attempt to train and test a new model without the benefit of these sequential variables. This effort will be limited to the more accurate random forests method.
trainhard<-trainingset[,-(1:6)]
testhard<-testingset[,-(1:6)]
#
# inTrain<-createDataPartition(trainhard$classe,
# p=.2,
# list=FALSE)
#
# train<-trainhard[inTrain,]
# quiz<-trainhard[-inTrain,]
#
# rm(inTrain)
# start<-proc.time()
# RF20hardmode<-train(classe ~ .,
# method="rf",
# data=train)
# end<-proc.time()-start
# saveRDS(RF20hardmode,file = "rf modfit 20 hard.RDS")
RF20hardmode<-readRDS(file = "rf modfit 20 hard.RDS")
RFHardPredicts<-predict(RF20hardmode, newdata=quiz)
Notably, the hard mode model is comparable in its accuracy to the model fitted including the timestamp data, coming in at 97.9% accuracy on the quiz set. It also returns the same results as the RF model for the final test set.
confusionMatrix(predict(RF20hardmode, quiz),quiz$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4439 73 0 1 1
## B 15 2931 50 13 8
## C 7 27 2668 40 10
## D 2 5 19 2512 47
## E 1 1 0 6 2819
##
## Overall Statistics
##
## Accuracy : 0.9792
## 95% CI : (0.9769, 0.9814)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9737
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9944 0.9651 0.9748 0.9767 0.9771
## Specificity 0.9933 0.9932 0.9935 0.9944 0.9994
## Pos Pred Value 0.9834 0.9715 0.9695 0.9718 0.9972
## Neg Pred Value 0.9978 0.9916 0.9947 0.9954 0.9949
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2828 0.1867 0.1700 0.1601 0.1796
## Detection Prevalence 0.2876 0.1922 0.1753 0.1647 0.1801
## Balanced Accuracy 0.9939 0.9792 0.9842 0.9856 0.9882
cbind(predict(RF20, testingset),predict(RF20hardmode,testhard))
## [,1] [,2]
## [1,] 2 2
## [2,] 1 1
## [3,] 2 2
## [4,] 1 1
## [5,] 1 1
## [6,] 5 5
## [7,] 4 4
## [8,] 2 2
## [9,] 1 1
## [10,] 1 1
## [11,] 2 2
## [12,] 3 3
## [13,] 2 2
## [14,] 1 1
## [15,] 5 5
## [16,] 5 5
## [17,] 1 1
## [18,] 2 2
## [19,] 2 2
## [20,] 2 2
varImp(RF20hardmode)
## rf variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 100.000
## pitch_forearm 57.201
## yaw_belt 52.211
## magnet_dumbbell_z 45.978
## roll_forearm 41.690
## pitch_belt 40.906
## magnet_dumbbell_y 39.930
## accel_dumbbell_y 23.363
## roll_dumbbell 19.135
## magnet_dumbbell_x 17.666
## accel_forearm_x 16.464
## magnet_belt_z 15.726
## magnet_belt_y 13.197
## accel_dumbbell_z 12.954
## accel_belt_z 12.496
## total_accel_dumbbell 10.976
## magnet_forearm_z 10.472
## gyros_belt_z 10.164
## yaw_dumbbell 9.668
## magnet_belt_x 9.437
Finally, we benchmark the various models against the testing set accuracy, finding that
correctset<-predict(RF20, testingset)
RF20set<-predict(RF20, testingset)
RP20set<-predict(RP20, testingset)
RF20hardSet<-predict(RF20hardmode, testhard)
Sets<-data.frame(RF20set,RP20set,RF20hardSet,correctset)
knitr::kable(Sets,row.names = 1:20,col.names = c("Random Forests", "RPart", "Random Forests Hard Mode", "Correct"))
## Warning in if (is.na(row.names)) row.names = has_rownames(x): the condition
## has length > 1 and only the first element will be used
## Warning in if (row.names) {: the condition has length > 1 and only the
## first element will be used
| Random Forests | RPart | Random Forests Hard Mode | Correct | |
|---|---|---|---|---|
| 1 | B | D | B | B |
| 2 | A | C | A | A |
| 3 | B | A | B | B |
| 4 | A | A | A | A |
| 5 | A | A | A | A |
| 6 | E | E | E | E |
| 7 | D | D | D | D |
| 8 | B | D | B | B |
| 9 | A | A | A | A |
| 10 | A | A | A | A |
| 11 | B | B | B | B |
| 12 | C | C | C | C |
| 13 | B | B | B | B |
| 14 | A | A | A | A |
| 15 | E | E | E | E |
| 16 | E | E | E | E |
| 17 | A | A | A | A |
| 18 | B | B | B | B |
| 19 | B | D | B | B |
| 20 | B | B | B | B |
paste("Random Forests Test Accuracy: ", sum(RF20set==correctset)/20*100, "%", sep="")
## [1] "Random Forests Test Accuracy: 100%"
paste("RPart Test Accuracy: ", sum(RP20set==correctset)/20*100, "%", sep="")
## [1] "RPart Test Accuracy: 75%"
paste("Random Forests Hard Mode Test Accuracy: ", sum(RF20hardSet==correctset)/20*100,"%", sep="")
## [1] "Random Forests Hard Mode Test Accuracy: 100%"