Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
This report provides an analysis and exploration on how well the accelerometers on the belt, forearm, arm and dumbell of 6 participants did in classifying whether the lifts were performed correctly or incorrectly. All data and information about the data sets is available from the website here: http://groupware.les.inf.puc-rio.br/har. Methods of analysis included non-linear regression modeling, cross validation and a validation test set. All supporting plots for the analysis are found within the analysis.
Classifying Definitions
A - Exactly according to the specification
B - Throwing the elbows to the front
C - Lifting the dumbbell only halfway
D - Lowering the dumbbell only halfway
E - Throwing the hips to the front
This report analyzes key multi predictors variables used for quantifying how well they performed the routine. Any variable which did not have a direct coorelation or was missing a significant of data was eliminated from the feature set. The non-linear regression models used included the Recursive Partition, Random Forest and Gradient Boosting.
Results of the data analysis showed that Random Forest had the best accuracy rating of 99.2% with a p-value of 2.2e-16 while Gradient Boosting had an estimated accuracy rating of 96.1% with a p-value of 2.2e-16, and Recursive Partition having an estimated accuracy rating of 49% with a p-value of 1.0 for classifying the quantification on how well the participants do the routines. When testing the validation test data against the models, both the Random Forest and Gradient Boosting were 100% accurate. Further analysis will need to continue to determine if there are any linear relationships between the variables or if variables can be combined for increasing the accuracy.
R version 3.4.4
caret 6.0.80
dplyr 0.7.4
elasticnet 1.1
gbm 2.1.3
ggplot2 2.2.1
rattle 5.1.0
rpart 4.1.13
library(dplyr)
library(AppliedPredictiveModeling)
library(caret)
library(RGtk2)
library(rattle)
library(rpart)
library(rpart.plot)
Data was downloaded from Cousera’s and stored with the project for reproducing the current analysis. The original data sets were downloaded from the following locations:
Training Data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
Validation Test Data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
raw_trainingData <- read.csv("./pml-training.csv", sep = ",", header = TRUE, fill = TRUE)
raw_testingData <- read.csv("./pml-testing.csv", sep = ",", header = TRUE, fill = TRUE)
dim(raw_trainingData); dim(raw_testingData)
## [1] 19622 160
## [1] 20 160
usefulFeatures <- !(names(raw_trainingData) %in% c("X", "user_name","raw_timestamp_part_1", "raw_timestamp_part_2", "cvtd_timestamp", "new_windows", "num_window"))
trainingData <- raw_trainingData[,usefulFeatures]
testingData <- raw_testingData[,usefulFeatures]
nonZeroPredictors <- nearZeroVar(trainingData, saveMetrics = TRUE)
trainingData = trainingData[,nonZeroPredictors$nzv==FALSE]
testingData = testingData[,nonZeroPredictors$nzv==FALSE]
usefulFeatures <- !(names(trainingData) %in% colnames(trainingData)[colSums(is.na(trainingData)) >= 0.65])
trainingData <- trainingData[,usefulFeatures]
testingData <- testingData[,usefulFeatures]
dim(trainingData); dim(testingData)
## [1] 19622 53
## [1] 20 53
# Set features (aka predictors, responses)
features <- colnames(trainingData)
features
## [1] "roll_belt" "pitch_belt" "yaw_belt"
## [4] "total_accel_belt" "gyros_belt_x" "gyros_belt_y"
## [7] "gyros_belt_z" "accel_belt_x" "accel_belt_y"
## [10] "accel_belt_z" "magnet_belt_x" "magnet_belt_y"
## [13] "magnet_belt_z" "roll_arm" "pitch_arm"
## [16] "yaw_arm" "total_accel_arm" "gyros_arm_x"
## [19] "gyros_arm_y" "gyros_arm_z" "accel_arm_x"
## [22] "accel_arm_y" "accel_arm_z" "magnet_arm_x"
## [25] "magnet_arm_y" "magnet_arm_z" "roll_dumbbell"
## [28] "pitch_dumbbell" "yaw_dumbbell" "total_accel_dumbbell"
## [31] "gyros_dumbbell_x" "gyros_dumbbell_y" "gyros_dumbbell_z"
## [34] "accel_dumbbell_x" "accel_dumbbell_y" "accel_dumbbell_z"
## [37] "magnet_dumbbell_x" "magnet_dumbbell_y" "magnet_dumbbell_z"
## [40] "roll_forearm" "pitch_forearm" "yaw_forearm"
## [43] "total_accel_forearm" "gyros_forearm_x" "gyros_forearm_y"
## [46] "gyros_forearm_z" "accel_forearm_x" "accel_forearm_y"
## [49] "accel_forearm_z" "magnet_forearm_x" "magnet_forearm_y"
## [52] "magnet_forearm_z" "classe"
# Frequency count of data points
summary(trainingData$classe)
## A B C D E
## 5580 3797 3422 3216 3607
# Set the seed for reproduciblility
set.seed(50342)
inTrain <- createDataPartition(y=trainingData$classe, p=0.70, list=FALSE)
training <- trainingData[inTrain,]
testing <- trainingData[-inTrain,]
dim(training); dim(testing)
## [1] 13737 53
## [1] 5885 53
# Frequency counts of data points
summary(training$classe)
## A B C D E
## 3906 2658 2396 2252 2525
summary(testing$classe)
## A B C D E
## 1674 1139 1026 964 1082
Define the training controls for use with cross validation. A k-fold of 5 was selected based on an accumlation of speed and accuracy. Additional k-folds could be selected but the speed to train the model does not quantify the small increase in accuracy.
trainControl <- trainControl(method="cv", number=5)
modelRpart <- train(classe ~ .,data=training, method="rpart", trControl=trainControl)
fancyRpartPlot(modelRpart$finalModel)
predictedByRpart <- predict(modelRpart, newdata=testing)
varImp(modelRpart)
## rpart variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## pitch_forearm 100.00
## roll_forearm 72.10
## roll_belt 70.29
## magnet_dumbbell_y 50.09
## accel_belt_z 43.18
## magnet_belt_y 39.80
## yaw_belt 39.49
## total_accel_belt 35.94
## magnet_arm_x 26.90
## accel_arm_x 25.76
## roll_dumbbell 19.55
## magnet_dumbbell_z 18.04
## magnet_dumbbell_x 17.99
## accel_dumbbell_y 15.39
## roll_arm 14.34
## gyros_belt_z 0.00
## accel_forearm_x 0.00
## gyros_belt_x 0.00
## magnet_arm_z 0.00
## magnet_belt_x 0.00
confMatRpart <- confusionMatrix(testing$classe, predictedByRpart)
confMatRpart
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1526 23 119 0 6
## B 453 375 311 0 0
## C 486 38 502 0 0
## D 421 171 372 0 0
## E 163 146 291 0 482
##
## Overall Statistics
##
## Accuracy : 0.4902
## 95% CI : (0.4774, 0.5031)
## No Information Rate : 0.5181
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.334
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.5005 0.49801 0.3147 NA 0.98770
## Specificity 0.9478 0.85113 0.8779 0.8362 0.88883
## Pos Pred Value 0.9116 0.32924 0.4893 NA 0.44547
## Neg Pred Value 0.6383 0.92035 0.7751 NA 0.99875
## Prevalence 0.5181 0.12795 0.2710 0.0000 0.08292
## Detection Rate 0.2593 0.06372 0.0853 0.0000 0.08190
## Detection Prevalence 0.2845 0.19354 0.1743 0.1638 0.18386
## Balanced Accuracy 0.7242 0.67457 0.5963 NA 0.93827
The overall accuracy for Recursive Partitioning is 0.4902294 with a p-value of 0.9999911 and a classification error 0.5097706
modelRf <- train(classe ~ .,data=training, method="rf", trControl=trainControl, verbose=FALSE)
predictedByRf <- predict(modelRf, newdata=testing)
varImp(modelRf)
## rf variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 100.00
## pitch_forearm 58.67
## yaw_belt 52.17
## pitch_belt 43.81
## magnet_dumbbell_y 41.85
## roll_forearm 41.85
## magnet_dumbbell_z 41.69
## accel_dumbbell_y 21.77
## roll_dumbbell 18.71
## magnet_dumbbell_x 16.91
## accel_forearm_x 15.91
## magnet_belt_z 15.55
## total_accel_dumbbell 14.23
## accel_belt_z 14.16
## magnet_forearm_z 13.84
## accel_dumbbell_z 13.82
## magnet_belt_y 12.67
## gyros_belt_z 10.21
## yaw_arm 10.15
## magnet_belt_x 9.84
confMatRf <- confusionMatrix(testing$classe, predictedByRf)
confMatRf
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 1 0 0 0
## B 13 1125 1 0 0
## C 0 9 1009 8 0
## D 0 0 11 952 1
## E 0 0 1 1 1080
##
## Overall Statistics
##
## Accuracy : 0.9922
## 95% CI : (0.9896, 0.9943)
## No Information Rate : 0.2865
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9901
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9923 0.9912 0.9873 0.9906 0.9991
## Specificity 0.9998 0.9971 0.9965 0.9976 0.9996
## Pos Pred Value 0.9994 0.9877 0.9834 0.9876 0.9982
## Neg Pred Value 0.9969 0.9979 0.9973 0.9982 0.9998
## Prevalence 0.2865 0.1929 0.1737 0.1633 0.1837
## Detection Rate 0.2843 0.1912 0.1715 0.1618 0.1835
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9960 0.9941 0.9919 0.9941 0.9993
plot(modelRf$finalModel, main="Model Error for Random Forest")
The overall accuracy for Random Forest is 0.9921835 and a classification error 0.0078165
modelBoosting <- train(classe ~ .,data=training, method="gbm", trControl=trainControl, verbose=FALSE)
predictedByBoosting <- predict(modelBoosting, newdata=testing)
#varImp(modelBoosting)
confMatBoosting <- confusionMatrix(testing$classe, predictedByBoosting)
confMatBoosting
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1649 12 4 5 4
## B 43 1058 36 2 0
## C 0 27 988 11 0
## D 0 1 41 918 4
## E 3 16 7 11 1045
##
## Overall Statistics
##
## Accuracy : 0.9614
## 95% CI : (0.9562, 0.9662)
## No Information Rate : 0.288
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9512
## Mcnemar's Test P-Value : 1.981e-11
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9729 0.9497 0.9182 0.9694 0.9924
## Specificity 0.9940 0.9830 0.9921 0.9907 0.9923
## Pos Pred Value 0.9851 0.9289 0.9630 0.9523 0.9658
## Neg Pred Value 0.9891 0.9882 0.9819 0.9941 0.9983
## Prevalence 0.2880 0.1893 0.1828 0.1609 0.1789
## Detection Rate 0.2802 0.1798 0.1679 0.1560 0.1776
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9834 0.9664 0.9552 0.9800 0.9924
The overall accuracy for Gradient Boosting is 0.9614274 and a classification error 0.0385726
The validation test data containing 20 rows is provided for validating the accuracy of the models
validationTest <- raw_testingData[,(names(raw_testingData) %in% features)]
validationPredictionRf <- predict(modelRf, newdata=validationTest, type="raw")
validationPredictionRfProbability <- predict(modelRf, newdata=validationTest, type="prob")
validationPredictionBoosting <- predict(modelBoosting, newdata=validationTest, type="raw")
validationPredictionBoostingProbablity <- predict(modelBoosting, newdata=validationTest, type="prob")
validationPredictionRpart <- predict(modelRpart, newdata=validationTest, type="raw")
validationPredictionRpartProbility <- predict(modelRpart, newdata=validationTest, type="prob")
validationPredictionRf
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
validationPredictionBoosting
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
validationPredictionRpart
## [1] C A C A A C C A A A C C C A C A A A A C
## Levels: A B C D E
validationPredictionRfProbability
## A B C D E
## 1 0.028 0.824 0.116 0.026 0.006
## 2 0.990 0.002 0.000 0.006 0.002
## 3 0.172 0.708 0.060 0.012 0.048
## 4 0.956 0.000 0.020 0.024 0.000
## 5 0.974 0.014 0.010 0.000 0.002
## 6 0.006 0.030 0.036 0.004 0.924
## 7 0.010 0.012 0.048 0.908 0.022
## 8 0.034 0.750 0.050 0.140 0.026
## 9 1.000 0.000 0.000 0.000 0.000
## 10 0.994 0.000 0.000 0.004 0.002
## 11 0.038 0.800 0.108 0.032 0.022
## 12 0.004 0.060 0.846 0.038 0.052
## 13 0.006 0.988 0.000 0.000 0.006
## 14 1.000 0.000 0.000 0.000 0.000
## 15 0.004 0.004 0.014 0.014 0.964
## 16 0.006 0.008 0.000 0.002 0.984
## 17 0.966 0.000 0.000 0.000 0.034
## 18 0.028 0.870 0.002 0.084 0.016
## 19 0.100 0.878 0.000 0.016 0.006
## 20 0.000 1.000 0.000 0.000 0.000