One thing that people regularly do is quantify how much of a particular activity they do, but they neglect to quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict the manner in which they did the exercise. Our machine learning algorithm will be applied to the 20 test cases available in the test data.
In this data set the variable representing the outcome, and the variable that we want to predict, is “classe”, a factor variable with 5 levels(A,B,C,D,E). To gather this data, participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in 5 different fashions: -exactly according to the specification (Class A) -throwing the elbows to the front (Class B) -lifting the dumbbell only halfway (Class C) -lowering the dumbbell only halfway (Class D) -throwing the hips to the front (Class E) Accelerometers located at various points about the participant and their dumbell recorded data which we will use to predict which “classe” the data comes from.
Using Practical Machine Learning techniques we will use the accelerometer data provided at the following links to train models which will predict the manner in which certain exercises were performed. #####Data: training: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv testing: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
To reproduce the results of this code certain packages must be installed and loaded.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rpart)
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
The data from this project comes from this source: http://groupware.les.inf.puc-rio.br/har Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
Here we load the data sets into R, and tidy up the blank, NA and undefined values:
training <- read.csv(url('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv'), na.strings=c("NA","#DIV/0!",""))
testing <- read.csv(url('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv'), na.strings=c("NA","#DIV/0!",""))
The training data set must be split into a training and (preliminary) testing set at a 60/40 ratio respectively (seed set for reproducibility).
set.seed(111)
Predictor variables with little or no values in their columns must be removed.
training <-training[,colSums(is.na(training)) == 0]
testing <-testing[,colSums(is.na(testing)) == 0]
Here we partition the training data into training (mod_training) and validation (mod_testing) data sets.
inTrain <- createDataPartition(y=training$classe, p=0.6, list=FALSE)
mod_training <- training[inTrain, ]
mod_testing <- training[-inTrain, ]
finalmod_testset <- testing
The first seven columns of the data set are not accelerometor measurement data. These columns must be removed because only data measurements should be included in the machine learning algorithm. We perform this transformation on our final model test data set as well.
mod_training <- mod_training[,-c(1:7)]
mod_testing <- mod_testing[,-c(1:7)]
finalmod_testset <- finalmod_testset[,-c(1:7)]
In these models we will be using the accelerometer data to predict the quality of the exercise which is represented in the variable “classe”.
Decision Tree Model:
mod1 <- rpart(classe ~ ., data = mod_training, method = 'class')
Prediction:
pred1 <- predict(mod1, mod_testing, type = 'class')
Results:
confusionMatrix(pred1, mod_testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2055 236 44 85 64
## B 79 997 153 117 170
## C 41 125 1085 192 149
## D 26 120 86 812 88
## E 31 40 0 80 971
##
## Overall Statistics
##
## Accuracy : 0.7545
## 95% CI : (0.7448, 0.764)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6882
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9207 0.6568 0.7931 0.6314 0.6734
## Specificity 0.9236 0.9180 0.9217 0.9512 0.9764
## Pos Pred Value 0.8273 0.6577 0.6815 0.7173 0.8654
## Neg Pred Value 0.9670 0.9177 0.9547 0.9294 0.9300
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2619 0.1271 0.1383 0.1035 0.1238
## Detection Prevalence 0.3166 0.1932 0.2029 0.1443 0.1430
## Balanced Accuracy 0.9221 0.7874 0.8574 0.7913 0.8249
The accuracy of this Decision Tree algorithm has been determined to be 0.7545 and the Estimated Out of Sample Error to be 0.2455.
Random Forest Model:
mod2 <- randomForest(classe ~ ., data = mod_training, method = 'class')
Prediction:
pred2 <- predict(mod2, mod_testing, type = 'class')
Results:
confusionMatrix(pred2, mod_testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2230 20 0 0 0
## B 2 1493 6 0 0
## C 0 5 1362 18 2
## D 0 0 0 1267 1
## E 0 0 0 1 1439
##
## Overall Statistics
##
## Accuracy : 0.993
## 95% CI : (0.9909, 0.9947)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9911
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9991 0.9835 0.9956 0.9852 0.9979
## Specificity 0.9964 0.9987 0.9961 0.9998 0.9998
## Pos Pred Value 0.9911 0.9947 0.9820 0.9992 0.9993
## Neg Pred Value 0.9996 0.9961 0.9991 0.9971 0.9995
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2842 0.1903 0.1736 0.1615 0.1834
## Detection Prevalence 0.2868 0.1913 0.1768 0.1616 0.1835
## Balanced Accuracy 0.9978 0.9911 0.9959 0.9925 0.9989
The accuracy of this Random Forest algorithm has been determined to be 0.9929 and the Estimated Out of Sample Error to be 0.0071.
Because the accuracy of the Random Forest algorithm is greater than the accuracy of the Decision Tree algorithm we will proceed with the Random Forest Algorithm.
Here we apply our algorithm with cross validation. We use the mod_testing data set to cross validate our model. ####Prediction Model 3 Random Forest with Cross Validation Model:
mod3 <- train(classe ~ ., data = mod_training, method = "rf", trControl = trainControl(method = "cv", 5), ntree = 250)
Prediction:
pred3 <- predict(mod3, mod_testing)
Results:
confusionMatrix(mod_testing$classe, pred3)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2225 3 3 0 1
## B 24 1486 8 0 0
## C 0 10 1351 7 0
## D 0 0 24 1260 2
## E 0 0 2 1 1439
##
## Overall Statistics
##
## Accuracy : 0.9892
## 95% CI : (0.9866, 0.9913)
## No Information Rate : 0.2866
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9863
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9893 0.9913 0.9733 0.9937 0.9979
## Specificity 0.9987 0.9950 0.9974 0.9960 0.9995
## Pos Pred Value 0.9969 0.9789 0.9876 0.9798 0.9979
## Neg Pred Value 0.9957 0.9979 0.9943 0.9988 0.9995
## Prevalence 0.2866 0.1911 0.1769 0.1616 0.1838
## Detection Rate 0.2836 0.1894 0.1722 0.1606 0.1834
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9940 0.9931 0.9854 0.9949 0.9987
The accuracy of this Random Forest algorithm has been determined to be 0.989 and the Estimated Out of Sample Error to be 0.011
Ultimately we must test our model on a set of data that has not been used at all throughout the building of this model. Fortunately we partitioned a set of testing data and saved it for our final model. Here we will attempt to use the model to make predictions with a test set it has never seen before.
Prediction:
pred4 <- predict(mod3, finalmod_testset)
pred4
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E