Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project,the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
In this project, we will use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants to predict the manner in which they did the exercise. This is the “classe” variable in the training set. We train 4 models: Decision Tree, Random Forest, Gradient Boosted Trees, Support Vector Machine using k-folds cross validation on the training set. We then predict using a validation set randomly selected from the training csv data to obtain the accuracy and out of sample error rate. Based on those numbers, we decide on the best model, and use it to predict 20 cases using the test csv set.
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
library(caret)
library(randomForest)
library(rattle)
set.seed(20112021) #to make this analysis reproducible
training<-read.csv("pml-training.csv")
testing<-read.csv("pml-testing.csv")
#removing first seven column as it is meta data
training<-training[,-c(1:7)]
#removing columns having mostly NA values
training<-training[,colMeans(is.na(training))<0.9]
#removing columns having values near to zero
training<-training[,-nearZeroVar(training)]
#dimension of the new training data set
dim(training)
## [1] 19622 53
We will now move forward with the new training set only by splitting it into a training and validation data set
inTrain<-createDataPartition(y=training$classe,p=0.7,list=FALSE)
trainSet<-training[inTrain,]
validSet<-training[-inTrain,]
We will consider some intuitive and popular prediction models that are Decision Trees, Random Forest, Gradient Boosted Trees, and SVM
Prediction Model
dtFit<-train(classe~.,data=trainSet, method="rpart",
trControl= trainControl(method = "cv", number = 3, verboseIter = F))
fancyRpartPlot(dtFit$finalModel)
Testing
dtPred<-predict(dtFit,validSet)
dtCM<-confusionMatrix(dtPred,as.factor(validSet$classe))
Prediction Model
rfFit<-train(classe~.,data=trainSet,method="rf",
trControl=trainControl(method="cv",number=3,verboseIter = F))
Testing
rfPred<-predict(rfFit,validSet)
rfCM<-confusionMatrix(rfPred,as.factor(validSet$classe))
rfCM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1670 8 0 0 0
## B 1 1123 7 0 0
## C 2 4 1015 12 4
## D 0 4 4 952 5
## E 1 0 0 0 1073
##
## Overall Statistics
##
## Accuracy : 0.9912
## 95% CI : (0.9884, 0.9934)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9888
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9976 0.9860 0.9893 0.9876 0.9917
## Specificity 0.9981 0.9983 0.9955 0.9974 0.9998
## Pos Pred Value 0.9952 0.9929 0.9788 0.9865 0.9991
## Neg Pred Value 0.9990 0.9966 0.9977 0.9976 0.9981
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2838 0.1908 0.1725 0.1618 0.1823
## Detection Prevalence 0.2851 0.1922 0.1762 0.1640 0.1825
## Balanced Accuracy 0.9979 0.9921 0.9924 0.9925 0.9957
Prediction Model
gbtFit<-train(classe~.,data=trainSet,method="gbm",
trControl=trainControl(method = "cv", number = 3, verboseIter = F),
verbose=F)
Testing
gbtPred<-predict(gbtFit,validSet)
gbtCM<-confusionMatrix(gbtPred,as.factor(validSet$classe))
gbtCM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1644 31 0 0 5
## B 15 1067 31 4 14
## C 6 33 982 26 12
## D 7 3 11 929 17
## E 2 5 2 5 1034
##
## Overall Statistics
##
## Accuracy : 0.9611
## 95% CI : (0.9558, 0.9659)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9508
##
## Mcnemar's Test P-Value : 3.173e-06
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9821 0.9368 0.9571 0.9637 0.9556
## Specificity 0.9915 0.9865 0.9842 0.9923 0.9971
## Pos Pred Value 0.9786 0.9434 0.9273 0.9607 0.9866
## Neg Pred Value 0.9929 0.9849 0.9909 0.9929 0.9901
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2794 0.1813 0.1669 0.1579 0.1757
## Detection Prevalence 0.2855 0.1922 0.1799 0.1643 0.1781
## Balanced Accuracy 0.9868 0.9617 0.9706 0.9780 0.9764
Prediction Model
svmFit<-train(classe~.,data=trainSet,method="svmLinear",
trControl=trainControl(method = "cv", number = 3, verboseIter = F),
verbose=F)
Testing
svmPred<-predict(svmFit,validSet)
svmCM<-confusionMatrix(svmPred,as.factor(validSet$classe))
svmCM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1519 149 87 56 52
## B 37 822 78 47 148
## C 55 76 800 99 73
## D 55 17 43 717 67
## E 8 75 18 45 742
##
## Overall Statistics
##
## Accuracy : 0.7816
## 95% CI : (0.7709, 0.7921)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7227
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9074 0.7217 0.7797 0.7438 0.6858
## Specificity 0.9183 0.9347 0.9376 0.9630 0.9696
## Pos Pred Value 0.8154 0.7261 0.7253 0.7976 0.8356
## Neg Pred Value 0.9615 0.9333 0.9527 0.9505 0.9320
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2581 0.1397 0.1359 0.1218 0.1261
## Detection Prevalence 0.3166 0.1924 0.1874 0.1528 0.1509
## Balanced Accuracy 0.9129 0.8282 0.8587 0.8534 0.8277
data.frame(Model=c("Decision Trees","Random Forests","Gradient Boosted Trees", "Support Vector Machine"),
Accuracy=c(dtCM$overall[1],rfCM$overall[1],gbtCM$overall[1],svmCM$overall[1])*100,
Out.of.Sample.Error=100-c(dtCM$overall[1],rfCM$overall[1],gbtCM$overall[1],svmCM$overall[1])*100
)
## Model Accuracy Out.of.Sample.Error
## 1 Decision Trees 49.29482 50.7051827
## 2 Random Forests 99.11640 0.8836024
## 3 Gradient Boosted Trees 96.10875 3.8912489
## 4 Support Vector Machine 78.16483 21.8351742
As we can clearly see Random Forests algorithm shows the maximum accuracy i.e., 0.991164 and lowest out of sample error i.e., 0.008836.
Predicting “classe” variable for the test set with Random Forests algorithm
testPred<-predict(rfFit,testing)
testPred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E