Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project,the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

Overview

In this project, we will use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants to predict the manner in which they did the exercise. This is the “classe” variable in the training set. We train 4 models: Decision Tree, Random Forest, Gradient Boosted Trees, Support Vector Machine using k-folds cross validation on the training set. We then predict using a validation set randomly selected from the training csv data to obtain the accuracy and out of sample error rate. Based on those numbers, we decide on the best model, and use it to predict 20 cases using the test csv set.

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Loading the data sets and required libraries

library(caret)
library(randomForest)
library(rattle)
set.seed(20112021) #to make this analysis reproducible


training<-read.csv("pml-training.csv")
testing<-read.csv("pml-testing.csv")

Cleaning the training data set

#removing first seven column as it is meta data
training<-training[,-c(1:7)]

#removing columns having mostly NA values
training<-training[,colMeans(is.na(training))<0.9]

#removing columns having values near to zero
training<-training[,-nearZeroVar(training)]

#dimension of the new training data set
dim(training)
## [1] 19622    53

We will now move forward with the new training set only by splitting it into a training and validation data set

inTrain<-createDataPartition(y=training$classe,p=0.7,list=FALSE)
trainSet<-training[inTrain,]
validSet<-training[-inTrain,]

Creating prediction models and testing on validation set

We will consider some intuitive and popular prediction models that are Decision Trees, Random Forest, Gradient Boosted Trees, and SVM

Descision Tree

Prediction Model

dtFit<-train(classe~.,data=trainSet, method="rpart",
             trControl= trainControl(method = "cv", number = 3, verboseIter = F))
fancyRpartPlot(dtFit$finalModel)

Testing

dtPred<-predict(dtFit,validSet)
dtCM<-confusionMatrix(dtPred,as.factor(validSet$classe))

Random Forests

Prediction Model

rfFit<-train(classe~.,data=trainSet,method="rf",
             trControl=trainControl(method="cv",number=3,verboseIter = F))

Testing

rfPred<-predict(rfFit,validSet)
rfCM<-confusionMatrix(rfPred,as.factor(validSet$classe))
rfCM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1670    8    0    0    0
##          B    1 1123    7    0    0
##          C    2    4 1015   12    4
##          D    0    4    4  952    5
##          E    1    0    0    0 1073
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9912          
##                  95% CI : (0.9884, 0.9934)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9888          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9976   0.9860   0.9893   0.9876   0.9917
## Specificity            0.9981   0.9983   0.9955   0.9974   0.9998
## Pos Pred Value         0.9952   0.9929   0.9788   0.9865   0.9991
## Neg Pred Value         0.9990   0.9966   0.9977   0.9976   0.9981
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2838   0.1908   0.1725   0.1618   0.1823
## Detection Prevalence   0.2851   0.1922   0.1762   0.1640   0.1825
## Balanced Accuracy      0.9979   0.9921   0.9924   0.9925   0.9957

Gradient Boosted Trees

Prediction Model

gbtFit<-train(classe~.,data=trainSet,method="gbm",
              trControl=trainControl(method = "cv", number = 3, verboseIter = F),
              verbose=F)

Testing

gbtPred<-predict(gbtFit,validSet)
gbtCM<-confusionMatrix(gbtPred,as.factor(validSet$classe))
gbtCM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1644   31    0    0    5
##          B   15 1067   31    4   14
##          C    6   33  982   26   12
##          D    7    3   11  929   17
##          E    2    5    2    5 1034
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9611          
##                  95% CI : (0.9558, 0.9659)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9508          
##                                           
##  Mcnemar's Test P-Value : 3.173e-06       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9821   0.9368   0.9571   0.9637   0.9556
## Specificity            0.9915   0.9865   0.9842   0.9923   0.9971
## Pos Pred Value         0.9786   0.9434   0.9273   0.9607   0.9866
## Neg Pred Value         0.9929   0.9849   0.9909   0.9929   0.9901
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2794   0.1813   0.1669   0.1579   0.1757
## Detection Prevalence   0.2855   0.1922   0.1799   0.1643   0.1781
## Balanced Accuracy      0.9868   0.9617   0.9706   0.9780   0.9764

Support Vector Machine (SVM)

Prediction Model

svmFit<-train(classe~.,data=trainSet,method="svmLinear",
              trControl=trainControl(method = "cv", number = 3, verboseIter = F),
              verbose=F)

Testing

svmPred<-predict(svmFit,validSet)
svmCM<-confusionMatrix(svmPred,as.factor(validSet$classe))
svmCM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1519  149   87   56   52
##          B   37  822   78   47  148
##          C   55   76  800   99   73
##          D   55   17   43  717   67
##          E    8   75   18   45  742
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7816          
##                  95% CI : (0.7709, 0.7921)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7227          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9074   0.7217   0.7797   0.7438   0.6858
## Specificity            0.9183   0.9347   0.9376   0.9630   0.9696
## Pos Pred Value         0.8154   0.7261   0.7253   0.7976   0.8356
## Neg Pred Value         0.9615   0.9333   0.9527   0.9505   0.9320
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2581   0.1397   0.1359   0.1218   0.1261
## Detection Prevalence   0.3166   0.1924   0.1874   0.1528   0.1509
## Balanced Accuracy      0.9129   0.8282   0.8587   0.8534   0.8277

Conclusion

data.frame(Model=c("Decision Trees","Random Forests","Gradient Boosted Trees", "Support Vector Machine"),
           Accuracy=c(dtCM$overall[1],rfCM$overall[1],gbtCM$overall[1],svmCM$overall[1])*100,
           Out.of.Sample.Error=100-c(dtCM$overall[1],rfCM$overall[1],gbtCM$overall[1],svmCM$overall[1])*100
           )
##                    Model Accuracy Out.of.Sample.Error
## 1         Decision Trees 49.29482          50.7051827
## 2         Random Forests 99.11640           0.8836024
## 3 Gradient Boosted Trees 96.10875           3.8912489
## 4 Support Vector Machine 78.16483          21.8351742

As we can clearly see Random Forests algorithm shows the maximum accuracy i.e., 0.991164 and lowest out of sample error i.e., 0.008836.

Prediction on Test Set

Predicting “classe” variable for the test set with Random Forests algorithm

testPred<-predict(rfFit,testing)
testPred
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E