Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. We will try to predict the manner in which they did the exercise by building best prediction model using cross validation and calculating the sample error.
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.
library(caret);library(rpart.plot);library(randomForest);library(gbm)
## Loading required package: lattice
## Loading required package: ggplot2
## Loading required package: rpart
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## Loaded gbm 2.1.5
trainURL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training_data<-read.csv(url(trainURL))
test_data<-read.csv(url(testURL))
dim(training_data)
dim(test_data)
# 1. Remove variables having more than 95% NA values
na_col<-sapply(training_data,function(x)mean(is.na(x)))>0.95
training_data<-training_data[,na_col==FALSE]
test_data<-test_data[,na_col==FALSE]
# 2.Remove variables having nearly zero variance
trainNZV<-nearZeroVar(training_data)
training_data<-training_data[,-trainNZV]
testNZV<-nearZeroVar(test_data)
test_data<-test_data[,-testNZV]
# 3. Remove variables that are not required in our analysis
training_data<-training_data[,-c(1:7)]
test_data<-test_data[,-c(1:7)]
dim(training_data)
## [1] 19622 52
dim(test_data)
## [1] 20 52
inTrain<-createDataPartition(training_data$classe,p=0.6,list=FALSE)
Training<-training_data[inTrain,]
Testing<-training_data[-inTrain,]
dim(Training)
## [1] 11776 52
dim(Testing)
## [1] 7846 52
set.seed(352020)
ModFit<-train(classe ~.,data=Training,method="rpart")
rpart.plot(ModFit$finalModel,roundint = FALSE)
PredFit<-predict(ModFit,Testing) # predicting on testset
CM<-confusionMatrix(PredFit,Testing$classe)
CM$overall["Accuracy"]
## Accuracy
## 0.4892939
# plot(CM$table,main="Decision Tree Prediction Accuracy= 69.7%")
We see that the accuracy rate of this model is low: 49% and therefore the out-of-sample-error is about 51% which is quite large and not accepted. # 2.Random Forest Model
set.seed(352020)
RFModFit <-train(classe ~.,data=Training, method="rf",ntree=100)
RFPredFit<-predict(RFModFit,Testing)
RFCM<-confusionMatrix(RFPredFit,Testing$classe)
RFCM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2229 22 0 0 0
## B 3 1481 16 1 0
## C 0 14 1346 15 2
## D 0 0 6 1269 5
## E 0 1 0 1 1435
##
## Overall Statistics
##
## Accuracy : 0.989
## 95% CI : (0.9865, 0.9912)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9861
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9987 0.9756 0.9839 0.9868 0.9951
## Specificity 0.9961 0.9968 0.9952 0.9983 0.9997
## Pos Pred Value 0.9902 0.9867 0.9775 0.9914 0.9986
## Neg Pred Value 0.9995 0.9942 0.9966 0.9974 0.9989
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2841 0.1888 0.1716 0.1617 0.1829
## Detection Prevalence 0.2869 0.1913 0.1755 0.1631 0.1832
## Balanced Accuracy 0.9974 0.9862 0.9896 0.9926 0.9974
plot(RFCM$table,main="Random Forest Prediction Accuracy= 99.15%")
As we can see that the accuracy rate of Random Forest Model is very high:99%,and the calculated sample error is as low as 1%, this might be the best model in this case. # 3.Gradient Boosting Model
set.seed(352020)
GBMModFit <-train(classe ~.,data=Training, method="gbm",verbose=FALSE)
GBMPredFit<-predict(GBMModFit,Testing)
GBMCM<-confusionMatrix(GBMPredFit,Testing$classe)
GBMCM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2198 52 0 5 0
## B 23 1417 49 10 16
## C 8 42 1303 36 14
## D 2 1 10 1221 21
## E 1 6 6 14 1391
##
## Overall Statistics
##
## Accuracy : 0.9597
## 95% CI : (0.9551, 0.964)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.949
##
## Mcnemar's Test P-Value : 6.713e-08
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9848 0.9335 0.9525 0.9495 0.9646
## Specificity 0.9898 0.9845 0.9846 0.9948 0.9958
## Pos Pred Value 0.9747 0.9353 0.9287 0.9729 0.9810
## Neg Pred Value 0.9939 0.9840 0.9899 0.9901 0.9921
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2801 0.1806 0.1661 0.1556 0.1773
## Detection Prevalence 0.2874 0.1931 0.1788 0.1600 0.1807
## Balanced Accuracy 0.9873 0.9590 0.9685 0.9721 0.9802
plot(GBMCM$table,main=" Gradient Boosting Prediction Accuracy= 95.99%")
From this model, we get the accuarcy rate of 96%, with sample error of 4%, which is lower than the above random forest model. # Random Forest Model with repeated cross-validation
RFcontrol<-trainControl(method="repeatedcv",number=5,repeats = 3)
set.seed(352020)
RFcustomFit <-train(classe ~.,data=Training, method="rf",trControl=RFcontrol,ntree=100)
RFPredcustom<-predict(RFcustomFit,Testing)
RFcustomCM<-confusionMatrix(RFPredcustom,Testing$classe)
RFcustomCM$overall["Accuracy"]
## Accuracy
## 0.9913332
trellis.par.set(caretTheme())
plot(RFcustomFit, metric = "Accuracy",main="RFM:Accuracy=99.11%")
gbmcontrol<-trainControl(method="repeatedcv",number=5,repeats = 3)
set.seed(352020)
GBMcustomFit <-train(classe ~.,data=Training, method="gbm",trControl=gbmcontrol,verbose=FALSE)
GBMPredcustom<-predict(GBMcustomFit,Testing)
GBMcustomCM<-confusionMatrix(GBMPredcustom,Testing$classe)
GBMcustomCM$overall["Accuracy"]
## Accuracy
## 0.9562835
plot(GBMcustomFit,metric = "Accuracy",main="GBM:Accuracy=96.24%")
ldacontrol<-trainControl(method="repeatedcv",number=5,repeats = 3)
set.seed(352020)
ldaMod <- train(classe ~ ., data=Training, method = "lda",trControl=ldacontrol)
ldapredict<-predict(ldaMod,Testing)
ldaCM<-confusionMatrix(ldapredict,Testing$classe)
ldaCM$overall["Accuracy"]
## Accuracy
## 0.6918175
# plot(ldaCM$table,main=" Linear Discriminant Analysis Accuracy= 69.36%")
This model shows only 69.36% accuracy with sample error of 31%, which is not good enough for best model consideration.
From all the above models and thier statistics with Bootstraping and cross-validation, in terms of accuracy and with lowest sample error, Random Forest model is best fitted model of choice.Therefore we next, apply this model to predict our test_data.
result<-predict(RFModFit,newdata=test_data)
result
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E