Overview

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. We will try to predict the manner in which they did the exercise by building best prediction model using cross validation and calculating the sample error.

Data Information

The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

Data loading and attaching libraries

library(caret);library(rpart.plot);library(randomForest);library(gbm)

## Loading required package: lattice

## Loading required package: ggplot2

## Loading required package: rpart

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## Loaded gbm 2.1.5

trainURL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training_data<-read.csv(url(trainURL))
test_data<-read.csv(url(testURL))
dim(training_data)
dim(test_data)

Data Cleaning and Exploration

# 1. Remove variables having more than 95% NA values 
na_col<-sapply(training_data,function(x)mean(is.na(x)))>0.95
training_data<-training_data[,na_col==FALSE]
test_data<-test_data[,na_col==FALSE]

# 2.Remove variables having nearly zero variance
trainNZV<-nearZeroVar(training_data)
training_data<-training_data[,-trainNZV]
testNZV<-nearZeroVar(test_data)
test_data<-test_data[,-testNZV]

# 3. Remove variables that are not required in our analysis
training_data<-training_data[,-c(1:7)]
test_data<-test_data[,-c(1:7)]
dim(training_data)

## [1] 19622    52

dim(test_data)

## [1] 20 52

Data partitioning

inTrain<-createDataPartition(training_data$classe,p=0.6,list=FALSE)
Training<-training_data[inTrain,]
Testing<-training_data[-inTrain,]
dim(Training)

## [1] 11776    52

dim(Testing)

## [1] 7846   52

Model Building

1.Decision Tree Model

set.seed(352020)
ModFit<-train(classe ~.,data=Training,method="rpart")
rpart.plot(ModFit$finalModel,roundint = FALSE)

PredFit<-predict(ModFit,Testing) # predicting on testset
CM<-confusionMatrix(PredFit,Testing$classe)
CM$overall["Accuracy"]

##  Accuracy 
## 0.4892939

# plot(CM$table,main="Decision Tree Prediction Accuracy= 69.7%")

We see that the accuracy rate of this model is low: 49% and therefore the out-of-sample-error is about 51% which is quite large and not accepted. # 2.Random Forest Model

set.seed(352020)
RFModFit <-train(classe ~.,data=Training, method="rf",ntree=100)
RFPredFit<-predict(RFModFit,Testing)
RFCM<-confusionMatrix(RFPredFit,Testing$classe)
RFCM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2229   22    0    0    0
##          B    3 1481   16    1    0
##          C    0   14 1346   15    2
##          D    0    0    6 1269    5
##          E    0    1    0    1 1435
## 
## Overall Statistics
##                                           
##                Accuracy : 0.989           
##                  95% CI : (0.9865, 0.9912)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9861          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9987   0.9756   0.9839   0.9868   0.9951
## Specificity            0.9961   0.9968   0.9952   0.9983   0.9997
## Pos Pred Value         0.9902   0.9867   0.9775   0.9914   0.9986
## Neg Pred Value         0.9995   0.9942   0.9966   0.9974   0.9989
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2841   0.1888   0.1716   0.1617   0.1829
## Detection Prevalence   0.2869   0.1913   0.1755   0.1631   0.1832
## Balanced Accuracy      0.9974   0.9862   0.9896   0.9926   0.9974

plot(RFCM$table,main="Random Forest Prediction Accuracy= 99.15%")

As we can see that the accuracy rate of Random Forest Model is very high:99%,and the calculated sample error is as low as 1%, this might be the best model in this case. # 3.Gradient Boosting Model

set.seed(352020)
GBMModFit <-train(classe ~.,data=Training, method="gbm",verbose=FALSE)
GBMPredFit<-predict(GBMModFit,Testing)
GBMCM<-confusionMatrix(GBMPredFit,Testing$classe)
GBMCM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2198   52    0    5    0
##          B   23 1417   49   10   16
##          C    8   42 1303   36   14
##          D    2    1   10 1221   21
##          E    1    6    6   14 1391
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9597         
##                  95% CI : (0.9551, 0.964)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.949          
##                                          
##  Mcnemar's Test P-Value : 6.713e-08      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9848   0.9335   0.9525   0.9495   0.9646
## Specificity            0.9898   0.9845   0.9846   0.9948   0.9958
## Pos Pred Value         0.9747   0.9353   0.9287   0.9729   0.9810
## Neg Pred Value         0.9939   0.9840   0.9899   0.9901   0.9921
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2801   0.1806   0.1661   0.1556   0.1773
## Detection Prevalence   0.2874   0.1931   0.1788   0.1600   0.1807
## Balanced Accuracy      0.9873   0.9590   0.9685   0.9721   0.9802

plot(GBMCM$table,main=" Gradient Boosting Prediction Accuracy= 95.99%")

From this model, we get the accuarcy rate of 96%, with sample error of 4%, which is lower than the above random forest model. # Random Forest Model with repeated cross-validation

RFcontrol<-trainControl(method="repeatedcv",number=5,repeats = 3)
set.seed(352020)
RFcustomFit <-train(classe ~.,data=Training, method="rf",trControl=RFcontrol,ntree=100)
RFPredcustom<-predict(RFcustomFit,Testing)
RFcustomCM<-confusionMatrix(RFPredcustom,Testing$classe)
RFcustomCM$overall["Accuracy"]

##  Accuracy 
## 0.9913332

trellis.par.set(caretTheme())
plot(RFcustomFit, metric = "Accuracy",main="RFM:Accuracy=99.11%")

GBM with repeated cross-validation

gbmcontrol<-trainControl(method="repeatedcv",number=5,repeats = 3)
set.seed(352020)
GBMcustomFit <-train(classe ~.,data=Training, method="gbm",trControl=gbmcontrol,verbose=FALSE)
GBMPredcustom<-predict(GBMcustomFit,Testing)
GBMcustomCM<-confusionMatrix(GBMPredcustom,Testing$classe)
GBMcustomCM$overall["Accuracy"]

##  Accuracy 
## 0.9562835

plot(GBMcustomFit,metric = "Accuracy",main="GBM:Accuracy=96.24%")

Linear Discriminant Analysis

ldacontrol<-trainControl(method="repeatedcv",number=5,repeats = 3)
set.seed(352020)
ldaMod <- train(classe ~ ., data=Training, method = "lda",trControl=ldacontrol)
ldapredict<-predict(ldaMod,Testing)
ldaCM<-confusionMatrix(ldapredict,Testing$classe)
ldaCM$overall["Accuracy"]

##  Accuracy 
## 0.6918175

# plot(ldaCM$table,main=" Linear Discriminant Analysis Accuracy= 69.36%")

This model shows only 69.36% accuracy with sample error of 31%, which is not good enough for best model consideration.

Conclusion

From all the above models and thier statistics with Bootstraping and cross-validation, in terms of accuracy and with lowest sample error, Random Forest model is best fitted model of choice.Therefore we next, apply this model to predict our test_data.

result<-predict(RFModFit,newdata=test_data)
result

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Building Best Prediction Model Using Practical Machine learning: Coursera Project

Satindra Kathania

5/4/2020