Over view:
Using devices such as Fitbit, Nike FuelBand and so on, is now easily possible to collect data about physical activity. These type of devices are part of the quantified self movement. A group of enthusiasts who take measurements about themselves regularly to improve their health and to find patterns in their behaviour.
The goal of this project is to use data from accelerometers on the belt, forearm, arm and dumbell of 6 participants. And also to predict the manner in which they did the exercise.
The 5 possible methods include:
A: Exactly according to the specification
B: Throwing the elbows to the front
C: Lifting the dumbell only halfway
D: Lowering the dumbell only halfway
E: Throwing the hips to the front
The dataset used in this project is a courtesy of Ugulino, W.;Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers, Data Classification of Body Postures and Movements.
Required Libraries for this project
library(knitr)
library(caret)
## Warning: package 'caret' was built under R version 4.0.2
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.2
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.2
## corrplot 0.84 loaded
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.0.2
library(rattle)
## Warning: package 'rattle' was built under R version 4.0.2
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.0.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
library(e1071)
## Warning: package 'e1071' was built under R version 4.0.2
Data Processing
Getting data
trainURL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
Reading data
training<-read.csv(trainURL)
testing<-read.csv(testURL)
# Data Analysis
dim(training)
## [1] 19622 160
dim(testing)
## [1] 20 160
##( I used hashtags before the 'names' and 'str' code here, bcz of the lengthy data it is occupying more space)
## names(training)
## str(training)
Data Slicing
inTrain<-createDataPartition(training$classe, p=0.7, list=FALSE)
trainSet<-training[inTrain,]
testSet<-training[-inTrain,]
dim(trainSet)
## [1] 13737 160
dim(testSet)
## [1] 5885 160
Data cleaning
## In data cleaning, first, remove the NA values from the data set
trainSet1<-trainSet[ , colSums(is.na(trainSet))==0]
testSet1<-testSet[ , colSums(is.na(testSet))==0]
dim(trainSet1)
## [1] 13737 93
dim(testSet1)
## [1] 5885 93
## In data cleaning, second, remove the columns that have near zero variance
nearZero<-nearZeroVar(trainSet1)
trainSet2<-trainSet1[ , -nearZero]
testSet2<-testSet1[ , -nearZero]
dim(trainSet2)
## [1] 13737 59
dim(testSet2)
## [1] 5885 59
## In data cleaning, finally, remove the first seven variables which are having less impact on the outcome variable
trainSet3<- trainSet2[ , -c(1:7)]
testSet3<-testSet2[ , -c(1:7)]
dim(trainSet3)
## [1] 13737 52
dim(testSet3)
## [1] 5885 52
Correlation analysis
## Checking the highly correlated variables prior to modelling
trainD<-sapply(trainSet3, is.numeric)
corMatrix<-cor(trainSet3[trainD])
corrplot(corMatrix, order="FPC", method="color",
tl.cex=0.45, tl.col="blue", number.cex=0.25)

Prediction model building
Here, we are using two models to predict the outcome variable.
1. Decision tree model
2. Random Forest model
In order to limit the effects of overfitting and improve the efficiency of the models, we will use Cross Validation.
Also we use Confusion Matrix for each analysis to better visualize the accuracy of the models.
Method1: Decision trees
## Model fit
DTcontrol<-trainControl(method="cv", number=5)
DTmodel<-train(classe~., data=trainSet3,
method= "rpart", trControl=DTcontrol)
fancyRpartPlot(DTmodel$finalModel,
sub="Classification Tree")

Prediction on test data set
DTpred<-predict(DTmodel, newdata=testSet3)
DTcfm<-confusionMatrix(table(DTpred,testSet3$classe))
DTcfm
## Confusion Matrix and Statistics
##
##
## DTpred A B C D E
## A 1528 499 487 435 246
## B 20 364 28 182 211
## C 90 240 413 120 264
## D 22 36 98 227 43
## E 14 0 0 0 318
##
## Overall Statistics
##
## Accuracy : 0.4843
## 95% CI : (0.4714, 0.4971)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3245
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9128 0.31958 0.40253 0.23548 0.29390
## Specificity 0.6041 0.90708 0.85306 0.95956 0.99709
## Pos Pred Value 0.4782 0.45217 0.36646 0.53286 0.95783
## Neg Pred Value 0.9457 0.84744 0.87116 0.86499 0.86242
## Prevalence 0.2845 0.19354 0.17434 0.16381 0.18386
## Detection Rate 0.2596 0.06185 0.07018 0.03857 0.05404
## Detection Prevalence 0.5429 0.13679 0.19150 0.07239 0.05641
## Balanced Accuracy 0.7585 0.61333 0.62780 0.59752 0.64549
plot matrix results
plot(DTcfm$table, col=DTcfm$byClass,
main=paste("Decision Tree Accuracy = ", round(DTcfm$overall['Accuracy'], digits=2),4))

In Decision tree model’s accuracy, there is a significant expected ‘out of sample error’.
Method2: Random Forest
## Model fit
RFmodel<-randomForest(as.factor(classe)~.,
data=trainSet3, ntree=500, importance=TRUE)
RFmodel
##
## Call:
## randomForest(formula = as.factor(classe) ~ ., data = trainSet3, ntree = 500, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.62%
## Confusion matrix:
## A B C D E class.error
## A 3898 6 0 1 1 0.002048131
## B 13 2639 6 0 0 0.007148232
## C 0 18 2377 1 0 0.007929883
## D 0 0 28 2222 2 0.013321492
## E 0 0 3 6 2516 0.003564356
Prediction on test data set
RFpred<-predict(RFmodel, newdata=testSet3)
RFcfm<-confusionMatrix(table(RFpred, testSet3$classe))
RFcfm
## Confusion Matrix and Statistics
##
##
## RFpred A B C D E
## A 1673 4 0 0 0
## B 1 1128 6 0 0
## C 0 7 1020 10 2
## D 0 0 0 953 0
## E 0 0 0 1 1080
##
## Overall Statistics
##
## Accuracy : 0.9947
## 95% CI : (0.9925, 0.9964)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9933
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9903 0.9942 0.9886 0.9982
## Specificity 0.9991 0.9985 0.9961 1.0000 0.9998
## Pos Pred Value 0.9976 0.9938 0.9817 1.0000 0.9991
## Neg Pred Value 0.9998 0.9977 0.9988 0.9978 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1917 0.1733 0.1619 0.1835
## Detection Prevalence 0.2850 0.1929 0.1766 0.1619 0.1837
## Balanced Accuracy 0.9992 0.9944 0.9951 0.9943 0.9990
plot matrix results
plot(RFcfm$table, col=RFcfm$byClass,
main=paste("Random Forest Accuracy = ", round(RFcfm$overall['Accuracy'], digits=2),4))

Accuracy rate for the Random forest model is very high and the out of sample error is equals to zero
Applying Random forest model to the test data
finalPredict<-predict(RFmodel, newdata=testing)
finalPredict
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E