Over view:

Using devices such as Fitbit, Nike FuelBand and so on, is now easily possible to collect data about physical activity. These type of devices are part of the quantified self movement. A group of enthusiasts who take measurements about themselves regularly to improve their health and to find patterns in their behaviour.

The goal of this project is to use data from accelerometers on the belt, forearm, arm and dumbell of 6 participants. And also to predict the manner in which they did the exercise.

The 5 possible methods include:

A: Exactly according to the specification

B: Throwing the elbows to the front

C: Lifting the dumbell only halfway

D: Lowering the dumbell only halfway

E: Throwing the hips to the front

The dataset used in this project is a courtesy of “Ugulino, W.;Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers, Data Classification of Body Postures and Movements.

Required Libraries for this project

library(knitr)
library(caret)
## Warning: package 'caret' was built under R version 4.0.2
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.2
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.2
## corrplot 0.84 loaded
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.0.2
library(rattle)
## Warning: package 'rattle' was built under R version 4.0.2
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.0.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
## 
##     importance
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(e1071)
## Warning: package 'e1071' was built under R version 4.0.2

Data Processing

Getting data

trainURL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"

testURL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

Reading data

training<-read.csv(trainURL)

testing<-read.csv(testURL)

# Data Analysis

dim(training)
## [1] 19622   160
dim(testing)
## [1]  20 160
##( I used hashtags before the 'names' and 'str' code here, bcz of the lengthy data it is occupying more space)

## names(training)

## str(training)

Data Slicing

inTrain<-createDataPartition(training$classe, p=0.7, list=FALSE)

trainSet<-training[inTrain,]

testSet<-training[-inTrain,]

dim(trainSet)
## [1] 13737   160
dim(testSet)
## [1] 5885  160

Data cleaning

## In data cleaning, first, remove the NA values from the data set

trainSet1<-trainSet[ , colSums(is.na(trainSet))==0]

testSet1<-testSet[ , colSums(is.na(testSet))==0]

dim(trainSet1)
## [1] 13737    93
dim(testSet1)
## [1] 5885   93
## In data cleaning, second, remove the columns that have near zero variance

nearZero<-nearZeroVar(trainSet1)

trainSet2<-trainSet1[ , -nearZero]

testSet2<-testSet1[ , -nearZero]

dim(trainSet2)
## [1] 13737    59
dim(testSet2)
## [1] 5885   59
## In data cleaning, finally, remove the first seven variables which are having less impact on the outcome variable

trainSet3<- trainSet2[ , -c(1:7)]

testSet3<-testSet2[ , -c(1:7)]

dim(trainSet3)
## [1] 13737    52
dim(testSet3)
## [1] 5885   52

Correlation analysis

## Checking the highly correlated variables prior to modelling

trainD<-sapply(trainSet3, is.numeric)

corMatrix<-cor(trainSet3[trainD])

corrplot(corMatrix, order="FPC", method="color",
           tl.cex=0.45, tl.col="blue", number.cex=0.25)

Names of highly correlated variables, which are shown as dark color intersection in corrplot.

highCor<-findCorrelation(corMatrix, cutoff=0.75)

names(trainSet3)[highCor]
##  [1] "accel_belt_z"      "accel_dumbbell_z"  "accel_belt_y"     
##  [4] "accel_arm_y"       "total_accel_belt"  "accel_belt_x"     
##  [7] "pitch_belt"        "accel_dumbbell_y"  "magnet_dumbbell_x"
## [10] "magnet_dumbbell_y" "accel_arm_x"       "accel_dumbbell_x" 
## [13] "accel_arm_z"       "magnet_arm_y"      "magnet_belt_z"    
## [16] "accel_forearm_y"   "gyros_dumbbell_x"  "gyros_forearm_y"  
## [19] "gyros_dumbbell_z"  "gyros_arm_x"

Prediction model building

Here, we are using two models to predict the outcome variable.

1. Decision tree model

2. Random Forest model

In order to limit the effects of overfitting and improve the efficiency of the models, we will use Cross Validation.

Also we use Confusion Matrix for each analysis to better visualize the accuracy of the models.

Method1: Decision trees

## Model fit

DTcontrol<-trainControl(method="cv", number=5)

DTmodel<-train(classe~., data=trainSet3, 
               method=  "rpart", trControl=DTcontrol)

fancyRpartPlot(DTmodel$finalModel,
               sub="Classification Tree")

Prediction on test data set

DTpred<-predict(DTmodel, newdata=testSet3)

DTcfm<-confusionMatrix(table(DTpred,testSet3$classe))

DTcfm
## Confusion Matrix and Statistics
## 
##       
## DTpred    A    B    C    D    E
##      A 1528  499  487  435  246
##      B   20  364   28  182  211
##      C   90  240  413  120  264
##      D   22   36   98  227   43
##      E   14    0    0    0  318
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4843          
##                  95% CI : (0.4714, 0.4971)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3245          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9128  0.31958  0.40253  0.23548  0.29390
## Specificity            0.6041  0.90708  0.85306  0.95956  0.99709
## Pos Pred Value         0.4782  0.45217  0.36646  0.53286  0.95783
## Neg Pred Value         0.9457  0.84744  0.87116  0.86499  0.86242
## Prevalence             0.2845  0.19354  0.17434  0.16381  0.18386
## Detection Rate         0.2596  0.06185  0.07018  0.03857  0.05404
## Detection Prevalence   0.5429  0.13679  0.19150  0.07239  0.05641
## Balanced Accuracy      0.7585  0.61333  0.62780  0.59752  0.64549

plot matrix results

plot(DTcfm$table, col=DTcfm$byClass, 
      main=paste("Decision Tree Accuracy = ",   round(DTcfm$overall['Accuracy'], digits=2),4))

In Decision tree model’s accuracy, there is a significant expected ‘out of sample error’.

Method2: Random Forest

## Model fit

RFmodel<-randomForest(as.factor(classe)~.,
        data=trainSet3, ntree=500, importance=TRUE)

RFmodel
## 
## Call:
##  randomForest(formula = as.factor(classe) ~ ., data = trainSet3,      ntree = 500, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.62%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3898    6    0    1    1 0.002048131
## B   13 2639    6    0    0 0.007148232
## C    0   18 2377    1    0 0.007929883
## D    0    0   28 2222    2 0.013321492
## E    0    0    3    6 2516 0.003564356

Prediction on test data set

RFpred<-predict(RFmodel, newdata=testSet3)

RFcfm<-confusionMatrix(table(RFpred, testSet3$classe))

RFcfm
## Confusion Matrix and Statistics
## 
##       
## RFpred    A    B    C    D    E
##      A 1673    4    0    0    0
##      B    1 1128    6    0    0
##      C    0    7 1020   10    2
##      D    0    0    0  953    0
##      E    0    0    0    1 1080
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9947          
##                  95% CI : (0.9925, 0.9964)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9933          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9903   0.9942   0.9886   0.9982
## Specificity            0.9991   0.9985   0.9961   1.0000   0.9998
## Pos Pred Value         0.9976   0.9938   0.9817   1.0000   0.9991
## Neg Pred Value         0.9998   0.9977   0.9988   0.9978   0.9996
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1917   0.1733   0.1619   0.1835
## Detection Prevalence   0.2850   0.1929   0.1766   0.1619   0.1837
## Balanced Accuracy      0.9992   0.9944   0.9951   0.9943   0.9990

plot matrix results

plot(RFcfm$table, col=RFcfm$byClass, 
      main=paste("Random Forest Accuracy = ", round(RFcfm$overall['Accuracy'], digits=2),4))

Accuracy rate for the Random forest model is very high and the out of sample error is equals to zero

Applying Random forest model to the test data

finalPredict<-predict(RFmodel, newdata=testing)

finalPredict
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E