R Markdown

Objective of the project

To predict the manner of exercise, whether it is the correct way or the incorrect way

Data sources

Data was downloaded using read.csv() function from the following sources. ### Training Source : https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv ### Test Source : https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

A basic inspection using summary function highlighted that many variables carry mostly NAs or blanks "". The sample summary result for such variable is produced below -

summary(training[, 12:15])
##  kurtosis_roll_belt kurtosis_picth_belt kurtosis_yaw_belt skewness_roll_belt
##           :19216             :19216            :19216              :19216   
##  #DIV/0!  :   10    #DIV/0!  :   32     #DIV/0!:  406     #DIV/0!  :    9   
##  -1.908453:    2    47.000000:    4                       0.000000 :    4   
##  -0.016850:    1    -0.150950:    3                       0.422463 :    2   
##  -0.021024:    1    -0.684748:    3                       -0.003095:    1   
##  -0.025513:    1    -1.750749:    3                       -0.010002:    1   
##  (Other)  :  391    (Other)  :  361                       (Other)  :  389

As can be seen above, the extremely high number of NAs or ""s make the variable redundant for prediction purposes.

Such columns carrying NAs or ""s were removed. First, seven variables consist of variables not relevant to prediction like time, name, etc. So, they are removed too.

x1 <- (training == "")
y1 <- apply(x1, 2, sum)
y1 <- y1/19622
colindex1 <- (y1 > 0.5)
colindex1 <- which(colindex1 == TRUE)
x <- is.na(training)
y <- apply(x, 2, sum)
y <- y/19622
colindex <- (y > 0.5)
colindex <- which(colindex == TRUE)
col1 <- union(colindex, colindex1)


training <- training[, -col1]
testing <- testing[, - col1]

training <- training[, -c(1:7)]
testing <- testing[, -c(1:7)]

dim(training)
## [1] 19622    53
dim(testing)
## [1] 20 53

Thus, the dimensions of the datasets were reduced.

Methodology

  1. Create Data Partition
  2. Train models for different methods a. Classification Trees b. Random Forests c. GBM
  3. Testing accuracy
  4. Final selection of model and prediction

Step 1 : Create Data Partition

Training data was divided into training and testing sets using the following code.

set.seed(1313)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
trainIndex <-  createDataPartition(training$classe, p = 0.8, list = FALSE)
train1 <- training[trainIndex, ]
test1 <- training[-trainIndex, ]

This provided us with two sets of data - one to train the model with and second to test the accuracy of the model.

Step 2 : Training the models

Three models were trained using Classification trees, random forests and gradient boosting (gbm). Each of them was cross validated using a k-fold cross validation using 5 folds. More than 5 fold increase the computation time without any benefit of increased accuracy.

Model 1 : Classification trees

controltrain <- trainControl(method = "cv", number = 5)
model1 <- train(classe ~ ., data = train1, method = "rpart", trControl = controltrain)
predict1 <- predict(model1, test1[,-53])
accuracy1 <- confusionMatrix(predict1, test1$classe)

For a quick look at the results, we plot the tree using rpart.plot. The result is as follows -

## Warning: package 'rattle' was built under R version 3.6.2
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

The above visualization itself raises serious question regarding the accuracy of the model as D remains unused, which effectively means 0 % accuracy for those with results D. The more detailed parameters are calculated later.

Model 2 : Random Forests

Here the choice of trees is based on keeping the computation time reasonable without compromising the accuracy of model. Increasing it to 32 does not lead to any significant benefits.

controltrain <- trainControl(method = "cv", number = 5)
model2 <- train(classe ~ ., data = train1, method = "rf", trControl = controltrain, ntree = 16)

predict2 <- predict(model2, test1[,-53])
accuracy2 <- confusionMatrix(predict2, test1$classe)

Model 3 : Gradient Boosting Method

controltrain <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
model3 <- train(classe ~ ., data = train1, method = "gbm", trControl = controltrain, verbose = TRUE)
predict3 <- predict(model3, test1[,-53])
accuracy3 <- confusionMatrix(predict3, test1$classe)

The results were as follows (in the sequence of operations) -

accuracy1[[3]][1]
##  Accuracy 
## 0.4901861
accuracy2[[3]][1]
##  Accuracy 
## 0.9920979
accuracy3[[3]][1]
##  Accuracy 
## 0.9648228

Clearly, Random Forests offer by far exceeds the accuracy of the other two methods. So, random forests is chosen as the method for making requisite predictions.

Step 4: MAKING FINAL PREDICTION

Final prediction is as follows -

final_predict <- predict(model2, testing[, -53])
final_predict
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E