Setting up current working directory

setwd("C:/Week4Assignment")

Initializing

library(caret)
## Warning: package 'caret' was built under R version 3.6.3
library(ggplot2)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.6.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.6.3

Reading training and test datasets

TrainingData <- read.csv('pml-training.csv', na.strings = c("NA", "#DIV/0!", ""))
TestData <- read.csv('pml-testing.csv', na.strings = c("NA", "#DIV/0!", ""))

Cleaning Training data

Removing columns with more than 90% observations as NA

clnColumnIndex <- colSums(is.na(TrainingData))/nrow(TrainingData) < 0.90
ClTrainingData <- TrainingData[,clnColumnIndex]

Removing columns 1 - 7 from train and test dataset as they are not required in predictions.

ClTrainingData <- ClTrainingData[,-c(1:7)]
ClTestData <- TestData[,-c(1:7)]

Partitioning training data into traning set and cross validation set

inTrainIndex <- createDataPartition(ClTrainingData$classe, p=0.75)[[1]]
TrainTrainData <- ClTrainingData[inTrainIndex,]
TrainCrossValData <- ClTrainingData[-inTrainIndex,]

Changing Test Data set into same

allNames <- names(ClTrainingData)
ClTestData <- TestData[,allNames[1:52]]

Machine Learning Algorithm - Decision Tree

Predict with decision tree and output the confusion matrix. It seems like the result of the model is not ideal

decisionTreePrediction <- predict(decisionTreeMod, TrainCrossValData)
confusionMatrix(TrainCrossValData$classe, decisionTreePrediction)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1271   15  107    0    2
##          B  399  308  242    0    0
##          C  403   17  435    0    0
##          D  366  145  293    0    0
##          E  139  123  218    0  421
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4965          
##                  95% CI : (0.4824, 0.5106)
##     No Information Rate : 0.5257          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3415          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.4930  0.50658   0.3359       NA  0.99527
## Specificity            0.9467  0.85079   0.8836   0.8361  0.89288
## Pos Pred Value         0.9111  0.32455   0.5088       NA  0.46726
## Neg Pred Value         0.6275  0.92415   0.7876       NA  0.99950
## Prevalence             0.5257  0.12398   0.2641   0.0000  0.08626
## Detection Rate         0.2592  0.06281   0.0887   0.0000  0.08585
## Detection Prevalence   0.2845  0.19352   0.1743   0.1639  0.18373
## Balanced Accuracy      0.7199  0.67869   0.6098       NA  0.94408

Plotting the decision tree

rpart.plot(decisionTreeMod$finalModel)

Machine Learning Algorithm - Random Forest

set.seed(21243)
rfMod <- train(classe ~., method='rf', data=TrainTrainData, ntree=100)
rfPrediction <- predict(rfMod, TrainCrossValData)
confusionMatrix(TrainCrossValData$classe, rfPrediction)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    0    0    0    0
##          B    8  940    1    0    0
##          C    0    0  852    3    0
##          D    0    0    8  794    2
##          E    0    0    0    2  899
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9951          
##                  95% CI : (0.9927, 0.9969)
##     No Information Rate : 0.2861          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9938          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9943   1.0000   0.9895   0.9937   0.9978
## Specificity            1.0000   0.9977   0.9993   0.9976   0.9995
## Pos Pred Value         1.0000   0.9905   0.9965   0.9876   0.9978
## Neg Pred Value         0.9977   1.0000   0.9978   0.9988   0.9995
## Prevalence             0.2861   0.1917   0.1756   0.1629   0.1837
## Detection Rate         0.2845   0.1917   0.1737   0.1619   0.1833
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9971   0.9989   0.9944   0.9957   0.9986

Prediction

Now lets predict using test data

predict(rfMod, ClTestData)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Concluison

It can be seen from result that Random Forst performs better than Decision Tree model. Random Forst gives accuracy of more than 99% in Sample data.