Setting up current working directory
setwd("C:/Week4Assignment")
Initializing
library(caret)
## Warning: package 'caret' was built under R version 3.6.3
library(ggplot2)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.6.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.6.3
Reading training and test datasets
TrainingData <- read.csv('pml-training.csv', na.strings = c("NA", "#DIV/0!", ""))
TestData <- read.csv('pml-testing.csv', na.strings = c("NA", "#DIV/0!", ""))
Cleaning Training data
Removing columns with more than 90% observations as NA
clnColumnIndex <- colSums(is.na(TrainingData))/nrow(TrainingData) < 0.90
ClTrainingData <- TrainingData[,clnColumnIndex]
Removing columns 1 - 7 from train and test dataset as they are not required in predictions.
ClTrainingData <- ClTrainingData[,-c(1:7)]
ClTestData <- TestData[,-c(1:7)]
Partitioning training data into traning set and cross validation set
inTrainIndex <- createDataPartition(ClTrainingData$classe, p=0.75)[[1]]
TrainTrainData <- ClTrainingData[inTrainIndex,]
TrainCrossValData <- ClTrainingData[-inTrainIndex,]
Changing Test Data set into same
allNames <- names(ClTrainingData)
ClTestData <- TestData[,allNames[1:52]]
Machine Learning Algorithm - Decision Tree
Predict with decision tree and output the confusion matrix. It seems like the result of the model is not ideal
decisionTreePrediction <- predict(decisionTreeMod, TrainCrossValData)
confusionMatrix(TrainCrossValData$classe, decisionTreePrediction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1271 15 107 0 2
## B 399 308 242 0 0
## C 403 17 435 0 0
## D 366 145 293 0 0
## E 139 123 218 0 421
##
## Overall Statistics
##
## Accuracy : 0.4965
## 95% CI : (0.4824, 0.5106)
## No Information Rate : 0.5257
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3415
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.4930 0.50658 0.3359 NA 0.99527
## Specificity 0.9467 0.85079 0.8836 0.8361 0.89288
## Pos Pred Value 0.9111 0.32455 0.5088 NA 0.46726
## Neg Pred Value 0.6275 0.92415 0.7876 NA 0.99950
## Prevalence 0.5257 0.12398 0.2641 0.0000 0.08626
## Detection Rate 0.2592 0.06281 0.0887 0.0000 0.08585
## Detection Prevalence 0.2845 0.19352 0.1743 0.1639 0.18373
## Balanced Accuracy 0.7199 0.67869 0.6098 NA 0.94408
Plotting the decision tree
rpart.plot(decisionTreeMod$finalModel)

Machine Learning Algorithm - Random Forest
set.seed(21243)
rfMod <- train(classe ~., method='rf', data=TrainTrainData, ntree=100)
rfPrediction <- predict(rfMod, TrainCrossValData)
confusionMatrix(TrainCrossValData$classe, rfPrediction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 0 0 0 0
## B 8 940 1 0 0
## C 0 0 852 3 0
## D 0 0 8 794 2
## E 0 0 0 2 899
##
## Overall Statistics
##
## Accuracy : 0.9951
## 95% CI : (0.9927, 0.9969)
## No Information Rate : 0.2861
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9938
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9943 1.0000 0.9895 0.9937 0.9978
## Specificity 1.0000 0.9977 0.9993 0.9976 0.9995
## Pos Pred Value 1.0000 0.9905 0.9965 0.9876 0.9978
## Neg Pred Value 0.9977 1.0000 0.9978 0.9988 0.9995
## Prevalence 0.2861 0.1917 0.1756 0.1629 0.1837
## Detection Rate 0.2845 0.1917 0.1737 0.1619 0.1833
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9971 0.9989 0.9944 0.9957 0.9986
Prediction
Now lets predict using test data
predict(rfMod, ClTestData)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Concluison
It can be seen from result that Random Forst performs better than Decision Tree model. Random Forst gives accuracy of more than 99% in Sample data.