Setting up current working directory

setwd("C:/Week4Assignment")

Initializing

library(caret)

## Warning: package 'caret' was built under R version 3.6.3

library(ggplot2)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.6.3

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.6.3

Reading training and test datasets

TrainingData <- read.csv('pml-training.csv', na.strings = c("NA", "#DIV/0!", ""))
TestData <- read.csv('pml-testing.csv', na.strings = c("NA", "#DIV/0!", ""))

Cleaning Training data

Removing columns with more than 90% observations as NA

clnColumnIndex <- colSums(is.na(TrainingData))/nrow(TrainingData) < 0.90
ClTrainingData <- TrainingData[,clnColumnIndex]

Removing columns 1 - 7 from train and test dataset as they are not required in predictions.

ClTrainingData <- ClTrainingData[,-c(1:7)]
ClTestData <- TestData[,-c(1:7)]

Partitioning training data into traning set and cross validation set

inTrainIndex <- createDataPartition(ClTrainingData$classe, p=0.75)[[1]]
TrainTrainData <- ClTrainingData[inTrainIndex,]
TrainCrossValData <- ClTrainingData[-inTrainIndex,]

Changing Test Data set into same

allNames <- names(ClTrainingData)
ClTestData <- TestData[,allNames[1:52]]

Machine Learning Algorithm - Decision Tree

Predict with decision tree and output the confusion matrix. It seems like the result of the model is not ideal

decisionTreePrediction <- predict(decisionTreeMod, TrainCrossValData)
confusionMatrix(TrainCrossValData$classe, decisionTreePrediction)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1271   15  107    0    2
##          B  399  308  242    0    0
##          C  403   17  435    0    0
##          D  366  145  293    0    0
##          E  139  123  218    0  421
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4965          
##                  95% CI : (0.4824, 0.5106)
##     No Information Rate : 0.5257          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3415          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.4930  0.50658   0.3359       NA  0.99527
## Specificity            0.9467  0.85079   0.8836   0.8361  0.89288
## Pos Pred Value         0.9111  0.32455   0.5088       NA  0.46726
## Neg Pred Value         0.6275  0.92415   0.7876       NA  0.99950
## Prevalence             0.5257  0.12398   0.2641   0.0000  0.08626
## Detection Rate         0.2592  0.06281   0.0887   0.0000  0.08585
## Detection Prevalence   0.2845  0.19352   0.1743   0.1639  0.18373
## Balanced Accuracy      0.7199  0.67869   0.6098       NA  0.94408

Plotting the decision tree

rpart.plot(decisionTreeMod$finalModel)

Machine Learning Algorithm - Random Forest

set.seed(21243)
rfMod <- train(classe ~., method='rf', data=TrainTrainData, ntree=100)
rfPrediction <- predict(rfMod, TrainCrossValData)
confusionMatrix(TrainCrossValData$classe, rfPrediction)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    0    0    0    0
##          B    8  940    1    0    0
##          C    0    0  852    3    0
##          D    0    0    8  794    2
##          E    0    0    0    2  899
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9951          
##                  95% CI : (0.9927, 0.9969)
##     No Information Rate : 0.2861          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9938          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9943   1.0000   0.9895   0.9937   0.9978
## Specificity            1.0000   0.9977   0.9993   0.9976   0.9995
## Pos Pred Value         1.0000   0.9905   0.9965   0.9876   0.9978
## Neg Pred Value         0.9977   1.0000   0.9978   0.9988   0.9995
## Prevalence             0.2861   0.1917   0.1756   0.1629   0.1837
## Detection Rate         0.2845   0.1917   0.1737   0.1619   0.1833
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9971   0.9989   0.9944   0.9957   0.9986

Prediction

Now lets predict using test data

predict(rfMod, ClTestData)

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Practical Machine Learning Week 4

Shailesh Parab

04/04/2020

Setting up current working directory

Initializing

Reading training and test datasets

Cleaning Training data

Removing columns with more than 90% observations as NA

Removing columns 1 - 7 from train and test dataset as they are not required in predictions.

Partitioning training data into traning set and cross validation set

Changing Test Data set into same

Machine Learning Algorithm - Decision Tree

Predict with decision tree and output the confusion matrix. It seems like the result of the model is not ideal

Plotting the decision tree

Machine Learning Algorithm - Random Forest

Prediction

Now lets predict using test data

Concluison

It can be seen from result that Random Forst performs better than Decision Tree model. Random Forst gives accuracy of more than 99% in Sample data.