This project goal is to predict the manner in which a group of enthusiasts, who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks, carried out the exercises. The data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants will be used for prediction. The “classe” variable in the training set is used as outcome for the prediction and the other variables were used to predict. This report describes how different models was built, how the testing set derived from the training set was used to confirm the accuracy of each model. The report also shows the expected out of sample error, and why the choices made, were made. Finally, the prediction model with the highest accuracy would used to predict 20 different test cases from the validation set.
The required Library packages for analysis was loaded into R.
library(caret)
library(rpart)
library(randomForest)
library(rattle)
library(gbm)
library(corrplot)
First, the data set is downloaded and stored in vector variables which is divided into training data and the validation data of 20 cases.
if(!file.exists("./DataDownload")){dir.create("./DataDownload")}
fileUrl = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(fileUrl, destfile = "./DataDownload/trainingdataset.csv")
trainingData = read.csv("./DataDownload/trainingdataset.csv")
if(!file.exists("./DataDownload")){dir.create("./DataDownload")}
fileUrl = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(fileUrl, destfile = "./DataDownload/testingdataset.csv")
testingData = read.csv("./DataDownload/testingdataset.csv")
Next, we remove columns that are not needed for prediction due to the availability of NAs, or due to their over-fitting tendencies.
training = trainingData[,colSums(is.na(trainingData))==0]
validation = testingData[,colSums(is.na(testingData))==0]
training = training[,-c(1:7)]
validation = validation[,-c(1:7)]
dim(training)
## [1] 19622 86
dim(validation)
## [1] 20 53
Next, the training data set is divided into training set and test sat that would be used to design the prediction model.
set.seed(123)
intrain = createDataPartition(training$classe,p=0.7,list=F)
trainData = training[intrain,]
testData = training[-intrain,]
Next, we identify and remove columns with no variability to avoid error in prediction.
nzv = nearZeroVar(trainData)
trainData = trainData[,-nzv]
testData = testData[,-nzv]
dim(trainData)
## [1] 13737 53
dim(testData)
## [1] 5885 53
Next we plot the correlation between the different variables to have a clearer view of the effect of each variable on another.
cor_plot = cor(trainData[,-53])
corrplot(cor_plot,order = "FPC", method = "color",type="upper",tl.cex=0.8,tl.col = rgb(0,0,0))
First, the prediction with tree method is used, which iteratively split variable groups,and analyses each group’s homogeneity.
set.seed(1234)
treemod = rpart(classe~.,data=trainData,method="class")
fancyRpartPlot(treemod)
treepred = predict(treemod, testData, type="class")
treeconfmat = confusionMatrix(treepred,as.factor(testData$classe))
treeconfmat
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1552 174 18 60 6
## B 48 588 43 63 64
## C 39 220 888 100 148
## D 24 83 75 651 86
## E 11 74 2 90 778
##
## Overall Statistics
##
## Accuracy : 0.7573
## 95% CI : (0.7462, 0.7683)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6926
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9271 0.51624 0.8655 0.6753 0.7190
## Specificity 0.9387 0.95407 0.8957 0.9455 0.9631
## Pos Pred Value 0.8575 0.72953 0.6366 0.7084 0.8147
## Neg Pred Value 0.9701 0.89151 0.9693 0.9370 0.9383
## Prevalence 0.2845 0.19354 0.1743 0.1638 0.1839
## Detection Rate 0.2637 0.09992 0.1509 0.1106 0.1322
## Detection Prevalence 0.3076 0.13696 0.2370 0.1562 0.1623
## Balanced Accuracy 0.9329 0.73515 0.8806 0.8104 0.8411
plot(treeconfmat$table,col=treeconfmat$byClass,main=paste("Decision Tree Accuracy =", round(treeconfmat$overall["Accuracy"],4)))
This method takes a resample of the observed data from the training data set and builds a regression tree on it, then the new outcome of the classification tree is resampled and reclassified.
trcontrol = trainControl(method = "cv",number = 3,verboseIter = F)
rfmod = train(classe~.,data=trainData,method="rf",trControl=trcontrol)
rfpred = predict(rfmod,testData)
RFconfmat = confusionMatrix(rfpred,as.factor(testData$classe))
RFconfmat
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1671 6 0 0 0
## B 2 1124 5 0 0
## C 0 9 1019 10 4
## D 0 0 2 954 5
## E 1 0 0 0 1073
##
## Overall Statistics
##
## Accuracy : 0.9925
## 95% CI : (0.99, 0.9946)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9905
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9982 0.9868 0.9932 0.9896 0.9917
## Specificity 0.9986 0.9985 0.9953 0.9986 0.9998
## Pos Pred Value 0.9964 0.9938 0.9779 0.9927 0.9991
## Neg Pred Value 0.9993 0.9968 0.9986 0.9980 0.9981
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2839 0.1910 0.1732 0.1621 0.1823
## Detection Prevalence 0.2850 0.1922 0.1771 0.1633 0.1825
## Balanced Accuracy 0.9984 0.9927 0.9942 0.9941 0.9957
plot(RFconfmat$table,col=RFconfmat$byClass,main=paste("Random Forest Accuracy =",round(RFconfmat$overall["Accuracy"],4)))
This method takes a lot of weak predictors and utilizes each predictor’s strength, building them up to become a strong predictor.
trcontrol2 = trainControl(method="repeatedcv",number = 5, repeats = 1)
gbmfit = train(classe~., data=trainData, method="gbm", trControl=trcontrol2,verbose=F)
gbmfit$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 52 had non-zero influence.
gbmpred = predict(gbmfit, testData)
gbmconfmat = confusionMatrix(gbmpred, as.factor(testData$classe))
gbmconfmat
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1647 28 0 3 6
## B 17 1081 27 5 9
## C 5 28 984 29 21
## D 4 2 14 919 18
## E 1 0 1 8 1028
##
## Overall Statistics
##
## Accuracy : 0.9616
## 95% CI : (0.9564, 0.9664)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9514
##
## Mcnemar's Test P-Value : 4.129e-07
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9839 0.9491 0.9591 0.9533 0.9501
## Specificity 0.9912 0.9878 0.9829 0.9923 0.9979
## Pos Pred Value 0.9780 0.9491 0.9222 0.9603 0.9904
## Neg Pred Value 0.9936 0.9878 0.9913 0.9909 0.9889
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2799 0.1837 0.1672 0.1562 0.1747
## Detection Prevalence 0.2862 0.1935 0.1813 0.1626 0.1764
## Balanced Accuracy 0.9875 0.9684 0.9710 0.9728 0.9740
plot(gbmconfmat$table,col=gbmconfmat$byClass,main=paste("Generalized Boosted Model Accuracy =",round(gbmconfmat$overall["Accuracy"],4)))
From the model plots, it is seen that the random forest method has the highest accuracy of prediction. It is then used to predict the outcome of the validation data set.
predict(rfmod, validation)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E