Executive Summary

This project goal is to predict the manner in which a group of enthusiasts, who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks, carried out the exercises. The data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants will be used for prediction. The “classe” variable in the training set is used as outcome for the prediction and the other variables were used to predict. This report describes how different models was built, how the testing set derived from the training set was used to confirm the accuracy of each model. The report also shows the expected out of sample error, and why the choices made, were made. Finally, the prediction model with the highest accuracy would used to predict 20 different test cases from the validation set.

Analysis

Loading Required Library Packages

The required Library packages for analysis was loaded into R.

library(caret)
library(rpart)
library(randomForest)
library(rattle)
library(gbm)
library(corrplot)

Data Set Download

First, the data set is downloaded and stored in vector variables which is divided into training data and the validation data of 20 cases.

if(!file.exists("./DataDownload")){dir.create("./DataDownload")}
fileUrl = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(fileUrl, destfile = "./DataDownload/trainingdataset.csv")
trainingData = read.csv("./DataDownload/trainingdataset.csv")

if(!file.exists("./DataDownload")){dir.create("./DataDownload")}
fileUrl = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(fileUrl, destfile = "./DataDownload/testingdataset.csv")
testingData = read.csv("./DataDownload/testingdataset.csv")

Cleaning and Analysis of Data Set

Next, we remove columns that are not needed for prediction due to the availability of NAs, or due to their over-fitting tendencies.

training = trainingData[,colSums(is.na(trainingData))==0]
validation = testingData[,colSums(is.na(testingData))==0]
training = training[,-c(1:7)]
validation = validation[,-c(1:7)]
dim(training)

## [1] 19622    86

dim(validation)

## [1] 20 53

Next, the training data set is divided into training set and test sat that would be used to design the prediction model.

set.seed(123)
intrain = createDataPartition(training$classe,p=0.7,list=F)
trainData = training[intrain,]
testData = training[-intrain,]

Next, we identify and remove columns with no variability to avoid error in prediction.

nzv = nearZeroVar(trainData)
trainData = trainData[,-nzv]
testData = testData[,-nzv]
dim(trainData)

## [1] 13737    53

dim(testData)

## [1] 5885   53

Correlation Plot

Next we plot the correlation between the different variables to have a clearer view of the effect of each variable on another.

cor_plot = cor(trainData[,-53])
corrplot(cor_plot,order = "FPC", method = "color",type="upper",tl.cex=0.8,tl.col = rgb(0,0,0))

Models

Prediction Tree Model

First, the prediction with tree method is used, which iteratively split variable groups,and analyses each group’s homogeneity.

set.seed(1234)
treemod = rpart(classe~.,data=trainData,method="class") 
fancyRpartPlot(treemod)

treepred = predict(treemod, testData, type="class")
treeconfmat = confusionMatrix(treepred,as.factor(testData$classe))
treeconfmat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1552  174   18   60    6
##          B   48  588   43   63   64
##          C   39  220  888  100  148
##          D   24   83   75  651   86
##          E   11   74    2   90  778
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7573          
##                  95% CI : (0.7462, 0.7683)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6926          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9271  0.51624   0.8655   0.6753   0.7190
## Specificity            0.9387  0.95407   0.8957   0.9455   0.9631
## Pos Pred Value         0.8575  0.72953   0.6366   0.7084   0.8147
## Neg Pred Value         0.9701  0.89151   0.9693   0.9370   0.9383
## Prevalence             0.2845  0.19354   0.1743   0.1638   0.1839
## Detection Rate         0.2637  0.09992   0.1509   0.1106   0.1322
## Detection Prevalence   0.3076  0.13696   0.2370   0.1562   0.1623
## Balanced Accuracy      0.9329  0.73515   0.8806   0.8104   0.8411

plot(treeconfmat$table,col=treeconfmat$byClass,main=paste("Decision Tree Accuracy =", round(treeconfmat$overall["Accuracy"],4)))

Random Forest Model

This method takes a resample of the observed data from the training data set and builds a regression tree on it, then the new outcome of the classification tree is resampled and reclassified.

trcontrol = trainControl(method = "cv",number = 3,verboseIter = F)
rfmod = train(classe~.,data=trainData,method="rf",trControl=trcontrol)
rfpred = predict(rfmod,testData)
RFconfmat = confusionMatrix(rfpred,as.factor(testData$classe))
RFconfmat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1671    6    0    0    0
##          B    2 1124    5    0    0
##          C    0    9 1019   10    4
##          D    0    0    2  954    5
##          E    1    0    0    0 1073
## 
## Overall Statistics
##                                         
##                Accuracy : 0.9925        
##                  95% CI : (0.99, 0.9946)
##     No Information Rate : 0.2845        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.9905        
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9868   0.9932   0.9896   0.9917
## Specificity            0.9986   0.9985   0.9953   0.9986   0.9998
## Pos Pred Value         0.9964   0.9938   0.9779   0.9927   0.9991
## Neg Pred Value         0.9993   0.9968   0.9986   0.9980   0.9981
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2839   0.1910   0.1732   0.1621   0.1823
## Detection Prevalence   0.2850   0.1922   0.1771   0.1633   0.1825
## Balanced Accuracy      0.9984   0.9927   0.9942   0.9941   0.9957

plot(RFconfmat$table,col=RFconfmat$byClass,main=paste("Random Forest Accuracy =",round(RFconfmat$overall["Accuracy"],4)))

Generalized Boosting Model

This method takes a lot of weak predictors and utilizes each predictor’s strength, building them up to become a strong predictor.

trcontrol2 = trainControl(method="repeatedcv",number = 5, repeats = 1)
gbmfit = train(classe~., data=trainData, method="gbm", trControl=trcontrol2,verbose=F)
gbmfit$finalModel

## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 52 had non-zero influence.

gbmpred = predict(gbmfit, testData)
gbmconfmat = confusionMatrix(gbmpred, as.factor(testData$classe))
gbmconfmat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1647   28    0    3    6
##          B   17 1081   27    5    9
##          C    5   28  984   29   21
##          D    4    2   14  919   18
##          E    1    0    1    8 1028
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9616          
##                  95% CI : (0.9564, 0.9664)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9514          
##                                           
##  Mcnemar's Test P-Value : 4.129e-07       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9839   0.9491   0.9591   0.9533   0.9501
## Specificity            0.9912   0.9878   0.9829   0.9923   0.9979
## Pos Pred Value         0.9780   0.9491   0.9222   0.9603   0.9904
## Neg Pred Value         0.9936   0.9878   0.9913   0.9909   0.9889
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2799   0.1837   0.1672   0.1562   0.1747
## Detection Prevalence   0.2862   0.1935   0.1813   0.1626   0.1764
## Balanced Accuracy      0.9875   0.9684   0.9710   0.9728   0.9740

plot(gbmconfmat$table,col=gbmconfmat$byClass,main=paste("Generalized Boosted Model Accuracy =",round(gbmconfmat$overall["Accuracy"],4)))

Validation Data Set Prediction

From the model plots, it is seen that the random forest method has the highest accuracy of prediction. It is then used to predict the outcome of the validation data set.

predict(rfmod, validation)

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Practical Machine Learning

Bamidele Tella

9/11/2020