Hossam Saad

July 20,2020

OverView

This document is the final report of the Peer Assessment project from Coursera’s course Practical Machine Learning, as part of the Specialization in Data Science. It was built in RStudio, using its knitr functions, meant to be published in html format. This analysis meant to be the basis for the course quiz and a prediction assignment writeup. The main goal of the project is to predict the manner in which 6 participants performed some exercise as described below. This is the “classe” variable in the training set. The machine learning algorithm described here is applied to the 20 test cases available in the test data and the predictions are submitted in appropriate format to the Course Project Prediction Quiz for automated grading.

Loading Data

1- Data Source

[Training Set]https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

[Test Set]https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

2- Loading Require packages

library(knitr)
## Warning: package 'knitr' was built under R version 3.6.3
library(caret)
## Warning: package 'caret' was built under R version 3.6.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.6.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.6.3
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.6.3
library(rattle)
## Warning: package 'rattle' was built under R version 3.6.3
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.6.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
## 
##     importance
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.3
## corrplot 0.84 loaded
library(gbm)
## Loaded gbm 2.1.8
library(survival)
## Warning: package 'survival' was built under R version 3.6.3
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
library(splines)
library(parallel)
library(pryr)
## Registered S3 method overwritten by 'pryr':
##   method      from
##   print.bytes Rcpp
set.seed(199)

3- Cleaning Data

TrainUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
TestUrl  <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
TrainFile<-"pml-traininig.csv"
TestFile<-"pml-testing.csv"

# download the datasets
if(!file.exists(TrainFile))
{
    download.file(TrainUrl,destfile = TrainFile)
}
trainingData <- read.csv(TrainFile)
if(!file.exists(TestFile))
{
    download.file(TestUrl,destfile = TestFile)
}
testingData  <- read.csv(TestFile)

# create a partition using caret with the training dataset on 70,30 ratio
inTrain  <- createDataPartition(trainingData$classe, p=0.7, list=FALSE)

TrainSet <- trainingData[inTrain, ]

TestSet  <- trainingData[-inTrain, ]
dim(TrainSet)
## [1] 13737   160

Clean Na , NZV

NZV <- nearZeroVar(TrainSet)
TrainSet <- TrainSet[, -NZV]
TestSet  <- TestSet[, -NZV]
dim(TestSet)
## [1] 5885  105
dim(TrainSet)
## [1] 13737   105

Remove Variables that have Na

NaVar    <- sapply(TrainSet, function(x) mean(is.na(x))) > 0.95
TrainSet <- TrainSet[, NaVar==FALSE]
TestSet  <- TestSet[, NaVar==FALSE]
dim(TestSet)
## [1] 5885   59
dim(TrainSet)
## [1] 13737    59

Remove the first 5 variables

TrainSet <- TrainSet[, -(1:5)]
TestSet  <- TestSet[, -(1:5)]
dim(TrainSet)
## [1] 13737    54

4- Correction Analysis

let's see the corr b/w variables first.

corMatrix <- cor(TrainSet[, -54])
corrplot(corMatrix, order = "FPC", method = "color", type = "lower", 
         tl.cex = 0.8, tl.col = rgb(0, 0, 0))

Building Prediction model

Three popular methods will be applied to model the regressions (in the Train dataset) and the best one (with higher accuracy when applied to the Test dataset) will be used for the quiz predictions. The methods are: Random Forests, Decision Tree and Generalized Boosted Model, as described below. A Confusion Matrix is plotted at the end of each analysis to better visualize the accuracy of the models.

1- Random Forests

Fitting model

set.seed(199)
RandomForestControl <- trainControl(method="cv", number=3, verboseIter=FALSE)
modFitRandForest <- train(classe ~ ., data=TrainSet, method="rf",
                          trControl=RandomForestControl)
modFitRandForest$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.28%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3904    1    0    0    1 0.0005120328
## B    9 2646    2    1    0 0.0045146727
## C    0    9 2386    1    0 0.0041736227
## D    0    0    9 2242    1 0.0044404973
## E    0    1    0    4 2520 0.0019801980

Now let's doing a prediction on test dataset

predictRandForest <- predict(modFitRandForest, newdata=TestSet)
confMatRandForest <- confusionMatrix(predictRandForest, TestSet$classe)
confMatRandForest
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    6    0    0    0
##          B    0 1132    2    0    0
##          C    0    1 1024    7    0
##          D    0    0    0  957    1
##          E    0    0    0    0 1081
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9971          
##                  95% CI : (0.9954, 0.9983)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9963          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9939   0.9981   0.9927   0.9991
## Specificity            0.9986   0.9996   0.9984   0.9998   1.0000
## Pos Pred Value         0.9964   0.9982   0.9922   0.9990   1.0000
## Neg Pred Value         1.0000   0.9985   0.9996   0.9986   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1924   0.1740   0.1626   0.1837
## Detection Prevalence   0.2855   0.1927   0.1754   0.1628   0.1837
## Balanced Accuracy      0.9993   0.9967   0.9982   0.9963   0.9995

Ploting martix for Results

png("plot1")
plot(confMatRandForest$table, col = confMatRandForest$byClass, 
     main = paste("Random Forest - Accuracy =",
                  round(confMatRandForest$overall['Accuracy'], 4)))

dev.off()
## png 
##   2
plot(confMatRandForest$table, col = confMatRandForest$byClass, 
     main = paste("Random Forest - Accuracy =",
                  round(confMatRandForest$overall['Accuracy'], 4)))

2- Decision Tree

Fitting model

set.seed(199)
modFitDecTree <- rpart(classe ~ ., data=TrainSet, method="class")
fancyRpartPlot(modFitDecTree)

Now let's doing a prediction on test dataset

predictDecTree <- predict(modFitDecTree, newdata=TestSet, type="class")
confMatDecTree <- confusionMatrix(predictDecTree, TestSet$classe)
confMatDecTree
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1500  254   40  109   91
##          B   49  594   36   23   87
##          C   20   72  831  142   82
##          D   89  147   53  625  129
##          E   16   72   66   65  693
## 
## Overall Statistics
##                                           
##                Accuracy : 0.721           
##                  95% CI : (0.7093, 0.7324)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6451          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8961   0.5215   0.8099   0.6483   0.6405
## Specificity            0.8827   0.9589   0.9350   0.9151   0.9544
## Pos Pred Value         0.7523   0.7529   0.7245   0.5992   0.7599
## Neg Pred Value         0.9553   0.8931   0.9588   0.9300   0.9218
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2549   0.1009   0.1412   0.1062   0.1178
## Detection Prevalence   0.3388   0.1341   0.1949   0.1772   0.1550
## Balanced Accuracy      0.8894   0.7402   0.8725   0.7817   0.7974

Ploting martix for Results

png("plot2")
plot(confMatDecTree$table, col = confMatDecTree$byClass, 
     main = paste("Decision Tree - Accuracy =",
                  round(confMatDecTree$overall['Accuracy'], 4)))
dev.off()
## png 
##   2
plot(confMatDecTree$table, col = confMatDecTree$byClass, 
     main = paste("Decision Tree - Accuracy =",
                  round(confMatDecTree$overall['Accuracy'], 4)))

3-Generalized Boosted Model

Fitting model

set.seed(199)
controlGBM <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
modFitGBM  <- train(classe ~ ., data=TrainSet, method = "gbm",
                    trControl = controlGBM, verbose = FALSE)
modFitGBM$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 53 had non-zero influence.

Now let's doing a prediction on test dataset

predictGBM <- predict(modFitGBM, newdata=TestSet)
confMatGBM <- confusionMatrix(predictGBM, TestSet$classe)
confMatGBM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1668    6    0    1    0
##          B    6 1126   14    8    4
##          C    0    6 1010   16    0
##          D    0    1    1  935    5
##          E    0    0    1    4 1073
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9876          
##                  95% CI : (0.9844, 0.9903)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9843          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9886   0.9844   0.9699   0.9917
## Specificity            0.9983   0.9933   0.9955   0.9986   0.9990
## Pos Pred Value         0.9958   0.9724   0.9787   0.9926   0.9954
## Neg Pred Value         0.9986   0.9972   0.9967   0.9941   0.9981
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2834   0.1913   0.1716   0.1589   0.1823
## Detection Prevalence   0.2846   0.1968   0.1754   0.1601   0.1832
## Balanced Accuracy      0.9974   0.9909   0.9899   0.9842   0.9953

Plotting a matrix For Results

png("plot3")
plot(confMatGBM$table, col = confMatGBM$byClass, 
     main = paste("GBM - Accuracy =", round(confMatGBM$overall['Accuracy'], 4)))
dev.off()
## png 
##   2
plot(confMatGBM$table, col = confMatGBM$byClass, 
     main = paste("GBM - Accuracy =", round(confMatGBM$overall['Accuracy'], 4)))

Appling the Best Model to the Test Data

Random Forest : 0.9968 Decision Tree : 0.8291 GBM : 0.9884 In that case, the Random Forest model will be applied to predict the 20 quiz results (testing dataset) as shown below.

predictTEST <- predict(modFitRandForest, newdata=testingData)
predictTEST
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E