Introduction

This is a submission for the final project in Coursera’s Practical Machine Learning by Johns Hopkins University, third course in the Data Science: Statistics and Machine Learning Specialization.

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict the manner in which they did the exercise. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.

In this report, we trained three models: Random Forest,Decision Trees and Support Vector Machine (svm) using k-folds cross validation for purposes of reducing noise and obtaining patterns in the training data. We split the pml-training data set into training and validation sets. The pml-testing data set provided was left for the purposes of the final prediction for quizzes.

From the three models, the random forest model had the highest accuracy level about 99.5% and a very small out of sample error about 0.5%. We then use this model to do the final prediction.

Data

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source:

http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.

library(caret);library(ggplot2);library(dplyr)
library(skimr)
library(naniar)
library(kernlab)
library(randomForest)
library(rattle)

Getting Data and cleaning data

trainingDF <- read.csv("data/pml-training.csv")
pmlTesting <- read.csv("data/pml-testing.csv")
dim(trainingDF);dim(pmlTesting)
## [1] 19622   160
## [1]  20 160
view(head(trainingDF[complete.cases(trainingDF),],10))

Check for the missing values

Removing unnecessary and missing variables.

Remove information not necessary to the outcome variable.

These are the first seven columns of the data.

pml_training <- trainingDF %>% select(-c(1:7))
pml_training %>% miss_var_summary()
## # A tibble: 153 x 3
##    variable             n_miss pct_miss
##    <chr>                 <int>    <dbl>
##  1 max_roll_belt         19216     97.9
##  2 max_picth_belt        19216     97.9
##  3 min_roll_belt         19216     97.9
##  4 min_pitch_belt        19216     97.9
##  5 amplitude_roll_belt   19216     97.9
##  6 amplitude_pitch_belt  19216     97.9
##  7 var_total_accel_belt  19216     97.9
##  8 avg_roll_belt         19216     97.9
##  9 stddev_roll_belt      19216     97.9
## 10 var_roll_belt         19216     97.9
## # ... with 143 more rows

There are about 67 variables with high number of missing values. We can eliminate this variables.

pml_training.no.na <- pml_training %>% select(which(colMeans(is.na(.))<0.9))

Preprocessing

Removing zero and near zero variance predictors

nzvVars <- nearZeroVar(pml_training.no.na)
pmlDf <- pml_training.no.na[,-nzvVars]

Check for correlated data

numDat <- select_if(pmlDf,is.numeric)
highCor<- findCorrelation(cor(numDat),cutoff = 0.9)
filterPmlDf <- pmlDf[,-highCor]

Splitting data to training and validation sets

We can now split the data to training and validation data set after cleaning and preprocessing. However, the test set (“pmlTesting”) will be left for the final prediction.

inTrain <- createDataPartition(y=filterPmlDf$classe, p=0.75, list=FALSE)
training <- filterPmlDf[inTrain,]
validation <- filterPmlDf[-inTrain,]

Creating and Testing the Models

We are going to fit three models: Random Forest,Decision Trees and SVM models for classification to check which algorithm is much better to fit the data.

Modeling

Cross validation

To obtain the correct patterns from the data and ensure it is not getting too much noise, we use k-folds cross validation technique.

train_control <- trainControl(method="cv", number=5)

Random Forest Model

set.seed(4578)
rfMod <- train(classe~., data=training, method="rf", trControl = train_control, tuneLength = 5)
rfPred <- predict(rfMod, validation)
cmRF <- confusionMatrix(rfPred, factor(validation$classe))
cmRF
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1393    6    0    0    0
##          B    1  939    7    0    0
##          C    0    4  845   12    0
##          D    0    0    3  791    1
##          E    1    0    0    1  900
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9927          
##                  95% CI : (0.9899, 0.9949)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9907          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9986   0.9895   0.9883   0.9838   0.9989
## Specificity            0.9983   0.9980   0.9960   0.9990   0.9995
## Pos Pred Value         0.9957   0.9916   0.9814   0.9950   0.9978
## Neg Pred Value         0.9994   0.9975   0.9975   0.9968   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2841   0.1915   0.1723   0.1613   0.1835
## Detection Prevalence   0.2853   0.1931   0.1756   0.1621   0.1839
## Balanced Accuracy      0.9984   0.9937   0.9922   0.9914   0.9992

Decision Tree

treeMod <- train(classe~., data=training, method="rpart", trControl = train_control, tuneLength = 5)

##  Plo the tree
fancyRpartPlot(treeMod$finalModel)

Prediction:

predTrees <- predict(treeMod, validation)
cmTrees <- confusionMatrix(predTrees, factor(validation$classe))
cmTrees
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1247  262  122  129   97
##          B   22  381   34   17  133
##          C   87  215  596  132  262
##          D   39   91  103  465  117
##          E    0    0    0   61  292
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6079         
##                  95% CI : (0.594, 0.6216)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.499          
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8939  0.40148   0.6971  0.57836  0.32408
## Specificity            0.8262  0.94791   0.8281  0.91463  0.98476
## Pos Pred Value         0.6715  0.64906   0.4613  0.57055  0.82720
## Neg Pred Value         0.9514  0.86843   0.9283  0.91709  0.86618
## Prevalence             0.2845  0.19352   0.1743  0.16395  0.18373
## Detection Rate         0.2543  0.07769   0.1215  0.09482  0.05954
## Detection Prevalence   0.3787  0.11970   0.2635  0.16619  0.07198
## Balanced Accuracy      0.8600  0.67469   0.7626  0.74650  0.65442

Support Vector Machine

set.seed(1234)
svmMod <- train(classe~., data=training, method="svmRadial", trControl = train_control, tuneLength = 5, verbose = FALSE)

# Prediction
predSvm <- predict(svmMod, validation)

#Confusion matrix
cmSvm <- confusionMatrix(predSvm, factor(validation$classe))
cmSvm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1390   44    3    1    0
##          B    0  895    6    0    0
##          C    4    8  845   63    5
##          D    0    0    1  739   14
##          E    1    2    0    1  882
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9688          
##                  95% CI : (0.9635, 0.9735)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9605          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9431   0.9883   0.9192   0.9789
## Specificity            0.9863   0.9985   0.9802   0.9963   0.9990
## Pos Pred Value         0.9666   0.9933   0.9135   0.9801   0.9955
## Neg Pred Value         0.9986   0.9865   0.9975   0.9843   0.9953
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2834   0.1825   0.1723   0.1507   0.1799
## Detection Prevalence   0.2932   0.1837   0.1886   0.1538   0.1807
## Balanced Accuracy      0.9914   0.9708   0.9843   0.9577   0.9890

Accuracy and Out of Sample Error

DTree <- c(cmTrees$overall["Accuracy"],1-c(cmTrees$overall["Accuracy"]))
RF <- c(cmRF$overall["Accuracy"],1-c(cmRF$overall["Accuracy"]))
SVM <- c(cmSvm$overall["Accuracy"],1-c(cmSvm$overall["Accuracy"]))

Output <- rbind(DTree,RF,SVM)
colnames(Output) <- c("Accuracy","oo_S_Err")

Output <- Output %>% apply(.,2, round,3)
Output[order(-Output[,1]),]
##       Accuracy oo_S_Err
## RF       0.993    0.007
## SVM      0.969    0.031
## DTree    0.608    0.392

The best model is the Random Forest model, with 0.9926591 accuracy and 0.0073409 out of sample error rate. We find that to be a sufficient enough model to use for our test sets.

Predictions on Test Set

We will use the random forest model to do prediction on the test set since it has the highest accuracy.

Random Forest Model Predictions on the test set

testPredRF <- predict(rfMod, pmlTesting)
print(testPredRF)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Appendix

correlation matrix of variables in training set

library(psych)
cor.plot(numDat,xlas = 2)

Plotting the models

Random Forest model

plot(rfMod)

Decision Trees

plot(treeMod)

Support Vector Machine

plot(svmMod, plotType = "line")