Introduction

This document is the final report of the Peer Assessment project from Coursera’s course Practical Machine Learning, as part of the Specialization in Data Science. It was built up in RStudio, using its knitr functions, meant to be published in html format.

This analysis is meant to be the basis for the course quiz and a prediction assignment write up. The main goal of the project is to predict the manner in which 6 participants performed some exercise as described below (This is the classe variable in the training set. The machine learning algorithm described here is applied to the 20 test cases available in the test data and the predictions are submitted in appropriate format to the Course Project Prediction Quiz for automated grading.

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this project, our goal is to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts in five different ways: exactly according to specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise while the other 4 classes correspond to common mistakes.

More information is available from the website here: http://groupware.les.inf.puc-rio.br/har

Data Sources

The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Loading Data and Libraries

Here, we downloaded and loaded the training and testing dataset from the given url. The needed libraries were also loaded.

training.Url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testing.Url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

#download.file(training.Url, "./pml-training.csv")
#download.file(testing.Url, "./pml-testing.csv")

training.df <- read.csv("./pml-training.csv")
testing.df <- read.csv("./pml-testing.csv")
library(knitr)
## Warning: package 'knitr' was built under R version 4.0.5
library(caret)
## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.5
## Loading required package: lattice
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.0.5
library(rattle)
## Warning: package 'rattle' was built under R version 4.0.5
## Loading required package: tibble
## Warning: package 'tibble' was built under R version 4.0.5
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.0.5
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
## 
##     importance
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(corrplot)
## corrplot 0.90 loaded

Data Cleaning and Partitioning

That’s a lot of NA variables which can affect our model. Time to remove the columns with a large amount of NA’s.

# Number of NA variables in dataset
sum(is.na(training.df))
## [1] 1287472
# Selecting columns of the training dataset which have a mean of the logical response of is.na() to be less than 0.9
training.df <- training.df[, colMeans(is.na(training.df)) < 0.9]

# removing irrelevant metadata
training.df <- training.df[, -c(1:7)]

dim(training.df)
## [1] 19622    86

Removing near zero variance variables

nvz <- nearZeroVar(training.df)
training.df <- training.df[,-nvz]

dim(training.df)
## [1] 19622    53

We partitioned the training data set into two to create a training dataset (70% of the training data set) for the modeling process and testing data set (the remaining 30%) for validations.

set.seed(12345)

# Partitioning based on the response column
inTrain <- createDataPartition(training.df$classe, p=0.7, list=FALSE)

trainSet <- training.df[inTrain,]
testSet <- training.df[-inTrain,]

dim(trainSet)
## [1] 13737    53
dim(testSet)
## [1] 5885   53

Correlation Analysis

A correlation analysis amongst the variables excluding the response variable (classe) before proceeding to the modeling procedures.

corMatrix <- cor(trainSet[, -53])   # Excluding the classe variable
corrplot(corMatrix, method = "color", type = "lower", order = "FPC",
         tl.cex = 0.8, tl.col = rgb(0,0,0))

The highly correlated values are shown in dark colours in the graph above.

Prediction Model Building

Three methods will be applied to model the regressions and the best one (the one with higher accuracy when applied to the test dataset) will be used for the quiz predictions. The methods are Random Forest, Decision Tree and Generalized Boosted Model. A confusion matrix is plotted at the end of each analysis to better visualize the accuracy of the models.

Method 1: Random Forest

set.seed(12345)

# model fit

controlRF <- trainControl(method="cv", number=3, verboseIter=FALSE)
modFitRandForest <- train(classe ~ ., data=trainSet, method="rf",
                          trControl=controlRF)

modFitRandForest$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.68%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3899    5    0    0    2 0.001792115
## B   19 2630    9    0    0 0.010534236
## C    0   15 2373    8    0 0.009599332
## D    0    1   21 2227    3 0.011101243
## E    0    3    4    3 2515 0.003960396
predictRandForest <- predict(modFitRandForest, newdata = testSet)
confMatRandForest <- confusionMatrix(predictRandForest, as.factor(testSet$classe))
confMatRandForest
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1672    7    0    0    0
##          B    1 1129    4    0    0
##          C    1    3 1019    7    1
##          D    0    0    3  956    1
##          E    0    0    0    1 1080
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9951          
##                  95% CI : (0.9929, 0.9967)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9938          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9988   0.9912   0.9932   0.9917   0.9982
## Specificity            0.9983   0.9989   0.9975   0.9992   0.9998
## Pos Pred Value         0.9958   0.9956   0.9884   0.9958   0.9991
## Neg Pred Value         0.9995   0.9979   0.9986   0.9984   0.9996
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2841   0.1918   0.1732   0.1624   0.1835
## Detection Prevalence   0.2853   0.1927   0.1752   0.1631   0.1837
## Balanced Accuracy      0.9986   0.9951   0.9954   0.9954   0.9990
# plot matrix results
plot(confMatRandForest$table, col=confMatRandForest$byClass,
     main = paste("Random Forest - Accuracy =",
                  round(confMatRandForest$overall["Accuracy"], 4)))

Method 2: Decision Trees

# model fit
set.seed(12345)

modFitDecTree <- rpart(classe ~ ., data=trainSet, method="class")
fancyRpartPlot(modFitDecTree)

predictDecTree <- predict(modFitDecTree, newdata=testSet, type="class")
confMatDecTree <- confusionMatrix(predictDecTree, as.factor(testSet$classe))

confMatDecTree
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1532  176   28   48   41
##          B   54  585   57   64   76
##          C   35  154  819  134  126
##          D   25   76   58  631   56
##          E   28  148   64   87  783
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7392          
##                  95% CI : (0.7277, 0.7503)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6692          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9152  0.51361   0.7982   0.6546   0.7237
## Specificity            0.9304  0.94711   0.9076   0.9563   0.9319
## Pos Pred Value         0.8395  0.69976   0.6459   0.7459   0.7054
## Neg Pred Value         0.9650  0.89028   0.9552   0.9339   0.9374
## Prevalence             0.2845  0.19354   0.1743   0.1638   0.1839
## Detection Rate         0.2603  0.09941   0.1392   0.1072   0.1331
## Detection Prevalence   0.3101  0.14206   0.2155   0.1438   0.1886
## Balanced Accuracy      0.9228  0.73036   0.8529   0.8054   0.8278
plot(confMatDecTree$table, col=confMatDecTree$byClass,
     main = paste("Decision Tree - Accuracy =",
                  round(confMatDecTree$overall["Accuracy"], 4)))

Method 3: Generalized Boosted Model

set.seed(12345)

controlGBM <- trainControl(method="repeatedcv", number=5, repeats=1)
modFitGBM <- train(classe ~ ., data=trainSet, method="gbm",
                   trControl=controlGBM, verbose=FALSE)

modFitGBM$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 51 had non-zero influence.
predictGBM <- predict(modFitGBM, newdata = testSet)
confMatGBM <- confusionMatrix(predictGBM, as.factor(testSet$classe))

confMatGBM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1647   39    0    1    1
##          B   19 1066   38    4   14
##          C    4   33  979   38    6
##          D    4    0    8  915    8
##          E    0    1    1    6 1053
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9618          
##                  95% CI : (0.9565, 0.9665)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9516          
##                                           
##  Mcnemar's Test P-Value : 8.329e-08       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9839   0.9359   0.9542   0.9492   0.9732
## Specificity            0.9903   0.9842   0.9833   0.9959   0.9983
## Pos Pred Value         0.9757   0.9343   0.9236   0.9786   0.9925
## Neg Pred Value         0.9936   0.9846   0.9903   0.9901   0.9940
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2799   0.1811   0.1664   0.1555   0.1789
## Detection Prevalence   0.2868   0.1939   0.1801   0.1589   0.1803
## Balanced Accuracy      0.9871   0.9601   0.9688   0.9726   0.9858
plot(confMatGBM$table, col=confMatGBM$byClass,
     main = paste("GBM - Accuracy =",
                  round(confMatGBM$overall["Accuracy"], 4)))

Applying the Selected Model to the Test Data

The accuracy of the three regression modeling methods above are:

  1. Random Forest: 0.9952
  2. Decision Tree: 0.7392
  3. Generalized Boosted Model: 0.9618

In that case, the Random Forest model will be applied to test dataset (testing.df).

predictTest <- predict(modFitRandForest, newdata = testing.df)
predictTest
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E