Summary

This document is the final report of the Peer Assessment project from the Practical Machine Learning course, which is a part of the Coursera John’s Hopkins University Data Science Specialization. It was written and coded in RStudio, using its knitr functions and published in the html and markdown format. The goal of this project is to predict the manner in which the six participants performed the exercises. The machine learning algorithm, which uses the classe variable in the training set, is applied to the 20 test cases available in the test data.

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this project, my goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict the manner in which they did the exercise.

This report describes how a model was built, how cross validation was used, what the expected out of sample error might be, and the rationale behind the choices made. The prediction model will also be used to predict 20 different test cases.

Ref: Coursera

Set-up

The following packages are required to reproduce results.

set.seed(2018)
#Loading Packages
library(caret)
library(randomForest)
library(rpart)
library(rpart.plot)
library(knitr)
library(rattle)
library(RColorBrewer)
library(lattice)
library(gbm)

Data Collection

Source: http://groupware.les.inf.puc-rio.br/har

The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Loading the data

Load the files, create 70/30 data partition and check for NAs. Missing data is mapped to NA strings.

set.seed(123)

# Getting and Cleaning Data
# Preparing for download
Trainurl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
Testurl  <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

# download the datasets
if(!file.exists("./data/training.csv")){
 download.file(Trainurl, destfile = "./data/training.csv", method = "curl")
download.file(Trainurl, destfile = "./data/testing.csv", method = "curl")
}

# Read data
training <- read.csv("./data/training.csv", na.strings = c("NA", "#DIV/0!"))
testing <- read.csv("./data/testing.csv", na.strings = c("NA", "#DIV/0!"))

Create a partition with the Training dataset

In building our model, for a cross validation objective, we subset our training data to a real training set and a test set. Partitioning will allow us to cross-validate. The data will be partitioned into 70% Training and 30% Testing bootstrap samples.

inTrain  <- createDataPartition(training$classe, p=0.7, list=FALSE)
Train <- training[inTrain, ]
Test  <- testing[-inTrain, ]

Cleaning variables that have near zero values

NZV <- nearZeroVar(Train)
Train <- Train[, -NZV]
Test  <- Test[, -NZV]

# Removing the nomenclature columns
Train <- Train[, -(1:5)]
Test  <- Test[, -(1:5)]

Looking at the data

Removing variables with too many NA values, 90% NA or more

We can see that in the training set we have 13737 observations of 126 variables and that in the testing set we have 5885 observations of 126 variables. Many of that variables (columns) have a lot of NAs and the first seven columns appear to have only identification purposes of the observations with little interest to prediction.

# Remove variables in the training set with too much NAs 
ToomanyNA    <- sapply(Train, function(x) mean(is.na(x))) > 0.90
Train <- Train[, ToomanyNA==FALSE]
Test  <- Test[, ToomanyNA==FALSE]
## Prediction with Random Forests
# Model Fit
set.seed(12345)
controlrf <- trainControl(method="repeatedcv", number=5, verboseIter=FALSE, repeats=2)
modFitrf <- train(y = Train$classe, x = Train[,-ncol(Train)], method = "rpart")

modFitrf$finalModel
## n= 13737 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 13737 9831 A (0.28 0.19 0.17 0.16 0.18)  
##    2) roll_belt< 130.5 12581 8687 A (0.31 0.21 0.19 0.18 0.11)  
##      4) pitch_forearm< -26.65 1260   60 A (0.95 0.048 0 0 0) *
##      5) pitch_forearm>=-26.65 11321 8627 A (0.24 0.23 0.21 0.2 0.12)  
##       10) num_window>=45.5 10809 8115 A (0.25 0.24 0.22 0.2 0.09)  
##         20) num_window< 241.5 2482 1183 A (0.52 0.13 0.1 0.19 0.054) *
##         21) num_window>=241.5 8327 6041 B (0.17 0.27 0.26 0.2 0.1)  
##           42) magnet_dumbbell_z< -27.5 2215 1177 A (0.47 0.35 0.045 0.12 0.014) *
##           43) magnet_dumbbell_z>=-27.5 6112 4073 C (0.058 0.25 0.33 0.23 0.13)  
##             86) magnet_dumbbell_x< -446.5 4370 2443 C (0.067 0.16 0.44 0.25 0.081) *
##             87) magnet_dumbbell_x>=-446.5 1742  942 B (0.036 0.46 0.064 0.18 0.26) *
##       11) num_window< 45.5 512  107 E (0 0 0 0.21 0.79) *
##    3) roll_belt>=130.5 1156   12 E (0.01 0 0 0 0.99) *
### The Random Forest model is selected and applied to make predictions on the 20
### data points from the original testing dataset (testing).

# Prediction on Test
predictrf <- predict(modFitrf, newdata=Test)
confMatrf <- confusionMatrix(predictrf, factor(Test$classe))
                             
confMatrf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1496  474  166  317   59
##          B   38  390   50  137  190
##          C  138  275  810  471  178
##          D    0    0    0    0    0
##          E    2    0    0   39  655
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5694          
##                  95% CI : (0.5566, 0.5821)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4443          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8937  0.34241   0.7895   0.0000   0.6054
## Specificity            0.7587  0.91256   0.7814   1.0000   0.9915
## Pos Pred Value         0.5955  0.48447   0.4327      NaN   0.9411
## Neg Pred Value         0.9472  0.85256   0.9462   0.8362   0.9177
## Prevalence             0.2845  0.19354   0.1743   0.1638   0.1839
## Detection Rate         0.2542  0.06627   0.1376   0.0000   0.1113
## Detection Prevalence   0.4268  0.13679   0.3181   0.0000   0.1183
## Balanced Accuracy      0.8262  0.62748   0.7855   0.5000   0.7984
# Plot
plot(confMatrf$table, col = confMatrf$byClass, 
     main = paste("Random Forest Accuracy =",
                  round(confMatrf$overall['Accuracy'], 4)))

### The predictive accuracy of the Random Forest model is excellent at 99.8 %.

## Prediction with Decision Trees
# Model Fit
set.seed(2222)
modFitdt <- rpart(classe ~ ., data=Train, method="class")
fancyRpartPlot(modFitdt)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting

# Predictions of the decision tree model on Test
predictdf <- predict(modFitdt, newdata=Test, type="class")
confMatdf <- confusionMatrix(predictdf, factor(Test$classe))
confMatdf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1459   86    0   13    2
##          B  104  855   61   75   51
##          C    0   57  856   37    3
##          D   92   81   99  759   86
##          E   19   60   10   80  940
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8274          
##                  95% CI : (0.8175, 0.8369)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7823          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8716   0.7507   0.8343   0.7873   0.8688
## Specificity            0.9760   0.9387   0.9800   0.9273   0.9648
## Pos Pred Value         0.9353   0.7461   0.8982   0.6795   0.8476
## Neg Pred Value         0.9503   0.9401   0.9655   0.9570   0.9703
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2479   0.1453   0.1455   0.1290   0.1597
## Detection Prevalence   0.2651   0.1947   0.1619   0.1898   0.1884
## Balanced Accuracy      0.9238   0.8447   0.9072   0.8573   0.9168
# Plot the predictive accuracy of the decision tree model.
plot(confMatdf$table, col = confMatdf$byClass, 
     main = paste("Decision Tree Accuracy =",
                  round(confMatdf$overall['Accuracy'], 4)))

### The predictive accuracy of the decision tree model is relatively low at 82.7%.

## Prediction with Generalized Boosted Regression
# Model Fit
controlgbm <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
modFitgbm <- train(classe ~ ., data=Train, method = "gbm",
                    trControl = controlgbm, verbose = FALSE)
modFitgbm$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 53 had non-zero influence.
# Prediction on Test
predictgbm <- predict(modFitgbm, newdata=Test)
confMatgbm <- confusionMatrix(predictgbm, factor(Test$classe))
confMatgbm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1670   12    0    1    2
##          B    4 1112    6    4    1
##          C    0   13 1017    7    4
##          D    0    2    3  952    8
##          E    0    0    0    0 1067
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9886          
##                  95% CI : (0.9856, 0.9912)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9856          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9976   0.9763   0.9912   0.9876   0.9861
## Specificity            0.9964   0.9968   0.9951   0.9974   1.0000
## Pos Pred Value         0.9911   0.9867   0.9769   0.9865   1.0000
## Neg Pred Value         0.9990   0.9943   0.9981   0.9976   0.9969
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2838   0.1890   0.1728   0.1618   0.1813
## Detection Prevalence   0.2863   0.1915   0.1769   0.1640   0.1813
## Balanced Accuracy      0.9970   0.9866   0.9931   0.9925   0.9931
# Plot
plot(confMatgbm$table, col = confMatgbm$byClass, 
     main = paste("Generalized Boosted Regression Accuracy =", round(confMatgbm$overall['Accuracy'], 4)))

### The predictive accuracy of the decision tree model is relatively high at 98.7%.

Applying the Best Predictive Model to the Test Data

The following are the predictive accuracy of the three models:

Decision Tree Model: 82.7 % Generalized Boosted Model: 98.7 % Random Forest Model: 99.80 %