Executive Summary

The objective of study is to predict the manner in which 6 participants did the exercise. After getting, cleaning, and performing the data preparation. The models were built based on 3 algorithm that were (1) Classification Trees: CART, (2) Random Forests: RF, and (3) Generalized Boosted Regression Modeling: GBM. After all models were tested, it was found that the random forest model was provided the best accuracy compared with the others. Model validation was performed in the final section.

Getting, Cleaning, and Data Preparation

Training dataset was then seperated into 2 groups, which were (1) training set (70% of the training data), and (2) testing set (the 30% of remaining data), while the validation data was still unchanged.

library(caret)

## Warning: package 'caret' was built under R version 3.6.2

## Loading required package: lattice

## Loading required package: ggplot2

library(corrplot)

## Warning: package 'corrplot' was built under R version 3.6.2

## corrplot 0.84 loaded

library(knitr)

## Warning: package 'knitr' was built under R version 3.6.2

library(rattle)

## Warning: package 'rattle' was built under R version 3.6.2

## Rattle: A free graphical interface for data science with R.
## Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:rattle':
## 
##     importance

## The following object is masked from 'package:ggplot2':
## 
##     margin

trainURL = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(trainURL, "./training.csv")
download.file(testURL, "./testing.csv")
raw.training <- read.csv("./training.csv")
validation <- read.csv("./testing.csv")
t(data.frame(Training = dim(raw.training), Validation = dim(validation), 
    row.names = c("Observations", "Variables")))

##            Observations Variables
## Training          19622       160
## Validation           20       160

set.seed(12345)
inTrain <- createDataPartition(y = raw.training$classe, p = 0.7, 
    list = FALSE)
training <- raw.training[inTrain, ]
testing <- raw.training[-inTrain, ]
t(data.frame(Training = dim(training), Testing = dim(testing), 
    row.names = c("Observations", "Variables")))

##          Observations Variables
## Training        13737       160
## Testing          5885       160

Because there were a lot of variables for each dataset, some of incomplete varibles, which mostly contained NA, blank value or had very few unique values, were need to be removed. While, the first seven variables were only the information that might not had an major contribution to the measurement.

# Remove the first seven variables:
training <- training[, -c(1:7)]
testing <- testing[, -c(1:7)]
# Remove the incomplete variables:
training <- training[, colSums(is.na(training)) == 0]
testing <- testing[, colSums(is.na(testing)) == 0]
# Remove the very few unique values (near zero variance):
training <- training[, -nearZeroVar(training)]
testing <- testing[, -nearZeroVar(testing)]
# Data for processing
t(data.frame(Training = dim(training), Testing = dim(testing), 
    row.names = c("Observations", "Variables")))

##          Observations Variables
## Training        13737        53
## Testing          5885        53

Outliers would caused the significant model error, therefore, it was required to verify and replace with NA value before the models were built.

strip.outliers <- function(v, upperz = 3, lowerz = -3, replace = NA) {
    z <- scale(v)  # gives z-scores for vector v
    v[!is.na(z) & (z > upperz | z < lowerz)] <- replace
    return(v)  # replace outliers with NA
}
training.out.rm <- data.frame(lapply(training[, -c(53)], strip.outliers))
training <- cbind(training.out.rm, classe = training$classe)

To minimize the predictors and reduce the noise, all (52) variables of the training set were analyzed and performed the correlation plot.

cor.mat <- cor(training[, -53], use = "complete.obs")
highlyCorrelated <- findCorrelation(cor.mat, cutoff = 0.9)
names(training)[highlyCorrelated]

## [1] "accel_belt_z"  "roll_belt"     "accel_belt_y"  "magnet_belt_x"
## [5] "accel_belt_x"  "gyros_arm_x"

The list showed the highly correlated variables. However, as the number of correlations are quite few, PCA (Principal Components Analysis) step might not be required to perform for this assignment.

Model Building with cross validation

There were 3 methods, would be applied to create the prediction models from the training dataset, while the highest model accuracy would be applied to the validation dataset, which will be also used for the quiz predictions.
Those 3 methods were: (1) Classification Trees: CART, (2) Random Forests: RF, and (3) Generalized Boosted Regression Modeling: GBM.
Note that a Confusion Matrix was determined at the end of each analysis to better visualize the accuracy of the models. Furthermore, the effects of overfitting could be limited, and improved the model efficiency by applying the cross-validation technique (5-folds), while all NA value would be omitted.

trControl <- trainControl(method = "cv", number = 5, verboseIter = FALSE)
# Classification Trees
modFit.cart <- train(classe ~ ., method = "rpart", data = training, 
    trControl = trControl, na.action = na.omit)
# Random Forests
modFit.rf <- train(classe ~ ., method = "rf", data = training, 
    trControl = trControl, na.action = na.omit)
# Generalized Boosted Regression Modeling
modFit.gbm <- train(classe ~ ., method = "gbm", data = training, 
    trControl = trControl, na.action = na.omit)

Model Testing and Out-of-sample Error

pred.cart <- predict(modFit.cart, newdata = testing)
cm.cart <- confusionMatrix(pred.cart, testing$classe)
pred.rf <- predict(modFit.rf, newdata = testing)
cm.rf <- confusionMatrix(pred.rf, testing$classe)
pred.gbm <- predict(modFit.gbm, newdata = testing)
cm.gbm <- confusionMatrix(pred.gbm, testing$classe)
data.frame(Model = c("CART", "RF", "GBM"), Accuracy = c(cm.cart$overall[1], 
    cm.rf$overall[1], cm.gbm$overall[1]))

##   Model  Accuracy
## 1  CART 0.4897196
## 2    RF 0.9785896
## 3   GBM 0.9403568

According to the assessment of these 3 model fits and out-of-sample error testing results, it was apparently indicated that both RF and GBM were significant better than CART model. With random forests being slightly more accurate; therefore, it would be applied to the validation data.

Validation

validMod <- predict(modFit.rf, newdata = validation)
validMod

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Appendix Figures

Correlation plot

diag(cor.mat) <- 0
which(cor.mat > 0.9, arr.ind = TRUE)

##                  row col
## total_accel_belt   4   1
## accel_belt_y       9   1
## roll_belt          1   4
## accel_belt_y       9   4
## magnet_belt_x     11   8
## roll_belt          1   9
## total_accel_belt   4   9
## accel_belt_x       8  11

corrplot(cor.mat, order = "FPC", method = "color", type = "upper", 
    tl.cex = 0.6, tl.col = rgb(0, 0, 0))

Classification Trees Model

fancyRpartPlot(modFit.cart$finalModel)

Accuracy of Random Forest Model

plot(modFit.rf, main = "Accuracy of Random forest model by number of predictors")

Accuracy of Generalized Boosted Regression Modeling

plot(modFit.gbm)

Study on the prediction of the manner in which wearer did the exercise