The objective of study is to predict the manner in which 6 participants did the exercise. After getting, cleaning, and performing the data preparation. The models were built based on 3 algorithm that were (1) Classification Trees: CART, (2) Random Forests: RF, and (3) Generalized Boosted Regression Modeling: GBM. After all models were tested, it was found that the random forest model was provided the best accuracy compared with the others. Model validation was performed in the final section.
Training dataset was then seperated into 2 groups, which were (1) training set (70% of the training data), and (2) testing set (the 30% of remaining data), while the validation data was still unchanged.
library(caret)
## Warning: package 'caret' was built under R version 3.6.2
## Loading required package: lattice
## Loading required package: ggplot2
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.2
## corrplot 0.84 loaded
library(knitr)
## Warning: package 'knitr' was built under R version 3.6.2
library(rattle)
## Warning: package 'rattle' was built under R version 3.6.2
## Rattle: A free graphical interface for data science with R.
## Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
trainURL = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(trainURL, "./training.csv")
download.file(testURL, "./testing.csv")
raw.training <- read.csv("./training.csv")
validation <- read.csv("./testing.csv")
t(data.frame(Training = dim(raw.training), Validation = dim(validation),
row.names = c("Observations", "Variables")))
## Observations Variables
## Training 19622 160
## Validation 20 160
set.seed(12345)
inTrain <- createDataPartition(y = raw.training$classe, p = 0.7,
list = FALSE)
training <- raw.training[inTrain, ]
testing <- raw.training[-inTrain, ]
t(data.frame(Training = dim(training), Testing = dim(testing),
row.names = c("Observations", "Variables")))
## Observations Variables
## Training 13737 160
## Testing 5885 160
Because there were a lot of variables for each dataset, some of incomplete varibles, which mostly contained NA, blank value or had very few unique values, were need to be removed. While, the first seven variables were only the information that might not had an major contribution to the measurement.
# Remove the first seven variables:
training <- training[, -c(1:7)]
testing <- testing[, -c(1:7)]
# Remove the incomplete variables:
training <- training[, colSums(is.na(training)) == 0]
testing <- testing[, colSums(is.na(testing)) == 0]
# Remove the very few unique values (near zero variance):
training <- training[, -nearZeroVar(training)]
testing <- testing[, -nearZeroVar(testing)]
# Data for processing
t(data.frame(Training = dim(training), Testing = dim(testing),
row.names = c("Observations", "Variables")))
## Observations Variables
## Training 13737 53
## Testing 5885 53
Outliers would caused the significant model error, therefore, it was required to verify and replace with NA value before the models were built.
strip.outliers <- function(v, upperz = 3, lowerz = -3, replace = NA) {
z <- scale(v) # gives z-scores for vector v
v[!is.na(z) & (z > upperz | z < lowerz)] <- replace
return(v) # replace outliers with NA
}
training.out.rm <- data.frame(lapply(training[, -c(53)], strip.outliers))
training <- cbind(training.out.rm, classe = training$classe)
To minimize the predictors and reduce the noise, all (52) variables of the training set were analyzed and performed the correlation plot.
cor.mat <- cor(training[, -53], use = "complete.obs")
highlyCorrelated <- findCorrelation(cor.mat, cutoff = 0.9)
names(training)[highlyCorrelated]
## [1] "accel_belt_z" "roll_belt" "accel_belt_y" "magnet_belt_x"
## [5] "accel_belt_x" "gyros_arm_x"
The list showed the highly correlated variables. However, as the number of correlations are quite few, PCA (Principal Components Analysis) step might not be required to perform for this assignment.
There were 3 methods, would be applied to create the prediction models from the training dataset, while the highest model accuracy would be applied to the validation dataset, which will be also used for the quiz predictions.
Those 3 methods were: (1) Classification Trees: CART, (2) Random Forests: RF, and (3) Generalized Boosted Regression Modeling: GBM.
Note that a Confusion Matrix was determined at the end of each analysis to better visualize the accuracy of the models. Furthermore, the effects of overfitting could be limited, and improved the model efficiency by applying the cross-validation technique (5-folds), while all NA value would be omitted.
trControl <- trainControl(method = "cv", number = 5, verboseIter = FALSE)
# Classification Trees
modFit.cart <- train(classe ~ ., method = "rpart", data = training,
trControl = trControl, na.action = na.omit)
# Random Forests
modFit.rf <- train(classe ~ ., method = "rf", data = training,
trControl = trControl, na.action = na.omit)
# Generalized Boosted Regression Modeling
modFit.gbm <- train(classe ~ ., method = "gbm", data = training,
trControl = trControl, na.action = na.omit)
pred.cart <- predict(modFit.cart, newdata = testing)
cm.cart <- confusionMatrix(pred.cart, testing$classe)
pred.rf <- predict(modFit.rf, newdata = testing)
cm.rf <- confusionMatrix(pred.rf, testing$classe)
pred.gbm <- predict(modFit.gbm, newdata = testing)
cm.gbm <- confusionMatrix(pred.gbm, testing$classe)
data.frame(Model = c("CART", "RF", "GBM"), Accuracy = c(cm.cart$overall[1],
cm.rf$overall[1], cm.gbm$overall[1]))
## Model Accuracy
## 1 CART 0.4897196
## 2 RF 0.9785896
## 3 GBM 0.9403568
According to the assessment of these 3 model fits and out-of-sample error testing results, it was apparently indicated that both RF and GBM were significant better than CART model. With random forests being slightly more accurate; therefore, it would be applied to the validation data.
validMod <- predict(modFit.rf, newdata = validation)
validMod
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
diag(cor.mat) <- 0
which(cor.mat > 0.9, arr.ind = TRUE)
## row col
## total_accel_belt 4 1
## accel_belt_y 9 1
## roll_belt 1 4
## accel_belt_y 9 4
## magnet_belt_x 11 8
## roll_belt 1 9
## total_accel_belt 4 9
## accel_belt_x 8 11
corrplot(cor.mat, order = "FPC", method = "color", type = "upper",
tl.cex = 0.6, tl.col = rgb(0, 0, 0))
fancyRpartPlot(modFit.cart$finalModel)
plot(modFit.rf, main = "Accuracy of Random forest model by number of predictors")
plot(modFit.gbm)