Prediction Assignment Writeup

1. Overview

Platform: Coursera
Course: Practical Machine Learning
Task: Week 4 Final Project
Location: Texas, USA
Link: www.coursera.org/learn/practical-machine-learning/

2. Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These types of devices are part of the quantified self-movement. A group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

3. Data Loading and Exploratory Analysis

a. Data Source

The training data and the testing data are below:

Training set (Download)
Test set (Download)

b. Environment Setup

if(!require("pacman")) install.packages("pacman")
pacman::p_load(knitr, corrplot, caret, randomForest, rattle, gbm)
options(digits = 3)
set.seed(123)

c. Data Loading and Cleaning

# link set
train.file = "pml-training.csv"
test.file = "pml-testing.csv"

# download dataset
train = read.csv(train.file)
test = read.csv(test.file)

# partition dataset
in.train = createDataPartition(train$classe, p = 0.7, list = F)
train.set = train[in.train, ]
test.set = train[-in.train, ]

# check dataset
dim(train.set)

## [1] 13737   160

Firstly, we are gonna clean the NA variables and identification variables. Importantly, the test dataset is not changed and will only be used for the quiz results generation.

# near zero variance
nzv = nearZeroVar(train.set)
train.set = train.set[, -nzv]

# check dataset
dim(train.set)

## [1] 13737   111

The variables as near zero variance are meaningless for modeling.

# mostly NA variable
near.na = sapply(train.set, function(x) mean(is.na(x))) > 0.95
train.set = train.set[, near.na == F]

# check dataset
dim(train.set)

## [1] 13737    59

The variables as mostly NA variables are useless for modeling.

# identification variable
train.set = train.set[, -c(1:5)]

# check dataset
dim(train.set)

## [1] 13737    54

The variable as identification variables are pointless for modeling. After cleaning, we can see the ready-to-analysis dataset has 53 variables as independent and 1 variable as dependent.

d. Exploring Data Analysis

cor.matrix = cor(train.set[, -54])
corrplot(cor.matrix, order = "FPC", method = "color", type = "lower",
         tl.cex = 0.6, tl.col = rgb(0, 0, 0))

A correlation among variables is analysed before modeling. If the correlations are quite more, a principal components analysis (PCA) could be performed as processing step to make an even more compact analysis. However, the plot shows quite a few correlations. PCA will not be applied.

4. Model Building

We are gonna use three methods to build up models and even stack them together. The methods are random forest, decision tree, and generalized boosted model.

a. Random Forest

# train model
mod.rf = train(data = train.set,
               classe ~ .,
               method = "rf",
               trControl = trainControl(method = "cv",
                                        number = 3))

# test model & validate model
pred.rf = predict(mod.rf, newdata = test.set)

# evaluate model
cm.rf = confusionMatrix(table(pred.rf, test.set$classe))

# plot matrix result
plot(cm.rf$table, col = cm.rf$byClass,
     main = "Random Forest",
     sub = paste("Accuracy =", round(cm.rf$overall[1], 4)),
     xlab = "Prediction",
     ylab = "Reference")

b. Decision Tree

# train model
mod.dt = train(data = train.set,
               classe ~ .,
               method = "rpart",
               trControl = trainControl(method = "cv",
                                        number = 3,
                                        verboseIter = F))
fancyRpartPlot(mod.dt$finalModel) # decision tree plot (optional)

# test model & validate model
pred.dt = predict(mod.dt, newdata = test.set)

# evaluate model on validate
cm.dt = confusionMatrix(table(pred.dt, test.set$classe))

# plot matrix result
plot(cm.dt$table, col = cm.dt$byClass,
     main = "Decision Tree",
     sub = paste("Accuracy =", round(cm.dt$overall[1], 4)),
     xlab = "Prediction",
     ylab = "Reference")

c. Generalized Boosted Model

# train model
mod.gbm = train(data = train.set,
                classe ~ .,
                method = "gbm",
                trControl = trainControl(method = "cv",
                                         number = 3),
                verbose = F)

# test model & validate model
pred.gbm = predict(mod.gbm, newdata = test.set)

# evaluate model on validate
cm.gbm = confusionMatrix(table(pred.gbm, test.set$classe))

# plot matrix result
plot(cm.gbm$table, col = cm.gbm$byClass,
     main = "Generalized Boosted Model",
     sub = paste("Accuracy =", round(cm.gbm$overall[1], 4)),
     xlab = "Prediction",
     ylab = "Reference")

5. Model Applying on Test Dataset

The accuracy of the 3 modeling methods above are:

Random Forest (0.9986)
Decision Tree (0.5694)
Generalized Boosted Model (0.9845)

pred.test = predict(mod.rf, newdata = test)
pred.test

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

The random forest model will be applied to predict the 20 quiz results as shown above. We also used the stacking model strategy, but it was not working well with a classficiation model. Also, the accuracy was not improving.