Rpub Link: Click Here

1. Introduction

This report is for Coursera Practical Machine Learning Course’s Final Project.

The data is from this source: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har

Basically the data is collected from sensor device such as Jawbone Up, Nike FuelBand, and Fitbit which are attached on belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

The objective of this project is to predict the manner in which the participants did the exercise. The variable which I am predicting is called “classe”.

Our outcome variable “classe” is a factor variable with 5 levels. For this dataset, “participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in 5 different fashions:

  1. exactly according to the specification (Class A)

  2. throwing the elbows to the front (Class B)

  3. lifting the dumbbell only halfway (Class C)

  4. lowering the dumbbell only halfway (Class D)

  5. throwing the hips to the front (Class E)

The report will touch on how the model is built, cross validation, out of sample error and predict the outcome for 20 different test subjects.

2. Load Dataset

# Load dataset. Note "" and "NA" are actually na.string for this dataset
train <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", header=TRUE, stringsAsFactors = TRUE, na.strings = c("","NA"))
test <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", header=TRUE, stringsAsFactors = TRUE, na.strings = c("","NA"))

3. Data Exploration

First, let’s take a look at the summary and look for any NA. Missing dependencies check will be done to look for any missing package that require installation.

# Check for missing dependencies and load necessary R packages
if(!require(caret)){install.packages('caret')}; library(caret)
if(!require(rattle)){install.packages('rattle')}; library(rattle)
if(!require(randomForest)){install.packages('randomForest')}; library(randomForest)
if(!require(MASS)){install.packages('MASS')}; library(MASS)
if(!require(ggplot2)){install.packages('ggplot2')}; library(ggplot2)


# Check summary of Train & Test data
# summary(train); str(train); head(train); summary(test); str(test); head(test)  

# Check NA for each columns in Train
#sapply(train,function(x) sum(is.na(x)))

4. Data Cleaning

Noticed there are lots of columns with NA values. Removing these columns and the columns with time/date.

# Remove NA columns for Train
train2 <- train[ , apply(train, 2, function(x) !any(is.na(x)))]

# Remove unnecessary columns
train2 <- train2[,8:60]

Separate data the train data into 60% for training the model and 40% for testing the model. The model with the lowest MSE and highest AUC will be used to predict the final outcome for the 20 different test subjects.

# Create Index for training
IndexTrain <- createDataPartition(y=train2$classe, p=0.6, list=FALSE)
training <- train2[IndexTrain,]
testing <- train2[-IndexTrain,]

5. Model Building

5.1. Decision Tree

Using the train function in the caret package, we set method=“rpart” and train the Decision Tree model with the training data.

# Train Tree Model
tree1 <- train(classe~., method="rpart", data=training)
tree1$finalModel
## n= 11776 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 11776 8428 A (0.28 0.19 0.17 0.16 0.18)  
##     2) roll_belt< 130.5 10773 7435 A (0.31 0.21 0.19 0.18 0.11)  
##       4) pitch_forearm< -33.95 924    5 A (0.99 0.0054 0 0 0) *
##       5) pitch_forearm>=-33.95 9849 7430 A (0.25 0.23 0.21 0.2 0.12)  
##        10) yaw_belt>=169.5 505   48 A (0.9 0.046 0 0.044 0.0059) *
##        11) yaw_belt< 169.5 9344 7093 B (0.21 0.24 0.22 0.2 0.13)  
##          22) magnet_dumbbell_z< -87.5 1236  537 A (0.57 0.29 0.049 0.076 0.023) *
##          23) magnet_dumbbell_z>=-87.5 8108 6115 C (0.16 0.23 0.25 0.22 0.14)  
##            46) pitch_belt< -42.95 464   77 B (0.0065 0.83 0.11 0.022 0.026) *
##            47) pitch_belt>=-42.95 7644 5703 C (0.16 0.2 0.25 0.24 0.15)  
##              94) magnet_dumbbell_x>=-447.5 3282 2289 B (0.17 0.3 0.095 0.24 0.19)  
##               188) roll_belt< 117.5 2058 1171 B (0.17 0.43 0.025 0.13 0.25) *
##               189) roll_belt>=117.5 1224  707 D (0.18 0.087 0.21 0.42 0.1) *
##              95) magnet_dumbbell_x< -447.5 4362 2732 C (0.16 0.12 0.37 0.24 0.11) *
##     3) roll_belt>=130.5 1003   10 E (0.01 0 0 0 0.99) *
# Plot Tree Model
fancyRpartPlot(tree1$finalModel, tweak=1.5)

# Predictions using Testing dataset
tree.pred <- predict(tree1, newdata = testing)

# ConfusionMatrix for Tree Model
tree.confuse <- confusionMatrix(tree.pred, testing$classe)
tree.confuse
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1374  253   38   81   22
##          B  240  880   51  202  352
##          C  471  324 1081  663  353
##          D  143   61  198  340   77
##          E    4    0    0    0  638
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5497          
##                  95% CI : (0.5386, 0.5608)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.435           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.6156   0.5797   0.7902  0.26439  0.44244
## Specificity            0.9298   0.8665   0.7204  0.92698  0.99938
## Pos Pred Value         0.7771   0.5101   0.3738  0.41514  0.99377
## Neg Pred Value         0.8588   0.8958   0.9421  0.86538  0.88840
## Prevalence             0.2845   0.1935   0.1744  0.16391  0.18379
## Detection Rate         0.1751   0.1122   0.1378  0.04333  0.08132
## Detection Prevalence   0.2253   0.2199   0.3686  0.10438  0.08183
## Balanced Accuracy      0.7727   0.7231   0.7553  0.59568  0.72091

Based on the confusionMatrix, we can see the accuracy for Decision Tree Model is 0.5497069.

5.2. Random Forest

For Random Forest model, manual tuning was done to find the optimal mtry which will be used to train the final model. The optimal mtry will have the lowest out-of-bag error.

# Manually tune for Optimal mtry
mse.rfs <- rep(0, 13)
for(m in 1:13){
    set.seed(123)
    rf <- randomForest(classe ~ ., data=training, mtry=m)
    mse.rfs[m] <- rf$err.rate[500]  
}

# Plot OOB Error for each mtry
plot(1:13, mse.rfs, type="b", xlab="mtry", ylab="OOB Error")

mse.rfs
##  [1] 0.015030571 0.009595788 0.008661685 0.007387908 0.006623641
##  [6] 0.007218071 0.006878397 0.006623641 0.007218071 0.006453804
## [11] 0.006538723 0.006963315 0.007133152
optimal.mtry <- which.min(mse.rfs)


# Train randomForest Model with optimal mtry
rf1 <- randomForest(classe~., data=training, mtry=optimal.mtry)
rf1
## 
## Call:
##  randomForest(formula = classe ~ ., data = training, mtry = optimal.mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##         OOB estimate of  error rate: 0.68%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3342    3    0    1    2 0.001792115
## B   20 2252    7    0    0 0.011847301
## C    0   10 2040    4    0 0.006815969
## D    0    0   20 1908    2 0.011398964
## E    0    0    3    8 2154 0.005080831
# Predictions using Testing dataset
rf.pred <- predict(rf1, newdata = testing)

# ConfusionMatrix for Tree Model
rf.confuse <- confusionMatrix(rf.pred, testing$classe)
rf.confuse
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2224   14    0    0    0
##          B    8 1503    4    0    0
##          C    0    1 1359   16    2
##          D    0    0    5 1266    1
##          E    0    0    0    4 1439
## 
## Overall Statistics
##                                           
##                Accuracy : 0.993           
##                  95% CI : (0.9909, 0.9947)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9911          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9901   0.9934   0.9844   0.9979
## Specificity            0.9975   0.9981   0.9971   0.9991   0.9994
## Pos Pred Value         0.9937   0.9921   0.9862   0.9953   0.9972
## Neg Pred Value         0.9986   0.9976   0.9986   0.9970   0.9995
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2835   0.1916   0.1732   0.1614   0.1834
## Detection Prevalence   0.2852   0.1931   0.1756   0.1621   0.1839
## Balanced Accuracy      0.9970   0.9941   0.9952   0.9918   0.9986

Based on the confusionMatrix, we can see the accuracy for Random Forest Model is 0.9929901.

5.3. Gradient Boosting Model

For Gradient Boosting Model, train function in the caret package waas used and set method=“gbm”. Verbose=FALSE is to surpress all the messages.

# Train Tree Model
gbm <- train(classe~., method="gbm", data=training, verbose=FALSE)
gbm$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 43 had non-zero influence.
# Predictions using Testing dataset
gbm.pred <- predict(gbm, newdata = testing)

# ConfusionMatrix for Tree Model
gbm.confuse <- confusionMatrix(gbm.pred, testing$classe)
gbm.confuse
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2191   58    0    2    2
##          B   33 1420   32    5   15
##          C    4   35 1312   43   15
##          D    3    3   22 1227   19
##          E    1    2    2    9 1391
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9611          
##                  95% CI : (0.9566, 0.9653)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9508          
##  Mcnemar's Test P-Value : 6.701e-06       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9816   0.9354   0.9591   0.9541   0.9646
## Specificity            0.9890   0.9866   0.9850   0.9928   0.9978
## Pos Pred Value         0.9725   0.9435   0.9312   0.9631   0.9900
## Neg Pred Value         0.9927   0.9845   0.9913   0.9910   0.9921
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2793   0.1810   0.1672   0.1564   0.1773
## Detection Prevalence   0.2872   0.1918   0.1796   0.1624   0.1791
## Balanced Accuracy      0.9853   0.9610   0.9720   0.9735   0.9812

Based on the confusionMatrix, we can see the accuracy for Gradient Boosting Model is 0.9611267.

6. Model Selection

Based on the summary table below, we can see the model with best accuracy is the Random Forest Model. This model will be used to predict the final class for the 20 subjects in the test data.

# Create table for comparison of Accuracy
table1 <- data.frame(
  Model=c("Random Forest","Gradient Boosting", "Decision Tree"),
  Accuracy=c(rf.confuse$overall[[1]],gbm.confuse$overall[[1]],tree.confuse$overall[[1]]),
 "ConfInv 95 Lower"=c(rf.confuse$overall[[3]],gbm.confuse$overall[[3]],tree.confuse$overall[[3]]),
 "ConfInv 95 Upper"=c(rf.confuse$overall[[4]],gbm.confuse$overall[[4]],tree.confuse$overall[[4]])
 )

table1
##               Model  Accuracy ConfInv.95.Lower ConfInv.95.Upper
## 1     Random Forest 0.9929901        0.9908853        0.9947149
## 2 Gradient Boosting 0.9611267        0.9566112        0.9652953
## 3     Decision Tree 0.5497069        0.5386176        0.5607591

7. Test Data Prediction

Applying the trained model from Random Forest, we can get the predicted class as shown below.

# Predict outcome on the original Testing data set using Random Forest model
predictfinal <- predict(rf1, newdata=test, type="class")
predictfinal
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E