Machine Learning Course Project

SYNOPSIS

The objective of this project is to recognize the quality of exercises performed using wearable devices such as Fitbit, Nike Fuel, etc. The subjects were asked to performed Dubbell Biceps Curl in various different ways. Then, the sensors measured the orientation based on the movement performed.

Here is the description of the meaning of the response values from the original paper: “…exactly according to the specication (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specied execution of the exercise, while the other 4 classes correspond to common mistakes.”

READ DATA AND CLEANUP

First we read the csv files from the URL provided in the assignment page.

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

library(RCurl)

## Loading required package: bitops

set.seed(1111)

train <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",stringsAsFactors=FALSE, na.strings=c("NA","","#DIV/0!"))
test <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",stringsAsFactors=FALSE, na.strings=c("NA","","#DIV/0!"))

Next we remove variables from training dataset are do not contribute to the final model: 1) Character variables such as ‘user_name’ and date variables. 2) Variables that have near-zero variance (using nearZeroVar()) or more than 90% values missing.

Remove the exact same variables from the testing dataset as well.

TRAINING AND TESTING

Once the missing and near-zero variance data is removed, we split the original training dataset into train (3/4) and test (1/4) splits. The training part will be used to fit the model and the testing part will validate our fit.

  # split the train data into testing and training for prediction
  testindex = sample(1:dim(train)[1], dim(train)[1]/4)

  train.ds = train.clean[-testindex,]
  test.ds  = train.clean[testindex,]

MACHINE LEARNING ALGORITHM

Classification Tree

For the machine learning algorithm, we start with classification tree as our baseline. Classificaiton tree is used to predict a categorical response. A predictor is chosen based on its importance, and split into two parts that describe the response. Then based on yes/no answer, we either go left node or right node. At each node, another predictor is chosen, and the same process is repeated. It continues until we reach the leaf, at which point, the final decision is made for that leaf.

We use the package rpart to build our classification tree:

fit.rp <- train(factor(classe) ~ ., method="rpart", data = train.ds,
                trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE)
                )

## Loading required package: rpart

cm.rp = confusionMatrix(predict(fit.rp, test.ds), test.ds$classe)
cm.rp

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1247  245  130  141   35
##          B   25  326   23  144   58
##          C  139  379  662  489  262
##          D    0    0    0    0    0
##          E    3    0    0   42  555
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5688          
##                  95% CI : (0.5548, 0.5827)
##     No Information Rate : 0.2883          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4496          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8819  0.34316   0.8123   0.0000   0.6099
## Specificity            0.8422  0.93679   0.6897   1.0000   0.9887
## Pos Pred Value         0.6935  0.56597   0.3428      NaN   0.9250
## Neg Pred Value         0.9463  0.85586   0.9486   0.8336   0.9175
## Prevalence             0.2883  0.19368   0.1662   0.1664   0.1855
## Detection Rate         0.2542  0.06646   0.1350   0.0000   0.1131
## Detection Prevalence   0.3666  0.11743   0.3937   0.0000   0.1223
## Balanced Accuracy      0.8620  0.63997   0.7510   0.5000   0.7993

Out of sample error is also very high.

# Out of Sample Error
OutofsampleError = 1- cm.rp$overall[1]
names(OutofsampleError ) = "OutofsampleError"
OutofsampleError

## OutofsampleError 
##        0.4311927

Here, we apply a 10-fold cross validation to the dataset.

The confusion matrix shows an accuracy of 0.4922, which is pretty low. We can either decide to improve the tree by pruning its branches or by adjusting its parameters. But we decide to move on to another ML algorithm, Random Forest.

Random Forest

Random Forest uses similar technique as classifination/decision trees. However, they use an ensemble learning method. It takes small random subsets of the variables and creates trees for each of those subsets. Then it combines all the ‘weak learners’, to end up with a strong learner.

The model below created 500 trees and randomly picked 7 variables for each split

library(randomForest)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

rf.fit <- randomForest(factor(classe)~., data=train.ds,proximity=T )

rf.pred = predict(rf.fit,test.ds)
rf.cm = confusionMatrix(rf.pred, test.ds$classe)
rf.cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1413    2    0    0    0
##          B    0  948    0    0    0
##          C    0    0  815    3    0
##          D    0    0    0  812    1
##          E    1    0    0    1  909
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9984          
##                  95% CI : (0.9968, 0.9993)
##     No Information Rate : 0.2883          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9979          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9993   0.9979   1.0000   0.9951   0.9989
## Specificity            0.9994   1.0000   0.9993   0.9998   0.9995
## Pos Pred Value         0.9986   1.0000   0.9963   0.9988   0.9978
## Neg Pred Value         0.9997   0.9995   1.0000   0.9990   0.9997
## Prevalence             0.2883   0.1937   0.1662   0.1664   0.1855
## Detection Rate         0.2881   0.1933   0.1662   0.1655   0.1853
## Detection Prevalence   0.2885   0.1933   0.1668   0.1657   0.1857
## Balanced Accuracy      0.9994   0.9989   0.9996   0.9974   0.9992

Out of sample error is very low.

# Out of Sample Error
OutofsampleError = 1- rf.cm$overall[1]
names(OutofsampleError ) = "OutofsampleError"
OutofsampleError

## OutofsampleError 
##      0.001630989

Using the variable importance feature of randomForest() function, we look at few plots that shows the relationship between the predictors and the response. These plots shows the relationships between variables that Random Forest determined to have high importance. The colors correspond to the response values. Here, we see that for different ranges of num_window variable, the reponse variables (A-E) are nicely clustered. The Random Forest identifies these relationships and creates an ensemble model.

MODEL PERFORMANCE

The confusion matrix shows an accuracy of .99, which is very high (considering the low accuracy in the classification tree). The sesitivity and specificity rates are also very high. This shows that Random Forest combines all the weak learners and extracts a strong ensemble model.

Also, the additional test dataset provided for the Course project Submission are also correctely predicted by the RF model (according to the feedback!).

predict(rf.fit,test.clean)

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E