Practical Machine Learning Project

Introduction

Using data collected from 6 participants who wore accelerometers on a belt, on their forearm, on their arm, and on a dumbell, the goal was to create a model that can predict how the participant was executing a bicep curl exercise. According to the paper the data was originally used for, there are five different “Classes” to describe the manner in which the participant was executing the exercise. Class A was the correct specification while classes B-E were various incorrect forms. Students in the Coursera class were provided a training data set with all of the variables including the Class variable as well as a testing data set that did not include the class variable. Students were asked to predict the class variable in the test data set based on their model created from the training data set.

Classification Tree

After reading in the data, I first separated the training set into a sub training set as well as a validation set using the createDataPartition function to test models before applying to the testing set. I trimmed the data sets based on the variables that were summarizations that included mostly NA values as well as a few other variables that were not helpful for building a model (e.g. the name of the participants, the timestamp, etc.). All variables kept were numeric except the outcome variable.

Using all of the variables as predictors and Class as the outcome, I first created a classification tree model. Then, I used the varImp function from the caret package to gauge which variables are important and can be used for a random forest model.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1510  463  462  427  179
##          B   27  389   36  163  141
##          C  132  287  528  374  299
##          D    0    0    0    0    0
##          E    5    0    0    0  463
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4911          
##                  95% CI : (0.4782, 0.5039)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3352          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9020   0.3415  0.51462   0.0000  0.42791
## Specificity            0.6364   0.9227  0.77526   1.0000  0.99896
## Pos Pred Value         0.4965   0.5146  0.32593      NaN  0.98932
## Neg Pred Value         0.9423   0.8538  0.88324   0.8362  0.88573
## Prevalence             0.2845   0.1935  0.17434   0.1638  0.18386
## Detection Rate         0.2566   0.0661  0.08972   0.0000  0.07867
## Detection Prevalence   0.5167   0.1285  0.27528   0.0000  0.07952
## Balanced Accuracy      0.7692   0.6321  0.64494   0.5000  0.71344

## rpart variable importance
## 
##   only 20 most important variables shown (out of 53)
## 
##                      Overall
## pitch_forearm         1464.3
## roll_forearm          1069.3
## roll_belt             1062.1
## magnet_dumbbell_y      734.6
## accel_belt_z           645.2
## magnet_belt_y          615.6
## yaw_belt               590.6
## num_window             585.7
## total_accel_belt       526.3
## magnet_arm_x           396.6
## accel_arm_x            383.9
## roll_dumbbell          283.6
## magnet_dumbbell_z      271.7
## accel_dumbbell_y       233.9
## magnet_forearm_x         0.0
## gyros_forearm_x          0.0
## total_accel_dumbbell     0.0
## accel_forearm_y          0.0
## magnet_arm_z             0.0
## magnet_forearm_z         0.0

Random Forest Model

While the classification tree had poor predictive value, it provided information about potentially useful predictor variables. For the sake of parsimony and scalability, I selected the top ten variables ranked on importance from this model to be used in a random forest model. Since the random forest function in caret is quite time consuming, I used a 3-Fold cross-validation with parallel processing approach to reduce the time.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    1    0    0    0
##          B    0 1137    0    0    2
##          C    0    1 1026    0    0
##          D    0    0    0  964    1
##          E    0    0    0    0 1079
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9992         
##                  95% CI : (0.998, 0.9997)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9989         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9982   1.0000   1.0000   0.9972
## Specificity            0.9998   0.9996   0.9998   0.9998   1.0000
## Pos Pred Value         0.9994   0.9982   0.9990   0.9990   1.0000
## Neg Pred Value         1.0000   0.9996   1.0000   1.0000   0.9994
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1932   0.1743   0.1638   0.1833
## Detection Prevalence   0.2846   0.1935   0.1745   0.1640   0.1833
## Balanced Accuracy      0.9999   0.9989   0.9999   0.9999   0.9986

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

According to the confusion matrix applied to the validation data set, accuracy was 99.9%. For the 20 data points from the testing data set, this model correctly identified all 20 of the classes.

Summary

How I Built the Model

My approach was to build a quick model using a classification tree then use the most important variables from the model to build a random forest model. Another approach would be to build a random forest model using all variables as predictors then determine which of those were the most important, but that approach would take much more time. The random forest model I used takes about ten minutes to run on R using a Macbook Pro.

Cross Validation

I used a 3-fold cross validation approach to reduce the processing time of the random forest model. The default method is “boot” (i.e. bootstrapping) which requires a much longer processing time.

Expected Out of Sample Error

Below is an estimation of the out of sample error based on the accuracy from the validation data set.

## [1] 0.0008496177

Given the validation data set yielded an accuracy of about 99.9% and the model correctly predicted the class 100% of the time (20 out of 20 correct), the out of sample error rate is very low (<1%). However, since the validation set was used in the training process, overfitting might mean this is an underestimate of the true out of sample error.

Reasoning Behind Model Creation

Most of the choices I made were for the sake of scalability. The random forest model has good accuracy and the processing time isn’t unreasonably long (about 10 minutes) . The small number of folds (3) means that variability is low but at the expense of increased bias.

Appendix(r code)

training <- "train.csv"
testing <- "test.csv"

#download the file
if(!file.exists(training)){
  fileURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
  download.file(fileURL, training, method = "curl")
}

if(!file.exists(testing)){
  fileURL1 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
  download.file(fileURL1, testing, method = "curl")
}

set.seed(1)
training_data <- read.csv("train.csv")
testing_data <- read.csv("test.csv")
library(dplyr)
library(parallel)
library(doParallel)
library(caret)
library(rpart.plot)
training_data_NA_RM <- training_data[, -grep("^kurtosis|^skewness|^max|^min|^amplitude|^var|^avg|^stddev",
                                             colnames(training_data))]
training_data_NA_RM <- training_data_NA_RM[, -c(1:6)]

testing_data_NA_RM <- testing_data[, -grep("^kurtosis|^skewness|^max|^min|^amplitude|^var|^avg|^stddev",
                                           colnames(training_data))]
testing_data_NA_RM <- testing_data_NA_RM[, -c(1:6)]

inTrain <- createDataPartition(y=training_data_NA_RM$classe, p = .7, list=F)

training1 <- training_data_NA_RM[inTrain, ]
validation <- training_data_NA_RM[-inTrain, ]

modFit <- train(classe ~., data = training1, method = "rpart")
modFit$finalModel

rpart.plot(modFit$finalModel)

pred <- predict(modFit, validation)
confusionMatrix(pred, validation$classe)

importance <- varImp(modFit, scale=F, order = T)
print(importance)

cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)

RFmodelFit <- train(classe ~ pitch_forearm + roll_forearm + roll_belt + magnet_dumbbell_y + accel_belt_z + magnet_belt_y + accel_belt_z + magnet_belt_y + yaw_belt + num_window,
                    data=training1, method = "rf",
                    prox=T, trControl = trainControl(method = "cv",
                                                     number = 3, allowParallel = T))
stopCluster(cluster)
registerDoSEQ()


pred <- predict(RFmodelFit, validation)
confusionMatrix(pred, validation$classe)
predict(RFmodelFit, testing_data_NA_RM)

out_of_sample_accuracy <- sum(pred == validation$classe)/length(pred)
out_of_sample_error <- 1 - out_of_sample_accuracy
out_of_sample_error

References: Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.