Machine Learning Course Project

Executive Summary

This analysis uses data from a personal fitness monitor like Jawbone Up, Nike FuelBand or Fitbit. This analysis uses data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. In the study they were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The machine learning goal is to predict the manner in which they did the exercise. This is the “classe” variable in the training dataset. The analysis partitions the training data reserving 30% of the observations for validation, fits six machine learning models using these methods

Random forest
Linear discriminant analysis (LDA)
Naive Bayes
K nearest neighbor (KNN)
Gradient boosting machine (GBM)
Recursive partitioning and regression tree (Rpart)

then chooses a model based on the lowest out of sample error rate to make the final predictions. A random forest model appears to be most accurate - with an out of sample error rate of about 0.5% - so it is used on the test set to predict values using predictors from the test set. Plots of the predictions and of variable importance for the chosen model are included in an appendix.

Processing

After loading the function libraries…

require(caret); require(randomForest); require(MASS); require(klaR); require(gbm); require(plyr); 
require(splines); require(rpart); require(knitr)

Input Data Preparation

Data is read from two separate .csv files - one from training and a second test file for the final prediction. In some cases the files have values that need to be converted to NA as they’re read in order to keep the column numeric. The training file has 19,622 rows and 159 columns plus the class outcome.

trainCsv <- read.csv("pml-training.csv", sep=",", na.strings=c("NA","#DIV/0!",""))
testCsv  <- read.csv("pml-testing.csv",  sep=",", na.strings=c("NA","#DIV/0!",""))

The first columns include a row number, timestamps and some data that could confuse a model without providing information. Once these are removed there are a number of columns there are a number of columns with missing (NA) values. Attempts to use these columns were quite unsuccessful so here we remove any column where 50% or move of the values are NA - and we do the same to the test set.

trainSet <- trainCsv[, -(1:7)]                                            #cannot be predictors(time, row)
trainSet <- trainSet[, colSums(is.na(trainSet)) < nrow(trainSet) * 0.5]   #Drop columns with 50% NA or more

testing  <- testCsv [, -(1:7)]                                            #cannot be predictors(time, row)
testing  <- testing [, colSums(is.na(testing))  < nrow(testing) * 0.5]    #Drop columns with 50% NA or more

We partition the training file into a training set and a validation set to check predictions from the models using a 70/30 split. A set is set for reproduceability. The partition function will keep a similar distribution of the ‘classe’ outcome in the training and validation sets.

set.seed(173)
inTrain <- createDataPartition(trainSet$classe, p=0.70, list=FALSE)       #Save 30% to validate the models
training <- trainSet[ inTrain, ]
validatn <- trainSet[-inTrain, ]

Fitting and selecting a machine learning model

Using the training partition we fit a few different classification model types using 10-fold cross validation

tcontrol <- trainControl(method="cv", number=10, verboseIter = FALSE) 
modelRF  <- train(classe ~ ., data=training, method="rf",  trControl=tcontrol, ntree=200)
modelLDA <- train(classe ~ ., data=training, method="lda", trControl=tcontrol)
modelNB  <- train(classe ~ ., data=training, method="nb" , trControl=tcontrol)
modelKNN <- train(classe ~ ., data=training, method="knn", trControl=tcontrol)
modelGBM <- train(classe ~ ., data=training, method="gbm", trControl=tcontrol, verbose=FALSE)
modelRP  <- train(classe ~ ., data=training, method="rpart",trControl=tcontrol, tuneLength=10)

Each model can then be used to predict using the validation set where we know the actual outcome.

pRF  <- predict(modelRF,  validatn)
pLDA <- predict(modelLDA, validatn)
pNB  <- predict(modelNB,  validatn)
pKNN <- predict(modelKNN, validatn)
pGBM <- predict(modelGBM, validatn)
pRP  <- predict(modelRP,  validatn)

Predictions from the validation set can be compared to the actual outcomes to create a confusion matrix for each model.

cmRF  <- confusionMatrix(validatn$classe, pRF)
cmLDA <- confusionMatrix(validatn$classe, pLDA)
cmNB  <- confusionMatrix(validatn$classe, pNB)
cmKNN <- confusionMatrix(validatn$classe, pKNN)
cmGBM <- confusionMatrix(validatn$classe, pGBM)
cmRP  <- confusionMatrix(validatn$classe, pRP)

Evaluating Model Results

For each model type used we look at training accuracy (performance in building the model), validation accuracy (model performance against a separate data set than we used to train the model), validation kappa (measuring validation agreement between actual and predicted values) and the out of sample error - which is one minus validation accuracy.

ModelType <- c("Random forest", "Linear discriminant","Naive Bayes", "K nearest neighbor", "Gradient boosting machine","Rpart tree")
TrainAccuracy <- c(max(modelRF$results$Accuracy), max(modelLDA$results$Accuracy), 
                   max(modelNB$results$Accuracy), max(modelGBM$results$Accuracy), 
                   max(modelKNN$results$Accuracy), max(modelRP$results$Accuracy))
ValidationAccuracy <- c(cmRF$overall[1],  cmLDA$overall[1], cmNB$overall[1], 
                        cmKNN$overall[1], cmGBM$overall[1], cmRP$overall[1])
ValidationKappa    <- c(cmRF$overall[2],  cmLDA$overall[2], cmNB$overall[2], 
                        cmKNN$overall[2], cmGBM$overall[2], cmRP$overall[2])
OutOfSampleErr <- 1 - ValidationAccuracy
metrics <- data.frame(ModelType, TrainAccuracy, ValidationAccuracy, ValidationKappa, OutOfSampleErr)
kable(metrics, digits=5)

ModelType	TrainAccuracy	ValidationAccuracy	ValidationKappa	OutOfSampleErr
Random forest	0.99221	0.99456	0.99312	0.00544
Linear discriminant	0.70190	0.70127	0.62218	0.29873
Naive Bayes	0.73662	0.74528	0.67407	0.25472
K nearest neighbor	0.96105	0.91011	0.88622	0.08989
Gradient boosting machine	0.89700	0.96534	0.95615	0.03466
Rpart tree	0.67759	0.67477	0.58986	0.32523

The model built using the random forest method has the lowest out of sample error 0.0054376 and is chosen to make the final predictions. It is a bit surprising that multiple models had a better validation accuracy than training accuracy. This behavior is possible but it is common for validation accuracy to be lower than training accuracy due to some degree of overfitting.

Looking at the random forest model and the confusion matrix: accuracy 0.9945624 and kappa 0.9931211 appear to be quite good and the p-value is very low so this model seems to be doing a good job of predicting the output using the validation predictors (the plot in Appendix shows the prediction errors). In the print of the confusion matrix below the sensitivity and specificity of the random forest model both seem to be quite good.The detection rate (the rate of true events also predicted to be events) closely matches the prevalence of the classes.

print(modelRF)

## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 12363, 12363, 12363, 12362, 12365, 12363, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9922111  0.9901470  0.002000979  0.002532067
##   27    0.9916288  0.9894109  0.001688406  0.002135725
##   52    0.9861686  0.9825034  0.001497416  0.001895673
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

print(cmRF)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1672    0    0    0    2
##          B    8 1130    1    0    0
##          C    0    6 1019    1    0
##          D    0    0   12  951    1
##          E    0    0    0    1 1081
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9946          
##                  95% CI : (0.9923, 0.9963)
##     No Information Rate : 0.2855          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9931          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9952   0.9947   0.9874   0.9979   0.9972
## Specificity            0.9995   0.9981   0.9986   0.9974   0.9998
## Pos Pred Value         0.9988   0.9921   0.9932   0.9865   0.9991
## Neg Pred Value         0.9981   0.9987   0.9973   0.9996   0.9994
## Prevalence             0.2855   0.1930   0.1754   0.1619   0.1842
## Detection Rate         0.2841   0.1920   0.1732   0.1616   0.1837
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9974   0.9964   0.9930   0.9976   0.9985

Predicting Test Values

We now use the random forest model to predict the 20 testing values

pTesting <- predict(modelRF, testing)
pTesting

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

testing$classe <- pTesting               #include predictions in the testing frame

Thanks to Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H Thanks also for the examples on recursive feature elimination (RFE) used in the appendix to Jason at Machine Learning Mastery. The use of multiple machine learning techniques was greatly simplified by the caret package. The R markdown for this analysis can be found on Github

Appendix

We can plot the prediction for the validation set using two of the predictors pitch_forearm and roll_belt. Dots show correct validation predictions using color to identify the class. An incorrect prediction is shown with an X in the color of the incorrect predicted classe. Squares show predictions for the test set.

equalPredictions <- (validatn$classe == pRF)
neq <- validatn[!equalPredictions, ] #incorrect predictions in the validation set
ggplot(data=validatn, aes(x=roll_belt, y=pitch_forearm, colour=pRF))+ 
    geom_point()+ 
    geom_point(data=neq, aes(x=roll_belt, y=pitch_forearm, colour=classe), size=9,shape=4)+
    geom_point(data=testing, aes(x=roll_belt, y=pitch_forearm, colour=classe), size=7,shape=0)+
    ggtitle("Predicted values in Random Forest model plotted across two predictors")

The very first attempt to fit a model used nearly all the columns. This did not work well to due all the missing values. Removing columns with 50% or more missing values worked well to get to better models quickly. Looking at the remaining 52 in a correlation matrix it may have worked because they are not too highly correlated.

require(corrplot)

## Loading required package: corrplot

corTrain <- cor(trainSet[ ,-53], use="complete.obs")
corrplot(corTrain, type = "upper", order = "hclust", tl.col = "black", tl.srt = 60, tl.cex=0.8, diag=FALSE)

We can look at variable importance in the random forest model:

varImpRF <- varImp(modelRF, scale=FALSE)
plot(varImpRF, top=30, main="Variable importance - Random Forest model - top 30 predictors")

Another way to look at variable importance is through recursive feature elimination(RFE). From this perspective the random forest model shows a rapid decline in predictor importance:

rfeCtrl <- rfeControl(functions=rfFuncs, method="cv", number=10)                      
refResults <- rfe(training[,1:52], training[,53], sizes=c(1:50), rfeControl=rfeCtrl)      
plot(refResults, type=c("g", "o"), main="Recursive Feature Elimination (RFE) using a Random Forest model")