JHU Machine Learning, August 2014

Course Project

Cleaning the data and loading it

Prior to the analysis, some variables in the dataset were removed due to them primarily consisting of either NA values or blanks. This cleaning is summarized in a different file called MLCourseProjectCleaning, and can be viewed here: MLCourseProjectCleaning I refer my grader to a different file because I wanted to document the cleaning thoroughly, and also leave the full 5 page maximum for this main analysis. To summarize the cleaning, the dataset was reduced to 52 predictor variables, and all integer variables were converted to numeric variables.

The training data is available here and the test data is available here. The cleaning script will look for the file names pml-training.csv and pml-testing.csv in the working directory, and then place their cleaned versions in the same working directory as .Rda files. In this analysis, the seed will be set to 1234 and we will be using the caret package.

Splitting the training data

The first step in building a prediction model will be to split the training data into two pieces. The first piece will be strictly for training, and the second piece will be for cross validation. We will build a few different prediction models and compare their performance on the cross validation set.

## Partition the training data into (strictly) training and cross validation
inTrain <- createDataPartition(train$classe, p=.75, list=FALSE)
training <- train[inTrain,]
crossval <- train[-inTrain,]
## Remove the original training dataframe so we do not get confused later, an inTrain while we're at it
rm(train)
rm(inTrain)

Building a couple of prediction models

The first step will be to process our training data down to a smaller set of variables. This will help algorithms like random forest to run more quickly. Let’s see how many components we need in order to capture 95% of the variance in our dataset.

preProcess(training[,-53], method="pca", thresh=.95)

## 
## Call:
## preProcess.default(x = training[, -53], method = "pca", thresh = 0.95)
## 
## Created from 14718 samples and 52 variables
## Pre-processing: principal component signal extraction, scaled, centered 
## 
## PCA needed 25 components to capture 95 percent of the variance

Since 95% is a good target to shoot for, we will go ahead and specify that 25 components be used. This preprocessing will be need to be done on the training, cross validation, and testing datasets.

## Perform PCA on the training, cross validation, and test datasets
preProc <- preProcess(training[,-53], method="pca", pcaComp=25)
trainingPC <- predict(preProc, training[,-53])
crossvalPC <- predict(preProc, crossval[,-53])
testPC <- predict(preProc, test[,-53])
rm(preProc)

Next, let’s build a random forest and also a neural network on the training data. The random forest was chosen as a model due to its reputation for accuracy. I also chose a neural network out of curiosity to see how it compared to the random forest and because it was the only other model that I have experience with that was suited for this classification job.

## Random forest, with some extra parameters passed to help with the speed
modelRF <- train(x=trainingPC, y=training$classe, method="rf", trControl = trainControl(method = "cv", number = 4, allowParallel = TRUE, verboseIter = FALSE))

## Neural network
modelNN <- train(x=trainingPC, y=training$classe, method="nnet")

Let’s have a look at how accurate our models are in the training set:

modelRF

## Random Forest 
## 
## 14718 samples
##    25 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold) 
## 
## Summary of sample sizes: 11038, 11038, 11038, 11040 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
##   2     1         1      0.003        0.004   
##   10    1         1      0.003        0.003   
##   20    1         0.9    0.003        0.004   
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

modelNN

## Neural Network 
## 
## 14718 samples
##    25 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 14718, 14718, 14718, 14718, 14718, 14718, ... 
## 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy  Kappa  Accuracy SD  Kappa SD
##   1     0      0.4       0.2    0.02         0.01    
##   1     1e-04  0.4       0.2    0.02         0.02    
##   1     0.1    0.4       0.2    0.02         0.01    
##   3     0      0.5       0.4    0.03         0.03    
##   3     1e-04  0.5       0.4    0.03         0.04    
##   3     0.1    0.5       0.4    0.03         0.05    
##   5     0      0.6       0.5    0.03         0.04    
##   5     1e-04  0.6       0.5    0.03         0.03    
##   5     0.1    0.6       0.5    0.02         0.03    
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were size = 5 and decay = 1e-04.

It appears we have a clear winner with the random forest. However, accuracy in the training set is not a sure indicator for accuracy outside of it. Before we move on to testing on the cross validation set, let’s have a look at a plot and see if we see anything interesting:

plot(trainingPC[,2] ~ trainingPC[,1], 
     type="p", 
     col=training$classe,
     xlab="Principal Component 1",
     ylab="Principal Component 2",
     main= "Plot of two Principal Components, colored by Classe")

plot of chunk unnamed-chunk-7

In this plot it is easy to see the 5 distinct groupings. It is reassuring to see their separation here, and presumably the other principal components are creating separation in other dimensions. Next let’s take a look at how the models stack up against each other on the cross validation set in terms of prediction power. I predict that the random forest will outperform the neural network, though it is so far not possible to predict exactly how well they will do.

Testing the models on the cross validation set

So here is how the two models performed:

## Prints the results of the model predictions
confusionMatrix(predict(modelRF, crossvalPC), crossval$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1391   16    2    3    0
##          B    2  927   17    2    2
##          C    1    5  826   27    6
##          D    1    0    8  771    9
##          E    0    1    2    1  884
## 
## Overall Statistics
##                                         
##                Accuracy : 0.979         
##                  95% CI : (0.974, 0.982)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.973         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.997    0.977    0.966    0.959    0.981
## Specificity             0.994    0.994    0.990    0.996    0.999
## Pos Pred Value          0.985    0.976    0.955    0.977    0.995
## Neg Pred Value          0.999    0.994    0.993    0.992    0.996
## Prevalence              0.284    0.194    0.174    0.164    0.184
## Detection Rate          0.284    0.189    0.168    0.157    0.180
## Detection Prevalence    0.288    0.194    0.176    0.161    0.181
## Balanced Accuracy       0.996    0.986    0.978    0.977    0.990

confusionMatrix(predict(modelNN, crossvalPC), crossval$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1117  179  159   97  106
##          B  116  488  143  139  120
##          C   68  131  491  154   97
##          D   73   62   35  394   66
##          E   21   89   27   20  512
## 
## Overall Statistics
##                                         
##                Accuracy : 0.612         
##                  95% CI : (0.598, 0.626)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.506         
##  Mcnemar's Test P-Value : <2e-16        
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.801   0.5142    0.574   0.4900    0.568
## Specificity             0.846   0.8690    0.889   0.9424    0.961
## Pos Pred Value          0.674   0.4851    0.522   0.6254    0.765
## Neg Pred Value          0.914   0.8817    0.908   0.9041    0.908
## Prevalence              0.284   0.1935    0.174   0.1639    0.184
## Detection Rate          0.228   0.0995    0.100   0.0803    0.104
## Detection Prevalence    0.338   0.2051    0.192   0.1285    0.136
## Balanced Accuracy       0.823   0.6916    0.732   0.7162    0.765

Lucky for us, the random forest had an outstanding performance on the cross validation set! The neural network did not perform terribly well however, so we will discard that model for now. Here is a simple visual representation of how the random forest performed. In fact there is almost very little to even see due to the fact that the predictions were so accurate.

## Simple plot to aid in visualization of the prediction results
plot(predict(modelRF, crossvalPC), crossval$classe, 
     col=c("red","blue","green","grey","orange"),
     xlab="Predicted Classe",
     ylab="True Classe",
     main="Plot of True Class by Predicted Classe")

plot of chunk unnamed-chunk-9

Notice the very solid consistent colors in each bar. This represents that the vast majority of the predictions were correct. We can see quickly that classe E was the easiest class to predict, while the other bars have thin slices of other colors present. You can see that some A predictions were actually B, some B predictions were actually C, and some C predictions were actually D.

Conclusion

It appears that this type of physical activity lends itself well to classification by a machine learning algorithm. We were able to very accurately predict movements using a random forest, and it is very fair to say that since it achieved an accuracy rate of over 97% on a cross validation set of almost 5000 observations, that that level of prediction accuracy should carry over to any new observations given to it.

Epilogue

On the submission portion of the assignment, I scored 19/20 on the first pass. The one that I missed I was able to guess what the correct answer was based on the colorful plot of True vs Predicted. My prediction was A, and since that was incorrect, I knew the next best guess was Blue, or in other words, B. 19/20 is a somewhat expected in fact. The chance of getting 19 right under the 97.88% accuracy rate we saw on the cross validation set is 28.2%, so not at all unexpected.

## Plot of binomial distribution for questions to get right out of 20
plot(0:20, dbinom(0:20, size=20, prob=.9788), 
     type="h", lwd=15, xlab="Number of questions answered correctly",
     ylab="Chance of that occuring",
     main="Binomial density plot for answers correct out of 20",
     sub="Probability of success is .9788")

plot of chunk unnamed-chunk-10