This document outlines the machine learning analysis performed on a data set of exercise routines collected by an accelerometer.
The goal of the analysis is to develop a model which will predict how well a particular set of exercises was done.
The traing data is available here: Training Data.
The test data is available here: Test Data.
We can see a histogram of the outcome variable we’re interested in: classe. It has a roughly even distribution of values.
library(caret)
## Warning: package 'caret' was built under R version 3.4.3
## Warning: package 'ggplot2' was built under R version 3.4.1
library(rattle)
## Warning: package 'rattle' was built under R version 3.4.3
training <- read.csv("C:/Users/tkasten/Desktop/Coursera/pml-training.csv")
testing <- read.csv("C:/Users/tkasten/Desktop/Coursera/pml-testing.csv")
# The outcome we're interested in
plot(training$classe)
The training data consists of over 19K observations of 160 features. A quick review of the training data shows a lot of columns where the values are mostly NA or empty. We can see this using the following code:
# For each column sum up the number of NA or empty values
missing_data_sums <- colSums(is.na(training) | training == "")
hist(missing_data_sums)
empty_cols <- names(missing_data_sums[missing_data_sums > 19000])
Let’s remove the columns from our training set that have very little data:
training_2 <- training[, !(colnames(training) %in% empty_cols)]
This leaves us with 60 columns that are mostly populated.
One variable I’d like to take out is “X”:
training_2 <- training_2[ , -which(names(training_2) %in% c("X"))]
Finally, let’s shuffle the data to ensure that any previous ordering doesn’t affect our results:
training_2 <- training_2[sample(nrow(training_2 )), ]
The training data set still has a lot of rows. Let’s split the data into a smaller training set and also a cross-validation data set.
Starting with a training data set of size N = 1000 and a cross-valiation data set also of N = 1000, we can quickly see if various Machine Learning algorithms work and which ones to focus on.
# Subset of this for actual training
training_3 <- training_2[1:1000,]
# Take the next 1000 entries for cross-validation
cross_val <- training_2[1001:2000,]
Once we’ve seen some initial results, we can increase the size of the training data set (but not the cross-validation one) to see how accuracy improves.
We will try four different methods with different training set sizes:
Random Forests and GBM were chosen in particular because they handle data sets with a large number of features very well.
For each method we measure the accuracy of the model in the cross-valiation data set.
The code for each run is exactly the same. Only the size of the training set is different. Simple timings using Sys.time() are used to show how long each run took.
# LDA
modelLDA <- train(classe ~ ., method="lda", data=training_3)
predsLDA <- predict(modelLDA, newdata = cross_val)
crosstabLDA <- table(predsLDA, cross_val$classe)
accuracyLDA <- (sum(diag(crosstabLDA)) * 100)/1000
modelRpart <- train(classe ~ ., method="rpart", data=training_3)
predsRpart <- predict(modelRpart, newdata = cross_val)
crosstabRpart <- table(predsRpart, cross_val$classe)
accuracyRpart <- (sum(diag(crosstabRpart)) * 100)/1000
modelRF <- train(classe ~ ., method="rf", data=training_3)
predsRF <- predict(modelRF, newdata = cross_val)
crosstabRF <- table(predsRF, cross_val$classe)
accuracyRF <- (sum(diag(crosstabRF)) * 100)/1000
modelGBM <- train(classe ~ ., method="gbm", data=training_3)
predsGBM <- predict(modelGBM, newdata = cross_val)
crosstabGBM <- table(predsGBM, cross_val$classe)
accuracyGBM <- (sum(diag(crosstabGBM)) * 100)/1000
For each machine learning method we plot the accuracy (in %) against the different training data set sizes.
| Method vs N | 1000 | 2000 | 5000 | 10000 |
|---|---|---|---|---|
| LDA | 81.7 | 84.3 | 86.3 | 86.2 |
| RPart | 51.9 | 52.5 | 49.5 | 46.8 |
| RF | 96.4 | 99.0 | 99.5 | 99.7 |
| GBM | 96.8 | 98.9 | 99.4 | 99.6 |
The RPart results are not great, and get worse with increasing N, whereas the others do improve.
For each machine learing method we show the preditions on the test set for each training data set size.
| N | Prediction |
|---|---|
| 1000 | B A B A A E D B A A B C B A E E A B B B |
| 2000 | B B B A A E D B A A B C B A E E A B B B |
| 5000 | B B B A A E D C A A B C B A E E A B B B |
| 10000 | B B B A A E D C A A B C B A E E A B B B |
The predictions are consistent with increasing N.
| N | Prediction |
|---|---|
| 1000 | A A A A A C C C A A A C C A C C A A A C |
| 2000 | A A A A A C C C A A C C C A C C A A A C |
| 5000 | A A C A A C C C A A C C C A C C A A A C |
| 10000 | A A A A C C C C A A C C C A C C C A A C |
The predictions look odd - no Bs, Ds or Es.
| N | Prediction |
|---|---|
| 1000 | B A A A A E D B A A B C B A E E A B B B |
| 2000 | B A A A A E D B A A B C B A E E A B B B |
| 5000 | B A B A A E D B A A B C B A E E A B B B |
| 10000 | B A B A A E D B A A B C B A E E A B B B |
The predictions here are very consistent, even with N relatively small.
| N | Prediction |
|---|---|
| 1000 | B A B A A E D B A A B C B A E E A B B B |
| 2000 | B A B A A E D B A A B C B A E E A B B B |
| 5000 | B A B A A E D B A A B C B A E E A B B B |
| 10000 | B A B A A E D B A A B C B A E E A B B B |
Again, the predictions here are very consistent, even with N relatively small.
The RF and GBM predictions all agree with each other, regardless of the choice of N, so we could conclude that setting N=1000 is good enough for our purposes using these algorithms. For LDA, you need a much higher value of N to get the same predictions.
From a performance point of view, the Random Forests algorithm is the slowest to run, while the GBM one is about twice as fast as RF, even though its accuracy is about the same as RF.
We conclude that the use of a GBM technique, as implemented using the GBM R package, works very well for accurately predicting how well the subjects trained based on the training data provided.