Coursera Machine Learning Project

Introduction

This is the course project submission for Week 4 of the Coursera Practical Machine Learning course.

This report uses data collected from devices such as Jawbone Up, Nike FuelBand, and Fitbit and uses a supplied training set of data to build a machine learning model to predict different classes of activity.

The objective is to use the model developed on the training set of data to predict activity classifications for a set of test data supplied as part of the assignment. This set of data has the same measurement dimensions as the training data apart from an activity-type classification.

Each activity has a classification from A to E. Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.

Further information about the study is provided here: http://groupware.les.inf.puc-rio.br/har

Summary Findings

Using a Random Forest algorithm to fit a model worked very well, with a predection accuracy of 99% against the Validation data set and getting 20 out of 20 (100%) of the test data set predictions correct, based on the Coursera Week 4 quiz that was set against this data-set.

The R Caret package Random Forest method that was chosen was very CPU intensive and took the longest to run, but was very effective working against a 10% sample of training data. Once the desired result was obtained the model was saved and no further investigation was performed into whether the run-time of the Random Forest model fit could be further optimised by running in parallel and reducing the data sample size further and whether supplying specific mtry and ntree parameters would improve performance without losing too much accuracy.

Data

Data Source

The data was downloaded into the local working directory where the code listed in this document was executed.

The training data used for this project was downloaded from here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data used for this project was downloaded from here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data was loaded into the R environment as follows:

testing <- read.csv("pml-testing.csv", stringsAsFactors = FALSE)
training <- read.csv("pml-training.csv", stringsAsFactors = FALSE)

The dimensions of the test and training data sets were compared:

dim(testing)

## [1]  20 160

dim(training)

## [1] 19622   160

Both sets of data have 160 columns of data, with 19622 observations in the training set and only 20 in the test set.

It was noted that the same data column names were provided for both sets of data except for the final column, column 160.

setdiff(names(testing),names(training))

## [1] "problem_id"

setdiff(names(training),names(testing))

## [1] "classe"

The “classe” column in the training set contains a range of classification identifiers A - E which determine different exercise approaches with A being “correct” and B - E being different types of mistakes.

table(training$classe)

## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

The “problem_id” column in the test set contains a set of numeric identifies from 1 .. 20:

range(testing$problem_id)

## [1]  1 20

Data Inspection and Cleanup

Using head and summary on the training data-set the following points were noted:

Many of the columns of data contain a large number of NA values
Most of the columns are numeric measurements apart from the first 7 columns which are timestamp, user and execution iD identifiers
Some columns have missing values.

The data-sets were modified to remove columns with missing or NA data and the first 7 columns were also removed as follows:

### Data Cleanup - remove cols with NAs > 25% and NULL("") > 25% ###
rtotal <- dim(training)[1]
napct <- sapply(training,function(y) (sum(length(which(is.na(y))))/rtotal))
nullpct <- sapply(training,function(y) (sum(length(which((y == ""))))/rtotal))

highna <- napct[which(napct > 0.25)]
to.remove <- names(highna)
highnull <- nullpct[which(nullpct > 0.25)]
to.remove <- c(to.remove,names(highnull))

training <- training[ , !(names(training) %in% to.remove)]
testing <- testing[ , !(names(testing) %in% to.remove)]
#drop first 7 cols - just keep numeric obs
training <- training[, 8:60]
testing <- testing[,8:60]

This reduced each data-set to 53 columns of observations.

Given the large number of columns (dimensions), the level of correlation and duplication in the data was considered. This was done by creating a matrix of correlation values for each combination and then plotting this as a heatmap in Figure 1, listed below:

M <- round(abs(cor(training[1:52])),2)
diag(M) <- 0  # zero the diags as these have correlation =1
heatmap(M , main="Figure 1: Training Data Correlation")

On the basis of this, it was decided to not try to reduce the dimensions of the data further, as there are only small pockets of significant correlation, and instead to work with the full remaining 52 columns of data.

Approach

The problem consists of taking multiple numeric measurements for each observation to predict one of five classifications. This appears to be a classic problem to solve with tree classification or the Random Forest algorithm.

All processing was done with the assistance of the R machine learning Caret Package.

Before starting the training data was split into a model build and valid (validation) set. This is to allow the model to be validated to give an estimation of accuracy before running a prediction against the test set.

# Create a Validation data-set
index <- createDataPartition(y=training$classe, p=0.7, list=FALSE)
build <- training[index, ]
valid <- training[-index,]

Model Development

Caret rpart Decision Tree

The first model to be tested was the Caret package rpart method, to apply a decision-tree algorithm to the data:

This was quick to execute, but when tested for accuracy as listed below it was shown to be unsuccessful with less than 50% prediction accuracy against the validation set:

pred <- predict(treefit, valid )      # validate model for accuracy
print(confusionMatrix(valid$classe, pred)) # validate model for accuracy

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1509   34  128    0    3
##          B  478  383  278    0    0
##          C  475   33  518    0    0
##          D  430  174  360    0    0
##          E  163  157  270    0  492
## 
## Overall Statistics
##                                          
##                Accuracy : 0.4931         
##                  95% CI : (0.4803, 0.506)
##     No Information Rate : 0.5191         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3375         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.4939  0.49040  0.33333       NA  0.99394
## Specificity            0.9417  0.85188  0.88271   0.8362  0.89054
## Pos Pred Value         0.9014  0.33626  0.50487       NA  0.45471
## Neg Pred Value         0.6329  0.91614  0.78679       NA  0.99938
## Prevalence             0.5191  0.13271  0.26406   0.0000  0.08411
## Detection Rate         0.2564  0.06508  0.08802   0.0000  0.08360
## Detection Prevalence   0.2845  0.19354  0.17434   0.1638  0.18386
## Balanced Accuracy      0.7178  0.67114  0.60802       NA  0.94224

The assumption is that using the Decision Tree method did not perform well as there is not a clear linear boundary between observation values and classifications.

Caret rf Random Forest

The second model to be tested was the Caret package rf model, to use the Random Forest algorithm against the data. Initially it was attempted to fit an rf model against the full build training set of data, but this ran for several hours without completing and was eventually abandoned.

One option would be to use the R Caret Package parallel processing options using the doParallel package, but first a simpler approach was take - just sampling 10% of the “build”" training data-set:

# Set build to a 10% sample - just to save time on the RF build
n = round(length(build$classe)/10)
i <- sample(nrow(build),n)
build <- build[i, ]

# Build Random Forest Model
rffit <- train(classe ~ ., method="rf", data=build)

This took about 2 hours to run in single-threaded mode on an a PC laptop with an Intel i5 2.3 GHz CPU processor. Validation of the model as show below gave a very good accuracy estimation of about 99%, as shown below:

pred <- predict(rffit, valid )      # predict classe values
print(confusionMatrix(valid$classe, pred)) # validate model predictions against validation data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    0    1    0    0
##          B    2 1134    3    0    0
##          C    0    0 1023    3    0
##          D    0    0    5  959    0
##          E    0    0    0    1 1081
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9975          
##                  95% CI : (0.9958, 0.9986)
##     No Information Rate : 0.2846          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9968          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9988   1.0000   0.9913   0.9958   1.0000
## Specificity            0.9998   0.9989   0.9994   0.9990   0.9998
## Pos Pred Value         0.9994   0.9956   0.9971   0.9948   0.9991
## Neg Pred Value         0.9995   1.0000   0.9981   0.9992   1.0000
## Prevalence             0.2846   0.1927   0.1754   0.1636   0.1837
## Detection Rate         0.2843   0.1927   0.1738   0.1630   0.1837
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9993   0.9995   0.9953   0.9974   0.9999

On the basis of this, it was decided to use this Random Forest model to predict the “classe” values for 20 observations in the test data-set

Final Model Predictions

The supplied test data set, with missing values for “classe” was applied to the Random Forest model built against the sample of training data:

testPred <- predict(rffit, testing)
testing$classe <- testPred

The test-predictions are listed against the test “Problem IDs” below:

testing$classe <- testPred
print(testing[, c("problem_id", "classe")])

##    problem_id classe
## 1           1      B
## 2           2      A
## 3           3      B
## 4           4      A
## 5           5      A
## 6           6      E
## 7           7      D
## 8           8      B
## 9           9      A
## 10         10      A
## 11         11      B
## 12         12      C
## 13         13      B
## 14         14      A
## 15         15      E
## 16         16      E
## 17         17      A
## 18         18      B
## 19         19      B
## 20         20      B