A Practical Machine Learning Approach to Prediction on Exercise Data

Background

The goal of this project’s analysis is to predict classes of exercises based upon self-quantified movement. In this study the author’s took a group of health enthuasists who normally take measurements of their exercise regularly to improve their health using devices such as Jawbone Up, Nike FuelBand and Fitbit. The data from the current study was quantified to determine how well participants performed barbell lifts. They were asked to perform the barbell ifts both correctly and incorrectly in 5 different ways. These methods were then quantified using 160 features batherd from acclerometers on the belt, forearm, and dumbell of 6 participants. More information is provided at http://groupwar.les.inf.puc-rio.br/har.

Both training and testing sets were kindly provided by authors of the published study:

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

Raw Data Assessment

The full training data set contains 19622 observations and 160 features. It is unlikely that all these features are necessary to make accurate predictions. To address this a practical approach is taken to reducing data dimensionality. We will do this with a bit of pre-processing of the data before splitting the “training” data into a training set and a validation set.

library(gridExtra)

## Loading required package: grid

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

library(rpart)
library(RColorBrewer)
library(rattle)

## Rattle: A free graphical interface for data mining with R.
## Version 3.4.1 Copyright (c) 2006-2014 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(ggplot2)
library(randomForest)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

library(rpart.plot)
library(AppliedPredictiveModeling)
library(parallel)
library(doParallel)

## Loading required package: foreach
## Loading required package: iterators

#load the training and testing data for pre-processing

TrainingData <- read.csv("/Users/Work/Desktop/PracticalML/pml-training.csv", header=TRUE, stringsAsFactors=FALSE)

#dimensions of training data
dim(TrainingData)

## [1] 19622   160

Data Cleaning and Reproducibility

First we will check for features with missing data, as missing data can inhibit accuracy of predictions. This is part of preprocessing which includes in this instance:

Removing features with missing values.
Removing features with variance of less than 25%
Centering and scaling data prior to building and applying the predication algorithm.

Prior to step 3, (i) data is split into training and validation sets using a seed set at 1234, so analysis is reproducible, (ii) a data visualization is performed.

isMissing <- sapply(TrainingData, function(x) any(is.na(x) | x== ""))
isPredictor <- !isMissing & grepl("belt|[^(fore) ] arm |dumbell | forearm", names(isMissing))
featureCandidates <- names(isMissing)[isPredictor]

dataInclude <- c("classe", featureCandidates)
featureData <- TrainingData[dataInclude] #now all data is selected for 13 features
dim(featureData) #14 columns, 19622 observations

## [1] 19622    14

str(featureData) #make sure "classe" is still present

## 'data.frame':    19622 obs. of  14 variables:
##  $ classe          : chr  "A" "A" "A" "A" ...
##  $ roll_belt       : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt      : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt        : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt: int  3 3 3 3 3 3 3 3 3 3 ...
##  $ gyros_belt_x    : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y    : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z    : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x    : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y    : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z    : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x   : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y   : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z   : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...

#confirm numerically that none of selected features have near-zero variances

nzv <- nearZeroVar(featureData, saveMetrics=TRUE)
summary(nzv)

##    freqRatio     percentUnique      zeroVar           nzv         
##  Min.   :1.006   Min.   :0.02548   Mode :logical   Mode :logical  
##  1st Qu.:1.059   1st Qu.:0.71731   FALSE:14        FALSE:14       
##  Median :1.072   Median :1.18999   NA's :0         NA's :0        
##  Mean   :1.103   Mean   :2.63079                                  
##  3rd Qu.:1.101   3rd Qu.:2.16339                                  
##  Max.   :1.470   Max.   :9.97350

featureData2 <- TrainingData[featureCandidates]
descrCor <- cor(featureData2)
perfectCorFeatures <- sum(abs(descrCor[upper.tri(descrCor)]) > 0.999) 
perfectCorFeatures

## [1] 0

highCorFeatures <- findCorrelation(descrCor, cutoff = 0.75)
highCorFeatures

## [1]  3  1 10  4 11  8 12

classe <- featureData$classe
filteredFeatures <- cbind(classe, (featureData[,-(findCorrelation(descrCor, cutoff=0.75))])) 

#creation of Training and Validation data using 5 resamples
set.seed(1234)

trainIndex <- createDataPartition(filteredFeatures$classe, p=0.6, list=FALSE, times=5) #here we created a 60/40% split with 5 resamplings

dataTrain <- filteredFeatures[trainIndex,]
dataValidate <- filteredFeatures[-trainIndex,]

Step 1 & 2 of reduction of dimensionality are complete. We have now created a training and validation data sets which has only 7 features and classe (classification for exercise performance) for our prediction model compared to the raw data with 160 features. Pretty good reduction of dimensionality!

Next we can do a visualization of the training data to see if any patterns emerge which might suggest a classification algorithm will work well. For the final analysis we will test both classification and random forest since accuracy is will be graded and our model is relatively non-complex.

Building Two Models Based on Classification or Random Forest Algorithms

From the visualization, one can see that some features clearly differ depending upon classe of exercise, suggesting a classification tree might be appropriate. However random forest models still tend to be more accurate so both will be tested.

The next step will be to take the visualized and partitioned data to develop our algorithms. The R package “caret” has a helpful feature which allows our model to use data than has been centered and scaled during model building. After the models are built, both are saved to a file for later use on test data. To decide between the two models, cross validation is performed on the validation data created above in the data preprocessing steps.

A confusion matrix will help determine which algorithm is choosen to apply to the test data.

clusters <- makeCluster(detectCores() -1)
registerDoParallel(clusters)

ctrl <- trainControl(classProbs = TRUE, savePredictions = TRUE, allowParallel = TRUE)

method <- "rpart"

system.time(ModelClassTree <- train(classe ~., preProcess=c("center", "scale"), data=dataTrain, method="rpart"))

##    user  system elapsed 
##   6.420   0.222  40.072

ModelClassTree

## CART 
## 
## 58880 samples
##     7 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered, scaled 
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 58880, 58880, 58880, 58880, 58880, 58880, ... 
## 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa       Accuracy SD  Kappa SD  
##   0.01765543  0.4005427  0.17991174  0.01402429   0.02602497
##   0.03490745  0.3763033  0.14061146  0.01168593   0.01902099
##   0.11525866  0.3366784  0.07971013  0.04004040   0.06103125
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.01765543.

predictedClassTree <- predict(ModelClassTree, dataValidate)

cmCT <- confusionMatrix(predictedClassTree, dataValidate$classe)
cmCT

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  A  B  C  D  E
##          A 48 40 35 32 17
##          B  0  0  0  0  0
##          C  0  0  0  0  0
##          D  0  0  0  0  0
##          E  0  0  0  3 24
## 
## Overall Statistics
##                                          
##                Accuracy : 0.3618         
##                  95% CI : (0.295, 0.4328)
##     No Information Rate : 0.2412         
##     P-Value [Acc > NIR] : 9.441e-05      
##                                          
##                   Kappa : 0.1642         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000    0.000   0.0000   0.0000   0.5854
## Specificity            0.1788    1.000   1.0000   1.0000   0.9810
## Pos Pred Value         0.2791      NaN      NaN      NaN   0.8889
## Neg Pred Value         1.0000    0.799   0.8241   0.8241   0.9012
## Prevalence             0.2412    0.201   0.1759   0.1759   0.2060
## Detection Rate         0.2412    0.000   0.0000   0.0000   0.1206
## Detection Prevalence   0.8643    0.000   0.0000   0.0000   0.1357
## Balanced Accuracy      0.5894    0.500   0.5000   0.5000   0.7832

stopCluster(clusters)

varImp(ModelClassTree)

## rpart variable importance
## 
##                  Overall
## roll_belt        100.000
## magnet_belt_y     57.857
## total_accel_belt  47.489
## magnet_belt_z     40.443
## gyros_belt_x      15.527
## gyros_belt_y       9.975
## accel_belt_x       0.000

ModelClassTree$finalModel

## n= 58880 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 58880 42140 A (0.28 0.19 0.17 0.16 0.18)  
##   2) roll_belt< 1.052489 53943 37243 A (0.31 0.21 0.19 0.18 0.11)  
##     4) roll_belt>=-1.037945 51382 34737 A (0.32 0.22 0.2 0.17 0.086) *
##     5) roll_belt< -1.037945 2561  1035 E (0.021 0.009 0 0.37 0.6) *
##   3) roll_belt>=1.052489 4937    40 E (0.0081 0 0 0 0.99) *

#estimated error rate : 

save(ModelClassTree, file="ModelClassTree.RData")

Assessment of Classification Tree Model

The confusion matrix created for our first prediction algorithm shows an overall accuracy of 36% with a 95% Confidence interval of (29.5%-43.3%). Assessment of the balanced accuracy shows the percent for classe E is 78% but for classe A-D the balanced accuracy is only 50-59%.

This is clearly not a great model! If one looks back at the visualization results and varible importance its clear why this may be the case. The roll-belt has an overall importance of 100% while the other features range from 0-57.9%.

The model could be improved by a reassessment of features, but again, CART models tend to be less accurate, so next a random forrest model is assessed.

Error Rates are calculated as 1-overall accuracy of the prediction algorithm. For the CART model presented here the error rate is 1-0.36=0.64 or 64%. The confidence interval for the error rate is 1-0.433=0.567 to 1-0.295= 0.705 or 56%-70.5%

Assessment of a Random Forrest Model

To improve our accuracy, while maintaining scalability (the speed at which our algorithm runs), we will next train and validate an Random Forrest Model on training and validation data respectively.

#########Train the Prediction Model Random Forest##############

clusters <- makeCluster(detectCores() -1)
registerDoParallel(clusters)

ctrl <- trainControl(classProbs = TRUE, savePredictions = TRUE, allowParallel = TRUE)

method <- "rf"

system.time(ModelRF <- train(classe ~., preProcess=c("center", "scale"), data=dataTrain, method="rf"))

##     user   system  elapsed 
##   46.498    2.182 1016.540

ModelRF

## Random Forest 
## 
## 58880 samples
##     7 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered, scaled 
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 58880, 58880, 58880, 58880, 58880, 58880, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##   2     0.9542179  0.9420650  0.001697132  0.002166066
##   4     0.9577325  0.9465110  0.001978630  0.002515431
##   7     0.9555228  0.9437178  0.002137422  0.002713068
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 4.

predictedRF <- predict(ModelRF, dataValidate)

cmRF <- confusionMatrix(predictedRF, dataValidate$classe)
cmRF

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  A  B  C  D  E
##          A 41  6  6  5  2
##          B  1 23  6  2  0
##          C  4  7 18  7  1
##          D  2  4  5 21  1
##          E  0  0  0  0 37
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7035         
##                  95% CI : (0.6348, 0.766)
##     No Information Rate : 0.2412         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.627          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8542   0.5750  0.51429   0.6000   0.9024
## Specificity            0.8742   0.9434  0.88415   0.9268   1.0000
## Pos Pred Value         0.6833   0.7188  0.48649   0.6364   1.0000
## Neg Pred Value         0.9496   0.8982  0.89506   0.9157   0.9753
## Prevalence             0.2412   0.2010  0.17588   0.1759   0.2060
## Detection Rate         0.2060   0.1156  0.09045   0.1055   0.1859
## Detection Prevalence   0.3015   0.1608  0.18593   0.1658   0.1859
## Balanced Accuracy      0.8642   0.7592  0.69922   0.7634   0.9512

varImp(ModelRF)

## rf variable importance
## 
##                  Overall
## roll_belt        100.000
## magnet_belt_z     45.717
## magnet_belt_y     33.961
## accel_belt_x      31.999
## gyros_belt_x      26.862
## gyros_belt_y       5.184
## total_accel_belt   0.000

ModelClassTree$finalModel

## n= 58880 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 58880 42140 A (0.28 0.19 0.17 0.16 0.18)  
##   2) roll_belt< 1.052489 53943 37243 A (0.31 0.21 0.19 0.18 0.11)  
##     4) roll_belt>=-1.037945 51382 34737 A (0.32 0.22 0.2 0.17 0.086) *
##     5) roll_belt< -1.037945 2561  1035 E (0.021 0.009 0 0.37 0.6) *
##   3) roll_belt>=1.052489 4937    40 E (0.0081 0 0 0 0.99) *

stopCluster(clusters)

#estimated error rate : 

save(ModelRF, file="ModelRF.RData")

The confusion matrix for our second algorithm shows an overall accuracy rate of: 70.4% and classe accuracy rates of: 70-95%. This is a large jump from accuracy of the classification tree model.

The overall error rate is: 1-0.704= 29.6%

The overall confidence interval for the error rate is: (1-0.766 to 1-0.635) or 23.4-36.5%.

Final Algorithm Choice

Because of the better accuracy, while maintaining scalability, the Random Forrest algorithm will be used to predict results on our test data and output those files with single letter answers for submission.

########Predict with chosen model on test data############

TestingData <- read.csv("/Users/Work/Desktop/PracticalML/pml-testing.csv", header=TRUE, stringsAsFactors=FALSE)

load(file="ModelRF.RData", verbose=FALSE)

predictFinal <- predict(ModelRF, TestingData)
predictCT <- predict(ModelClassTree, TestingData)

#Write files for submission
pml_write_files = function(x){
    n=length(x)
    for (i in 1:n){
        filename=paste("problem_id_", i, ".txt")
        write.table(x[i], file=filename, quote=FALSE, row.names=FALSE, col.names=FALSE)
    }
}

pml_write_files(predictFinal)
pml_write_files(predictCT)

Conclusion

While dimensionality of the data was reduced, this may have affected the accuracy of the classification tree model. However in order to preserve speed and scalability, a reduction in the dimensionality of the data was performed and kept. As an added way to improve performance, parallel R package was used and code was run on clusters.

Data output was submitted to project site for testing and about 90% of the answers scored correctly on the test data set. This suggest that either there may be some slight skew (i.e. more of the classes with higher accuracy were present in the test data) or the random forrest model slightly under-fit the training data. The test data set is only 20 and thus a larger test set might be helpful for a final diagnosis.

For full code see github repository with html, r-markdown and raw R files.