The goal of this project’s analysis is to predict classes of exercises based upon self-quantified movement. In this study the author’s took a group of health enthuasists who normally take measurements of their exercise regularly to improve their health using devices such as Jawbone Up, Nike FuelBand and Fitbit. The data from the current study was quantified to determine how well participants performed barbell lifts. They were asked to perform the barbell ifts both correctly and incorrectly in 5 different ways. These methods were then quantified using 160 features batherd from acclerometers on the belt, forearm, and dumbell of 6 participants. More information is provided at http://groupwar.les.inf.puc-rio.br/har.
Both training and testing sets were kindly provided by authors of the published study:
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
The full training data set contains 19622 observations and 160 features. It is unlikely that all these features are necessary to make accurate predictions. To address this a practical approach is taken to reducing data dimensionality. We will do this with a bit of pre-processing of the data before splitting the “training” data into a training set and a validation set.
library(gridExtra)
## Loading required package: grid
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rpart)
library(RColorBrewer)
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 3.4.1 Copyright (c) 2006-2014 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(ggplot2)
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
library(rpart.plot)
library(AppliedPredictiveModeling)
library(parallel)
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
#load the training and testing data for pre-processing
TrainingData <- read.csv("/Users/Work/Desktop/PracticalML/pml-training.csv", header=TRUE, stringsAsFactors=FALSE)
#dimensions of training data
dim(TrainingData)
## [1] 19622 160
First we will check for features with missing data, as missing data can inhibit accuracy of predictions. This is part of preprocessing which includes in this instance:
Prior to step 3, (i) data is split into training and validation sets using a seed set at 1234, so analysis is reproducible, (ii) a data visualization is performed.
isMissing <- sapply(TrainingData, function(x) any(is.na(x) | x== ""))
isPredictor <- !isMissing & grepl("belt|[^(fore) ] arm |dumbell | forearm", names(isMissing))
featureCandidates <- names(isMissing)[isPredictor]
dataInclude <- c("classe", featureCandidates)
featureData <- TrainingData[dataInclude] #now all data is selected for 13 features
dim(featureData) #14 columns, 19622 observations
## [1] 19622 14
str(featureData) #make sure "classe" is still present
## 'data.frame': 19622 obs. of 14 variables:
## $ classe : chr "A" "A" "A" "A" ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt: int 3 3 3 3 3 3 3 3 3 3 ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
#confirm numerically that none of selected features have near-zero variances
nzv <- nearZeroVar(featureData, saveMetrics=TRUE)
summary(nzv)
## freqRatio percentUnique zeroVar nzv
## Min. :1.006 Min. :0.02548 Mode :logical Mode :logical
## 1st Qu.:1.059 1st Qu.:0.71731 FALSE:14 FALSE:14
## Median :1.072 Median :1.18999 NA's :0 NA's :0
## Mean :1.103 Mean :2.63079
## 3rd Qu.:1.101 3rd Qu.:2.16339
## Max. :1.470 Max. :9.97350
featureData2 <- TrainingData[featureCandidates]
descrCor <- cor(featureData2)
perfectCorFeatures <- sum(abs(descrCor[upper.tri(descrCor)]) > 0.999)
perfectCorFeatures
## [1] 0
highCorFeatures <- findCorrelation(descrCor, cutoff = 0.75)
highCorFeatures
## [1] 3 1 10 4 11 8 12
classe <- featureData$classe
filteredFeatures <- cbind(classe, (featureData[,-(findCorrelation(descrCor, cutoff=0.75))]))
#creation of Training and Validation data using 5 resamples
set.seed(1234)
trainIndex <- createDataPartition(filteredFeatures$classe, p=0.6, list=FALSE, times=5) #here we created a 60/40% split with 5 resamplings
dataTrain <- filteredFeatures[trainIndex,]
dataValidate <- filteredFeatures[-trainIndex,]
Step 1 & 2 of reduction of dimensionality are complete. We have now created a training and validation data sets which has only 7 features and classe (classification for exercise performance) for our prediction model compared to the raw data with 160 features. Pretty good reduction of dimensionality!
Next we can do a visualization of the training data to see if any patterns emerge which might suggest a classification algorithm will work well. For the final analysis we will test both classification and random forest since accuracy is will be graded and our model is relatively non-complex.
From the visualization, one can see that some features clearly differ depending upon classe of exercise, suggesting a classification tree might be appropriate. However random forest models still tend to be more accurate so both will be tested.
The next step will be to take the visualized and partitioned data to develop our algorithms. The R package “caret” has a helpful feature which allows our model to use data than has been centered and scaled during model building. After the models are built, both are saved to a file for later use on test data. To decide between the two models, cross validation is performed on the validation data created above in the data preprocessing steps.
A confusion matrix will help determine which algorithm is choosen to apply to the test data.
clusters <- makeCluster(detectCores() -1)
registerDoParallel(clusters)
ctrl <- trainControl(classProbs = TRUE, savePredictions = TRUE, allowParallel = TRUE)
method <- "rpart"
system.time(ModelClassTree <- train(classe ~., preProcess=c("center", "scale"), data=dataTrain, method="rpart"))
## user system elapsed
## 6.420 0.222 40.072
ModelClassTree
## CART
##
## 58880 samples
## 7 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: centered, scaled
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 58880, 58880, 58880, 58880, 58880, 58880, ...
##
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa Accuracy SD Kappa SD
## 0.01765543 0.4005427 0.17991174 0.01402429 0.02602497
## 0.03490745 0.3763033 0.14061146 0.01168593 0.01902099
## 0.11525866 0.3366784 0.07971013 0.04004040 0.06103125
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01765543.
predictedClassTree <- predict(ModelClassTree, dataValidate)
cmCT <- confusionMatrix(predictedClassTree, dataValidate$classe)
cmCT
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 48 40 35 32 17
## B 0 0 0 0 0
## C 0 0 0 0 0
## D 0 0 0 0 0
## E 0 0 0 3 24
##
## Overall Statistics
##
## Accuracy : 0.3618
## 95% CI : (0.295, 0.4328)
## No Information Rate : 0.2412
## P-Value [Acc > NIR] : 9.441e-05
##
## Kappa : 0.1642
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.000 0.0000 0.0000 0.5854
## Specificity 0.1788 1.000 1.0000 1.0000 0.9810
## Pos Pred Value 0.2791 NaN NaN NaN 0.8889
## Neg Pred Value 1.0000 0.799 0.8241 0.8241 0.9012
## Prevalence 0.2412 0.201 0.1759 0.1759 0.2060
## Detection Rate 0.2412 0.000 0.0000 0.0000 0.1206
## Detection Prevalence 0.8643 0.000 0.0000 0.0000 0.1357
## Balanced Accuracy 0.5894 0.500 0.5000 0.5000 0.7832
stopCluster(clusters)
varImp(ModelClassTree)
## rpart variable importance
##
## Overall
## roll_belt 100.000
## magnet_belt_y 57.857
## total_accel_belt 47.489
## magnet_belt_z 40.443
## gyros_belt_x 15.527
## gyros_belt_y 9.975
## accel_belt_x 0.000
ModelClassTree$finalModel
## n= 58880
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 58880 42140 A (0.28 0.19 0.17 0.16 0.18)
## 2) roll_belt< 1.052489 53943 37243 A (0.31 0.21 0.19 0.18 0.11)
## 4) roll_belt>=-1.037945 51382 34737 A (0.32 0.22 0.2 0.17 0.086) *
## 5) roll_belt< -1.037945 2561 1035 E (0.021 0.009 0 0.37 0.6) *
## 3) roll_belt>=1.052489 4937 40 E (0.0081 0 0 0 0.99) *
#estimated error rate :
save(ModelClassTree, file="ModelClassTree.RData")
The confusion matrix created for our first prediction algorithm shows an overall accuracy of 36% with a 95% Confidence interval of (29.5%-43.3%). Assessment of the balanced accuracy shows the percent for classe E is 78% but for classe A-D the balanced accuracy is only 50-59%.
This is clearly not a great model! If one looks back at the visualization results and varible importance its clear why this may be the case. The roll-belt has an overall importance of 100% while the other features range from 0-57.9%.
The model could be improved by a reassessment of features, but again, CART models tend to be less accurate, so next a random forrest model is assessed.
Error Rates are calculated as 1-overall accuracy of the prediction algorithm. For the CART model presented here the error rate is 1-0.36=0.64 or 64%. The confidence interval for the error rate is 1-0.433=0.567 to 1-0.295= 0.705 or 56%-70.5%
To improve our accuracy, while maintaining scalability (the speed at which our algorithm runs), we will next train and validate an Random Forrest Model on training and validation data respectively.
#########Train the Prediction Model Random Forest##############
clusters <- makeCluster(detectCores() -1)
registerDoParallel(clusters)
ctrl <- trainControl(classProbs = TRUE, savePredictions = TRUE, allowParallel = TRUE)
method <- "rf"
system.time(ModelRF <- train(classe ~., preProcess=c("center", "scale"), data=dataTrain, method="rf"))
## user system elapsed
## 46.498 2.182 1016.540
ModelRF
## Random Forest
##
## 58880 samples
## 7 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: centered, scaled
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 58880, 58880, 58880, 58880, 58880, 58880, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9542179 0.9420650 0.001697132 0.002166066
## 4 0.9577325 0.9465110 0.001978630 0.002515431
## 7 0.9555228 0.9437178 0.002137422 0.002713068
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.
predictedRF <- predict(ModelRF, dataValidate)
cmRF <- confusionMatrix(predictedRF, dataValidate$classe)
cmRF
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 41 6 6 5 2
## B 1 23 6 2 0
## C 4 7 18 7 1
## D 2 4 5 21 1
## E 0 0 0 0 37
##
## Overall Statistics
##
## Accuracy : 0.7035
## 95% CI : (0.6348, 0.766)
## No Information Rate : 0.2412
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.627
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8542 0.5750 0.51429 0.6000 0.9024
## Specificity 0.8742 0.9434 0.88415 0.9268 1.0000
## Pos Pred Value 0.6833 0.7188 0.48649 0.6364 1.0000
## Neg Pred Value 0.9496 0.8982 0.89506 0.9157 0.9753
## Prevalence 0.2412 0.2010 0.17588 0.1759 0.2060
## Detection Rate 0.2060 0.1156 0.09045 0.1055 0.1859
## Detection Prevalence 0.3015 0.1608 0.18593 0.1658 0.1859
## Balanced Accuracy 0.8642 0.7592 0.69922 0.7634 0.9512
varImp(ModelRF)
## rf variable importance
##
## Overall
## roll_belt 100.000
## magnet_belt_z 45.717
## magnet_belt_y 33.961
## accel_belt_x 31.999
## gyros_belt_x 26.862
## gyros_belt_y 5.184
## total_accel_belt 0.000
ModelClassTree$finalModel
## n= 58880
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 58880 42140 A (0.28 0.19 0.17 0.16 0.18)
## 2) roll_belt< 1.052489 53943 37243 A (0.31 0.21 0.19 0.18 0.11)
## 4) roll_belt>=-1.037945 51382 34737 A (0.32 0.22 0.2 0.17 0.086) *
## 5) roll_belt< -1.037945 2561 1035 E (0.021 0.009 0 0.37 0.6) *
## 3) roll_belt>=1.052489 4937 40 E (0.0081 0 0 0 0.99) *
stopCluster(clusters)
#estimated error rate :
save(ModelRF, file="ModelRF.RData")
The confusion matrix for our second algorithm shows an overall accuracy rate of: 70.4% and classe accuracy rates of: 70-95%. This is a large jump from accuracy of the classification tree model.
The overall error rate is: 1-0.704= 29.6%
The overall confidence interval for the error rate is: (1-0.766 to 1-0.635) or 23.4-36.5%.
Because of the better accuracy, while maintaining scalability, the Random Forrest algorithm will be used to predict results on our test data and output those files with single letter answers for submission.
########Predict with chosen model on test data############
TestingData <- read.csv("/Users/Work/Desktop/PracticalML/pml-testing.csv", header=TRUE, stringsAsFactors=FALSE)
load(file="ModelRF.RData", verbose=FALSE)
predictFinal <- predict(ModelRF, TestingData)
predictCT <- predict(ModelClassTree, TestingData)
#Write files for submission
pml_write_files = function(x){
n=length(x)
for (i in 1:n){
filename=paste("problem_id_", i, ".txt")
write.table(x[i], file=filename, quote=FALSE, row.names=FALSE, col.names=FALSE)
}
}
pml_write_files(predictFinal)
pml_write_files(predictCT)
While dimensionality of the data was reduced, this may have affected the accuracy of the classification tree model. However in order to preserve speed and scalability, a reduction in the dimensionality of the data was performed and kept. As an added way to improve performance, parallel R package was used and code was run on clusters.
Data output was submitted to project site for testing and about 90% of the answers scored correctly on the test data set. This suggest that either there may be some slight skew (i.e. more of the classes with higher accuracy were present in the test data) or the random forrest model slightly under-fit the training data. The test data set is only 20 and thus a larger test set might be helpful for a final diagnosis.
For full code see github repository with html, r-markdown and raw R files.