This study compares machine learning approaches to evaluate and predict pysical exercise outcome based on proper weight lifting technques. Our expectation, based on the information provided on the web site, is that an effective predictive model can be built; our goal is to produce a model that will yield results with an error rate less than 1.0%.
The web site, referenced below, provides details of a study of six participants performing dumbell lifting exercises. The quality of executing an activity, the “how (well)” it was performed, was measured using sensors on wearable devices and exercise equipment.
Read more: http://groupware.les.inf.puc-rio.br/work.jsf?p1=11201#ixzz3jY4OxP33The data was captured and evaluationed, with execution clustered in five categories. The categories described in the study are:
| Category | Value |
|---|---|
| A | exactly according to the specification |
| B | throwing the elbows to the front |
| C | lifting the dumbbell only halfway |
| D | lowering the dumbbell only halfway |
| E | throwing the hips to the front |
Only category A corresponds to the correct execution of each exercise. The other categories capture exercise technique errors.
Taken from the assignment listing: Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.
options(warn=-1)
# Clean the Environment
rm(list = ls(all = TRUE))
#Setting the working directory - This is specific to your system
setwd('~/Dropbox/Coursera/MachineLearning')
# Load the classification and regression training library
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
The function below was taken from the project assignment page, and will be used to create the files for the submission portion of the assignment.
# Function from the assignment to write files for submission
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("submit/problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
In this section the data is read and processed. Based on empty fields (NA) and sparsely populated categories, the number of dimensions is reduced significantly.
# Read the training and testing sets
training <- read.csv(file="pml-training.csv", header=TRUE, as.is = TRUE, stringsAsFactors = FALSE, sep=',', na.strings=c('NA','','#DIV/0!'))
testing <- read.csv(file="pml-testing.csv", header=TRUE, as.is = TRUE, stringsAsFactors = FALSE, sep=',', na.strings=c('NA','','#DIV/0!'))
training$classe <- as.factor(training$classe)
#Removing NAs and columns not needed
NAidx <- colnames(training)
NAidx <- colnames(training[colSums(is.na(training)) == 0])
NAidx <- NAidx[-c(1:7)]
NAidx <- apply(training,2,function(x) {sum(is.na(x))})
training <- training[,which(NAidx == 0)]
NAidx <- apply(testing,2,function(x) {sum(is.na(x))})
testing <- testing[,which(NAidx == 0)]
#Preprocess
vec <- which(lapply(training[,], class) %in% "numeric")
# Pre-processing to include 5 nearest neighbors, centered and scaled
preObj <-preProcess(training[,vec],method=c('knnImpute', 'center', 'scale'))
trainSet <- predict(preObj, training[,vec])
trainSet$classe <- training$classe
testSet <-predict(preObj,testing[,vec])
# remove near zero values, if any
nearZ <- nearZeroVar(trainSet,saveMetrics=TRUE)
trainSet <- trainSet[,nearZ$nzv==FALSE]
nearZ <- nearZeroVar(testSet,saveMetrics=TRUE)
testSet <- testSet[,nearZ$nzv==FALSE]
Cross validation will help estimate the accuracy of the prediction model. We need to partition the data to prepare for cross validation, which will follow our model building and test predictions.
# Create cross validation set
set.seed(33833)
inTrain = createDataPartition(trainSet$classe, p = 0.8, list=FALSE)
training = trainSet[inTrain,]
crossValidation = trainSet[-inTrain,]
Using the random forest approach (Model A) and generalized linear regression model (Model B), create the models and perform a rudimentary principal components analysis on each.
# Train with random forest and trainControl using 5 fold cross validation.
ctrl <- trainControl(method='cv', number=5 )
fitA <- train(classe ~., method="rf", data=training, trControl=ctrl)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
# Train with general linear regression model
fitB <- train(classe ~., model="glm", data=training, preProcess=c("center", "scale"))
# Note compare the estimated of error rate and PCA from the models
# Model A
fitA$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0.62%
## Confusion matrix:
## A B C D E class.error
## A 4461 2 0 0 1 0.000672043
## B 17 3007 14 0 0 0.010204082
## C 0 13 2710 15 0 0.010226443
## D 0 0 23 2546 4 0.010493587
## E 0 0 0 8 2878 0.002772003
varImp(fitA)
## rf variable importance
##
## only 20 most important variables shown (out of 27)
##
## Overall
## roll_belt 100.00
## yaw_belt 76.79
## magnet_dumbbell_z 63.89
## pitch_forearm 62.65
## pitch_belt 57.25
## roll_forearm 46.62
## roll_dumbbell 41.64
## roll_arm 29.84
## yaw_dumbbell 29.71
## gyros_belt_z 28.52
## gyros_dumbbell_y 28.05
## magnet_forearm_z 27.06
## yaw_arm 26.64
## pitch_dumbbell 24.66
## magnet_forearm_y 23.36
## yaw_forearm 20.81
## pitch_arm 14.79
## gyros_arm_y 11.58
## gyros_arm_x 11.26
## gyros_dumbbell_x 10.66
# Model B
fitB$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, model = "glm")
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0.62%
## Confusion matrix:
## A B C D E class.error
## A 4462 2 0 0 0 0.0004480287
## B 17 3011 9 0 1 0.0088874259
## C 0 16 2705 16 1 0.0120525931
## D 0 0 22 2547 4 0.0101049359
## E 0 0 0 9 2877 0.0031185031
varImp(fitB)
## rf variable importance
##
## only 20 most important variables shown (out of 27)
##
## Overall
## roll_belt 100.00
## yaw_belt 82.81
## magnet_dumbbell_z 66.12
## pitch_forearm 63.78
## pitch_belt 59.84
## roll_forearm 47.05
## roll_dumbbell 44.59
## roll_arm 31.72
## gyros_belt_z 30.77
## yaw_dumbbell 29.80
## yaw_arm 28.31
## gyros_dumbbell_y 27.99
## magnet_forearm_z 27.33
## pitch_dumbbell 24.57
## magnet_forearm_y 23.65
## yaw_forearm 20.18
## pitch_arm 15.69
## gyros_arm_x 11.54
## gyros_arm_y 11.53
## gyros_dumbbell_x 10.66
And now we predict and examine the errors to compare our models.
# Training set accuracy - Model A
trainingPred <- predict(fitA, training)
confusionMatrix(trainingPred, training$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4464 0 0 0 0
## B 0 3038 0 0 0
## C 0 0 2738 0 0
## D 0 0 0 2573 0
## E 0 0 0 0 2886
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9998, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
# Training set accuracy - Model B
trainingPred <- predict(fitB, training)
confusionMatrix(trainingPred, training$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4464 0 0 0 0
## B 0 3038 0 0 0
## C 0 0 2738 0 0
## D 0 0 0 2573 0
## E 0 0 0 0 2886
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9998, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
# Cross validation set accuracy - Model A
cvPred <- predict(fitA, crossValidation)
confusionMatrix(cvPred, crossValidation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1114 1 0 0 0
## B 2 756 2 0 0
## C 0 2 679 7 1
## D 0 0 3 636 0
## E 0 0 0 0 720
##
## Overall Statistics
##
## Accuracy : 0.9954
## 95% CI : (0.9928, 0.9973)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9942
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9982 0.9960 0.9927 0.9891 0.9986
## Specificity 0.9996 0.9987 0.9969 0.9991 1.0000
## Pos Pred Value 0.9991 0.9947 0.9855 0.9953 1.0000
## Neg Pred Value 0.9993 0.9991 0.9985 0.9979 0.9997
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2840 0.1927 0.1731 0.1621 0.1835
## Detection Prevalence 0.2842 0.1937 0.1756 0.1629 0.1835
## Balanced Accuracy 0.9989 0.9974 0.9948 0.9941 0.9993
# Cross validation set accuracy - Model B
cvPred <- predict(fitB, crossValidation)
confusionMatrix(cvPred, crossValidation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1114 1 0 0 0
## B 2 757 1 0 0
## C 0 1 680 6 1
## D 0 0 3 637 0
## E 0 0 0 0 720
##
## Overall Statistics
##
## Accuracy : 0.9962
## 95% CI : (0.9937, 0.9979)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9952
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9982 0.9974 0.9942 0.9907 0.9986
## Specificity 0.9996 0.9991 0.9975 0.9991 1.0000
## Pos Pred Value 0.9991 0.9961 0.9884 0.9953 1.0000
## Neg Pred Value 0.9993 0.9994 0.9988 0.9982 0.9997
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2840 0.1930 0.1733 0.1624 0.1835
## Detection Prevalence 0.2842 0.1937 0.1754 0.1631 0.1835
## Balanced Accuracy 0.9989 0.9982 0.9958 0.9949 0.9993
The error rates listed above with both of our final models for the cross validation data is below the 1.0% goal stated earlier. Thus, the test data is used to predict the categories for each and the results files created for submission.
#Predictions on the real testing set
# Predictions from Model A
testingPred <- predict(fitA, testSet)
testingPred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
# Predictions from Model B
testingPred <- predict(fitB, testSet)
testingPred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
system("mkdir submit")
pml_write_files(testingPred)
Both of he models built to predict exercise form from movement data have an error rate of less than 1.0%, which was the goal stated intially. The predicted results are indentical for both Model A and Model B.