Summary

This study compares machine learning approaches to evaluate and predict pysical exercise outcome based on proper weight lifting technques. Our expectation, based on the information provided on the web site, is that an effective predictive model can be built; our goal is to produce a model that will yield results with an error rate less than 1.0%.

The web site, referenced below, provides details of a study of six participants performing dumbell lifting exercises. The quality of executing an activity, the “how (well)” it was performed, was measured using sensors on wearable devices and exercise equipment.

Read more: http://groupware.les.inf.puc-rio.br/work.jsf?p1=11201#ixzz3jY4OxP33The data was captured and evaluationed, with execution clustered in five categories. The categories described in the study are:

Category	Value
A	exactly according to the specification
B	throwing the elbows to the front
C	lifting the dumbbell only halfway
D	lowering the dumbbell only halfway
E	throwing the hips to the front

Only category A corresponds to the correct execution of each exercise. The other categories capture exercise technique errors.

Background

Taken from the assignment listing: Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Data

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.

options(warn=-1)
# Clean the Environment
rm(list = ls(all = TRUE))
#Setting the working directory - This is specific to your system
setwd('~/Dropbox/Coursera/MachineLearning')
# Load the classification and regression training library
library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

The function below was taken from the project assignment page, and will be used to create the files for the submission portion of the assignment.

# Function from the assignment to write files for submission
pml_write_files = function(x){
    n = length(x)
    for(i in 1:n){
        filename = paste0("submit/problem_id_",i,".txt")
        write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
    }
}

Processing the Data

In this section the data is read and processed. Based on empty fields (NA) and sparsely populated categories, the number of dimensions is reduced significantly.

# Read the training and testing sets
training <- read.csv(file="pml-training.csv", header=TRUE, as.is = TRUE, stringsAsFactors = FALSE, sep=',', na.strings=c('NA','','#DIV/0!'))
testing <- read.csv(file="pml-testing.csv", header=TRUE, as.is = TRUE, stringsAsFactors = FALSE, sep=',', na.strings=c('NA','','#DIV/0!'))

training$classe <- as.factor(training$classe)

#Removing NAs and columns not needed
NAidx <- colnames(training)
NAidx <- colnames(training[colSums(is.na(training)) == 0])
NAidx <- NAidx[-c(1:7)]
NAidx <- apply(training,2,function(x) {sum(is.na(x))}) 
training <- training[,which(NAidx == 0)]
NAidx <- apply(testing,2,function(x) {sum(is.na(x))}) 
testing <- testing[,which(NAidx == 0)]

#Preprocess
vec <- which(lapply(training[,], class) %in% "numeric")
# Pre-processing to include 5 nearest neighbors, centered and scaled
preObj <-preProcess(training[,vec],method=c('knnImpute', 'center', 'scale'))
trainSet <- predict(preObj, training[,vec])
trainSet$classe <- training$classe
testSet <-predict(preObj,testing[,vec])

# remove near zero values, if any
nearZ <- nearZeroVar(trainSet,saveMetrics=TRUE)
trainSet <- trainSet[,nearZ$nzv==FALSE]
nearZ <- nearZeroVar(testSet,saveMetrics=TRUE)
testSet <- testSet[,nearZ$nzv==FALSE]

Prepare for Cross Validation

Cross validation will help estimate the accuracy of the prediction model. We need to partition the data to prepare for cross validation, which will follow our model building and test predictions.

# Create cross validation set
set.seed(33833)

inTrain = createDataPartition(trainSet$classe, p = 0.8, list=FALSE)
training = trainSet[inTrain,]
crossValidation = trainSet[-inTrain,]

Building the Models

Using the random forest approach (Model A) and generalized linear regression model (Model B), create the models and perform a rudimentary principal components analysis on each.

# Train with random forest and trainControl using 5 fold cross validation.
ctrl <- trainControl(method='cv', number=5 )
fitA <- train(classe ~., method="rf", data=training, trControl=ctrl)

## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

# Train with general linear regression model
fitB <- train(classe ~., model="glm", data=training, preProcess=c("center", "scale"))
# Note compare the estimated of error rate and PCA from the models

# Model A
fitA$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.62%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 4461    2    0    0    1 0.000672043
## B   17 3007   14    0    0 0.010204082
## C    0   13 2710   15    0 0.010226443
## D    0    0   23 2546    4 0.010493587
## E    0    0    0    8 2878 0.002772003

varImp(fitA)

## rf variable importance
## 
##   only 20 most important variables shown (out of 27)
## 
##                   Overall
## roll_belt          100.00
## yaw_belt            76.79
## magnet_dumbbell_z   63.89
## pitch_forearm       62.65
## pitch_belt          57.25
## roll_forearm        46.62
## roll_dumbbell       41.64
## roll_arm            29.84
## yaw_dumbbell        29.71
## gyros_belt_z        28.52
## gyros_dumbbell_y    28.05
## magnet_forearm_z    27.06
## yaw_arm             26.64
## pitch_dumbbell      24.66
## magnet_forearm_y    23.36
## yaw_forearm         20.81
## pitch_arm           14.79
## gyros_arm_y         11.58
## gyros_arm_x         11.26
## gyros_dumbbell_x    10.66

# Model B
fitB$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, model = "glm") 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.62%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 4462    2    0    0    0 0.0004480287
## B   17 3011    9    0    1 0.0088874259
## C    0   16 2705   16    1 0.0120525931
## D    0    0   22 2547    4 0.0101049359
## E    0    0    0    9 2877 0.0031185031

varImp(fitB)

## rf variable importance
## 
##   only 20 most important variables shown (out of 27)
## 
##                   Overall
## roll_belt          100.00
## yaw_belt            82.81
## magnet_dumbbell_z   66.12
## pitch_forearm       63.78
## pitch_belt          59.84
## roll_forearm        47.05
## roll_dumbbell       44.59
## roll_arm            31.72
## gyros_belt_z        30.77
## yaw_dumbbell        29.80
## yaw_arm             28.31
## gyros_dumbbell_y    27.99
## magnet_forearm_z    27.33
## pitch_dumbbell      24.57
## magnet_forearm_y    23.65
## yaw_forearm         20.18
## pitch_arm           15.69
## gyros_arm_x         11.54
## gyros_arm_y         11.53
## gyros_dumbbell_x    10.66

Predictions and Errors

And now we predict and examine the errors to compare our models.

# Training set accuracy - Model A
trainingPred <- predict(fitA, training)
confusionMatrix(trainingPred, training$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4464    0    0    0    0
##          B    0 3038    0    0    0
##          C    0    0 2738    0    0
##          D    0    0    0 2573    0
##          E    0    0    0    0 2886
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9998, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

# Training set accuracy - Model B
trainingPred <- predict(fitB, training)
confusionMatrix(trainingPred, training$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4464    0    0    0    0
##          B    0 3038    0    0    0
##          C    0    0 2738    0    0
##          D    0    0    0 2573    0
##          E    0    0    0    0 2886
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9998, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

# Cross validation set accuracy - Model A
cvPred <- predict(fitA, crossValidation)
confusionMatrix(cvPred, crossValidation$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1114    1    0    0    0
##          B    2  756    2    0    0
##          C    0    2  679    7    1
##          D    0    0    3  636    0
##          E    0    0    0    0  720
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9954          
##                  95% CI : (0.9928, 0.9973)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9942          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9960   0.9927   0.9891   0.9986
## Specificity            0.9996   0.9987   0.9969   0.9991   1.0000
## Pos Pred Value         0.9991   0.9947   0.9855   0.9953   1.0000
## Neg Pred Value         0.9993   0.9991   0.9985   0.9979   0.9997
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2840   0.1927   0.1731   0.1621   0.1835
## Detection Prevalence   0.2842   0.1937   0.1756   0.1629   0.1835
## Balanced Accuracy      0.9989   0.9974   0.9948   0.9941   0.9993

# Cross validation set accuracy - Model B
cvPred <- predict(fitB, crossValidation)
confusionMatrix(cvPred, crossValidation$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1114    1    0    0    0
##          B    2  757    1    0    0
##          C    0    1  680    6    1
##          D    0    0    3  637    0
##          E    0    0    0    0  720
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9962          
##                  95% CI : (0.9937, 0.9979)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9952          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9974   0.9942   0.9907   0.9986
## Specificity            0.9996   0.9991   0.9975   0.9991   1.0000
## Pos Pred Value         0.9991   0.9961   0.9884   0.9953   1.0000
## Neg Pred Value         0.9993   0.9994   0.9988   0.9982   0.9997
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2840   0.1930   0.1733   0.1624   0.1835
## Detection Prevalence   0.2842   0.1937   0.1754   0.1631   0.1835
## Balanced Accuracy      0.9989   0.9982   0.9958   0.9949   0.9993

Results and Submission

The error rates listed above with both of our final models for the cross validation data is below the 1.0% goal stated earlier. Thus, the test data is used to predict the categories for each and the results files created for submission.

#Predictions on the real testing set
# Predictions from Model A
testingPred <- predict(fitA, testSet)
testingPred

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

# Predictions from Model B
testingPred <- predict(fitB, testSet)
testingPred

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

system("mkdir submit")
pml_write_files(testingPred)

Conclusion

Both of he models built to predict exercise form from movement data have an error rate of less than 1.0%, which was the goal stated intially. The predicted results are indentical for both Model A and Model B.

Practical Machine Learning Project

Max K. Goff

August 21, 2015