Introduction

The data from this report comes from the research published in the following paper:

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13). Stuttgart, Germany: ACM SIGCHI, 2013.

Read more at this link: http://groupware.les.inf.puc-rio.br/har#ixzz45Lm1GQPm

The researchers of this experiment have generously made their data publicly available for use. In this study, 6 participants wore accelerometers on their belt, forearm, arm, and dumbell. They were asked to perform barbell lifts correctly and incorrectly in 5 different manners The goal is to predict the manner in which 20 participants in a validation set did the exercise. This manner is the “classe” variable in the training set.

Loading and Preprocessing data

This report will make use of several libraries loaded below. The analysis will be conducted on the data set made publicly available at the links shown in the code below.

#Load necessary libraries
library(caret); library(e1071); library(knitr); library(rpart); library(randomForest); library(dplyr); library(mlbench)

#Load data
training <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv")
validation <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv")
#Analyze Data
head(training); str(training)
head(validation); str(validation)

The outputs from the “Analyze Data” chunk of code has been left out due to its length. What it shows is that in the validation set, there are some variables that have no values. These variables in the training set will be of no use to us in a prediction model so we need to filter them out. The first 4 columns of the data set are descriptive variables so also cannot be considered as predictor variables.

#Remove the descriptive variables and unneccesary columns that have no entries in the validation set 
keyvars <- sapply(validation, is.numeric)
training57 <- training[,keyvars]
training53 <- training57[,-c(1:4)]
dim(training53)
## [1] 19622    53

The cleaned up data set has 53 potential predictors to work with.

Building the Prediction Model

Feature elimination

Some of the variables may be strongly correlated to eachother which would reduce the accuracy of the prediction model, so I will find and eliminate the variables that are strongly correlated to eachother. I will use a cutoff of r > 0.75 to determine which variables are strongly correlated. Of the highly correlated pairs of variables, I will remove the variables with the largest mean absolute correlation.

#Create correlation matrix on data except classe variable
correlationMatrix <- cor(training53[,-53])

#Get the column indices of the features that are strongly correlated with one another
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.75)

#New data set without variables that had the largest mean absolute correlation
training32 <- training53[,-c(highlyCorrelated)]

Training the model

To train a model, I will partition my data set into a training set containing 60% of the data and a testing set containing 40% of the data. I will use 3-fold cross validation, which Prof. Gutierrez-Osuna states: “For large datasets, even 3-Fold Cross Validation will be quite accurate.” http://research.cs.tamu.edu/prism/lectures/iss/iss_l13.pdf

I will then use the random forests method to train the model.

#Set seed for reproducibility
set.seed(12121)

#Create training and test partitions
intrain <- createDataPartition(y = training32$classe, p = 0.6, list = FALSE)
finaltrainingset <- training32[intrain,]
finaltestingset <- training32[-intrain,]

#Use 3 fold cross validation
control <- trainControl(method = "cv", number = 3)

#Create predictive model by random forests method
fitmod <- train(classe~., data = finaltrainingset, method = "rf", trControl = control) 

Evaluating the results and out of sample error

Now we are ready to see the results of the prediction on the testing set.

#Use model to predict "classe" variable on the testing partition
pred <- predict(fitmod, finaltestingset)

#Print confusion matrix to see results of predictions
confusionMatrix(pred, finaltestingset$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2230   11    0    0    0
##          B    2 1500    9    0    0
##          C    0    7 1352   31    0
##          D    0    0    7 1255    6
##          E    0    0    0    0 1436
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9907          
##                  95% CI : (0.9883, 0.9927)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9882          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9991   0.9881   0.9883   0.9759   0.9958
## Specificity            0.9980   0.9983   0.9941   0.9980   1.0000
## Pos Pred Value         0.9951   0.9927   0.9727   0.9897   1.0000
## Neg Pred Value         0.9996   0.9972   0.9975   0.9953   0.9991
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2842   0.1912   0.1723   0.1600   0.1830
## Detection Prevalence   0.2856   0.1926   0.1772   0.1616   0.1830
## Balanced Accuracy      0.9986   0.9932   0.9912   0.9870   0.9979

The confusion matrix shows us an accuracy of 99.08% so the out of sample error is 0.92%.

Variable importance

#Find the variable importance and print plot
important <- varImp(fitmod)
plot(important)

It’s worth seeing that some variables (yaw belt) are of high importance and some (gyros arm z) are not. Other models could be retrained leaving out the lesser important variables if one desired to do so.

Final Predictions

predict(fitmod, validation)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

These results turned out to have an accuracy of 20/20 when submitted into the quiz.

This report was done as a course project for Practical Machine Learning, the 8th course in the Data Science Specialization offered though Coursera.