Executive Summary

The goal of this project is to predict if a person is performing a dumbell curl exercise properly using only the data collected from an on-body sensing approach.

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Building the model

Model selection

I decided to use random forest model for this exercise because it has a reputation as being a good performer. The variable classe was used to make predictions on.

classe variable

+ A - exercse performed correctly
+ B - throwing elbows to the front
+ C - lifing dumbbell halfway
+ D - lowering dumbbell halfway
+ E - throwing hips to front

The model is to predict classe A occurances.

Cross validation

The training set is split 70% for training and 30% for testing.

Model evalaution

The model will be evaluated by the confusion matrix: Sensitivity - the true positive rate Specificity - the true negative rate

Install packages and load data.

Download data and place into working directory.

library(caret)
library(dplyr)
train <- read.csv('pml-training.csv', na.strings = c("NA","#DIV/0!",""))
test <- read.csv('pml-testing.csv', na.strings = c("NA","#DIV/0!",""))
set.seed(101)

Cleaning the data

there are a lot of missing values, so we will keep only the colums with 90% or greater data available.

# keep columns that have 90% of the data, dropped 100 columns
train.noNA <- train[colSums(!is.na(train))>(nrow(train)*.9)]
test <- test[colSums(!is.na(test))>(nrow(test)*.9)]
print("Any missing training data?")

## [1] "Any missing training data?"

any(is.na(train.noNA))

## [1] FALSE

print("Any missing test data?")

## [1] "Any missing test data?"

any(is.na(test))

## [1] FALSE

Remove the first 7 columns because they are not needed for calculation.

# Remove the first 7 columns because they are not needed for calculation
train.clean <- train.noNA[,-c(1:7)]

Removing highly correlated predictors.

# get only numeric columns
num.cols <-sapply(train.clean,is.numeric)
# make correlation matrix
cor.data <- cor(train.clean[,num.cols])
# identify correlated predictors for removal
high.cor <- findCorrelation(cor.data, cutoff = .75)
train.cor <- train.clean[,-high.cor]
train.cor$classe <-as.factor(train.cor$classe)

Split training set for cross validation.

# split train.cor into training and testing sets
inTrain <- createDataPartition(y=train.cor$classe,p=0.75,list=FALSE)
training <- train.cor[inTrain,]
testing <- train.cor[-inTrain,]

Build random forest model

# random forest
library(randomForest)
modelfit.rf <- randomForest(classe ~.,training)

Predict on testing set

# predict
pred.rf <- predict(modelfit.rf,testing)

View confusion matrix

Based on the confusion matrix summary. Sensitivity is .9986 Specificity is .9977 The balaced accuracy is .9981

cmatrix <- confusionMatrix(pred.rf,testing$classe)
cmatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1393    8    0    0    0
##          B    1  936   10    0    0
##          C    0    4  838    8    0
##          D    0    0    5  795    3
##          E    1    1    2    1  898
## 
## Overall Statistics
##                                          
##                Accuracy : 0.991          
##                  95% CI : (0.988, 0.9935)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9886         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9986   0.9863   0.9801   0.9888   0.9967
## Specificity            0.9977   0.9972   0.9970   0.9980   0.9988
## Pos Pred Value         0.9943   0.9884   0.9859   0.9900   0.9945
## Neg Pred Value         0.9994   0.9967   0.9958   0.9978   0.9993
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2841   0.1909   0.1709   0.1621   0.1831
## Detection Prevalence   0.2857   0.1931   0.1733   0.1637   0.1841
## Balanced Accuracy      0.9981   0.9918   0.9886   0.9934   0.9977

View model fit

The model fit plot show that error drops from .15 to below .025 with 500 trees.

# plot model fit
plot(modelfit.rf)

Final prediction

The random forrest model worked great. So we will use the model to predict our test set.

final.pred <- predict(modelfit.rf, test, type = "class")
print(final.pred)

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

# add final predict to test set
test <- cbind(test,final.pred)

Peer-graded Assignment: Prediction Assignment Writeup

Wayne Tipton

January 7, 2018