The goal of this project is to predict if a person is performing a dumbell curl exercise properly using only the data collected from an on-body sensing approach.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
I decided to use random forest model for this exercise because it has a reputation as being a good performer. The variable classe was used to make predictions on.
classe variable
+ A - exercse performed correctly
+ B - throwing elbows to the front
+ C - lifing dumbbell halfway
+ D - lowering dumbbell halfway
+ E - throwing hips to front
The model is to predict classe A occurances.
The training set is split 70% for training and 30% for testing.
The model will be evaluated by the confusion matrix: Sensitivity - the true positive rate Specificity - the true negative rate
Download data and place into working directory.
library(caret)
library(dplyr)
train <- read.csv('pml-training.csv', na.strings = c("NA","#DIV/0!",""))
test <- read.csv('pml-testing.csv', na.strings = c("NA","#DIV/0!",""))
set.seed(101)
there are a lot of missing values, so we will keep only the colums with 90% or greater data available.
# keep columns that have 90% of the data, dropped 100 columns
train.noNA <- train[colSums(!is.na(train))>(nrow(train)*.9)]
test <- test[colSums(!is.na(test))>(nrow(test)*.9)]
print("Any missing training data?")
## [1] "Any missing training data?"
any(is.na(train.noNA))
## [1] FALSE
print("Any missing test data?")
## [1] "Any missing test data?"
any(is.na(test))
## [1] FALSE
Remove the first 7 columns because they are not needed for calculation.
# Remove the first 7 columns because they are not needed for calculation
train.clean <- train.noNA[,-c(1:7)]
Removing highly correlated predictors.
# get only numeric columns
num.cols <-sapply(train.clean,is.numeric)
# make correlation matrix
cor.data <- cor(train.clean[,num.cols])
# identify correlated predictors for removal
high.cor <- findCorrelation(cor.data, cutoff = .75)
train.cor <- train.clean[,-high.cor]
train.cor$classe <-as.factor(train.cor$classe)
Split training set for cross validation.
# split train.cor into training and testing sets
inTrain <- createDataPartition(y=train.cor$classe,p=0.75,list=FALSE)
training <- train.cor[inTrain,]
testing <- train.cor[-inTrain,]
# random forest
library(randomForest)
modelfit.rf <- randomForest(classe ~.,training)
# predict
pred.rf <- predict(modelfit.rf,testing)
Based on the confusion matrix summary. Sensitivity is .9986 Specificity is .9977 The balaced accuracy is .9981
cmatrix <- confusionMatrix(pred.rf,testing$classe)
cmatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1393 8 0 0 0
## B 1 936 10 0 0
## C 0 4 838 8 0
## D 0 0 5 795 3
## E 1 1 2 1 898
##
## Overall Statistics
##
## Accuracy : 0.991
## 95% CI : (0.988, 0.9935)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9886
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9986 0.9863 0.9801 0.9888 0.9967
## Specificity 0.9977 0.9972 0.9970 0.9980 0.9988
## Pos Pred Value 0.9943 0.9884 0.9859 0.9900 0.9945
## Neg Pred Value 0.9994 0.9967 0.9958 0.9978 0.9993
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2841 0.1909 0.1709 0.1621 0.1831
## Detection Prevalence 0.2857 0.1931 0.1733 0.1637 0.1841
## Balanced Accuracy 0.9981 0.9918 0.9886 0.9934 0.9977
The model fit plot show that error drops from .15 to below .025 with 500 trees.
# plot model fit
plot(modelfit.rf)
The random forrest model worked great. So we will use the model to predict our test set.
final.pred <- predict(modelfit.rf, test, type = "class")
print(final.pred)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
# add final predict to test set
test <- cbind(test,final.pred)