Background Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. This human activity recognition research has traditionally focused on discriminating between different activities, i.e. to predict “which” activity was performed at a specific point in time (like with the Daily Living Activities dataset above). In this work we first define quality of execution of Correct execution, Automatic and Robust detection of execution mistakes, and how to provide feedback on the quality of execution to the user.
Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: • Class A: exactly according to the specification • Class B: throwing the elbows to the front • Class C: lifting the dumbbell only halfway • Class D: lowering the dumbbell only halfway • Class E: throwing the hips to the front
Goal
The goal of your project is to predict the manner in which they did the exercise. Also,we have created a report that describe how we built our model, how we used cross validation.
Libraries The R libraries utilized for this analysis includes:
library(caret)
library(rpart)
library(rpart.plot)
library(rattle)
library(randomForest)
library(gridExtra)
library(dplyr)
library(lattice)
library(ggplot2)
library(cluster)
``` Data Loading
The data for this project originated from the following source: http://groupware.les.inf.puc-rio.br/har.
Initial loading and reading of the data is as follows:
URLtraining <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
URLtesting <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(url = URLtraining, destfile = "training.csv",method = "curl")
download.file(url = URLtesting, destfile = "testing.csv",method = "curl")
Training <- read.csv("training.csv",header = TRUE, stringsAsFactors=FALSE, na.strings=c("NA","#DIV/0!",""))
Testing <- read.csv("testing.csv",header = TRUE, stringsAsFactors=FALSE, na.strings=c("NA","#DIV/0!",""))
Data Pre-processing
Next, I perform some data pre-processing for data reduction , substitution of the (NA) with (0 value), and validation of the near zero variance predictor for our project prediction.
Training <- Training[,(colSums(is.na(Training)) == 0)]
Testing <- Testing[,(colSums(is.na(Testing)) == 0)]
Training <- Training[,-c(1:7)]
Testing <- Testing[,-c(1:7)]
nzv <- nearZeroVar(Training, saveMetrics=TRUE)
nzv<- nearZeroVar(Testing,saveMetrics=TRUE)
Slicing the data:
We can cross validation/data splitting oth the “Training” data into a new validated training data set (70%) and a validation “testing” data set (30%). We will use the validation training data set to conduct cross validation in future steps.
set.seed(42)
inTrain <- createDataPartition(y=Training$classe, p = 0.7, list = FALSE)
training <- Training[inTrain,]
testing <- Training[-inTrain,]
dim(training)
## [1] 13737 53
dim(testing)
## [1] 5885 53
Locating Relevant Features
This graphic leaves us to distinguish the class pattern of the exercise in general data set
qplot(classe, colour=classe, data=training, geom="density")
Data Modeling: rpart model
We used rpart model to construct trees for activity recognition because it automatically selects important variables and is robust to correlated covariates & outliers in general.
set.seed(42)
treeFit <- rpart(classe ~ ., method = "class", data = training)
treePredict <- predict(treeFit, training, type = "class")
confusionmatrix_rp<-confusionMatrix(treePredict, training$classe)
confusionmatrix_rp
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3488 393 32 98 34
## B 99 1472 210 173 184
## C 115 362 1953 365 323
## D 131 192 138 1406 136
## E 73 239 63 210 1848
##
## Overall Statistics
##
## Accuracy : 0.7401
## 95% CI : (0.7327, 0.7474)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6711
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8930 0.5538 0.8151 0.6243 0.7319
## Specificity 0.9433 0.9399 0.8973 0.9480 0.9478
## Pos Pred Value 0.8623 0.6885 0.6264 0.7019 0.7596
## Neg Pred Value 0.9569 0.8977 0.9583 0.9279 0.9401
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2539 0.1072 0.1422 0.1024 0.1345
## Detection Prevalence 0.2945 0.1556 0.2270 0.1458 0.1771
## Balanced Accuracy 0.9182 0.7468 0.8562 0.7862 0.8399
Prediction with Decision Trees
fancyRpartPlot(treeFit, main = "Decision Tree",
sub = "Rpart Decision Tree To Predict Classe", cex=0.3, cex.main = 2)
Data Modeling: Random Forest Model
As the rpart model was generally inaccurate and the outcome variable appears to have more gradations in variable, a random forest model was tested to see if this model fit more suitably in these project
set.seed(40)
inTrain <- createDataPartition(Training$classe, p=0.70, list=FALSE)
trainData <- Training[inTrain, ]
testData <- Training[-inTrain, ]
fit <- randomForest(as.factor(classe) ~ . , data=trainData, importance=TRUE, proximity=TRUE )
prediction_rf <- predict(fit , trainData, type = "class")
confusionmatrix_rf <- confusionMatrix(prediction_rf, trainData$classe)
confusionmatrix_rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3906 0 0 0 0
## B 0 2658 0 0 0
## C 0 0 2396 0 0
## D 0 0 0 2252 0
## E 0 0 0 0 2525
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9997, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
The variable importance plot shown below illustrates a model in survival analysis- prediction which of the variables have significant importance.
varImpPlot(fit, main="Random Forest Variable Importance")
The variable importance plot shown below illustrates a model in survival analysis- prediction error curves
plot(fit)
Summary
Random Forest was a superior model for prediction of exercise quality compared to rpart. The Random Forest had over 99%accuracy and fitted well to other subsamples of the data.
In general, it is important in evaluation of the devices for tracking movements are affected in gathering data, predictable errors, and quality of measurements. This project give us idea about qualty of exercise that can be collected and analysed from this type of device.