The objective of this project is to recognize the quality of exercises performed using wearable devices such as Fitbit, Nike Fuel, etc. The subjects were asked to performed Dubbell Biceps Curl in various different ways. Then, the sensors measured the orientation based on the movement performed.
Here is the description of the meaning of the response values from the original paper: “…exactly according to the specication (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specied execution of the exercise, while the other 4 classes correspond to common mistakes.”
First we read the csv files from the URL provided in the assignment page.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(RCurl)
## Loading required package: bitops
set.seed(1111)
train <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",stringsAsFactors=FALSE, na.strings=c("NA","","#DIV/0!"))
test <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",stringsAsFactors=FALSE, na.strings=c("NA","","#DIV/0!"))
Next we remove variables from training dataset are do not contribute to the final model: 1) Character variables such as ‘user_name’ and date variables. 2) Variables that have near-zero variance (using nearZeroVar()) or more than 90% values missing.
Remove the exact same variables from the testing dataset as well.
Once the missing and near-zero variance data is removed, we split the original training dataset into train (3/4) and test (1/4) splits. The training part will be used to fit the model and the testing part will validate our fit.
# split the train data into testing and training for prediction
testindex = sample(1:dim(train)[1], dim(train)[1]/4)
train.ds = train.clean[-testindex,]
test.ds = train.clean[testindex,]
For the machine learning algorithm, we start with classification tree as our baseline. Classificaiton tree is used to predict a categorical response. A predictor is chosen based on its importance, and split into two parts that describe the response. Then based on yes/no answer, we either go left node or right node. At each node, another predictor is chosen, and the same process is repeated. It continues until we reach the leaf, at which point, the final decision is made for that leaf.
We use the package rpart to build our classification tree:
fit.rp <- train(factor(classe) ~ ., method="rpart", data = train.ds,
trControl = trainControl(method = "cv", number = 10, allowParallel = TRUE)
)
## Loading required package: rpart
cm.rp = confusionMatrix(predict(fit.rp, test.ds), test.ds$classe)
cm.rp
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1247 245 130 141 35
## B 25 326 23 144 58
## C 139 379 662 489 262
## D 0 0 0 0 0
## E 3 0 0 42 555
##
## Overall Statistics
##
## Accuracy : 0.5688
## 95% CI : (0.5548, 0.5827)
## No Information Rate : 0.2883
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4496
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8819 0.34316 0.8123 0.0000 0.6099
## Specificity 0.8422 0.93679 0.6897 1.0000 0.9887
## Pos Pred Value 0.6935 0.56597 0.3428 NaN 0.9250
## Neg Pred Value 0.9463 0.85586 0.9486 0.8336 0.9175
## Prevalence 0.2883 0.19368 0.1662 0.1664 0.1855
## Detection Rate 0.2542 0.06646 0.1350 0.0000 0.1131
## Detection Prevalence 0.3666 0.11743 0.3937 0.0000 0.1223
## Balanced Accuracy 0.8620 0.63997 0.7510 0.5000 0.7993
Out of sample error is also very high.
# Out of Sample Error
OutofsampleError = 1- cm.rp$overall[1]
names(OutofsampleError ) = "OutofsampleError"
OutofsampleError
## OutofsampleError
## 0.4311927
Here, we apply a 10-fold cross validation to the dataset.
The confusion matrix shows an accuracy of 0.4922, which is pretty low. We can either decide to improve the tree by pruning its branches or by adjusting its parameters. But we decide to move on to another ML algorithm, Random Forest.
Random Forest uses similar technique as classifination/decision trees. However, they use an ensemble learning method. It takes small random subsets of the variables and creates trees for each of those subsets. Then it combines all the ‘weak learners’, to end up with a strong learner.
The model below created 500 trees and randomly picked 7 variables for each split
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
rf.fit <- randomForest(factor(classe)~., data=train.ds,proximity=T )
rf.pred = predict(rf.fit,test.ds)
rf.cm = confusionMatrix(rf.pred, test.ds$classe)
rf.cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1413 2 0 0 0
## B 0 948 0 0 0
## C 0 0 815 3 0
## D 0 0 0 812 1
## E 1 0 0 1 909
##
## Overall Statistics
##
## Accuracy : 0.9984
## 95% CI : (0.9968, 0.9993)
## No Information Rate : 0.2883
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9979
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9993 0.9979 1.0000 0.9951 0.9989
## Specificity 0.9994 1.0000 0.9993 0.9998 0.9995
## Pos Pred Value 0.9986 1.0000 0.9963 0.9988 0.9978
## Neg Pred Value 0.9997 0.9995 1.0000 0.9990 0.9997
## Prevalence 0.2883 0.1937 0.1662 0.1664 0.1855
## Detection Rate 0.2881 0.1933 0.1662 0.1655 0.1853
## Detection Prevalence 0.2885 0.1933 0.1668 0.1657 0.1857
## Balanced Accuracy 0.9994 0.9989 0.9996 0.9974 0.9992
Out of sample error is very low.
# Out of Sample Error
OutofsampleError = 1- rf.cm$overall[1]
names(OutofsampleError ) = "OutofsampleError"
OutofsampleError
## OutofsampleError
## 0.001630989
Using the variable importance feature of randomForest() function, we look at few plots that shows the relationship between the predictors and the response. These plots shows the relationships between variables that Random Forest determined to have high importance. The colors correspond to the response values. Here, we see that for different ranges of num_window variable, the reponse variables (A-E) are nicely clustered. The Random Forest identifies these relationships and creates an ensemble model.
The confusion matrix shows an accuracy of .99, which is very high (considering the low accuracy in the classification tree). The sesitivity and specificity rates are also very high. This shows that Random Forest combines all the weak learners and extracts a strong ensemble model.
Also, the additional test dataset provided for the Course project Submission are also correctely predicted by the RF model (according to the feedback!).
predict(rf.fit,test.clean)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E