Rpub Link: Click Here
This report is for Coursera Practical Machine Learning Course’s Final Project.
The data is from this source: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har
Basically the data is collected from sensor device such as Jawbone Up, Nike FuelBand, and Fitbit which are attached on belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
The objective of this project is to predict the manner in which the participants did the exercise. The variable which I am predicting is called “classe”.
Our outcome variable “classe” is a factor variable with 5 levels. For this dataset, “participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in 5 different fashions:
exactly according to the specification (Class A)
throwing the elbows to the front (Class B)
lifting the dumbbell only halfway (Class C)
lowering the dumbbell only halfway (Class D)
throwing the hips to the front (Class E)
The report will touch on how the model is built, cross validation, out of sample error and predict the outcome for 20 different test subjects.
# Load dataset. Note "" and "NA" are actually na.string for this dataset
train <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", header=TRUE, stringsAsFactors = TRUE, na.strings = c("","NA"))
test <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", header=TRUE, stringsAsFactors = TRUE, na.strings = c("","NA"))
First, let’s take a look at the summary and look for any NA. Missing dependencies check will be done to look for any missing package that require installation.
# Check for missing dependencies and load necessary R packages
if(!require(caret)){install.packages('caret')}; library(caret)
if(!require(rattle)){install.packages('rattle')}; library(rattle)
if(!require(randomForest)){install.packages('randomForest')}; library(randomForest)
if(!require(MASS)){install.packages('MASS')}; library(MASS)
if(!require(ggplot2)){install.packages('ggplot2')}; library(ggplot2)
# Check summary of Train & Test data
# summary(train); str(train); head(train); summary(test); str(test); head(test)
# Check NA for each columns in Train
#sapply(train,function(x) sum(is.na(x)))
Noticed there are lots of columns with NA values. Removing these columns and the columns with time/date.
# Remove NA columns for Train
train2 <- train[ , apply(train, 2, function(x) !any(is.na(x)))]
# Remove unnecessary columns
train2 <- train2[,8:60]
Separate data the train data into 60% for training the model and 40% for testing the model. The model with the lowest MSE and highest AUC will be used to predict the final outcome for the 20 different test subjects.
# Create Index for training
IndexTrain <- createDataPartition(y=train2$classe, p=0.6, list=FALSE)
training <- train2[IndexTrain,]
testing <- train2[-IndexTrain,]
Using the train function in the caret package, we set method=“rpart” and train the Decision Tree model with the training data.
# Train Tree Model
tree1 <- train(classe~., method="rpart", data=training)
tree1$finalModel
## n= 11776
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 11776 8428 A (0.28 0.19 0.17 0.16 0.18)
## 2) roll_belt< 130.5 10773 7435 A (0.31 0.21 0.19 0.18 0.11)
## 4) pitch_forearm< -33.95 924 5 A (0.99 0.0054 0 0 0) *
## 5) pitch_forearm>=-33.95 9849 7430 A (0.25 0.23 0.21 0.2 0.12)
## 10) yaw_belt>=169.5 505 48 A (0.9 0.046 0 0.044 0.0059) *
## 11) yaw_belt< 169.5 9344 7093 B (0.21 0.24 0.22 0.2 0.13)
## 22) magnet_dumbbell_z< -87.5 1236 537 A (0.57 0.29 0.049 0.076 0.023) *
## 23) magnet_dumbbell_z>=-87.5 8108 6115 C (0.16 0.23 0.25 0.22 0.14)
## 46) pitch_belt< -42.95 464 77 B (0.0065 0.83 0.11 0.022 0.026) *
## 47) pitch_belt>=-42.95 7644 5703 C (0.16 0.2 0.25 0.24 0.15)
## 94) magnet_dumbbell_x>=-447.5 3282 2289 B (0.17 0.3 0.095 0.24 0.19)
## 188) roll_belt< 117.5 2058 1171 B (0.17 0.43 0.025 0.13 0.25) *
## 189) roll_belt>=117.5 1224 707 D (0.18 0.087 0.21 0.42 0.1) *
## 95) magnet_dumbbell_x< -447.5 4362 2732 C (0.16 0.12 0.37 0.24 0.11) *
## 3) roll_belt>=130.5 1003 10 E (0.01 0 0 0 0.99) *
# Plot Tree Model
fancyRpartPlot(tree1$finalModel, tweak=1.5)
# Predictions using Testing dataset
tree.pred <- predict(tree1, newdata = testing)
# ConfusionMatrix for Tree Model
tree.confuse <- confusionMatrix(tree.pred, testing$classe)
tree.confuse
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1374 253 38 81 22
## B 240 880 51 202 352
## C 471 324 1081 663 353
## D 143 61 198 340 77
## E 4 0 0 0 638
##
## Overall Statistics
##
## Accuracy : 0.5497
## 95% CI : (0.5386, 0.5608)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.435
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.6156 0.5797 0.7902 0.26439 0.44244
## Specificity 0.9298 0.8665 0.7204 0.92698 0.99938
## Pos Pred Value 0.7771 0.5101 0.3738 0.41514 0.99377
## Neg Pred Value 0.8588 0.8958 0.9421 0.86538 0.88840
## Prevalence 0.2845 0.1935 0.1744 0.16391 0.18379
## Detection Rate 0.1751 0.1122 0.1378 0.04333 0.08132
## Detection Prevalence 0.2253 0.2199 0.3686 0.10438 0.08183
## Balanced Accuracy 0.7727 0.7231 0.7553 0.59568 0.72091
Based on the confusionMatrix, we can see the accuracy for Decision Tree Model is 0.5497069.
For Random Forest model, manual tuning was done to find the optimal mtry which will be used to train the final model. The optimal mtry will have the lowest out-of-bag error.
# Manually tune for Optimal mtry
mse.rfs <- rep(0, 13)
for(m in 1:13){
set.seed(123)
rf <- randomForest(classe ~ ., data=training, mtry=m)
mse.rfs[m] <- rf$err.rate[500]
}
# Plot OOB Error for each mtry
plot(1:13, mse.rfs, type="b", xlab="mtry", ylab="OOB Error")
mse.rfs
## [1] 0.015030571 0.009595788 0.008661685 0.007387908 0.006623641
## [6] 0.007218071 0.006878397 0.006623641 0.007218071 0.006453804
## [11] 0.006538723 0.006963315 0.007133152
optimal.mtry <- which.min(mse.rfs)
# Train randomForest Model with optimal mtry
rf1 <- randomForest(classe~., data=training, mtry=optimal.mtry)
rf1
##
## Call:
## randomForest(formula = classe ~ ., data = training, mtry = optimal.mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 10
##
## OOB estimate of error rate: 0.68%
## Confusion matrix:
## A B C D E class.error
## A 3342 3 0 1 2 0.001792115
## B 20 2252 7 0 0 0.011847301
## C 0 10 2040 4 0 0.006815969
## D 0 0 20 1908 2 0.011398964
## E 0 0 3 8 2154 0.005080831
# Predictions using Testing dataset
rf.pred <- predict(rf1, newdata = testing)
# ConfusionMatrix for Tree Model
rf.confuse <- confusionMatrix(rf.pred, testing$classe)
rf.confuse
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2224 14 0 0 0
## B 8 1503 4 0 0
## C 0 1 1359 16 2
## D 0 0 5 1266 1
## E 0 0 0 4 1439
##
## Overall Statistics
##
## Accuracy : 0.993
## 95% CI : (0.9909, 0.9947)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9911
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9964 0.9901 0.9934 0.9844 0.9979
## Specificity 0.9975 0.9981 0.9971 0.9991 0.9994
## Pos Pred Value 0.9937 0.9921 0.9862 0.9953 0.9972
## Neg Pred Value 0.9986 0.9976 0.9986 0.9970 0.9995
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2835 0.1916 0.1732 0.1614 0.1834
## Detection Prevalence 0.2852 0.1931 0.1756 0.1621 0.1839
## Balanced Accuracy 0.9970 0.9941 0.9952 0.9918 0.9986
Based on the confusionMatrix, we can see the accuracy for Random Forest Model is 0.9929901.
For Gradient Boosting Model, train function in the caret package waas used and set method=“gbm”. Verbose=FALSE is to surpress all the messages.
# Train Tree Model
gbm <- train(classe~., method="gbm", data=training, verbose=FALSE)
gbm$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 43 had non-zero influence.
# Predictions using Testing dataset
gbm.pred <- predict(gbm, newdata = testing)
# ConfusionMatrix for Tree Model
gbm.confuse <- confusionMatrix(gbm.pred, testing$classe)
gbm.confuse
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2191 58 0 2 2
## B 33 1420 32 5 15
## C 4 35 1312 43 15
## D 3 3 22 1227 19
## E 1 2 2 9 1391
##
## Overall Statistics
##
## Accuracy : 0.9611
## 95% CI : (0.9566, 0.9653)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9508
## Mcnemar's Test P-Value : 6.701e-06
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9816 0.9354 0.9591 0.9541 0.9646
## Specificity 0.9890 0.9866 0.9850 0.9928 0.9978
## Pos Pred Value 0.9725 0.9435 0.9312 0.9631 0.9900
## Neg Pred Value 0.9927 0.9845 0.9913 0.9910 0.9921
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2793 0.1810 0.1672 0.1564 0.1773
## Detection Prevalence 0.2872 0.1918 0.1796 0.1624 0.1791
## Balanced Accuracy 0.9853 0.9610 0.9720 0.9735 0.9812
Based on the confusionMatrix, we can see the accuracy for Gradient Boosting Model is 0.9611267.
Based on the summary table below, we can see the model with best accuracy is the Random Forest Model. This model will be used to predict the final class for the 20 subjects in the test data.
# Create table for comparison of Accuracy
table1 <- data.frame(
Model=c("Random Forest","Gradient Boosting", "Decision Tree"),
Accuracy=c(rf.confuse$overall[[1]],gbm.confuse$overall[[1]],tree.confuse$overall[[1]]),
"ConfInv 95 Lower"=c(rf.confuse$overall[[3]],gbm.confuse$overall[[3]],tree.confuse$overall[[3]]),
"ConfInv 95 Upper"=c(rf.confuse$overall[[4]],gbm.confuse$overall[[4]],tree.confuse$overall[[4]])
)
table1
## Model Accuracy ConfInv.95.Lower ConfInv.95.Upper
## 1 Random Forest 0.9929901 0.9908853 0.9947149
## 2 Gradient Boosting 0.9611267 0.9566112 0.9652953
## 3 Decision Tree 0.5497069 0.5386176 0.5607591
Applying the trained model from Random Forest, we can get the predicted class as shown below.
# Predict outcome on the original Testing data set using Random Forest model
predictfinal <- predict(rf1, newdata=test, type="class")
predictfinal
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E