fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(fileUrl, destfile = "./pml_training.csv")
fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(fileUrl,destfile = "pml_testing.csv")
lib_names <- list("tidyverse","caret","corrplot","gbm","forecast","rattle","stringr","rebus","rpart")
lapply(lib_names,library,character.only=TRUE)
(Class A) exactly according to the specification, (Class B) throwing the elbows to the front, (Class C) lifting the dumbbell only halfway, (Class D) lowering the dumbbell only halfway and (Class E) throwing the hips to the front .
df <- read.csv("pml_training.csv")
test <- read.csv("pml_testing.csv")
First we will clean out variables that is irrelevant to the study such names, times. We also use near zero variance to weed out more variables. Lastly, we can substitute features with lots of NA with 0 but it will create a near zero variance variable. Thus, we will also delete these features from our dataset.
##Irrelevant variables
pattern <- c("X","user_name","timestamp")
irrIndex <- str_detect(names(df),pattern=or1(pattern))
df <- df[,!irrIndex]
test <- test[,!irrIndex]
##Near zero variance
nzvIndex <- nzv(df,saveMetrics = TRUE)$nzv
df <- df[,!nzvIndex]
test <- test[,!nzvIndex]
##No NA values
naIndex <- colSums(is.na(df))!=0
df <- df[,!naIndex]
test <- test[,!naIndex]
Now we will separate our df into training (0.8) and testing (0.2) sets. To keep the paper short, we won’t resample more than once.
inTrain <- createDataPartition(df$classe, p=0.8, list=FALSE)
training <- df[inTrain,]
validation <- df[-inTrain,];rm(inTrain)
Our Cross Validation setup is as followed: 1. Training set: 15699 observations 2. Validation set: 3923 observations *3. Test set: 20 observations
First, we want to get a general feel for how each variable relates to each other through GGally package.
corrplot(cor(training[,-54]),tl.cex = 0.5)
There are not many variables correlate to each other
We will attemp to fit 3 models using classification tree (since the classe is a factor and not numerical), random forest, and boosting. method.
####Classification Tree
modTree <- train(classe~., method="rpart",data=training)
fancyRpartPlot(modTree$finalModel)
predTree <- predict(modTree,validation)
statTree <- c(Accuracy = confusionMatrix(predTree,validation$classe)$overall[[1]],
out_of_sample_error = 1-confusionMatrix(predTree,validation$classe)$overall[[1]])
statTree
## Accuracy out_of_sample_error
## 0.5019118 0.4980882
We can see that classification tree somehow completely miss the Class:D which brings its accuracy down subtaintially.
modForest <- train(classe~., method="rf",data=training)
## Warning: package 'randomForest' was built under R version 3.4.1
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
modForest
## Random Forest
##
## 15699 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 15699, 15699, 15699, 15699, 15699, 15699, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9929378 0.9910685
## 27 0.9961222 0.9950963
## 53 0.9921772 0.9901079
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
predForest <- predict(modForest,validation)
statForest <- c(Accuracy = confusionMatrix(predForest,validation$classe)$overall[[1]],
out_of_sample_error = 1-confusionMatrix(predForest,validation$classe)$overall[[1]])
statForest
## Accuracy out_of_sample_error
## 0.9992352791 0.0007647209
We can see the Random forest methods give us almost 100% accuracy. However, the big trade off is the running speed. It tooks much longer than classification tree.
Final method is boosting should give almost the same accuracy as random forest and suffers from the same weakness of running time!
modBoost <- train(classe~., method="gbm",data=training,verbose=FALSE)
## Warning: package 'plyr' was built under R version 3.4.1
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
modBoost
## Stochastic Gradient Boosting
##
## 15699 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 15699, 15699, 15699, 15699, 15699, 15699, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.7581808 0.6930412
## 1 100 0.8304891 0.7853292
## 1 150 0.8688239 0.8339597
## 2 50 0.8823704 0.8510098
## 2 100 0.9383950 0.9220207
## 2 150 0.9617667 0.9516057
## 3 50 0.9302655 0.9117023
## 3 100 0.9689074 0.9606430
## 3 150 0.9840545 0.9798203
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
predBoost <- predict(modBoost,validation)
statBoost <- c(Accuracy = confusionMatrix(predBoost,validation$classe)$overall[[1]],
out_of_sample_error = 1-confusionMatrix(predBoost,validation$classe)$overall[[1]])
statBoost
## Accuracy out_of_sample_error
## 0.98827428 0.01172572
Our comparison between methods are as followed
rbind(statTree,statForest,statBoost)
## Accuracy out_of_sample_error
## statTree 0.5019118 0.4980881978
## statForest 0.9992353 0.0007647209
## statBoost 0.9882743 0.0117257201
Based on the accuracy rate, we will choost random forest as the final model to test
predict(modForest,test)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E