This exercise is part of Human Activity Recognition and its purpose is to build a predictive model, which will classify “quality” of weight lifting exercises. Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.
The data set used in this analysis comes from the following publications:
Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6. This dataset is licensed under the Creative Commons license (CC BY-SA). Read more: http://groupware.les.inf.puc-rio.br/har#sbia_paper_section#ixzz4Jq86qeAF
library(caret)
library(ggplot2)
library("doParallel")
library(ranger)
library(corrplot)
if (!file.exists("pml-training.csv")) {
download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", destfile = "pml-training.csv")
}
if (!file.exists("pml-testing.csv")) {
download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", destfile = "pml-testing.csv")
}
testing <- read.csv("pml-testing.csv", sep = ",", na.strings = c("", "NA"))
training <- read.csv("pml-training.csv", sep = ",", na.strings = c("", "NA"))
Removing the first 7 columns as they are irrelevant
training<- training[,7:160]
testing <- testing[,7:160]
Removing columns with NAs
non_na <- apply(is.na(training), 2, sum)==0
training_no_na <- training[,non_na]
Identifing and removing variables with nearZeroVariance
nearzv <- nearZeroVar(training_no_na[sapply(training_no_na, is.numeric)], saveMetrics = TRUE)
training_nzv <- training_no_na[, nearzv[,'nzv']==0]
rm(training_no_na)
CorMatrix <- cor(training_nzv[,-54])
corrplot(CorMatrix, method = "color", type="lower", order="hclust", tl.cex = 0.65, tl.col="black", tl.srt = 45)
Removing the variables which are highly correlated with each other
cor_to_be_removed <- findCorrelation(CorMatrix, cutoff = .9, verbose = TRUE)
training_corr <- training_nzv[,-cor_to_be_removed]
set.seed(12334)
inTrain <- createDataPartition(training_corr$classe, p=.7, list = FALSE)
training.set <- training_corr[inTrain, ]
testing.set <- training_corr[-inTrain, ]
For a base model I decided to use Random Forest algorithm because of two main reasons: Its confirmed efficiency and accuracy in dealing with high dimension data sets as well as its ability to reduce both bias and variability in the generated models. I will apply “ranger” package to generate the random forest model in order to lower the computation time and package “doParallel” to apply parallel processing algorithm with multicore threads.
For our model, we apply cross validation method in order to increase the accuracy.
nr_core<- detectCores()
cl <-makeCluster(nr_core)
registerDoParallel(cl)
train_Control <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
set.seed(12334)
rfmodel_base <- train(classe~., data=training.set, method="ranger", metric="Accuracy", trControl=train_Control)
stopCluster(cl)
nr_core<- detectCores()
cl <-makeCluster(nr_core)
registerDoParallel(cl)
set.seed(12334)
gbmmodel <- train(classe~., data=training.set, method="gbm")
stopCluster(cl)
Printing Random Forest based model
## Random Forest
##
## 13737 samples
## 46 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10990, 10991, 10989, 10989, 10989
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9923561 0.9903295
## 24 0.9972340 0.9965014
## 46 0.9930847 0.9912535
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 24.
As we see above, the model was cross-validated (k-fold k=5) and the most accurate results are obtained with mtry = 24 with the 99.71% accuracy.
Final Random Forest model parameters
## Ranger result
##
## Call:
## ranger(.outcome ~ ., data = x, mtry = param$mtry, write.forest = TRUE, probability = classProbs, ...)
##
## Type: Classification
## Number of trees: 500
## Sample size: 13737
## Number of independent variables: 46
## Mtry: 24
## Target node size: 1
## Variable importance mode: none
## OOB prediction error: 0.19 %
Printing Grandient Boosting based model (GBM)
## Stochastic Gradient Boosting
##
## 13737 samples
## 46 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.7490008 0.6812767
## 1 100 0.8213709 0.7736779
## 1 150 0.8621840 0.8254501
## 2 50 0.8810945 0.8493220
## 2 100 0.9361134 0.9191034
## 2 150 0.9599103 0.9492421
## 3 50 0.9307183 0.9122329
## 3 100 0.9690489 0.9608150
## 3 150 0.9837512 0.9794337
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
GBM based model with n.trees = 150, interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10 has achieved the 98.35% accuracy which
Final GBM model parameters
## Ranger result
##
## Call:
## ranger(.outcome ~ ., data = x, mtry = param$mtry, write.forest = TRUE, probability = classProbs, ...)
##
## Type: Classification
## Number of trees: 500
## Sample size: 13737
## Number of independent variables: 46
## Mtry: 24
## Target node size: 1
## Variable importance mode: none
## OOB prediction error: 0.19 %
Random Forest based algorithm boosts higher accuracy (99.71%) compared to Grandient Boosting based model (98.35%)
pred_rf <- predict(rfmodel_base, newdata = testing.set)
confusionMatrix(pred_rf, testing.set$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1670 1 0 0 0
## B 0 1135 1 0 0
## C 0 3 1023 2 0
## D 0 0 2 961 0
## E 4 0 0 1 1082
##
## Overall Statistics
##
## Accuracy : 0.9976
## 95% CI : (0.996, 0.9987)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.997
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9976 0.9965 0.9971 0.9969 1.0000
## Specificity 0.9998 0.9998 0.9990 0.9996 0.9990
## Pos Pred Value 0.9994 0.9991 0.9951 0.9979 0.9954
## Neg Pred Value 0.9991 0.9992 0.9994 0.9994 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2838 0.1929 0.1738 0.1633 0.1839
## Detection Prevalence 0.2839 0.1930 0.1747 0.1636 0.1847
## Balanced Accuracy 0.9987 0.9981 0.9980 0.9982 0.9995
accuracy <-sum(pred_rf==testing.set$classe)/ length(testing.set$classe)
error = 1- accuracy
## [1] 0.002378929
predicted_outcome <- predict(rfmodel_base, newdata = testing)
print(predicted_outcome)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
In order to meet the project course requirements we’ve built and trained the two models based on Random Forest and Gradient Boosting algorithms respectively. Our final choice was based on model accuracy criteria hence RandomForest based model was selected to predict the test values. The selected model has 99.76% accuracy with out of sample error rate of 0.002378929. It was applied to the test data set and final results have been submitted to Coursera with 100% accuracy.