Synopips

This exercise is part of Human Activity Recognition and its purpose is to build a predictive model, which will classify “quality” of weight lifting exercises. Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.

The data set used in this analysis comes from the following publications:

Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6. This dataset is licensed under the Creative Commons license (CC BY-SA). Read more: http://groupware.les.inf.puc-rio.br/har#sbia_paper_section#ixzz4Jq86qeAF

Loading required libraries and data sets.

library(caret)
library(ggplot2)
library("doParallel")
library(ranger)
library(corrplot)

if (!file.exists("pml-training.csv")) {
  download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", destfile = "pml-training.csv")
}
if (!file.exists("pml-testing.csv")) {
  download.file("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", destfile = "pml-testing.csv")
}
testing <- read.csv("pml-testing.csv", sep = ",", na.strings = c("", "NA"))
training <- read.csv("pml-training.csv", sep = ",", na.strings = c("", "NA"))

Data Cleaning and Data Dimension Reduction

Removing the first 7 columns as they are irrelevant

training<- training[,7:160]
testing <- testing[,7:160]

Removing columns with NAs

non_na <- apply(is.na(training), 2, sum)==0
training_no_na <- training[,non_na]

Identifing and removing variables with nearZeroVariance

nearzv <- nearZeroVar(training_no_na[sapply(training_no_na, is.numeric)], saveMetrics = TRUE)
training_nzv <- training_no_na[, nearzv[,'nzv']==0]
rm(training_no_na)

CorMatrix <- cor(training_nzv[,-54])
corrplot(CorMatrix, method = "color", type="lower", order="hclust", tl.cex = 0.65, tl.col="black", tl.srt = 45)

Removing the variables which are highly correlated with each other

cor_to_be_removed <- findCorrelation(CorMatrix, cutoff = .9, verbose = TRUE)
training_corr <- training_nzv[,-cor_to_be_removed]

Splitting the training data set

set.seed(12334)
inTrain <- createDataPartition(training_corr$classe, p=.7, list = FALSE)
training.set <- training_corr[inTrain, ]
testing.set <- training_corr[-inTrain, ]

Training a base model

For a base model I decided to use Random Forest algorithm because of two main reasons: Its confirmed efficiency and accuracy in dealing with high dimension data sets as well as its ability to reduce both bias and variability in the generated models. I will apply “ranger” package to generate the random forest model in order to lower the computation time and package “doParallel” to apply parallel processing algorithm with multicore threads.

For our model, we apply cross validation method in order to increase the accuracy.

nr_core<- detectCores()
cl <-makeCluster(nr_core)
registerDoParallel(cl)

train_Control <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
set.seed(12334)
rfmodel_base <- train(classe~., data=training.set, method="ranger", metric="Accuracy", trControl=train_Control)
stopCluster(cl)

Training an alternative model usign Gradient Boosting algorithm

nr_core<- detectCores()
cl <-makeCluster(nr_core)
registerDoParallel(cl)
set.seed(12334)
gbmmodel <- train(classe~., data=training.set, method="gbm")
stopCluster(cl)

Printing Random Forest based model

## Random Forest 
## 
## 13737 samples
##    46 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10990, 10991, 10989, 10989, 10989 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9923561  0.9903295
##   24    0.9972340  0.9965014
##   46    0.9930847  0.9912535
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 24.

As we see above, the model was cross-validated (k-fold k=5) and the most accurate results are obtained with mtry = 24 with the 99.71% accuracy.

Final Random Forest model parameters

## Ranger result
## 
## Call:
##  ranger(.outcome ~ ., data = x, mtry = param$mtry, write.forest = TRUE,      probability = classProbs, ...) 
## 
## Type:                             Classification 
## Number of trees:                  500 
## Sample size:                      13737 
## Number of independent variables:  46 
## Mtry:                             24 
## Target node size:                 1 
## Variable importance mode:         none 
## OOB prediction error:             0.19 %

Printing Grandient Boosting based model (GBM)

## Stochastic Gradient Boosting 
## 
## 13737 samples
##    46 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.7490008  0.6812767
##   1                  100      0.8213709  0.7736779
##   1                  150      0.8621840  0.8254501
##   2                   50      0.8810945  0.8493220
##   2                  100      0.9361134  0.9191034
##   2                  150      0.9599103  0.9492421
##   3                   50      0.9307183  0.9122329
##   3                  100      0.9690489  0.9608150
##   3                  150      0.9837512  0.9794337
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

GBM based model with n.trees = 150, interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10 has achieved the 98.35% accuracy which

Final GBM model parameters

## Ranger result
## 
## Call:
##  ranger(.outcome ~ ., data = x, mtry = param$mtry, write.forest = TRUE,      probability = classProbs, ...) 
## 
## Type:                             Classification 
## Number of trees:                  500 
## Sample size:                      13737 
## Number of independent variables:  46 
## Mtry:                             24 
## Target node size:                 1 
## Variable importance mode:         none 
## OOB prediction error:             0.19 %

Model performance comparison

Random Forest based algorithm boosts higher accuracy (99.71%) compared to Grandient Boosting based model (98.35%)

Cross Validating the Random Forest based model on the testing data set

pred_rf <- predict(rfmodel_base, newdata = testing.set)
confusionMatrix(pred_rf, testing.set$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1670    1    0    0    0
##          B    0 1135    1    0    0
##          C    0    3 1023    2    0
##          D    0    0    2  961    0
##          E    4    0    0    1 1082
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9976         
##                  95% CI : (0.996, 0.9987)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.997          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9976   0.9965   0.9971   0.9969   1.0000
## Specificity            0.9998   0.9998   0.9990   0.9996   0.9990
## Pos Pred Value         0.9994   0.9991   0.9951   0.9979   0.9954
## Neg Pred Value         0.9991   0.9992   0.9994   0.9994   1.0000
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2838   0.1929   0.1738   0.1633   0.1839
## Detection Prevalence   0.2839   0.1930   0.1747   0.1636   0.1847
## Balanced Accuracy      0.9987   0.9981   0.9980   0.9982   0.9995

Calculating the out of sample error rate

accuracy <-sum(pred_rf==testing.set$classe)/ length(testing.set$classe)
error = 1- accuracy

## [1] 0.002378929

Predicting the quality of weigth liffting exercises on test data set

predicted_outcome <- predict(rfmodel_base, newdata = testing)
print(predicted_outcome)

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Final Conclusions

In order to meet the project course requirements we’ve built and trained the two models based on Random Forest and Gradient Boosting algorithms respectively. Our final choice was based on model accuracy criteria hence RandomForest based model was selected to predict the test values. The selected model has 99.76% accuracy with out of sample error rate of 0.002378929. It was applied to the test data set and final results have been submitted to Coursera with 100% accuracy.

Predicting Quality of Weight Lifting Exercises using Machine Learning algorithms in R

Tom Checkiewicz

10 September 2016