We are given two sets of data collected from accelerometers placed on the belt, forearm, arm, and dumbell of 6 research study participants for this machine learning project. Training data stems from accelerometers with label identifying the quality of the activity the participant was doing. Testing data also comprised of accelerometer data without identifiable label(A-E).
The definitive instruction for this project is to use data to predict whether the exercise is being done properly or improperly based solely on accelerometer data measurements. The participants were instructed to perform the exercise either properly (Class A) or in a way which replicated 4 common weightlifting mistakes (Classes B, C, D, and E).
The question is, would we be able to predict appropriately each participants exercise manner by processing data gathered from classe(A-E) accelerometers? In that persuasion, we should apply some Machine Learning(ML) algorithms on ‘trainData’ and test them on given ‘test dataset’ for ‘classe-level’ based exercise manner prediction.
Here in drop down, I wrote the needed ‘code’ along with ‘line-description’ on each step of the process of ML-algorithms. I have used four machine learnig algorithms are Classification Tree, lda, gbm and random forest. I also used cross-validation in last three(lda,gbm and random forest) models within ‘trainControl’ method. At the end of each ML-algorithm run, I presented the quantified ‘accuracy rate’. I have used a relatively newer ggplot package ‘plotly’ and encouraging any reader to have fun hovering over the plot, you will see all classe variable data distribution in detail. A great ggplot2 addition!
These findings would help us to analyse and predict the manner, in which participants did their exercise regime.
# Necessary library loaded
library(easypackages)
## Warning: package 'easypackages' was built under R version 3.3.3
suppressMessages(libraries("formattable", "dplyr", "tidyr", "ggplot2"))
## Warning: package 'formattable' was built under R version 3.3.3
## Warning: package 'dplyr' was built under R version 3.3.3
## Warning: package 'tidyr' was built under R version 3.3.3
## Warning: package 'ggplot2' was built under R version 3.3.3
# loading and reading data file from my desktop
trainDataSet <- read.csv("pml-training.csv", na.strings = c("", "NA"), header = TRUE)
testDataSet <- read.csv("pml-testing.csv", na.strings = c("", "NA"), header = TRUE)
# data dimension with row and columns
rbind ( trainDataSet = dim(trainDataSet), testDataSet = dim(testDataSet) )
## [,1] [,2]
## trainDataSet 19622 160
## testDataSet 20 160
# displaying classe(A-E) elements summarized by user name
trainDataSet %>% count(classe, user_name) %>% spread(classe, n) %>% formattable(align = 'l')
| user_name | A | B | C | D | E |
|---|---|---|---|---|---|
| adelmo | 1165 | 776 | 750 | 515 | 686 |
| carlitos | 834 | 690 | 493 | 486 | 609 |
| charles | 899 | 745 | 539 | 642 | 711 |
| eurico | 865 | 592 | 489 | 582 | 542 |
| jeremy | 1177 | 489 | 652 | 522 | 562 |
| pedro | 640 | 505 | 499 | 469 | 497 |
# a visual of classe-vs-user in a bar-graph
suppressMessages(library(plotly))
## Warning: package 'plotly' was built under R version 3.3.3
plot.01 = ggplot(trainDataSet, aes(x=classe, fill=user_name)) + geom_bar() + xlab("Classe in bar segments") + ylab("User performances") + ggtitle("Sequence of classe by user")
ggplotly(plot.01, width = 750, height = 500) %>% highlight(on = "plotly_hover", color = "white")
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
Plot analysis: In this plots we can see that the all participants did ‘Classe A’ the most number of times and then slowly down to (B-E) pattern. They all started doing biceps curls the proper way (Class A), then proceeded with Class B, C to E. This plot gives us a number representation of each classe variable in total with a visible exercise manner. Please roll-over the bar-graph to see how each-user did their exercise ‘classe-variable’ in sequence.
This columnar overview rendered in a 100% scale, which displays, how each user did their exercise regime(A-E), in what percentage of the total workout sequence.
# percentile projection of classe elements by user name
trainDataSet %>% count(classe, user_name) %>% group_by(user_name) %>% mutate(n=percent(n/sum(n),0))%>% spread(classe, n) %>% formattable(align = 'l')
## Warning: package 'bindrcpp' was built under R version 3.3.3
## Warning in mutate_impl(.data, dots): Vectorizing 'formattable' elements may
## not preserve their attributes
## Warning in mutate_impl(.data, dots): Vectorizing 'formattable' elements may
## not preserve their attributes
## Warning in mutate_impl(.data, dots): Vectorizing 'formattable' elements may
## not preserve their attributes
## Warning in mutate_impl(.data, dots): Vectorizing 'formattable' elements may
## not preserve their attributes
## Warning in mutate_impl(.data, dots): Vectorizing 'formattable' elements may
## not preserve their attributes
## Warning in mutate_impl(.data, dots): Vectorizing 'formattable' elements may
## not preserve their attributes
| user_name | A | B | C | D | E |
|---|---|---|---|---|---|
| adelmo | 0.2993320 | 0.1993834 | 0.1927030 | 0.1323227 | 0.1762590 |
| carlitos | 0.2679949 | 0.2217224 | 0.1584190 | 0.1561697 | 0.1956941 |
| charles | 0.2542421 | 0.2106900 | 0.1524321 | 0.1815611 | 0.2010747 |
| eurico | 0.2817590 | 0.1928339 | 0.1592834 | 0.1895765 | 0.1765472 |
| jeremy | 0.3459730 | 0.1437390 | 0.1916520 | 0.1534392 | 0.1651969 |
| pedro | 0.2452107 | 0.1934866 | 0.1911877 | 0.1796935 | 0.1904215 |
This density plot is to visualize the distribution of classe(A-E) exercise variable’, which portrays overall exercise manner of all the participant subjects.
if(FALSE){
# drawing a density plot to see the classe-elements distribution
trainDataSet %>% ggplot(aes(classe)) + geom_density(fill="salmon")-> plot.02
ggplotly(plot.02, width = 700, height = 450) %>% highlight(on = "plotly_hover", color = "red")
}
Note: Obviously classe component A has a peak distribution at 2.5 and all other elements (B-E) has slowly reduced toward(0.5) scale.
suppressMessages(library(caret))
## Warning: package 'caret' was built under R version 3.3.3
# Create Data Partition with 0.75 is training and 0.25 test dataset
inTrain <- createDataPartition(trainDataSet$classe, p=0.75, list=FALSE)
TrainSet <- trainDataSet[inTrain, ]
TestSet <- trainDataSet[-inTrain,]
# quick data-dimension after data partition
rbind ( TrainSet = dim(TrainSet), TestSet = dim(TestSet) )
## [,1] [,2]
## TrainSet 14718 160
## TestSet 4904 160
#> **Note: some machine learning algoriths do not accept 'NA' values inside the DataSet.So we will do some 'NA' input manipulation.
# checking number of columns have 'NA' values with percentile projeciton in a table
table (NA_Value_Percent <- round(colMeans(is.na(TrainSet)), 2))
##
## 0 0.98
## 60 100
Note: We see that 100-variables have more than 98 percent data with “NA” input ‘filled-in’ and only 60-variables have complete data set. Variables with 98% data is ‘NA’ doesn’t make any quantifiable effect in decision making anlytic processes.
# so we'd eliminate all variable-columns, where more than 96% of the input are 'NA'
All_NA_columns <- sapply(TrainSet, function(x) mean(is.na(x))) > 0.96
# removing columns with 96% 'NA' only input from both 'Train and Test' dataset
TrainSet <- TrainSet[, All_NA_columns == FALSE]
TestSet <- TestSet [, All_NA_columns == FALSE]
# a quick view of how many 'variable-columns' left after 'NA-elimination' process
rbind(TrainSet = dim(TrainSet), TestSet = dim(TestSet))
## [,1] [,2]
## TrainSet 14718 60
## TestSet 4904 60
# covariates variability check by setting 'saveMetrics = TRUE', return a data frame with predictor info
nzv <- nearZeroVar(TrainSet, saveMetrics = TRUE)
head(nzv)
## freqRatio percentUnique zeroVar nzv
## X 1.000000 100.00000000 FALSE FALSE
## user_name 1.106084 0.04076641 FALSE FALSE
## raw_timestamp_part_1 1.066667 5.68691398 FALSE FALSE
## raw_timestamp_part_2 1.250000 88.70091045 FALSE FALSE
## cvtd_timestamp 1.037273 0.13588803 FALSE FALSE
## new_window 47.735099 0.01358880 FALSE TRUE
Analsis: We see that most of the near-zero-variables(nzv) are ‘false’, so we don’t need to eliminate any covariates.For further Simplification we will remove some unwarranted columns (‘row-index’ to ‘not-relevant’) from the dataset.
TrainSet <- TrainSet[, -(1:7)]
TestSet <- TestSet [, -(1:7)]
# final dataSet dimension after all irrelevant column elimination
rbind ( TrainSet = dim(TrainSet), TestSet = dim(TestSet))
## [,1] [,2]
## TrainSet 14718 53
## TestSet 4904 53
Here, I have used multiple Machine Learning algorithim in searching for high level model accuracy. I have used four algorithms Decision Tree, Linear Discriminant Analysis(lda), Gradient Boosting Method(gbm) and Random Forest(rf) to validate my search. Cross validation processes were included in ‘trainControl’ method with number of folds added. I used parallel-processing feature to reduce ‘data-processing’ time with ‘gbm’ and ‘rf’ model. I also used a confusion Matrix plot to visualize the level of accuracy of the classe variables with ‘rf’ model-algorithm only.
# setting seed and loading library 'rattle' for decision tree
suppressMessages(library(rattle));set.seed(666)
## Warning: package 'rattle' was built under R version 3.3.3
# designing the tree using 'rpart' method
control_dt <- trainControl(method="cv", number = 10)
model_Tree <- train(classe~., method = "rpart", data = TrainSet, trControl = control_dt)
## Loading required package: rpart
## Warning: package 'rpart' was built under R version 3.3.3
# displaying brief model summary
print(model_Tree, digits = 4)
## CART
##
## 14718 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 13246, 13245, 13245, 13246, 13248, 13247, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.03370 0.5085 0.35816
## 0.05956 0.4517 0.26912
## 0.11450 0.3225 0.05826
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.0337.
# displaying 'model_Tree' node and leaf detail
print(model_Tree$finalModel, digits = 4)
## n= 14718
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 14718 10530 A (0.28 0.19 0.17 0.16 0.18)
## 2) roll_belt< 130.5 13492 9317 A (0.31 0.21 0.19 0.18 0.11)
## 4) pitch_forearm< -33.95 1221 8 A (0.99 0.0066 0 0 0) *
## 5) pitch_forearm>=-33.95 12271 9309 A (0.24 0.23 0.21 0.2 0.12)
## 10) magnet_dumbbell_y< 439.5 10414 7510 A (0.28 0.18 0.24 0.19 0.11)
## 20) roll_forearm< 120.5 6404 3820 A (0.4 0.18 0.19 0.17 0.064) *
## 21) roll_forearm>=120.5 4010 2711 C (0.08 0.18 0.32 0.23 0.18) *
## 11) magnet_dumbbell_y>=439.5 1857 896 B (0.031 0.52 0.044 0.22 0.19) *
## 3) roll_belt>=130.5 1226 10 E (0.0082 0 0 0 0.99) *
# visualizing the decision tree with all detail 'leaf-palletes'
fancyRpartPlot(model_Tree$finalModel)
# running the 'rpart' model on 'TestSet' data and measure model accuracy rate
Test_pred <- predict(model_Tree, newdata = TestSet)
confusionMatrix(Test_pred, TestSet$classe)$overall['Accuracy']
## Accuracy
## 0.4967374
Upshot: The accuracy rate with ‘rpart’ model on ‘TestSet’ data is 0.490, which is significantly lower and needs newer model exploration.
suppressMessages(library(MASS));set.seed(459)
# setting 'trainControl' feature for the 'lda' model with 8-fold cross-validation method
control_lda <- trainControl(method="cv", number = 8)
model_lda <- train(classe~., trControl = control_lda, method="lda", data=TrainSet)
# displaying brief model summary
print(model_lda, digits = 4)
## Linear Discriminant Analysis
##
## 14718 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (8 fold)
## Summary of sample sizes: 12879, 12877, 12878, 12879, 12878, 12879, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7038 0.6253
# using predict method to verify the model with 'TestSet' data and display model accuracy
lda_pred <- predict(model_lda, TestSet)
confusionMatrix(lda_pred, TestSet$classe)$overall['Accuracy']
## Accuracy
## 0.7061582
Upshot: ‘lda’ model accuracy rate now rose up to at 0.70 on ‘TestSet’ data.
Note: ‘gbm’ and Random Forest(rf) models are computationally intensive, I have decided to use parallel processing to reduce computation timing. Parallel processing gave me a significant reduction(almost 60%, about 12 minutes) of time savings in ML-code processing.
# all necessary library for 'gbm' model including (parallel and doParallel) for faster processing
suppressMessages(libraries("gbm", "plyr", "dplyr", "survival", "parallel", "doParallel"));set.seed(9515)
## Warning: package 'gbm' was built under R version 3.3.3
## Warning: package 'survival' was built under R version 3.3.3
## Warning: package 'doParallel' was built under R version 3.3.3
## Warning: package 'foreach' was built under R version 3.3.3
## Warning: package 'iterators' was built under R version 3.3.3
# leaving a single core fo the operating system and registering the cluster
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
#> ** Note: 'trainControl' with repeated-cross-validation method, number specifies number of folds for k-fold cross-validation, and setting 'allowParallel= TRUE', mandates caret to use the cluster, we've register in previous steps.
control_gbm <- trainControl(method = "repeatedcv", number = 10, allowParallel = TRUE)
model_gbm <- train(classe~., preProcess= c("center", "scale"), trControl = control_gbm, method="gbm", data=TrainSet, verbose = FALSE)
# displaying brief model summary
print(model_gbm, digits = 4)
## Stochastic Gradient Boosting
##
## 14718 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: centered (52), scaled (52)
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 13246, 13245, 13246, 13246, 13247, 13245, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.7542 0.6885
## 1 100 0.8207 0.7730
## 1 150 0.8536 0.8147
## 2 50 0.8560 0.8175
## 2 100 0.9069 0.8822
## 2 150 0.9325 0.9146
## 3 50 0.8959 0.8682
## 3 100 0.9412 0.9256
## 3 150 0.9626 0.9527
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
# applying 'gbm' model on 'TestSet' data
gbm_pred <- predict(model_gbm, TestSet)
# confusion Matrix summary statistics with model 'accuracy' rate
print(confusionMatrix(gbm_pred, TestSet$classe), digits = 4)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1360 21 0 0 2
## B 18 899 25 3 4
## C 9 28 819 23 9
## D 5 1 10 774 14
## E 3 0 1 4 872
##
## Overall Statistics
##
## Accuracy : 0.9633
## 95% CI : (0.9576, 0.9684)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9536
## Mcnemar's Test P-Value : 6.435e-05
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9749 0.9473 0.9579 0.9627 0.9678
## Specificity 0.9934 0.9874 0.9830 0.9927 0.9980
## Pos Pred Value 0.9834 0.9473 0.9223 0.9627 0.9909
## Neg Pred Value 0.9901 0.9874 0.9910 0.9927 0.9928
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2773 0.1833 0.1670 0.1578 0.1778
## Detection Prevalence 0.2820 0.1935 0.1811 0.1639 0.1794
## Balanced Accuracy 0.9842 0.9673 0.9704 0.9777 0.9829
confusionMatrix(gbm_pred, TestSet$classe)$overall['Accuracy']
## Accuracy
## 0.9632953
Upshot: There is a considerable accuracy rate increase up to (0.963) compare to ‘lda’ model (0.701).
# loading library, setting seed and 'registering-parallel-processing'
suppressMessages(library(randomForest));set.seed(969)
## Warning: package 'randomForest' was built under R version 3.3.3
registerDoParallel(cluster)
# setting control feature with method 'repeatedcv' and adding parallel processing cluster
Control_Rfo <- trainControl(method = "repeatedcv", number = 9, allowParallel = TRUE)
# running 'rf' model with proprocessing method and predefined control feature
model_Rfo <- train(classe~., method = "rf", preProcess=c("center", "scale"), data=TrainSet, trControl = Control_Rfo, verboseIter =FALSE)
# displaying brief model summary
print(model_Rfo, digits = 4)
## Random Forest
##
## 14718 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: centered (52), scaled (52)
## Resampling: Cross-Validated (9 fold, repeated 1 times)
## Summary of sample sizes: 13083, 13082, 13082, 13083, 13083, 13083, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9920 0.9899
## 27 0.9926 0.9906
## 52 0.9854 0.9815
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
# Evaluating the model on 'TestSet' data and calculating confusionMatrix
Rfo_pred <- predict(model_Rfo, TestSet)
confusion_Rfo <- confusionMatrix(Rfo_pred, TestSet$classe)
# confusion Matrix summary statistics with 'accuracy' rate
print(confusionMatrix(Rfo_pred, TestSet$classe), digits = 4 )
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1391 6 0 0 0
## B 2 939 3 1 0
## C 0 3 848 9 4
## D 0 1 4 792 1
## E 2 0 0 2 896
##
## Overall Statistics
##
## Accuracy : 0.9923
## 95% CI : (0.9894, 0.9945)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9902
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9971 0.9895 0.9918 0.9851 0.9945
## Specificity 0.9983 0.9985 0.9960 0.9985 0.9990
## Pos Pred Value 0.9957 0.9937 0.9815 0.9925 0.9956
## Neg Pred Value 0.9989 0.9975 0.9983 0.9971 0.9988
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2836 0.1915 0.1729 0.1615 0.1827
## Detection Prevalence 0.2849 0.1927 0.1762 0.1627 0.1835
## Balanced Accuracy 0.9977 0.9940 0.9939 0.9918 0.9967
confusion_Rfo$overall['Accuracy']
## Accuracy
## 0.9922512
# ploting the 'confusion Matrix' of "Random Forest" model for classe-steps verification
plot(confusion_Rfo$table, col = confusion_Rfo$byClass, main = paste("Random Forest Model Accuracy =",
round(confusion_Rfo$overall['Accuracy'], 4)))
Upshot: Random forest model by far is predicting the best ‘accuracy rate’ 0.9955 with least ‘out-of-sample error’ is 0.004 rate.
Random Forest Model out of sample error:(1 - 0.9938825) = 0.006
Gradient Boosting Model out of sample error:(1- 0.9667618) = 0.033
Linear Discriminant Analysis out of sample error:(1- 0.7055465) = 0.294
Classification or Decision tree out of sample error:(1- 0.4912316) = 0.508
Note: Every single time running these algorithms produces slightly different accuracy rates and tree pallets.
Applying only three machine learning(‘rf’,‘gbm’,‘lda’) algorithm model on to the 20 test-cases (‘testDataSet’) dataset, provided with the project instruction for level-based prediction.
print(predict(model_Rfo, newdata = testDataSet))
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
print(predict(model_gbm, newdata = testDataSet))
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Analysis: Remarkably ‘random-forest’ and ‘gbm’ model both made exact same ‘level’ of prediction on ‘testDataSet’, which proves high level of accuracy proximity.
print(predict(model_lda, newdata = testDataSet))
## [1] B A B C C E D D A A D A B A E A A B B B
## Levels: A B C D E
# finally folding the parallel-processing cluster
stopCluster(cluster)
# forcing 'R' to return single threading process
registerDoSEQ()