Coursera - Machine Learning algorithms

Project Report: Machine Learning Algorithms

We are given two sets of data collected from accelerometers placed on the belt, forearm, arm, and dumbell of 6 research study participants for this machine learning project. Training data stems from accelerometers with label identifying the quality of the activity the participant was doing. Testing data also comprised of accelerometer data without identifiable label(A-E).

The definitive instruction for this project is to use data to predict whether the exercise is being done properly or improperly based solely on accelerometer data measurements. The participants were instructed to perform the exercise either properly (Class A) or in a way which replicated 4 common weightlifting mistakes (Classes B, C, D, and E).

The question is, would we be able to predict appropriately each participants exercise manner by processing data gathered from classe(A-E) accelerometers? In that persuasion, we should apply some Machine Learning(ML) algorithms on ‘trainData’ and test them on given ‘test dataset’ for ‘classe-level’ based exercise manner prediction.

1. Project write up Sequence:

Here in drop down, I wrote the needed ‘code’ along with ‘line-description’ on each step of the process of ML-algorithms. I have used four machine learnig algorithms are Classification Tree, lda, gbm and random forest. I also used cross-validation in last three(lda,gbm and random forest) models within ‘trainControl’ method. At the end of each ML-algorithm run, I presented the quantified ‘accuracy rate’. I have used a relatively newer ggplot package ‘plotly’ and encouraging any reader to have fun hovering over the plot, you will see all classe variable data distribution in detail. A great ggplot2 addition!

These findings would help us to analyse and predict the manner, in which participants did their exercise regime.

2. Data loading, visual overview and manipulation:

# Necessary library loaded
library(easypackages)

## Warning: package 'easypackages' was built under R version 3.3.3

suppressMessages(libraries("formattable", "dplyr", "tidyr", "ggplot2"))

## Warning: package 'formattable' was built under R version 3.3.3

## Warning: package 'dplyr' was built under R version 3.3.3

## Warning: package 'tidyr' was built under R version 3.3.3

## Warning: package 'ggplot2' was built under R version 3.3.3

# loading and reading data file from my desktop
trainDataSet <- read.csv("pml-training.csv", na.strings = c("", "NA"), header = TRUE)
testDataSet  <- read.csv("pml-testing.csv",  na.strings = c("", "NA"), header = TRUE)

# data dimension with row and columns
rbind ( trainDataSet = dim(trainDataSet), testDataSet = dim(testDataSet) )

##               [,1] [,2]
## trainDataSet 19622  160
## testDataSet     20  160

2.a. Row-Columnar presentation of classe(A-E) variables with ‘plotly-library’

# displaying classe(A-E) elements summarized by user name
trainDataSet %>% count(classe, user_name) %>% spread(classe, n) %>% formattable(align = 'l')

user_name	A	B	C	D	E
adelmo	1165	776	750	515	686
carlitos	834	690	493	486	609
charles	899	745	539	642	711
eurico	865	592	489	582	542
jeremy	1177	489	652	522	562
pedro	640	505	499	469	497

# a visual of classe-vs-user in a bar-graph
suppressMessages(library(plotly))

## Warning: package 'plotly' was built under R version 3.3.3

plot.01 = ggplot(trainDataSet, aes(x=classe, fill=user_name)) + geom_bar() + xlab("Classe in bar segments") + ylab("User performances") + ggtitle("Sequence of classe by user")

ggplotly(plot.01, width = 750, height = 500) %>% highlight(on = "plotly_hover", color = "white")

## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Plot analysis: In this plots we can see that the all participants did ‘Classe A’ the most number of times and then slowly down to (B-E) pattern. They all started doing biceps curls the proper way (Class A), then proceeded with Class B, C to E. This plot gives us a number representation of each classe variable in total with a visible exercise manner. Please roll-over the bar-graph to see how each-user did their exercise ‘classe-variable’ in sequence.

2.b. Percentile projection of ‘classe’ variable by each user

This columnar overview rendered in a 100% scale, which displays, how each user did their exercise regime(A-E), in what percentage of the total workout sequence.

# percentile projection of classe elements by user name
trainDataSet %>% count(classe, user_name) %>% group_by(user_name) %>% mutate(n=percent(n/sum(n),0))%>% spread(classe, n) %>% formattable(align = 'l')

## Warning: package 'bindrcpp' was built under R version 3.3.3

## Warning in mutate_impl(.data, dots): Vectorizing 'formattable' elements may
## not preserve their attributes

## Warning in mutate_impl(.data, dots): Vectorizing 'formattable' elements may
## not preserve their attributes

## Warning in mutate_impl(.data, dots): Vectorizing 'formattable' elements may
## not preserve their attributes

## Warning in mutate_impl(.data, dots): Vectorizing 'formattable' elements may
## not preserve their attributes

## Warning in mutate_impl(.data, dots): Vectorizing 'formattable' elements may
## not preserve their attributes

## Warning in mutate_impl(.data, dots): Vectorizing 'formattable' elements may
## not preserve their attributes

user_name	A	B	C	D	E
adelmo	0.2993320	0.1993834	0.1927030	0.1323227	0.1762590
carlitos	0.2679949	0.2217224	0.1584190	0.1561697	0.1956941
charles	0.2542421	0.2106900	0.1524321	0.1815611	0.2010747
eurico	0.2817590	0.1928339	0.1592834	0.1895765	0.1765472
jeremy	0.3459730	0.1437390	0.1916520	0.1534392	0.1651969
pedro	0.2452107	0.1934866	0.1911877	0.1796935	0.1904215

2.c. Density projection of ‘classe’ variable

This density plot is to visualize the distribution of classe(A-E) exercise variable’, which portrays overall exercise manner of all the participant subjects.

if(FALSE){
# drawing a density plot to see the classe-elements distribution
trainDataSet %>%  ggplot(aes(classe)) + geom_density(fill="salmon")-> plot.02
ggplotly(plot.02, width = 700, height = 450) %>% highlight(on = "plotly_hover", color = "red")
}

Note: Obviously classe component A has a peak distribution at 2.5 and all other elements (B-E) has slowly reduced toward(0.5) scale.

3. DataSet Partition and Exploratory data Cleaning:

suppressMessages(library(caret))

## Warning: package 'caret' was built under R version 3.3.3

# Create Data Partition with 0.75 is training and 0.25 test dataset
inTrain <- createDataPartition(trainDataSet$classe, p=0.75, list=FALSE)
TrainSet <- trainDataSet[inTrain, ]
TestSet  <- trainDataSet[-inTrain,]

# quick data-dimension after data partition
rbind ( TrainSet = dim(TrainSet), TestSet = dim(TestSet) )

##           [,1] [,2]
## TrainSet 14718  160
## TestSet   4904  160

#> **Note: some machine learning algoriths do not accept 'NA' values inside the DataSet.So we will do some 'NA' input manipulation.

# checking number of columns have 'NA' values with percentile projeciton in a table
table (NA_Value_Percent <- round(colMeans(is.na(TrainSet)), 2))

## 
##    0 0.98 
##   60  100

Note: We see that 100-variables have more than 98 percent data with “NA” input ‘filled-in’ and only 60-variables have complete data set. Variables with 98% data is ‘NA’ doesn’t make any quantifiable effect in decision making anlytic processes.

# so we'd eliminate all variable-columns, where more than 96% of the input are 'NA'
All_NA_columns <- sapply(TrainSet, function(x) mean(is.na(x))) > 0.96

# removing columns with 96% 'NA' only input from both 'Train and Test' dataset
TrainSet <- TrainSet[, All_NA_columns == FALSE]
TestSet  <- TestSet [, All_NA_columns == FALSE]

# a quick view of how many 'variable-columns' left after 'NA-elimination' process
rbind(TrainSet = dim(TrainSet), TestSet = dim(TestSet))

##           [,1] [,2]
## TrainSet 14718   60
## TestSet   4904   60

3.a. Covariates variation check

# covariates variability check by setting 'saveMetrics = TRUE', return a data frame with predictor info
nzv <- nearZeroVar(TrainSet, saveMetrics = TRUE)
head(nzv)

##                      freqRatio percentUnique zeroVar   nzv
## X                     1.000000  100.00000000   FALSE FALSE
## user_name             1.106084    0.04076641   FALSE FALSE
## raw_timestamp_part_1  1.066667    5.68691398   FALSE FALSE
## raw_timestamp_part_2  1.250000   88.70091045   FALSE FALSE
## cvtd_timestamp        1.037273    0.13588803   FALSE FALSE
## new_window           47.735099    0.01358880   FALSE  TRUE

Analsis: We see that most of the near-zero-variables(nzv) are ‘false’, so we don’t need to eliminate any covariates.For further Simplification we will remove some unwarranted columns (‘row-index’ to ‘not-relevant’) from the dataset.

TrainSet <- TrainSet[, -(1:7)]
TestSet  <- TestSet [, -(1:7)]

# final dataSet dimension after all irrelevant column elimination
rbind ( TrainSet = dim(TrainSet), TestSet = dim(TestSet))

##           [,1] [,2]
## TrainSet 14718   53
## TestSet   4904   53

4. Machine Learning Algorithms with Cross Validation:

Here, I have used multiple Machine Learning algorithim in searching for high level model accuracy. I have used four algorithms Decision Tree, Linear Discriminant Analysis(lda), Gradient Boosting Method(gbm) and Random Forest(rf) to validate my search. Cross validation processes were included in ‘trainControl’ method with number of folds added. I used parallel-processing feature to reduce ‘data-processing’ time with ‘gbm’ and ‘rf’ model. I also used a confusion Matrix plot to visualize the level of accuracy of the classe variables with ‘rf’ model-algorithm only.

Model.01: Decision (Classification) Tree

# setting seed and loading library 'rattle' for decision tree 
suppressMessages(library(rattle));set.seed(666)

## Warning: package 'rattle' was built under R version 3.3.3

# designing the tree using 'rpart' method
control_dt <- trainControl(method="cv", number = 10)
model_Tree <- train(classe~., method = "rpart", data = TrainSet, trControl = control_dt)

## Loading required package: rpart

## Warning: package 'rpart' was built under R version 3.3.3

# displaying brief model summary
print(model_Tree, digits = 4)

## CART 
## 
## 14718 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 13246, 13245, 13245, 13246, 13248, 13247, ... 
## Resampling results across tuning parameters:
## 
##   cp       Accuracy  Kappa  
##   0.03370  0.5085    0.35816
##   0.05956  0.4517    0.26912
##   0.11450  0.3225    0.05826
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.0337.

# displaying 'model_Tree' node and leaf detail
print(model_Tree$finalModel, digits = 4)

## n= 14718 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 14718 10530 A (0.28 0.19 0.17 0.16 0.18)  
##    2) roll_belt< 130.5 13492  9317 A (0.31 0.21 0.19 0.18 0.11)  
##      4) pitch_forearm< -33.95 1221     8 A (0.99 0.0066 0 0 0) *
##      5) pitch_forearm>=-33.95 12271  9309 A (0.24 0.23 0.21 0.2 0.12)  
##       10) magnet_dumbbell_y< 439.5 10414  7510 A (0.28 0.18 0.24 0.19 0.11)  
##         20) roll_forearm< 120.5 6404  3820 A (0.4 0.18 0.19 0.17 0.064) *
##         21) roll_forearm>=120.5 4010  2711 C (0.08 0.18 0.32 0.23 0.18) *
##       11) magnet_dumbbell_y>=439.5 1857   896 B (0.031 0.52 0.044 0.22 0.19) *
##    3) roll_belt>=130.5 1226    10 E (0.0082 0 0 0 0.99) *

# visualizing the decision tree with all detail 'leaf-palletes'
fancyRpartPlot(model_Tree$finalModel)

# running the 'rpart' model on 'TestSet' data and measure model accuracy rate
Test_pred <- predict(model_Tree, newdata = TestSet)
confusionMatrix(Test_pred, TestSet$classe)$overall['Accuracy']

##  Accuracy 
## 0.4967374

Upshot: The accuracy rate with ‘rpart’ model on ‘TestSet’ data is 0.490, which is significantly lower and needs newer model exploration.

Model.02: Linear Discriminant Analysis (lda)

suppressMessages(library(MASS));set.seed(459)

# setting 'trainControl' feature for the 'lda' model with 8-fold cross-validation method
control_lda <- trainControl(method="cv", number = 8)
model_lda  <- train(classe~., trControl = control_lda, method="lda", data=TrainSet)

# displaying brief model summary
print(model_lda, digits = 4)

## Linear Discriminant Analysis 
## 
## 14718 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (8 fold) 
## Summary of sample sizes: 12879, 12877, 12878, 12879, 12878, 12879, ... 
## Resampling results:
## 
##   Accuracy  Kappa 
##   0.7038    0.6253

# using predict method to verify the model with 'TestSet' data and display model accuracy
lda_pred <- predict(model_lda, TestSet)
confusionMatrix(lda_pred, TestSet$classe)$overall['Accuracy']

##  Accuracy 
## 0.7061582

Upshot: ‘lda’ model accuracy rate now rose up to at 0.70 on ‘TestSet’ data.

Model.03: Gradient Boosting Method (gbm)

Note: ‘gbm’ and Random Forest(rf) models are computationally intensive, I have decided to use parallel processing to reduce computation timing. Parallel processing gave me a significant reduction(almost 60%, about 12 minutes) of time savings in ML-code processing.

# all necessary library for 'gbm' model including (parallel and doParallel) for faster processing
suppressMessages(libraries("gbm", "plyr", "dplyr", "survival", "parallel", "doParallel"));set.seed(9515)

## Warning: package 'gbm' was built under R version 3.3.3

## Warning: package 'survival' was built under R version 3.3.3

## Warning: package 'doParallel' was built under R version 3.3.3

## Warning: package 'foreach' was built under R version 3.3.3

## Warning: package 'iterators' was built under R version 3.3.3

# leaving a single core fo the operating system and registering the cluster
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)

#> ** Note: 'trainControl' with repeated-cross-validation method, number specifies number of folds for k-fold cross-validation, and setting 'allowParallel= TRUE', mandates caret to use the cluster, we've register in previous steps.

control_gbm <- trainControl(method = "repeatedcv", number = 10, allowParallel = TRUE)
model_gbm <- train(classe~., preProcess= c("center", "scale"), trControl = control_gbm, method="gbm", data=TrainSet, verbose = FALSE)

# displaying brief model summary
print(model_gbm, digits = 4)

## Stochastic Gradient Boosting 
## 
## 14718 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (52), scaled (52) 
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 13246, 13245, 13246, 13246, 13247, 13245, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy  Kappa 
##   1                   50      0.7542    0.6885
##   1                  100      0.8207    0.7730
##   1                  150      0.8536    0.8147
##   2                   50      0.8560    0.8175
##   2                  100      0.9069    0.8822
##   2                  150      0.9325    0.9146
##   3                   50      0.8959    0.8682
##   3                  100      0.9412    0.9256
##   3                  150      0.9626    0.9527
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

# applying 'gbm' model on 'TestSet' data
gbm_pred <- predict(model_gbm, TestSet)

# confusion Matrix summary statistics with model 'accuracy' rate 
print(confusionMatrix(gbm_pred, TestSet$classe), digits = 4)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1360   21    0    0    2
##          B   18  899   25    3    4
##          C    9   28  819   23    9
##          D    5    1   10  774   14
##          E    3    0    1    4  872
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9633          
##                  95% CI : (0.9576, 0.9684)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9536          
##  Mcnemar's Test P-Value : 6.435e-05       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9749   0.9473   0.9579   0.9627   0.9678
## Specificity            0.9934   0.9874   0.9830   0.9927   0.9980
## Pos Pred Value         0.9834   0.9473   0.9223   0.9627   0.9909
## Neg Pred Value         0.9901   0.9874   0.9910   0.9927   0.9928
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2773   0.1833   0.1670   0.1578   0.1778
## Detection Prevalence   0.2820   0.1935   0.1811   0.1639   0.1794
## Balanced Accuracy      0.9842   0.9673   0.9704   0.9777   0.9829

confusionMatrix(gbm_pred, TestSet$classe)$overall['Accuracy']

##  Accuracy 
## 0.9632953

Upshot: There is a considerable accuracy rate increase up to (0.963) compare to ‘lda’ model (0.701).

Model.04: Random Forest (rf)

# loading library, setting seed and 'registering-parallel-processing'
suppressMessages(library(randomForest));set.seed(969)

## Warning: package 'randomForest' was built under R version 3.3.3

registerDoParallel(cluster)

# setting control feature with method 'repeatedcv' and adding parallel processing cluster
Control_Rfo <- trainControl(method = "repeatedcv", number = 9, allowParallel = TRUE)

# running 'rf' model with proprocessing method and predefined control feature
model_Rfo  <- train(classe~., method = "rf", preProcess=c("center", "scale"),  data=TrainSet, trControl = Control_Rfo, verboseIter =FALSE)

# displaying brief model summary
print(model_Rfo, digits = 4)

## Random Forest 
## 
## 14718 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (52), scaled (52) 
## Resampling: Cross-Validated (9 fold, repeated 1 times) 
## Summary of sample sizes: 13083, 13082, 13082, 13083, 13083, 13083, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa 
##    2    0.9920    0.9899
##   27    0.9926    0.9906
##   52    0.9854    0.9815
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

# Evaluating the model on 'TestSet' data and calculating confusionMatrix
Rfo_pred <- predict(model_Rfo, TestSet)
confusion_Rfo <- confusionMatrix(Rfo_pred, TestSet$classe)

# confusion Matrix summary statistics with 'accuracy' rate
print(confusionMatrix(Rfo_pred, TestSet$classe), digits = 4 )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1391    6    0    0    0
##          B    2  939    3    1    0
##          C    0    3  848    9    4
##          D    0    1    4  792    1
##          E    2    0    0    2  896
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9923          
##                  95% CI : (0.9894, 0.9945)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9902          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9971   0.9895   0.9918   0.9851   0.9945
## Specificity            0.9983   0.9985   0.9960   0.9985   0.9990
## Pos Pred Value         0.9957   0.9937   0.9815   0.9925   0.9956
## Neg Pred Value         0.9989   0.9975   0.9983   0.9971   0.9988
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2836   0.1915   0.1729   0.1615   0.1827
## Detection Prevalence   0.2849   0.1927   0.1762   0.1627   0.1835
## Balanced Accuracy      0.9977   0.9940   0.9939   0.9918   0.9967

confusion_Rfo$overall['Accuracy']

##  Accuracy 
## 0.9922512

# ploting the 'confusion Matrix' of "Random Forest" model for classe-steps verification
plot(confusion_Rfo$table, col = confusion_Rfo$byClass, main = paste("Random Forest Model Accuracy =",
            round(confusion_Rfo$overall['Accuracy'], 4)))

Upshot: Random forest model by far is predicting the best ‘accuracy rate’ 0.9955 with least ‘out-of-sample error’ is 0.004 rate.

Out-Of-Sample error calculation:

Random Forest Model out of sample error:(1 - 0.9938825) = 0.006

Gradient Boosting Model out of sample error:(1- 0.9667618) = 0.033

Linear Discriminant Analysis out of sample error:(1- 0.7055465) = 0.294

Classification or Decision tree out of sample error:(1- 0.4912316) = 0.508

Note: Every single time running these algorithms produces slightly different accuracy rates and tree pallets.

Applying ML-models on 20 test-case data set:

Applying only three machine learning(‘rf’,‘gbm’,‘lda’) algorithm model on to the 20 test-cases (‘testDataSet’) dataset, provided with the project instruction for level-based prediction.

print(predict(model_Rfo, newdata = testDataSet))

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

print(predict(model_gbm, newdata = testDataSet))

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Analysis: Remarkably ‘random-forest’ and ‘gbm’ model both made exact same ‘level’ of prediction on ‘testDataSet’, which proves high level of accuracy proximity.

print(predict(model_lda, newdata = testDataSet))

##  [1] B A B C C E D D A A D A B A E A A B B B
## Levels: A B C D E

# finally folding the parallel-processing cluster 
stopCluster(cluster)
# forcing 'R' to return single threading process
registerDoSEQ()