Our variables hardly correlated with each other, it makes sense because our variables are record of electrical signal from several different sensor.
From the correlation plot, we could conclude the best model to be trained would be Naive Bayes in term of Speed and Accuracy because it assume independency of variables and the worst performer would be Decision Tree, the random Forest would have some advantage because it was created from several decision Tress, thus it leveraged the power of several decision trees to cover decision tree’s weakness which are assumption of interaction over variables.
Model Training
In this project, we will try to model our variables using Naive Bayes, Decision Tree, and Random Forest Algorithm using the caret package.
For this computational intensive model training, R will use for-loops for each cross validation, and for loops in R means that we will copy each dataset for each loops and in order to reduce the computation time, we could use configure caret to use 4 of our avaiable cores using the doMC package.
First of all, we will check the proportion ouf our target variable to know the balance of our observation.
round(prop.table(table(df$X65)), 2)
##
## PAPER ROCK SCISSOR SIGN
## 0.25 0.25 0.25 0.25
Our target variable is perfectly balanced along the dataset, we could ensure better prediction quality by training our model using this dataset.
Next, we will split our df dataset into train and test variable with 80% of observation goes to train and the rest for test.
inTraining <- createDataPartition(df$X65, p = .8, list = FALSE)
train <- df[ inTraining,]
test <- df[-inTraining,]
The train variable is used for training our model and test variable will be used for testing our model prediction.
By default, the train() function without any arguments re-runs the model over 25 bootstrap samples. So we will assign new control variable to use the cross-validation method with 5 fold and 3 repeats.
fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
Naive Bayes
nb_mod <- train(X65~.,
data = train,
method = "nb",
trControl = fitControl)
Decision Tree
dt_mod <- train(X65~.,
data = train,
method = "rpart",
trControl = fitControl)
Random Forest
rf_mod <- train(X65~.,
data = train,
method = "rf",
trControl = fitControl)
Model Evaluation
Now, we will use the predict() function with our test dataset as the reference for our model and confusionMatrix() to see the result.
Naive Bayes
## Naive Bayes
##
## 9344 samples
## 64 predictor
## 4 classes: 'PAPER', 'ROCK', 'SCISSOR', 'SIGN'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 7475, 7474, 7475, 7477, 7475, 7475, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.8878425 0.8504687
## TRUE 0.9077484 0.8770033
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
## = 1.
nb_test <- predict(nb_mod, test)
confusionMatrix(nb_test, test$X65)
## Confusion Matrix and Statistics
##
## Reference
## Prediction PAPER ROCK SCISSOR SIGN
## PAPER 536 10 12 20
## ROCK 6 548 1 60
## SCISSOR 17 0 556 40
## SIGN 29 24 11 464
##
## Overall Statistics
##
## Accuracy : 0.9015
## 95% CI : (0.8886, 0.9133)
## No Information Rate : 0.2519
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.8686
##
## Mcnemar's Test P-Value : 0.00000227
##
## Statistics by Class:
##
## Class: PAPER Class: ROCK Class: SCISSOR Class: SIGN
## Sensitivity 0.9116 0.9416 0.9586 0.7945
## Specificity 0.9759 0.9618 0.9675 0.9634
## Pos Pred Value 0.9273 0.8911 0.9070 0.8788
## Neg Pred Value 0.9704 0.9802 0.9861 0.9336
## Prevalence 0.2519 0.2494 0.2485 0.2502
## Detection Rate 0.2296 0.2348 0.2382 0.1988
## Detection Prevalence 0.2476 0.2635 0.2626 0.2262
## Balanced Accuracy 0.9438 0.9517 0.9631 0.8790
Our Naive Bayes model seems performing quite good, the in-sample Accuracy is 0.907 and the out-sample accuracy is 0.901, and it’s very fast to train, the different between in and out sample accuracy shows that our model is not being overfitted to the train dataset.
Why the Naive Bayes perform so good on our dataset? Naive Bayes Consider each predictor variable to be independent of any other variable in the model, and our variables have little to none correlation with each other.
Decision Tree
## CART
##
## 9344 samples
## 64 predictor
## 4 classes: 'PAPER', 'ROCK', 'SCISSOR', 'SIGN'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 7474, 7475, 7476, 7476, 7475, 7476, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.09314637 0.4449933 0.2610775
## 0.09700959 0.4030115 0.2050581
## 0.11303477 0.3272341 0.1035685
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.09314637.
dt_test <- predict(dt_mod, test)
confusionMatrix(dt_test,test$X65)
## Confusion Matrix and Statistics
##
## Reference
## Prediction PAPER ROCK SCISSOR SIGN
## PAPER 0 0 0 0
## ROCK 24 383 1 41
## SCISSOR 564 199 579 543
## SIGN 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.4122
## 95% CI : (0.3921, 0.4325)
## No Information Rate : 0.2519
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.2176
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: PAPER Class: ROCK Class: SCISSOR Class: SIGN
## Sensitivity 0.0000 0.6581 0.9983 0.0000
## Specificity 1.0000 0.9623 0.2554 1.0000
## Pos Pred Value NaN 0.8530 0.3072 NaN
## Neg Pred Value 0.7481 0.8944 0.9978 0.7498
## Prevalence 0.2519 0.2494 0.2485 0.2502
## Detection Rate 0.0000 0.1641 0.2481 0.0000
## Detection Prevalence 0.0000 0.1924 0.8076 0.0000
## Balanced Accuracy 0.5000 0.8102 0.6268 0.5000
(ggplot(dt_mod) +
labs(title = "Accuracy Graph", subtitle = "Decision Tree") +
theme(axis.title.y = element_text(angle = 90))) %>%
ggplotly() %>%
layout(title = list(text = paste0("Accuracy Graph",
"<br>",
"<sup>",
"Decision Tree",
"</sup>")))
Our decision tree model’s in-sample accuracy is bad, and even worse on the out-sample, one way to interpret this is by knowing that the decision tree assume that all predictor are interacted, while in our case, the variables have close to no interaction.
(ggplot(varImp(dt_mod)) +
labs(title = "Variable Importance",
subtitle = "Decison Tree") +
theme(axis.title.y = element_text(angle = 90),
axis.text.y = element_blank())) %>%
ggplotly() %>%
layout(title = list(text = paste0("Variable Importance",
"<br>",
"<sup>",
"Decision Tree",
"</sup>")))
Take a look at our decision tree model variables’ importance, only few variables are deemed to hold and importance in our model, and the others doesm’t hold any importance at all. This is what we call a rigid model, our variables’ importance are divided among 3 cluster, one with importance between 80-100, 40-60, and 0.
Random Forest
## Random Forest
##
## 9344 samples
## 64 predictor
## 4 classes: 'PAPER', 'ROCK', 'SCISSOR', 'SIGN'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 7475, 7475, 7475, 7477, 7474, 7476, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9348606 0.9131479
## 33 0.9132068 0.8842689
## 64 0.9114596 0.8819389
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
(ggplot(rf_mod) +
labs(title = "Accuracy Graph", subtitle = "Random Forest") +
theme(axis.title.y = element_text(angle = 90))) %>%
ggplotly() %>%
layout(title = list(text = paste0("Accuracy Graph",
"<br>",
"<sup>",
"Random Forest",
"</sup>")))
rf_test <- predict(rf_mod, test)
confusionMatrix(rf_test,test$X65)
## Confusion Matrix and Statistics
##
## Reference
## Prediction PAPER ROCK SCISSOR SIGN
## PAPER 560 8 13 29
## ROCK 5 567 1 58
## SCISSOR 7 0 560 17
## SIGN 16 7 6 480
##
## Overall Statistics
##
## Accuracy : 0.9284
## 95% CI : (0.9172, 0.9386)
## No Information Rate : 0.2519
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.9046
##
## Mcnemar's Test P-Value : 0.000000001463
##
## Statistics by Class:
##
## Class: PAPER Class: ROCK Class: SCISSOR Class: SIGN
## Sensitivity 0.9524 0.9742 0.9655 0.8219
## Specificity 0.9714 0.9635 0.9863 0.9834
## Pos Pred Value 0.9180 0.8986 0.9589 0.9430
## Neg Pred Value 0.9838 0.9912 0.9886 0.9430
## Prevalence 0.2519 0.2494 0.2485 0.2502
## Detection Rate 0.2399 0.2429 0.2399 0.2057
## Detection Prevalence 0.2614 0.2704 0.2502 0.2181
## Balanced Accuracy 0.9619 0.9688 0.9759 0.9027
Our random forest performed as the best model so far for the Muscle Electric Signal reading because rf is a forest created from several decision trees, and the performance would be very high with the drawback of long training time. Thus, we ended up with 0.934 in-sample Accuracy and 0.928 out-sample Accuracy.
(ggplot(varImp(rf_mod)) +
labs(title = "Variable Importance", subtitle = "Random Forest") +
theme(axis.title.y = element_text(angle = 90),
axis.text.y = element_blank())) %>%
ggplotly() %>%
layout(title = list(text = paste0("Feature Importance",
"<br>",
"<sup>",
"Random Forest",
"</sup>")))
Remember the variable importance in our decision trees? In our random forest, there’s clusters where the variable of importance nearly reached 100, between 40 - 60, and 0 - 20.
Model Optimization
In this model optimization part, we will be using python module scikit-learn to help us train our Random Forest Classifier faster.
We split our independent and dependent variables from the dataset, and into training and testing data to test our model.
train_X <- as.matrix(train[,-65])
test_X <- as.matrix(test[,-65])
train_y <- as.matrix(train[,65])
test_y <- as.matrix(test[,65])
use_python("/home/jevian/anaconda3/bin/python")
Now we will use StandardScaler from sklearn.preprocessing to scale our data, RepeatedKFold to implement K-fold cross validation with 5 fold and 3 repeat, and search the most probable range for the best parameter for our model with RandomizedSearchCV from sklearn.model_selection to sample over 24 iterations. Train our model with RandomForestClassifer from sklearn.ensemble with value of 100, 200, 350, and 500, max_depth of 2, 8, 16, 32, 64. criterion of gini and entropy. We will use 3 of our cores for randomized search, and 2 cores for fit. dont forget to set random_state in model, cv, and randomized search to 0 seed for reproducibility.
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
pipe = Pipeline([
('scale', StandardScaler()),
('rfc', RandomForestClassifier())
])
parameter = {
'rfc__n_estimators': [100, 200, 350, 500],
'rfc__max_features': ['auto', 'sqrt', 'log2'],
'rfc__max_depth': [2, 8, 16, 32, 64],
'rfc__criterion': ['gini', 'entropy'],
'rfc__n_jobs': [2],
'rfc__random_state': [0]
}
rkfcv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=0)
GSCV_rfc = RandomizedSearchCV(estimator=pipe, n_iter=24, verbose=2, param_distributions=parameter, cv=rkfcv, scoring='accuracy', n_jobs=3, random_state=0)
GSCV_rfc.fit(r.train_X, np.ravel(r.train_y))
Now that we have range for our best parameter, we will search for the best parameter again using GridSearchCV from sklearn.model_selection, this time using value near the best_params_ from our last model.
print('Train-set Accuracy: ', GSCV_rfc.score(r.train_X, np.ravel(r.train_y)))
## Train-set Accuracy: 1.0
print('Test-set Accuracy: ', GSCV_rfc.score(r.test_X, np.ravel(r.test_y)))
## Test-set Accuracy: 0.921165381319623
print("Best: %f using %s" % (GSCV_rfc.best_score_, GSCV_rfc.best_params_))
## Best: 0.928581 using {'rfc__random_state': 0, 'rfc__n_jobs': 2, 'rfc__n_estimators': 500, 'rfc__max_features': 'log2', 'rfc__max_depth': 64, 'rfc__criterion': 'entropy'}
Our model perfectly fit the training dataset and have 0.921 Accuracy on the test set, we will use the parameter from the model to find the best max_depth from 50 to 72 using grid search.
from sklearn.model_selection import GridSearchCV
params_grid = {
'rfc__n_estimators': [500, 625, 750, 875, 1000],
'rfc__max_features': ['log2'],
'rfc__max_depth': [50, 52, 54, 58, 60, 62, 64, 66, 68, 70, 72],
'rfc__criterion': ['entropy'],
'rfc__n_jobs': [2],
'rfc__random_state': [0]
}
final = GridSearchCV(estimator=pipe, verbose=2, param_grid=params_grid, cv=rkfcv, scoring='accuracy', n_jobs=3)
final.fit(r.train_X, np.ravel(r.train_y))
print('Train-set Accuracy: ', final.score(r.train_X, np.ravel(r.train_y)))
## Train-set Accuracy: 1.0
print('Test-set Accuracy: ', final.score(r.test_X, np.ravel(r.test_y)))
## Test-set Accuracy: 0.9215938303341902
print('In-sample Accuracy: ', final.best_score_)
## In-sample Accuracy: 0.9296156838213476
GSCV_rfc_pred = GSCV_rfc.predict(r.test_X)
pred_y = final.predict(r.test_X)
Finally, we evaluate the classification prediction using r again.
confusionMatrix(as.factor(py$GSCV_rfc_pred), test$X65)
## Confusion Matrix and Statistics
##
## Reference
## Prediction PAPER ROCK SCISSOR SIGN
## PAPER 550 7 15 29
## ROCK 7 567 1 39
## SCISSOR 11 0 539 22
## SIGN 20 8 25 494
##
## Overall Statistics
##
## Accuracy : 0.9212
## 95% CI : (0.9095, 0.9318)
## No Information Rate : 0.2519
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.8949
##
## Mcnemar's Test P-Value : 0.0005433
##
## Statistics by Class:
##
## Class: PAPER Class: ROCK Class: SCISSOR Class: SIGN
## Sensitivity 0.9354 0.9742 0.9293 0.8459
## Specificity 0.9708 0.9732 0.9812 0.9697
## Pos Pred Value 0.9151 0.9235 0.9423 0.9031
## Neg Pred Value 0.9781 0.9913 0.9767 0.9496
## Prevalence 0.2519 0.2494 0.2485 0.2502
## Detection Rate 0.2356 0.2429 0.2309 0.2117
## Detection Prevalence 0.2575 0.2631 0.2451 0.2344
## Balanced Accuracy 0.9531 0.9737 0.9552 0.9078
rf_pred <- py$pred_y
confusionMatrix(as.factor(rf_pred), test$X65)
## Confusion Matrix and Statistics
##
## Reference
## Prediction PAPER ROCK SCISSOR SIGN
## PAPER 551 7 16 28
## ROCK 5 566 1 38
## SCISSOR 12 0 538 22
## SIGN 20 9 25 496
##
## Overall Statistics
##
## Accuracy : 0.9216
## 95% CI : (0.9099, 0.9322)
## No Information Rate : 0.2519
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.8955
##
## Mcnemar's Test P-Value : 0.001605
##
## Statistics by Class:
##
## Class: PAPER Class: ROCK Class: SCISSOR Class: SIGN
## Sensitivity 0.9371 0.9725 0.9276 0.8493
## Specificity 0.9708 0.9749 0.9806 0.9691
## Pos Pred Value 0.9153 0.9279 0.9406 0.9018
## Neg Pred Value 0.9786 0.9907 0.9762 0.9507
## Prevalence 0.2519 0.2494 0.2485 0.2502
## Detection Rate 0.2361 0.2425 0.2305 0.2125
## Detection Prevalence 0.2579 0.2614 0.2451 0.2356
## Balanced Accuracy 0.9539 0.9737 0.9541 0.9092
We ended up with 0.0004 improvement in our out-sample accuracy from the exhaustive grid search.