library(readr) #read_CSV
library(dplyr) #data wrangling
library(magrittr) #useful piping operator
library(tidyr) #tidying data
library(ggplot2) #main plotting package
library(ggfortify) #plot PCA 
library(plotly) #interactivity in plot
library(patchwork) #easily arrange multiple plot
library(caret) #short for Classification And Regression Training
set.seed(11) #for reproducbility of result
theme_set(hrbrthemes::theme_ipsum()) # set plotting theme

Introduction

EMG: anatomical and physiological background

EMG stands for electromyography. It is the study of muscle electrical signals. EMG is sometimes referred to as myoelectric activity. Muscle tissue conducts electrical potentials similar to the way nerves do and the name given to these electrical signals is the muscle action potential. Surface EMG is a method of recording the information present in these muscle action potentials.

When detecting and recording the EMG signal, there are two main issues of concern that influence the fidelity of the signal. The first is the signal-to-noise ratio. That is, the ratio of the energy in the EMG signals to the energy in the noise signal. In general, noise is defined as electrical signals that are not part of the desired EMG signal.

The other issue is the distortion of the signal, meaning that the relative contribution of any frequency component in the EMG signal should not be altered. Two types of electrodes have been used to acquire muscle signal: invasive electrode and non-invasive electrode.

When EMG is acquired from electrodes mounted directly on the skin, the signal is a composite of all the muscle fiber action potentials occurring in the muscles underlying the skin. These action potentials occur at random intervals. So at any one moment, the EMG signal may be either positive or negative voltage.

Individual muscle fiber action potentials are sometimes acquired using wire or needle electrodes placed directly in the muscle. The combination of the muscle fiber action potentials from all the muscle fibers of a single motor unit is the motor unit action potential (MUAP) which can be detected by a skin surface electrode (non-invasive) located near this field, or by a needle electrode (invasive) inserted in the muscle (Reaz, Hussain & Mohd-Yasin, 2006).

Electrical Noise and Factors Affecting EMG Signal

The amplitude range of EMG signal is 0-10 mV (+5 to -5) prior to amplification. EMG signals acquire noise while traveling through different tissue. It is important to understand the characteristics of the electrical noise. Electrical noise, which will affect EMG signals, can be categorized into the following types:

  • Inherent noise in electronics equipment: All electronics equipment generate noise. This noise cannot be eliminated; using high quality electronic components can only reduce it.

  • Ambient noise: Electromagnetic radiation is the source of this kind of noise. The surfaces of our bodies are constantly inundated with electric-magnetic radiation and it is virtually impossible to avoid exposure to it on the surface of earth. The ambient noise may have amplitude that is one to three orders of magnitude greater than the EMG signal.

  • Motion artifact: When motion artifact is introduced to the system, the information is skewed. Motion artifact causes irregularities in the data. There are two main sources for motion artifact: 1) electrode interface and 2) electrode cable. Motion artifact can be reduced by proper design of the electronics circuitry and set-up.

  • Inherent instability of signal: The amplitude of EMG is random in nature. EMG signal is affected by the firing rate of the motor units, which, in most conditions, fire in the frequency region of 0 to 20 Hz. This kind of noise is considered as unwanted and the removal of the noise is important.

The factors that mainly affect the EMG signal can also be classified. This kind of classification is set so that EMG signal analysis algorithms can be optimized and equipments can be designed in a consistent manner. Factors affecting EMG signal falls into three basic categories:

  • Causative Factors: This is the direct affect on signals. Causative factors can be divided into two classes:

    • Extrinsic: This is due to electrode structure and placement. Factors like area of the detection surface, shape of electrode, distance between electrode detection surface, location of electrode with respect to the motor points in the muscle, location of the muscle electrode on the muscle surface with respect to the lateral edge of the muscle, orientation of the detection surfaces with respect to the muscle fibers mainly have an effect on EMG signal.

    • Intrinsic: Physiological, anatomical, biochemical factors take place due to number of active motor units, fiber type composition, blood flow, fiber diameter, depth and location of active fibers and amount of tissue between surface of the muscle and the electrode.

  • Intermediate Factors: Intermediate factors are physical and physiological phenomena influenced by one or more causative factors. Reasons behind this can be the band-pass filtering aspects of the electrode alone with its detection volume, superposition of action potentials in the detected EMG signal, conduction velocity of the action potential that propagate along the muscle fiber membrane. Even crosstalk from nearby muscle can cause Intermediate Factors.

  • Deterministic Factors: These are influenced by Intermediate Factors. The number of active motor units, motor firing rate, and mechanical interaction between muscle fibers have a direct bearing on the information in the EMG signal and the recorded force. Amplitude, duration, and shape of the motor unit action potential can also be responsible.

The maximization of the quality of EMG signal can be done by the following ways:

  • The signal-to-noise ratio should contain the highest amount of information from EMG signal as possible and minimum amount of noise contamination.

  • The distortion of EMG signal must be as minimal as possible with no unnecessary filtering and distortion of signal peaks and notch filters are not recommended.

During the EMG signal processing, only positive values are analyzed. When half-wave rectification is performed, all negative data is discarded and positive data is kept. The absolute value of each data point is used during full-wave rectification. Usually for rectification, full-wave rectification is preferred (Reaz, Hussain & Mohd-Yasin, 2006).

Objective

In this project, we will try to classify muscle’s electrical response signal with 3 Classification algorithm namely Naive Bayes, Decision Tree, and Random Forest.

Data

  • Source: Classify gestures by reading muscle activity.

  • License: CC0: Public Domain

  • Visibility: Public

  • Dataset Owner: Kirill Yashuk

  • Expected Update Frequency: Not specified

  • Last updated: 2018-12-10

  • Date created: 2018-12-08

  • Current version: Version 2

Context

Kirill Yashuk: My friends and I are creating an open source prosthetic control system which would enable prosthetic devices to have multiple degrees of freedom. (https://github.com/cyber-punk-me)

VIDEO

The system is built of several components. It connects a muscle activity (EMG, Electromyography) sensor to a user Android/Android Things App. The app collects data, then a server builds a Tensorflow model specifically for this user. After that the model can be downloaded and executed on the device to control motors or other appendages.

This dataset can be used to map user residual muscle gestures to certain actions of a prosthetic such as open/close hand or rotate wrist.

For a reference please watch a video on this topic : Living with a mind-controlled robot arm

Content

Four classes of motion were written from MYO armband with the help of our app (https://github.com/cyber-punk-me/nukleos). The MYO armband has 8 sensors placed on skin surface, each measures electrical activity produced by muscles beneath.

Each dataset line has 8 consecutive readings of all 8 sensors. so 64 columns of EMG data. The last column is a resulting gesture that was made while recording the data (classes 0-3) So each line has the following structure:

[8sensors][8sensors][8sensors][8sensors][8sensors][8sensors][8sensors][8sensors][GESTURE_CLASS]

Data was recorded at 200 Hz, which means that each line is 8*(1/200) seconds = 40ms of record time.

A classifier given 64 numbers would predict a gesture class (0-3). Gesture classes were : rock - 0, scissors - 1, paper - 2, ok - 3. Rock, paper, scissors gestures are like in the game with the same name, and OK sign is index finger touching the thumb and the rest of the fingers spread. Gestures were selected pretty much randomly.

Each gesture was recorded 6 times for 20 seconds. Each time recording started with the gesture being already prepared and held. Recording stopped while the gesture was still being held. In total there is 120 seconds of each gesture being held in fixed position. All of them recorded from the same right forearm in a short timespan. Every recording of a certain gesture class was concatenated into a .csv file with a corresponding name (0-3).

Inspiration

Be one of the real cyber punks inventing electronic appendages. Let’s help people who really need it. There’s a lot of work and cool stuff ahead =)

Methodology

Load the dataset into corresponding hand gesture

rock <- read_csv("0.csv", col_names = FALSE)
scissor <- read_csv("1.csv", col_names = FALSE)
paper <- read_csv("2.csv", col_names = FALSE)
okay <- read_csv("3.csv", col_names = FALSE)

Data Exploration

We start our journey by visualizing the data, our data is recorded muscle response when someone doing certain gesture with their own hand, the muscle response the stimulation to make certain gesture from brain and making shapes, thus resulted in different value of muscle electrical signal.

p1 <- rock %>% pivot_longer(everything(),
                      names_to = "key", 
                      values_to = "value") %>%
  ggplot(aes(key, value, col = value, alpha = 0.8)) +
  geom_line() +
  labs(title = "Rock gesture") +
  theme(axis.ticks = element_blank(),
        axis.line = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x.bottom = element_blank(),
        axis.title.y.left = element_blank(),
        legend.position = "none")
p2 <- scissor %>% pivot_longer(everything(),
                      names_to = "key", 
                      values_to = "value") %>%
  ggplot(aes(key, value, col = value, alpha = 0.8)) +
  geom_line() +
  labs(title = "Scissor gesture") +
  theme(axis.ticks = element_blank(),
        axis.line = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x.bottom = element_blank(),
        axis.title.y.left = element_blank(),
        legend.position = "none")
p3 <- paper %>% pivot_longer(everything(),
                      names_to = "key", 
                      values_to = "value") %>%
  ggplot(aes(key, value, col = value, alpha = 0.8)) +
  geom_line() +
  labs(title = "Paper gesture") +
  theme(axis.ticks = element_blank(),
        axis.line = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x.bottom = element_blank(),
        axis.title.y.left = element_blank(),
        legend.position = "none")
p4 <- okay %>% pivot_longer(everything(),
                      names_to = "key", 
                      values_to = "value") %>%
  ggplot(aes(key, value, col = value, alpha = 0.8)) +
  geom_line() +
  labs(title = "Okay sign") +
  theme(axis.ticks = element_blank(),
        axis.line = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.title.x.bottom = element_blank(),
        axis.title.y.left = element_blank(),
        legend.position = "none")
(p1 + p2) / (p3 + p4)

This plot is visual representation of muscle response.

We could see it clearly from the plot that the value of electrical activity in response to a nerve’s stimulation of the muscle differs when we’re making fist, open our hand, make okay sign, and make scissor gesture with our hand.

It’s a bit thicker when we clench our fist, thinner when we make scissor gesture, but relatively same when we make okay sign, it’s pretty hard to see the difference with eyes, so it would be the task for our machine learning algorithm to crunch the number and tell the difference using the value from our data.

Data Preprocessing

We will join the dataset and convert our gestures category into factor.

df <- rock %>% 
  full_join(scissor) %>%
  full_join(paper) %>%
  full_join(okay)

df %<>%
  mutate(X65 = as.factor(ifelse(X65 == 0, "ROCK",
                                ifelse(X65 == 1, "SCISSOR",
                                       ifelse(X65 == 2, "PAPER", "SIGN")))))

Let’s see the top 6 row of our full joined dataset.

df %>%
  rmarkdown::paged_table()

Now, we check the missing value resulting from our join.

sapply(df, function(x)sum(is.na(x)))
##  X1  X2  X3  X4  X5  X6  X7  X8  X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
## X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 X32 X33 X34 X35 X36 X37 X38 X39 X40 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
## X41 X42 X43 X44 X45 X46 X47 X48 X49 X50 X51 X52 X53 X54 X55 X56 X57 X58 X59 X60 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
## X61 X62 X63 X64 X65 
##   0   0   0   0   0

There’s no missing value in our full joined dataset.

prcomp(df[,-65], scale. = TRUE) %>%
  autoplot(data = df, 
           colour = 'X65', 
           shape = 'X65', 
           alpha = 0.7, 
           loadings = TRUE, 
           scale = 0) +
  ggtitle("Principal Component Analysis") +
  theme(
    axis.title.y = element_text(angle = 90)
  )
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

We could eliminate variables which has almost no variance from our dataset with nearZeroVar() function, but it would best to include all of the variables since the percentage of PC1 and PC2 is pretty low and the data comes from sensors in same hand. We would process to the modeling step with our target variable X65.

Data Modeling

p_mat <- ggcorrplot::cor_pmat(df[,-65])
(ggcorrplot::ggcorrplot(cor(df[1:64]),
                   type = "lower",
                   colors = RColorBrewer::brewer.pal(n = 3, name = "RdBu"),
                   method = "square",
                   outline.color = "white",
                   tl.cex = 0.5,
                   insig = "blank",
                   p.mat = p_mat) +
  labs(title = "Correlation Matrix Plot"))

Our variables hardly correlated with each other, it makes sense because our variables are record of electrical signal from several different sensor.

From the correlation plot, we could conclude the best model to be trained would be Naive Bayes in term of Speed and Accuracy because it assume independency of variables and the worst performer would be Decision Tree, the random Forest would have some advantage because it was created from several decision Tress, thus it leveraged the power of several decision trees to cover decision tree’s weakness which are assumption of interaction over variables.

Model Training

In this project, we will try to model our variables using Naive Bayes, Decision Tree, and Random Forest Algorithm using the caret package.

For this computational intensive model training, R will use for-loops for each cross validation, and for loops in R means that we will copy each dataset for each loops and in order to reduce the computation time, we could use configure caret to use 4 of our avaiable cores using the doMC package.

doMC::registerDoMC(4)

First of all, we will check the proportion ouf our target variable to know the balance of our observation.

round(prop.table(table(df$X65)), 2)
## 
##   PAPER    ROCK SCISSOR    SIGN 
##    0.25    0.25    0.25    0.25

Our target variable is perfectly balanced along the dataset, we could ensure better prediction quality by training our model using this dataset.

Next, we will split our df dataset into train and test variable with 80% of observation goes to train and the rest for test.

inTraining <- createDataPartition(df$X65, p = .8, list = FALSE)

train <- df[ inTraining,]
test  <- df[-inTraining,]

The train variable is used for training our model and test variable will be used for testing our model prediction.

By default, the train() function without any arguments re-runs the model over 25 bootstrap samples. So we will assign new control variable to use the cross-validation method with 5 fold and 3 repeats.

fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

Naive Bayes

nb_mod <- train(X65~., 
               data = train, 
               method = "nb",
               trControl = fitControl)

Decision Tree

dt_mod <- train(X65~., 
               data = train, 
               method = "rpart",
               trControl = fitControl)

Random Forest

rf_mod <- train(X65~., 
               data = train, 
               method = "rf", 
               trControl = fitControl)

Model Evaluation

Now, we will use the predict() function with our test dataset as the reference for our model and confusionMatrix() to see the result.

Naive Bayes

nb_mod
## Naive Bayes 
## 
## 9344 samples
##   64 predictor
##    4 classes: 'PAPER', 'ROCK', 'SCISSOR', 'SIGN' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 7475, 7474, 7475, 7477, 7475, 7475, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.8878425  0.8504687
##    TRUE      0.9077484  0.8770033
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
##  = 1.
nb_test <- predict(nb_mod, test)
confusionMatrix(nb_test, test$X65)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction PAPER ROCK SCISSOR SIGN
##    PAPER     536   10      12   20
##    ROCK        6  548       1   60
##    SCISSOR    17    0     556   40
##    SIGN       29   24      11  464
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9015               
##                  95% CI : (0.8886, 0.9133)     
##     No Information Rate : 0.2519               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.8686               
##                                                
##  Mcnemar's Test P-Value : 0.00000227           
## 
## Statistics by Class:
## 
##                      Class: PAPER Class: ROCK Class: SCISSOR Class: SIGN
## Sensitivity                0.9116      0.9416         0.9586      0.7945
## Specificity                0.9759      0.9618         0.9675      0.9634
## Pos Pred Value             0.9273      0.8911         0.9070      0.8788
## Neg Pred Value             0.9704      0.9802         0.9861      0.9336
## Prevalence                 0.2519      0.2494         0.2485      0.2502
## Detection Rate             0.2296      0.2348         0.2382      0.1988
## Detection Prevalence       0.2476      0.2635         0.2626      0.2262
## Balanced Accuracy          0.9438      0.9517         0.9631      0.8790

Our Naive Bayes model seems performing quite good, the in-sample Accuracy is 0.907 and the out-sample accuracy is 0.901, and it’s very fast to train, the different between in and out sample accuracy shows that our model is not being overfitted to the train dataset.

Why the Naive Bayes perform so good on our dataset? Naive Bayes Consider each predictor variable to be independent of any other variable in the model, and our variables have little to none correlation with each other.

Decision Tree

dt_mod
## CART 
## 
## 9344 samples
##   64 predictor
##    4 classes: 'PAPER', 'ROCK', 'SCISSOR', 'SIGN' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 7474, 7475, 7476, 7476, 7475, 7476, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.09314637  0.4449933  0.2610775
##   0.09700959  0.4030115  0.2050581
##   0.11303477  0.3272341  0.1035685
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.09314637.
dt_test <- predict(dt_mod, test)
confusionMatrix(dt_test,test$X65)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction PAPER ROCK SCISSOR SIGN
##    PAPER       0    0       0    0
##    ROCK       24  383       1   41
##    SCISSOR   564  199     579  543
##    SIGN        0    0       0    0
## 
## Overall Statistics
##                                                
##                Accuracy : 0.4122               
##                  95% CI : (0.3921, 0.4325)     
##     No Information Rate : 0.2519               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.2176               
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: PAPER Class: ROCK Class: SCISSOR Class: SIGN
## Sensitivity                0.0000      0.6581         0.9983      0.0000
## Specificity                1.0000      0.9623         0.2554      1.0000
## Pos Pred Value                NaN      0.8530         0.3072         NaN
## Neg Pred Value             0.7481      0.8944         0.9978      0.7498
## Prevalence                 0.2519      0.2494         0.2485      0.2502
## Detection Rate             0.0000      0.1641         0.2481      0.0000
## Detection Prevalence       0.0000      0.1924         0.8076      0.0000
## Balanced Accuracy          0.5000      0.8102         0.6268      0.5000
(ggplot(dt_mod) +
  labs(title = "Accuracy Graph", subtitle = "Decision Tree") +
  theme(axis.title.y = element_text(angle = 90))) %>%
  ggplotly() %>%
  layout(title = list(text = paste0("Accuracy Graph",
                                    "<br>",
                                    "<sup>",
                                    "Decision Tree",
                                    "</sup>")))

Our decision tree model’s in-sample accuracy is bad, and even worse on the out-sample, one way to interpret this is by knowing that the decision tree assume that all predictor are interacted, while in our case, the variables have close to no interaction.

(ggplot(varImp(dt_mod)) +
  labs(title = "Variable Importance", 
       subtitle = "Decison Tree") +
  theme(axis.title.y = element_text(angle = 90),
        axis.text.y = element_blank())) %>%
  ggplotly() %>%
  layout(title = list(text = paste0("Variable Importance",
                                    "<br>",
                                    "<sup>",
                                    "Decision Tree",
                                    "</sup>")))

Take a look at our decision tree model variables’ importance, only few variables are deemed to hold and importance in our model, and the others doesm’t hold any importance at all. This is what we call a rigid model, our variables’ importance are divided among 3 cluster, one with importance between 80-100, 40-60, and 0.

Random Forest

rf_mod
## Random Forest 
## 
## 9344 samples
##   64 predictor
##    4 classes: 'PAPER', 'ROCK', 'SCISSOR', 'SIGN' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 7475, 7475, 7475, 7477, 7474, 7476, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9348606  0.9131479
##   33    0.9132068  0.8842689
##   64    0.9114596  0.8819389
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
(ggplot(rf_mod) +
  labs(title = "Accuracy Graph", subtitle = "Random Forest") +
  theme(axis.title.y = element_text(angle = 90))) %>%
  ggplotly() %>%
  layout(title = list(text = paste0("Accuracy Graph",
                                    "<br>",
                                    "<sup>",
                                    "Random Forest",
                                    "</sup>")))
rf_test <- predict(rf_mod, test)
confusionMatrix(rf_test,test$X65)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction PAPER ROCK SCISSOR SIGN
##    PAPER     560    8      13   29
##    ROCK        5  567       1   58
##    SCISSOR     7    0     560   17
##    SIGN       16    7       6  480
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9284               
##                  95% CI : (0.9172, 0.9386)     
##     No Information Rate : 0.2519               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.9046               
##                                                
##  Mcnemar's Test P-Value : 0.000000001463       
## 
## Statistics by Class:
## 
##                      Class: PAPER Class: ROCK Class: SCISSOR Class: SIGN
## Sensitivity                0.9524      0.9742         0.9655      0.8219
## Specificity                0.9714      0.9635         0.9863      0.9834
## Pos Pred Value             0.9180      0.8986         0.9589      0.9430
## Neg Pred Value             0.9838      0.9912         0.9886      0.9430
## Prevalence                 0.2519      0.2494         0.2485      0.2502
## Detection Rate             0.2399      0.2429         0.2399      0.2057
## Detection Prevalence       0.2614      0.2704         0.2502      0.2181
## Balanced Accuracy          0.9619      0.9688         0.9759      0.9027

Our random forest performed as the best model so far for the Muscle Electric Signal reading because rf is a forest created from several decision trees, and the performance would be very high with the drawback of long training time. Thus, we ended up with 0.934 in-sample Accuracy and 0.928 out-sample Accuracy.

(ggplot(varImp(rf_mod)) +
  labs(title = "Variable Importance", subtitle = "Random Forest") +
  theme(axis.title.y = element_text(angle = 90),
        axis.text.y = element_blank())) %>%
  ggplotly() %>%
  layout(title = list(text = paste0("Feature Importance",
                                    "<br>",
                                    "<sup>",
                                    "Random Forest",
                                    "</sup>")))

Remember the variable importance in our decision trees? In our random forest, there’s clusters where the variable of importance nearly reached 100, between 40 - 60, and 0 - 20.

Model Optimization

In this model optimization part, we will be using python module scikit-learn to help us train our Random Forest Classifier faster.

We split our independent and dependent variables from the dataset, and into training and testing data to test our model.

train_X <- as.matrix(train[,-65])
test_X <- as.matrix(test[,-65])

train_y <- as.matrix(train[,65])
test_y <- as.matrix(test[,65])
library(reticulate)
use_python("/home/jevian/anaconda3/bin/python")

Now we will use StandardScaler from sklearn.preprocessing to scale our data, RepeatedKFold to implement K-fold cross validation with 5 fold and 3 repeat, and search the most probable range for the best parameter for our model with RandomizedSearchCV from sklearn.model_selection to sample over 24 iterations. Train our model with RandomForestClassifer from sklearn.ensemble with value of 100, 200, 350, and 500, max_depth of 2, 8, 16, 32, 64. criterion of gini and entropy. We will use 3 of our cores for randomized search, and 2 cores for fit. dont forget to set random_state in model, cv, and randomized search to 0 seed for reproducibility.

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
pipe = Pipeline([
  ('scale', StandardScaler()),
  ('rfc', RandomForestClassifier())
])

parameter = {
  'rfc__n_estimators': [100, 200, 350, 500],
  'rfc__max_features': ['auto', 'sqrt', 'log2'],
  'rfc__max_depth': [2, 8, 16, 32, 64],
  'rfc__criterion': ['gini', 'entropy'],
  'rfc__n_jobs': [2],
  'rfc__random_state': [0]
}

rkfcv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=0)

GSCV_rfc = RandomizedSearchCV(estimator=pipe, n_iter=24, verbose=2, param_distributions=parameter, cv=rkfcv, scoring='accuracy', n_jobs=3, random_state=0)
GSCV_rfc.fit(r.train_X, np.ravel(r.train_y))

Now that we have range for our best parameter, we will search for the best parameter again using GridSearchCV from sklearn.model_selection, this time using value near the best_params_ from our last model.

print('Train-set Accuracy: ', GSCV_rfc.score(r.train_X, np.ravel(r.train_y)))
## Train-set Accuracy:  1.0
print('Test-set Accuracy: ', GSCV_rfc.score(r.test_X, np.ravel(r.test_y)))
## Test-set Accuracy:  0.921165381319623
print("Best: %f using %s" % (GSCV_rfc.best_score_, GSCV_rfc.best_params_))
## Best: 0.928581 using {'rfc__random_state': 0, 'rfc__n_jobs': 2, 'rfc__n_estimators': 500, 'rfc__max_features': 'log2', 'rfc__max_depth': 64, 'rfc__criterion': 'entropy'}

Our model perfectly fit the training dataset and have 0.921 Accuracy on the test set, we will use the parameter from the model to find the best max_depth from 50 to 72 using grid search.

from sklearn.model_selection import GridSearchCV
params_grid = {
  'rfc__n_estimators': [500, 625, 750, 875, 1000],
  'rfc__max_features': ['log2'],
  'rfc__max_depth': [50, 52, 54, 58, 60, 62, 64, 66, 68, 70, 72],
  'rfc__criterion': ['entropy'],
  'rfc__n_jobs': [2],
  'rfc__random_state': [0]
}

final = GridSearchCV(estimator=pipe, verbose=2, param_grid=params_grid, cv=rkfcv, scoring='accuracy', n_jobs=3)
final.fit(r.train_X, np.ravel(r.train_y))
print('Train-set Accuracy: ', final.score(r.train_X, np.ravel(r.train_y)))
## Train-set Accuracy:  1.0
print('Test-set Accuracy: ', final.score(r.test_X, np.ravel(r.test_y)))
## Test-set Accuracy:  0.9215938303341902
print('In-sample Accuracy: ', final.best_score_)
## In-sample Accuracy:  0.9296156838213476
GSCV_rfc_pred = GSCV_rfc.predict(r.test_X)
pred_y = final.predict(r.test_X) 

Finally, we evaluate the classification prediction using r again.

confusionMatrix(as.factor(py$GSCV_rfc_pred), test$X65)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction PAPER ROCK SCISSOR SIGN
##    PAPER     550    7      15   29
##    ROCK        7  567       1   39
##    SCISSOR    11    0     539   22
##    SIGN       20    8      25  494
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9212               
##                  95% CI : (0.9095, 0.9318)     
##     No Information Rate : 0.2519               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.8949               
##                                                
##  Mcnemar's Test P-Value : 0.0005433            
## 
## Statistics by Class:
## 
##                      Class: PAPER Class: ROCK Class: SCISSOR Class: SIGN
## Sensitivity                0.9354      0.9742         0.9293      0.8459
## Specificity                0.9708      0.9732         0.9812      0.9697
## Pos Pred Value             0.9151      0.9235         0.9423      0.9031
## Neg Pred Value             0.9781      0.9913         0.9767      0.9496
## Prevalence                 0.2519      0.2494         0.2485      0.2502
## Detection Rate             0.2356      0.2429         0.2309      0.2117
## Detection Prevalence       0.2575      0.2631         0.2451      0.2344
## Balanced Accuracy          0.9531      0.9737         0.9552      0.9078
rf_pred <- py$pred_y
confusionMatrix(as.factor(rf_pred), test$X65)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction PAPER ROCK SCISSOR SIGN
##    PAPER     551    7      16   28
##    ROCK        5  566       1   38
##    SCISSOR    12    0     538   22
##    SIGN       20    9      25  496
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9216               
##                  95% CI : (0.9099, 0.9322)     
##     No Information Rate : 0.2519               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.8955               
##                                                
##  Mcnemar's Test P-Value : 0.001605             
## 
## Statistics by Class:
## 
##                      Class: PAPER Class: ROCK Class: SCISSOR Class: SIGN
## Sensitivity                0.9371      0.9725         0.9276      0.8493
## Specificity                0.9708      0.9749         0.9806      0.9691
## Pos Pred Value             0.9153      0.9279         0.9406      0.9018
## Neg Pred Value             0.9786      0.9907         0.9762      0.9507
## Prevalence                 0.2519      0.2494         0.2485      0.2502
## Detection Rate             0.2361      0.2425         0.2305      0.2125
## Detection Prevalence       0.2579      0.2614         0.2451      0.2356
## Balanced Accuracy          0.9539      0.9737         0.9541      0.9092

We ended up with 0.0004 improvement in our out-sample accuracy from the exhaustive grid search.

Conclusion

result <- resamples(list(NB = nb_mod, 
                         DT = dt_mod, 
                         RF = rf_mod))
summary(result)
## 
## Call:
## summary.resamples(object = result)
## 
## Models: NB, DT, RF 
## Number of resamples: 15 
## 
## Accuracy 
##         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## NB 0.9004283 0.9058320 0.9074371 0.9077484 0.9085072 0.9213483    0
## DT 0.3991439 0.4110256 0.4663102 0.4449933 0.4739088 0.4943820    0
## RF 0.9255889 0.9309791 0.9358289 0.9348606 0.9389721 0.9432548    0
## 
## Kappa 
##         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## NB 0.8672434 0.8744489 0.8765899 0.8770033 0.8780122 0.8951335    0
## DT 0.2005151 0.2160009 0.2892978 0.2610775 0.2994264 0.3267971    0
## RF 0.9007860 0.9079755 0.9144387 0.9131479 0.9186293 0.9243403    0
(ggplot(result) +
  labs(title = "Model Comparison",subtitle = "Accuracy")) %>%
  ggplotly() %>%
  layout(title = list(text = paste0("Model Comparison",
                                    "<br>",
                                    "<sup>",
                                    "Average Accuracy",
                                    "</sup>")))
result$timings

The performance our Classification Algorithm would be highly related to our dataset characteristic. In our dataset, the variables’ correlation are week and it seem’s to be independet each others, so the Naive Bayes which assume independency of variables would be at advantage here, therefore the decision trees would be poor performer due to the underlying assumption of variables correlation is not met. But our Random forest which constructed from decision trees succesfully outperform the naive bayes eventhough decision trees have the underlying correlation assumption, why? Because it combines multiple decision trees to arrive at better decision through voting mechanism, but it comes at cost of much longer training time.

Our optimized model from grid search have slightly higher accuracy than our randomized search, but with much higher time spent training the model, the slight increase in accuracy would be deemed significant in some case, but in the end, it’s a matter of trade off. It all going back into how much time and resource we have and we are willing to spend training our model.

Reference

Reaz, M. B., Hussain, M. S., & Mohd-Yasin, F. (2006). Techniques of EMG signal analysis: detection, processing, classification and applications. Biological procedures online, 8, 11–35. (https://doi.org/10.1251/bpo115)