1 Initialization

1.1 Library Call & Function Definition

library(dplyr)
library(mxnet)
library(keras)
library(ggplot2)
library(plotly)
library(caret)
library(e1071)

optimizer_list <- 
  c("adadelta", "adagrad", "adam", "adamax", "nadam", "rmsprop", "sgd")

plot_acc_it <- function(){
  history_df <- data.frame(optimizer=character(),
                  epoch=numeric(),
                  acc=numeric())
  for (i in 1:7) {
    h <- readRDS(paste("history_", optimizer_list[i], "_3.RDS", sep = ""))
    nr <- data.frame(
      optimizer=optimizer_list[i],
      epoch=1:length(h$metrics$acc),
      acc=round(h$metrics$acc*100,2))
    history_df <-rbind(history_df, nr)  
  }

  ggplotly(ggplot(data=history_df, aes(x=epoch, y=acc, color=optimizer))+
    geom_line() + geom_point())
}

plot_acc <- function(){
  model_df <- data.frame(optimizer=character(),
                  acc=numeric())
  for (i in 1:7) {
    m <- load_model_hdf5(paste("model_", optimizer_list[i], "_3.hdf5", sep=""))
    e <- evaluate(m, fm_test_images, fm_test_labels, verbose=0)
    nr <- data.frame(
      optimizer=optimizer_list[i],
      acc=round(e$acc*100, 2))
    model_df <-rbind(model_df, nr)  
  }

  ggplotly(
  ggplot(
    data=model_df, 
    aes(reorder(x=optimizer, acc), 
        y=acc, 
        fill=optimizer,
        text=paste("optimizer: ", optimizer, "\nacc: ", acc, "%" , sep="")
        )
    )+geom_col() + coord_flip()+labs(x = "Optimizer", y = "Acc"), 
  tooltip="text")
}

1.2 Data Pre-Processing

1.2.1 Read Data Source

We’re going to use Fasion MNIST dataset from keras.

fm_train <- read.csv("train.csv")
fm_test <- read.csv("test.csv")

1.2.2 Splitting images and label data

fm_train <- data.matrix(fm_train)
fm_train_images <- fm_train[,-1]
fm_train_labels <- fm_train[,1]

fm_test <- data.matrix(fm_test)
fm_test_images <- fm_test[,-1]
fm_test_labels <- fm_test[,1]

1.2.3 Scale the data value (divide by 255)

fm_train_images <- fm_train_images / 255
fm_test_images <- fm_test_images / 255

2 Keras

2.1 Setup the layers

model <- keras_model_sequential()
model %>%
  layer_dense(units = 512, activation = 'relu', input_shape = c(784)) %>%
  layer_dropout(rate = 0.5) %>% 
  layer_dense(units = 256, activation = 'relu') %>%
  layer_dropout(rate = 0.4) %>% 
  layer_dense(units = 10, activation = 'softmax')
summary(model)
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## dense_1 (Dense)                  (None, 512)                   401920      
## ___________________________________________________________________________
## dropout_1 (Dropout)              (None, 512)                   0           
## ___________________________________________________________________________
## dense_2 (Dense)                  (None, 256)                   131328      
## ___________________________________________________________________________
## dropout_2 (Dropout)              (None, 256)                   0           
## ___________________________________________________________________________
## dense_3 (Dense)                  (None, 10)                    2570        
## ===========================================================================
## Total params: 535,818
## Trainable params: 535,818
## Non-trainable params: 0
## ___________________________________________________________________________

2.2 Compile the model

I’ve been compiling the model using all pre-defined optimizers (adadelta, adagrad, adam, adamax, nadam, rmsprop, and sgd). Aside from opitimizer, another attribute that’s been used on compiling was: 1. loss -> sparse_categorical_crossentropy 2. metrics -> accuracy

model %>% compile(
  optimizer = optimizer_adam(), 
  loss = 'sparse_categorical_crossentropy',
  metrics = c('accuracy')
)

2.3 Train the model

For every optimizers, i train the model using epoch set to 100, validation split 0.2, and leverage the callback_earlystopping which monitor the accuracy value, and immediately stop the training if the accuracy fail to increase for consecutive 2 epochs

history <- model %>% 
  fit(fm_train_images, fm_train_labels, epoch=100, validation_split = 0.2,
      callback=callback_early_stopping(monitor = "acc", mode = "max", patience=2))

2.4 Comparing the models trained

We’re going to plot train step of every model with different optimization and compare theirs in-train accuracy.

plot_acc_it()

By viewing the plot, we can see that using sgd, the learning rate is the slowest hence it needed 54 epochs to get to the maximum in-train accuracy at 90.51%.
Meanwhile, adamax has the best in-train accuracy compared to other optimizers, valued 91.55% at 44 epochs.
Nadam is not really suitable for this kind of data, it can only achieve 85.57% accuracy at 21 epochs.

Next we’re going to compare the accuracy using unseen-data test.

plot_acc()

As we can see from above chart, adamax has the highest accuracy valued 89.68%, while nadam has the lowest accuracy valued 87.62%. These results are aligned with previous plot, therefore we could simply assume that for this dataset, using keras sequential model is best when applying adamax as the optimizer.

3 MXNET

3.1 Setup the Layers

We’re going to use the same layers configuration as keras model.

m1.data  <- mx.symbol.Variable("data") 
m1.fc1 <- mx.symbol.FullyConnected(m1.data, name="fc1", num_hidden=512)
m1.act1 <- mx.symbol.Activation(m1.fc1, name="activation1", act_type="relu")
m1.fc2 <- mx.symbol.FullyConnected(m1.act1, name="fc2", num_hidden=256)
m1.act2 <- mx.symbol.Activation(m1.fc2, name="activation2", act_type="relu")
m1.fc3 <- mx.symbol.FullyConnected(m1.act2, name="fc3", num_hidden=10)
m1.softmax <- mx.symbol.SoftmaxOutput(m1.fc3, name="softMax")

3.2 Train the Model

Due to insufficient time, i only managed to train MXNet model using default optimizer (sgd)

log <- mx.metric.logger$new()
startime <- proc.time() 
mx.set.seed(10)

mx_model <- mx.model.FeedForward.create(
  m1.softmax,  
  X = fm_train_images_t, 
  y = fm_train_labels,
  ctx = mx.cpu(),
  num.round = 20,  
  array.batch.size = 80,
  momentum = 0.95,
  array.layout="colmajor",
  learning.rate = 0.01,
  eval.metric = mx.metric.accuracy,
  epoch.end.callback = mx.callback.log.train.metric(1,log)
)
print(paste("Training took:", round((proc.time() - startime)[3],2),"seconds"))

3.3 Model Graph

mx_model <- mx.model.load("mxnet_model", iteration = 20)
graph.viz(mx_model$symbol)

3.4 In-Train Accuracy

log <- readRDS("mxnet_log.RDS")
ggplotly(ggplot(
  data = data.frame(round=1:length(log$train), acc=round(log$train*100,2)),
  aes(x=round, y=acc)
  ) + 
  geom_line(col="red") + 
  geom_point(col="blue"))

Looking at the plot above, we could see that using MXNet + sgd resulting a better accuracy than any keras model valued 92.5%, in just 20 rounds.

3.5 Unseen-Data Test Accuracy

mx_preds <- data.frame(
  t(predict(mx_model, fm_test_images, array.layout = "rowmajor"))
)
mx_preds$best <- as.factor(apply(mx_preds[,1:10], 1, which.max) - 1)
mx_preds$actual <- as.factor(fm_test_labels)
confusionMatrix(mx_preds$best, mx_preds$actual)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 820   0  15  10   0   0 132   0   3   0
##          1   6 984   2  12   0   0   8   0   0   0
##          2  17   2 795   9  39   1 111   0   8   1
##          3  76  12   9 926  25   0  51   0   1   0
##          4   7   0 145  31 925   0 123   0   9   0
##          5   2   1   1   1   0 925   0   3   2   5
##          6  64   1  30   7   9   0 569   0   3   0
##          7   0   0   0   0   0  53   0 946   1  29
##          8   8   0   3   4   2   3   6   0 972   0
##          9   0   0   0   0   0  18   0  51   1 965
## 
## Overall Statistics
##                                                
##                Accuracy : 0.8827               
##                  95% CI : (0.8762, 0.8889)     
##     No Information Rate : 0.1                  
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.8697               
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.8200   0.9840   0.7950   0.9260   0.9250   0.9250
## Specificity            0.9822   0.9969   0.9791   0.9807   0.9650   0.9983
## Pos Pred Value         0.8367   0.9723   0.8087   0.8418   0.7460   0.9840
## Neg Pred Value         0.9800   0.9982   0.9773   0.9917   0.9914   0.9917
## Prevalence             0.1000   0.1000   0.1000   0.1000   0.1000   0.1000
## Detection Rate         0.0820   0.0984   0.0795   0.0926   0.0925   0.0925
## Detection Prevalence   0.0980   0.1012   0.0983   0.1100   0.1240   0.0940
## Balanced Accuracy      0.9011   0.9904   0.8871   0.9533   0.9450   0.9617
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity            0.5690   0.9460   0.9720   0.9650
## Specificity            0.9873   0.9908   0.9971   0.9922
## Pos Pred Value         0.8331   0.9193   0.9739   0.9324
## Neg Pred Value         0.9537   0.9940   0.9969   0.9961
## Prevalence             0.1000   0.1000   0.1000   0.1000
## Detection Rate         0.0569   0.0946   0.0972   0.0965
## Detection Prevalence   0.0683   0.1029   0.0998   0.1035
## Balanced Accuracy      0.7782   0.9684   0.9846   0.9786

But after using unseen data for testing, MXNet model can only resulting decent accuracy valued 88.27%, even lower than Keras models average accuracy. It seems the MXNet model is overfitting

4 Summary

Running keras model seems more friendly for the hardware than MXNet, but MXNet provide us a more promising in-train accuracy with more ease tuning than keras. Furthermore, maybe we could tune MXNet more using another optimization method, tuning the learning rate, or weight decay, in order to avoid overfitting.

Aside from those tuning, i won’t recommend using any other layer configurations, because i already tried to add and reduce the layers and/or the neuron, but they don’t add a significant result rather than longer processing time.