library(dplyr)
library(mxnet)
library(keras)
library(ggplot2)
library(plotly)
library(caret)
library(e1071)
optimizer_list <-
c("adadelta", "adagrad", "adam", "adamax", "nadam", "rmsprop", "sgd")
plot_acc_it <- function(){
history_df <- data.frame(optimizer=character(),
epoch=numeric(),
acc=numeric())
for (i in 1:7) {
h <- readRDS(paste("history_", optimizer_list[i], "_3.RDS", sep = ""))
nr <- data.frame(
optimizer=optimizer_list[i],
epoch=1:length(h$metrics$acc),
acc=round(h$metrics$acc*100,2))
history_df <-rbind(history_df, nr)
}
ggplotly(ggplot(data=history_df, aes(x=epoch, y=acc, color=optimizer))+
geom_line() + geom_point())
}
plot_acc <- function(){
model_df <- data.frame(optimizer=character(),
acc=numeric())
for (i in 1:7) {
m <- load_model_hdf5(paste("model_", optimizer_list[i], "_3.hdf5", sep=""))
e <- evaluate(m, fm_test_images, fm_test_labels, verbose=0)
nr <- data.frame(
optimizer=optimizer_list[i],
acc=round(e$acc*100, 2))
model_df <-rbind(model_df, nr)
}
ggplotly(
ggplot(
data=model_df,
aes(reorder(x=optimizer, acc),
y=acc,
fill=optimizer,
text=paste("optimizer: ", optimizer, "\nacc: ", acc, "%" , sep="")
)
)+geom_col() + coord_flip()+labs(x = "Optimizer", y = "Acc"),
tooltip="text")
}We’re going to use Fasion MNIST dataset from keras.
fm_train <- read.csv("train.csv")
fm_test <- read.csv("test.csv")fm_train <- data.matrix(fm_train)
fm_train_images <- fm_train[,-1]
fm_train_labels <- fm_train[,1]
fm_test <- data.matrix(fm_test)
fm_test_images <- fm_test[,-1]
fm_test_labels <- fm_test[,1]fm_train_images <- fm_train_images / 255
fm_test_images <- fm_test_images / 255model <- keras_model_sequential()
model %>%
layer_dense(units = 512, activation = 'relu', input_shape = c(784)) %>%
layer_dropout(rate = 0.5) %>%
layer_dense(units = 256, activation = 'relu') %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 10, activation = 'softmax')
summary(model)## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## dense_1 (Dense) (None, 512) 401920
## ___________________________________________________________________________
## dropout_1 (Dropout) (None, 512) 0
## ___________________________________________________________________________
## dense_2 (Dense) (None, 256) 131328
## ___________________________________________________________________________
## dropout_2 (Dropout) (None, 256) 0
## ___________________________________________________________________________
## dense_3 (Dense) (None, 10) 2570
## ===========================================================================
## Total params: 535,818
## Trainable params: 535,818
## Non-trainable params: 0
## ___________________________________________________________________________
I’ve been compiling the model using all pre-defined optimizers (adadelta, adagrad, adam, adamax, nadam, rmsprop, and sgd). Aside from opitimizer, another attribute that’s been used on compiling was: 1. loss -> sparse_categorical_crossentropy 2. metrics -> accuracy
model %>% compile(
optimizer = optimizer_adam(),
loss = 'sparse_categorical_crossentropy',
metrics = c('accuracy')
)For every optimizers, i train the model using epoch set to 100, validation split 0.2, and leverage the callback_earlystopping which monitor the accuracy value, and immediately stop the training if the accuracy fail to increase for consecutive 2 epochs
history <- model %>%
fit(fm_train_images, fm_train_labels, epoch=100, validation_split = 0.2,
callback=callback_early_stopping(monitor = "acc", mode = "max", patience=2))We’re going to plot train step of every model with different optimization and compare theirs in-train accuracy.
plot_acc_it()By viewing the plot, we can see that using sgd, the learning rate is the slowest hence it needed 54 epochs to get to the maximum in-train accuracy at 90.51%.
Meanwhile, adamax has the best in-train accuracy compared to other optimizers, valued 91.55% at 44 epochs.
Nadam is not really suitable for this kind of data, it can only achieve 85.57% accuracy at 21 epochs.
Next we’re going to compare the accuracy using unseen-data test.
plot_acc()As we can see from above chart, adamax has the highest accuracy valued 89.68%, while nadam has the lowest accuracy valued 87.62%. These results are aligned with previous plot, therefore we could simply assume that for this dataset, using keras sequential model is best when applying adamax as the optimizer.
We’re going to use the same layers configuration as keras model.
m1.data <- mx.symbol.Variable("data")
m1.fc1 <- mx.symbol.FullyConnected(m1.data, name="fc1", num_hidden=512)
m1.act1 <- mx.symbol.Activation(m1.fc1, name="activation1", act_type="relu")
m1.fc2 <- mx.symbol.FullyConnected(m1.act1, name="fc2", num_hidden=256)
m1.act2 <- mx.symbol.Activation(m1.fc2, name="activation2", act_type="relu")
m1.fc3 <- mx.symbol.FullyConnected(m1.act2, name="fc3", num_hidden=10)
m1.softmax <- mx.symbol.SoftmaxOutput(m1.fc3, name="softMax")Due to insufficient time, i only managed to train MXNet model using default optimizer (sgd)
log <- mx.metric.logger$new()
startime <- proc.time()
mx.set.seed(10)
mx_model <- mx.model.FeedForward.create(
m1.softmax,
X = fm_train_images_t,
y = fm_train_labels,
ctx = mx.cpu(),
num.round = 20,
array.batch.size = 80,
momentum = 0.95,
array.layout="colmajor",
learning.rate = 0.01,
eval.metric = mx.metric.accuracy,
epoch.end.callback = mx.callback.log.train.metric(1,log)
)
print(paste("Training took:", round((proc.time() - startime)[3],2),"seconds"))mx_model <- mx.model.load("mxnet_model", iteration = 20)
graph.viz(mx_model$symbol)log <- readRDS("mxnet_log.RDS")
ggplotly(ggplot(
data = data.frame(round=1:length(log$train), acc=round(log$train*100,2)),
aes(x=round, y=acc)
) +
geom_line(col="red") +
geom_point(col="blue"))mx_preds <- data.frame(
t(predict(mx_model, fm_test_images, array.layout = "rowmajor"))
)
mx_preds$best <- as.factor(apply(mx_preds[,1:10], 1, which.max) - 1)
mx_preds$actual <- as.factor(fm_test_labels)
confusionMatrix(mx_preds$best, mx_preds$actual)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 820 0 15 10 0 0 132 0 3 0
## 1 6 984 2 12 0 0 8 0 0 0
## 2 17 2 795 9 39 1 111 0 8 1
## 3 76 12 9 926 25 0 51 0 1 0
## 4 7 0 145 31 925 0 123 0 9 0
## 5 2 1 1 1 0 925 0 3 2 5
## 6 64 1 30 7 9 0 569 0 3 0
## 7 0 0 0 0 0 53 0 946 1 29
## 8 8 0 3 4 2 3 6 0 972 0
## 9 0 0 0 0 0 18 0 51 1 965
##
## Overall Statistics
##
## Accuracy : 0.8827
## 95% CI : (0.8762, 0.8889)
## No Information Rate : 0.1
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.8697
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.8200 0.9840 0.7950 0.9260 0.9250 0.9250
## Specificity 0.9822 0.9969 0.9791 0.9807 0.9650 0.9983
## Pos Pred Value 0.8367 0.9723 0.8087 0.8418 0.7460 0.9840
## Neg Pred Value 0.9800 0.9982 0.9773 0.9917 0.9914 0.9917
## Prevalence 0.1000 0.1000 0.1000 0.1000 0.1000 0.1000
## Detection Rate 0.0820 0.0984 0.0795 0.0926 0.0925 0.0925
## Detection Prevalence 0.0980 0.1012 0.0983 0.1100 0.1240 0.0940
## Balanced Accuracy 0.9011 0.9904 0.8871 0.9533 0.9450 0.9617
## Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity 0.5690 0.9460 0.9720 0.9650
## Specificity 0.9873 0.9908 0.9971 0.9922
## Pos Pred Value 0.8331 0.9193 0.9739 0.9324
## Neg Pred Value 0.9537 0.9940 0.9969 0.9961
## Prevalence 0.1000 0.1000 0.1000 0.1000
## Detection Rate 0.0569 0.0946 0.0972 0.0965
## Detection Prevalence 0.0683 0.1029 0.0998 0.1035
## Balanced Accuracy 0.7782 0.9684 0.9846 0.9786
But after using unseen data for testing, MXNet model can only resulting decent accuracy valued 88.27%, even lower than Keras models average accuracy. It seems the MXNet model is overfitting
Running keras model seems more friendly for the hardware than MXNet, but MXNet provide us a more promising in-train accuracy with more ease tuning than keras. Furthermore, maybe we could tune MXNet more using another optimization method, tuning the learning rate, or weight decay, in order to avoid overfitting.
Aside from those tuning, i won’t recommend using any other layer configurations, because i already tried to add and reduce the layers and/or the neuron, but they don’t add a significant result rather than longer processing time.