This is a demo on end-to-end implementation of deep neural networks (DNN), a subclass of machine learning (artificial intelligence) class,in R using R interface to Keras, a high-level neural networks API developed in Python. In this demo, we apply DNN models to two data sets, MNIST data set and a loan default data set. This demo is organized as follows:
In Section Why Keras?, we provide an overview on why Keras is important when dealing with DNN models. Other alternatives to Keras are also provided. In Section Why R interface?, we highlight key points on what makes R and R interface to Keras very useful tool to work with. Installation procedure is also documented along with the required R codes.
We demonstrate the important steps in the exploratory data analysis step in the model fitting process in Section Preparing the Data. Two data sets are studies: MNIST and credit default loans data sets.
There are various deep learning frameworks available today enabling developers, academicians, and practitioners turns ideas into results. Just to name a few, Tensorflow, Theano, Caffe and Keras.
However, there are several advantages of using Keras over other frameworks which include:
It is worth highlighting that Keras, after Tensorflow, has the strongest adoption among all other deep learning frameworks in the industry and the research community (including CERN and NASA).
This statistics is based on the total mentions of deep learning frameworks in scientific papers uploaded to the preprint server arXiv.org
Due to the user friendly feature of R software, this program has a strong influence among different industries and academics. From a data science perspective, R has numerous packages helping implement deep learning models similar to the other machine learning models. For an overview of deep learning packages in R just try to click here. No danger, I promise! :)
However, due to the advantages of using Keras over other frameworks and the user friendly feature of R, there exists two R interfaces to Keras, kerasR package and RStudio’s keras package.
Question: What does it mean when a package, such as the RStudio’s keras package, is “an interface” to another package, the Python Keras?
Answer: In short, R interface means that you, as an user, can enjoy the flexibility and user friendly features of R and at the same time have access to the strong power of the Python Keras package. The best of both worlds
We demonstrate how to install RStudio’s keras package in the next tab.
To install RStudio’s keras package, first install R package from CRAN as follows:
install.packages("keras")
In the next step we need to install Tensorflow and Keras libraries. This is because the Keras R interface uses the TensorFlow backend engine by default. In order to install both libraries together, we use install_keras().
library(keras)
install_keras()
install_keras function has several arguments as follows:
install_keras(method = c("auto", "virtualenv", "conda"),
conda = "auto", tensorflow = "default",
extra_packages = c("tensorflow-hub"))
As can be seen from previous chunk of codes, there are three methods to install Keras and Tensorflow when using install_keras function. It is worth mentioning that the only supported installation method on Windows is “conda”.
The default version of Tensorflow is the CPU version; however, if you wish to enjoy your GPU, you are welcomed to change the configuration and specify tensorflow = “gpu”.
It is highly recommended to visit custom installation if you wish to do a custom installation of Keras and Tensorflow.
Similar to any other predictive models, the process of DTE in Keras takes the following steps:
a compact flowchart of the process is demonstrated in the following figure.
In this section a detailed breakdown of the required basic built-in functions and procedures in Keras library, which are used in the aforementioned steps, are introduced. Supplementary functions including those which are used for regularization techniques, tuning parameters task, among others, are discussed in the Supplementary functions subsection.
Define
model = keras_model_sequential()
model %>%
layer_dense(units = 256, activation = 'relu', input_shape = c(784))
Note 1: The %>% is the pipe operator in R. Pipe operator in simple term helps avoid opening and closing lots of parentheses when you write your code. This helps the code be more readable. If you are interested in knowing more about pipe operator in R see pip.
summary(model)
Compile
model %>% compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999),
metrics = c('accuracy')
)
Fit
Fitted_model = model %>% fit(
x_train, y_train,
epochs = 30, batch_size = 128,
validation_split = 0.2
)
score = model %>% evaluate(
x_test, y_test
)
cat('Test loss:', score$loss, '\n')
cat('Test accuracy:', score$acc, '\n')
To generate predictions on new data, we can use predict_classes as follows:
model %>% predict_classes(x_test)
This subsection deals with the functions in R interface to Keras which are used for regularization techniques, tuning parameters task, among others.
layer_dense(units = 128, activation = 'relu',
kernel_regularizer = regularizer_l2(l = 0.001))
regularizer_l2(l = 0.001) specifies l2 regularization with regularization factor \(l = 0.001\).
layer_dense(units = 128, activation = 'relu'
) %>%
layer_dropout(rate = .4) %>%
As can be seen from the above code, layer_dropout() function is added to the layer we wants to add to our model. There is rate argument in layer_dropout() which specifies a probability/fraction rate which is used to randomly set that fraction of inputs neurons to zero.
Tuning parameters is another step in the process of training a model with the hope to improve the metrics as much as we can.
To perform hyperparameters tuning in KerasRStudio, we can implement all three methods of hyperparameters tuning, manual, grid, and Bayesian hyperparameter optimization. In this subsection, we explain how to specify the parameters for which we need to perform the tuning process and how to call them in the body of the model. The final step which is to call the tuning method (mainly grid or Bayesian) is discussed and implemented in Hyperparameters tuning section.
Regardless of which method we use to tune the parameters, we need to specify/identify the parameters we need to tune and assign flag to them. Depending on the class of the parameter, there are four different types of flags in KerasRStudio.
flag_numeric(name, default, description = NULL)
flag_integer(name, default, description = NULL)
flag_boolean(name, default, description = NULL)
flag_string(name, default, description = NULL)
In all these four flags types, there is the argument name which specifies the name of the parameter (that can be called when training the model), the argument default which specifies the default value of the name argument, and description argument which provides an explanation of the name argument.
The following example demonstrates how the flags are used:
FLAGS = flags(
flag_numeric("dropout1", 0.4),
flag_numeric("dropout2", 0.3),
flag_string("activation1", "relu"),
flag_string("activation2", "relu"),
flag_string("activation3", "softmax")
)
the defined flags then need to be called in the body of the model in the DTE process. The following is an example of how the aforementioned flags are used:
model = keras_model_sequential()
model %>%
layer_dense(units = 256, activation = FLAGS$activation1, input_shape = c(784)) %>%
layer_dropout(rate = FLAGS$dropout1) %>%
layer_dense(units = 128, activation = FLAGS$activation2,
kernel_regularizer = regularizer_l2(l = 0.001)) %>%
layer_dropout(rate = FLAGS$dropout2) %>%
layer_dense(units = 10, activation = FLAGS$activation3)
In this section, we provide a plain implementation of a multi-perceptron neural network using Keras in R. The fundamental functions required to perform DTE task are presented in Fundamental functions in Keras section. Treating any overfitting or regularization problem and other model tuning are discussed in different sections.
In the following piece of code we put all together the DTE process for a plain multi-perceptron neural network.
model = keras_model_sequential()
model %>%
layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>%
layer_dense(units = 128, activation = 'relu') %>%
layer_dense(units = 10, activation = 'softmax')
summary(model)
model %>% compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999),
metrics = c('accuracy')
)
Fitted_model = model %>% fit(
x_train, y_train,
epochs = 30, batch_size = 128,
validation_split = 0.2
)
plot(Fitted_model)
score <- model %>% evaluate(
x_test, y_test,
verbose = 0
)
cat('Test loss:', score$loss, '\n')
cat('Test accuracy:', score$acc, '\n')
This section is devoted to the demonstration of the overfitting problem and DTE implementation of the regularized versions of the baseline model. Two regularization techniques are discussed: \(l1\), \(l2\) regularization and dropout techniques. We graphically show how to compare the performance of these models.
We break down the DTE implementation of the regularized versions of the baseline model in four pieces. In the first part we provide the baseline implementation of the DTE process and in the second part we demonstrate how DTE is implemented using \(l2\) regularization technique. In the third part we provide the DTE implementation of dropout regularization technique for neural networks. In the last part, the combined methods are implemented.
It is worth highlighting that at each step (when implementing the regularization) we graphically compare the loss and other metrics between the baseline and the regularized version of it.
model = keras_model_sequential()
model %>%
layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>%
layer_dense(units = 128, activation = 'relu') %>%
layer_dense(units = 10, activation = 'softmax')
model %>% summary()
model %>% compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999),
metrics = c('accuracy')
)
Baseline_history = model %>% fit(
x_train, y_train,
epochs = 30, batch_size = 128,
view_metrics = TRUE,
validation_split = 0.2
)
l1_model <- keras_model_sequential()
l1_model %>%
layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>%
layer_dense(units = 128, activation = 'relu',
kernel_regularizer = regularizer_l2(l = 0.001)) %>%
layer_dense(units = 10, activation = 'softmax')
l1_model %>% compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_rmsprop(lr = 0.001),
metrics = c('accuracy')
)
l1_history <- l1_model %>% fit(
x_train, y_train,
batch_size = 128,
epochs = 30,
verbose = 1,
view_metrics = TRUE,
validation_split = 0.2
)
comparison_l1 = data.frame(
Baseline_train= Baseline_history$metrics$loss,
Baseline_val = Baseline_history$metrics$val_loss,
l1_train = l1_history$metrics$loss,
l1_val = l1_history$metrics$val_loss
)%>%
rownames_to_column() %>%
mutate(rowname = as.integer(rowname)) %>%
gather(key = "type", value = "value", -rowname)
ggplot(comparison_l1, aes(x = rowname, y = value, color = type)) +
geom_line() +
xlab("epoch") +
ylab("loss")
drop_model <- keras_model_sequential()
drop_model %>%
layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>%
layer_dropout(rate = .3) %>%
layer_dense(units = 128, activation = 'relu'
) %>%
layer_dropout(rate = .4) %>%
layer_dense(units = 10, activation = 'softmax')
drop_model %>% compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_rmsprop(lr = 0.001),
metrics = c('accuracy')
)
drop_history <- drop_model %>% fit(
x_train, y_train,
batch_size = 128,
epochs = 30,
verbose = 1,
view_metrics = TRUE,
validation_split = 0.2
)
comparison_drop = data.frame(
Baseline_train= Baseline_history$metrics$loss,
Baseline_val = Baseline_history$metrics$val_loss,
drop_train = drop_history$metrics$loss,
drop_val = drop_history$metrics$val_loss
)%>%
rownames_to_column() %>%
mutate(rowname = as.integer(rowname)) %>%
gather(key = "type", value = "value", -rowname)
ggplot(comparison_drop, aes(x = rowname, y = value, color = type)) +
geom_line() +
xlab("epoch") +
ylab("loss")
dropl1_model <- keras_model_sequential()
dropl1_model %>%
layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>%
layer_dropout(rate = .3) %>%
layer_dense(units = 128, activation = 'relu',
kernel_regularizer = regularizer_l2(l = 0.001)) %>%
layer_dropout(rate = .4) %>%
layer_dense(units = 10, activation = 'softmax')
dropl1_model %>% compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_rmsprop(lr = 0.001),
metrics = c('accuracy')
)
dropl1_history <- dropl1_model %>% fit(
x_train, y_train,
batch_size = 128,
epochs = 30,
view_metrics = TRUE,
verbose = 1,
validation_split = 0.2
)
comparison_dropl1 = data.frame(
Baseline_train= Baseline_history$metrics$loss,
Baseline_val = Baseline_history$metrics$val_loss,
dropl1_train = dropl1_history$metrics$loss,
dropl1_val = dropl1_history$metrics$val_loss
)%>%
rownames_to_column() %>%
mutate(rowname = as.integer(rowname)) %>%
gather(key = "type", value = "value", -rowname)
ggplot(comparison_dropl1, aes(x = rowname, y = value, color = type)) +
geom_line() +
xlab("epoch") +
ylab("loss")
In this section we provide the full implementation of hyperparameters tuning process using flags in RStudio. We implement both grid search and Bayesian hyperparameter optimization techniques. The important functions to call for this purpose are discussed in Supplementary functions subsection.
Note 2: It is imperative to call appropriate type of flags when defining the hyperparameters in the model. If the parameter is a numeric one, we need to call flag_numeric(), if it’s a string type of parameter, we must call flag_string(), and so on. We refer to flag for further information on these functions and the relevant arguments.
The following chunk of code, demonstrates the full implementation of the flag labeling as well as their call in the body of the model before using a tuning technique. For the sake of later recalling, we save the following R script as “mnist_mlp.R”.
rm(list = ls())
FLAGS = flags(
flag_numeric("dropout1", 0.4),
flag_numeric("dropout2", 0.3),
flag_string("activation1", "relu"),
flag_string("activation2", "relu"),
flag_string("activation3", "softmax")
)
model = keras_model_sequential()
model %>%
layer_dense(units = 256, activation = FLAGS$activation1, input_shape = c(784)) %>%
layer_dropout(rate = FLAGS$dropout1) %>%
layer_dense(units = 128, activation = FLAGS$activation2,
kernel_regularizer = regularizer_l2(l = 0.001)) %>%
layer_dropout(rate = FLAGS$dropout2) %>%
layer_dense(units = 10, activation = FLAGS$activation3)
model %>% compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999),
metrics = c('accuracy')
)
history = model %>% fit(
x_train, y_train,
batch_size = 128,
epochs = 20,
view_metrics = TRUE,
verbose = 1,
validation_split = 0.2
)
score = model %>% evaluate(
x_test, y_test,
verbose = 0
)
cat('Test loss:', score$loss, '\n')
cat('Test accuracy:', score$acc, '\n')
To perform the tuning process under the grid approach in KerasRStudio, we call tuning_run() function in tfruns library as follows:
runs <- tuning_run("mnist_mlp.R", sample = 0.3, flags = list(
dropout1 = c(0.2, 0.4, 0.5),
dropout2 = c(0.1, 0.3, 0.5),
activation1 = c("relu", "softmax", "sigmoid"),
activation2 = c("relu", "softmax", "sigmoid"),
activation3 = c("relu", "softmax", "sigmoid")
))
As can be seen from the previous chunk of code, the tuning_run() function takes different arguments among which are the file name, sample, and flags. File name provides path to training script (in our example it is “mnist_mlp.R”), sample specifies the sampling rate for flag combinations. Sometimes the combination of different flags with multiple values makes the tuning process quite computationally expensive. In this case sample argument randomly performs the tuning on the sampling rate for flag combinations instead of all combinations. The flags arguments in tuning_run() function specifies the list of all parameters names with multiple flag values.
Another approach to the hyperparameter tuning process is the Bayesian approach. This approach can be employed in RStudio using CloudML and R interface to Google CloudML, cloudml package. To install “cloudml” package as well as “Google Cloud SDK”, we refer to CloudML.
Once the required packages are installed, we set up the training configuration for the later use in cloudml-related functions. The following script demonstrates how to create a training configuration file. We bname it “tuning.yml”" for later recalling.
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
hyperparameters:
goal: MAXIMIZE
hyperparameterMetricTag: val_acc
maxTrials: 10
maxParallelTrials: 2
params:
- parameterName: dropout1
type: DOUBLE
minValue: 0.2
maxValue: 0.5
scaleType: UNIT_LINEAR_SCALE
- parameterName: dropout2
type: DOUBLE
minValue: 0.1
maxValue: 0.5
scaleType: UNIT_LINEAR_SCALE
- parameterName: activation1
type: CATEGORICAL
categoricalValues: [relu, softmax, sigmoid]
- parameterName: activation2
type: CATEGORICAL
categoricalValues: [relu, softmax, sigmoid]
- parameterName: activation3
type: CATEGORICAL
categoricalValues: [relu, softmax, sigmoid]
Note 3: It is clear from tuning.yml file that there are several parameters which need to be specified when defining the training configuration file for the purpose of tuning job. For instance, hyperparameterMetricTag specifies the metric to optimze for (either maximize or minimize) when training the model. For Keras, there are “acc”, “loss”, “val_acc” and “val_loss”. The Type parameters can take one of “integer”, “double”, “categorical” or “discrete”. For other configurations we refer to training config.
The next step is to submit a hyperparameter tuning job using CloudML. This task can be done by calling cloudml_train function as follows:
cloudml_train("mnist_mlp.R", config = "tuning.yml")
As can be seen from the previous line of code, we need to specify the training configuration when we call “cloudml_train” function for the sake tuning job.
In this subsection, we provide an overview of how to employ rBayesianOptimization package in R when trying to tune hyperparameters using Bayesian approach.
The full implementation of hyperparameter tuning using rBayesianOptimization package does not require any flag labeling to the hyperparameters.
To apply BayesianOptimization() function, we first define a function (with hyperparameters as inputs) which performs the DTE process as follows:
rm(list = ls())
training_credit = function(initParams){
urlToData = "https://assets.datacamp.com/production/course_1025/datasets/loan_data_ch1.rds"
savePath = tempfile(fileext = ".rds")
download.file(urlToData, destfile = savePath, mode = "wb")
loanData = readRDS(savePath)
# Clean up
rm(urlToData, savePath)
# Convert default status to factor
setDT(loanData)
loanData[, `:=`(loan_status, factor(loan_status))]
loanData= as.data.frame.matrix(loanData)
for(i in c(3,5)){
loanData[which(is.na(loanData[,i])), i] = mean(loanData[,i], na.rm = TRUE)
}
loanData$loan_status = factor(loanData$loan_status)
loanData$loan_status = as.numeric(loanData$loan_status)
loanData$grade = factor(loanData$grade)
loanData$grade = as.numeric(loanData$grade)
loanData$home_ownership = factor(loanData$home_ownership)
loanData$home_ownership = as.numeric(loanData$home_ownership)
maxs = apply(loanData, 2, max)
mins = apply(loanData, 2, min)
scaled = as.data.frame(scale(loanData,
center = mins, scale = maxs - mins))
splitData = initial_split(scaled, prop = 2/3, strata = "loan_status")
trainData = training(splitData)
testData = setdiff(scaled,trainData)
x_train = trainData[-1]
x_train = data.matrix(x_train)
y_train = trainData$loan_status
y_train = data.matrix(y_train)
x_test = testData[-1]
x_test = data.matrix(x_test)
y_test = testData$loan_status
y_test = data.matrix(y_test) # need to convert dataframe to matrix
# Define Model --------------------------------------------------------------
model = keras_model_sequential()
model %>%
layer_dense(units = 20, activation = 'sigmoid', input_shape = c(7)) %>%
layer_dropout(rate = initParams$dropout1) %>%
layer_dense(units = 10, activation = 'sigmoid',
kernel_regularizer = regularizer_l2(l = 0.001)) %>%
layer_dropout(rate = initParams$dropout2) %>%
layer_dense(units = 1, activation = 'sigmoid')
summary(model)
model %>% compile(
loss = 'binary_crossentropy',
#optimizer = optimizer_rmsprop(lr = 0.001),
optimizer = optimizer_adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999),
metrics = c('accuracy')
)
# Training & Evaluation ----------------------------------------------------
history = model %>% fit(
x_train, y_train,
batch_size = 128,
epochs = 20,
view_metrics = TRUE,
verbose = 1,
#callbacks = callback_tensorboard("logs/run_a"),
validation_split = 0.2
)
return(history$metrics$val_acc[20])
}
In the next step, we define another function maximizing/minimizing our desired metrics. This function then needs to be fed into the “BayesianOptimization()” function to perform the tuning process.
maximizeACC = function(dropout1, dropout2) {
replaceParams = list ( dropout1 = dropout1, dropout2 = dropout2)
updatedParams = modifyList(initParams, replaceParams)
score = training_credit(updatedParams)
results = list (Score = score, Pred = 0)
return(results)
}
boundsParams = list (dropout1 = c(0.1, 0.7), dropout2 = c(0.1, 0.7))
Final_calibrated = BayesianOptimization(maximizeACC, bounds = boundsParams,
init_grid_dt = as.data.table(boundsParams),
init_points = 10, n_iter = 30, acq = "ucb",
kappa = 2.576, eps = 0, verbose = FALSE)
tail(Final_calibrated$History)