1 Deep Learning in R

This is a demo on end-to-end implementation of deep neural networks (DNN), a subclass of machine learning (artificial intelligence) class in R, using R interface to Keras, a high-level neural networks API developed in Python. In this demo, we apply DNN models to a loan default data set. This demo is organized as follows:

  • In Section Why Keras?, we provide an overview on why Keras is important when dealing with DNN models. Other alternatives to Keras are also listed. In Section Why R interface?, we highlight key points on what makes R and R interface to Keras very useful tool to work with. Installation procedure is also documented along with calling the required R codes.

  • We demonstrate the important steps in the exploratory data analysis step in the model fitting process in Section Data Preparation on a credit default loans data set.

  • Section Keras model training and evaluation is devoted to the intermediary steps between the inputs and the predicted outputs of a DNN model. That is, we show how to use R interface for Keras to define a model, train it, and evaluate it (DTE) on the test set. In fact, we break down the demo into four different but related subsections as follows:
    • In Subsection Fundamental functions in Keras, we list important fundamental functions to recall to implement DTE for a feed-forward DNN using keras;
    • In subsection Baseline Model we provide a full implementation of a plain DNN model to our data sets without employing any regularization techniques;
    • In subsection Multiple models with regularization, we demonstrate the overfitting problem and perform DTE on the baseline model and regularized versions of the model. Two regularization techniques are discussed: \(l1\), \(l2\) regularization and dropout techniques. We graphically show how to compare the performance of these models;
    • How to apply hyperparameters tuning in Keras is another question we address in Subsection Hyperparameters tuning.
  • Model diagnosis metrics/tests including PDP, ALE plot and PFI are discussed in Section Model interpretability/diagnosis.

  • In the last section, Feature engineering/selection, we provide one class of feature selection approaches so-called Filter-based techniques. An entropy-based algorithm so-called Fast Correlation based Feature Selection (FCBF) is discussed and its R implemenation is presented as well.

1.1 Why Keras?

There are various deep learning frameworks available today enabling developers, academicians, and practitioners turns ideas into results. Just to name a few, Tensorflow, Theano, Caffe and Keras.

However, there are several advantages of using Keras over other frameworks which include:

  • Keras in designed for human beings, not machines; it has the capability to minimize the number of user actions required for common use cases;
  • Keras is easy to learn and to use, hence allowing the users to be faster and more productive; and
  • Models developed in Keras can be easily deployed across R and Python web apps like Shiny or Flask app.

It is worth highlighting that Keras, after Tensorflow, has the strongest adoption among all other deep learning frameworks in the industry and the research community (including CERN and NASA).

This statistics is based on the total mentions of deep learning frameworks in scientific papers uploaded to the preprint server arXiv.org

This statistics is based on the total mentions of deep learning frameworks in scientific papers uploaded to the preprint server arXiv.org

1.2 Why R interface?

1.2.1 Keras in R

Due to the user friendly feature of R software, this program has a strong influence among different industries and academics. From a data science perspective, R has numerous packages helping implement deep learning models similar to the other machine learning models. For an overview of deep learning packages in R just try to click here. No danger, I promise! :)

However, due to the advantages of using Keras over other frameworks and the user friendly feature of R, there exist two R interfaces to Keras, kerasR package and RStudio’s keras package.

Question: What does it mean when a package, such as the RStudio’s keras package, is “an interface” to another package, the Python Keras?

Answer: In short, R interface means that you, as an user, can enjoy the flexibility and user friendly features of R and at the same time have access to the strong power of the Python Keras package. The best of both worlds

We demonstrate how to install RStudio’s keras package in the next tab.

1.2.2 RStudio’s keras package installation

To install RStudio’s keras package, first install R package from CRAN as follows:

install.packages("keras")

In the next step we need to install Tensorflow and Keras libraries. This is because the Keras R interface uses the TensorFlow backend engine by default. In order to install both libraries together, we use install_keras().

library(keras)
install_keras()

install_keras function has several arguments as follows:

install_keras(method = c("auto", "virtualenv", "conda"),
  conda = "auto", tensorflow = "default",
  extra_packages = c("tensorflow-hub"))

As can be seen from previous chunk of codes, there are three methods to install Keras and Tensorflow when using install_keras function. It is worth mentioning that the only supported installation method on Windows is “conda”.

The default version of Tensorflow is the CPU version; however, if you wish to enjoy your GPU, you are welcomed to change the configuration and specify tensorflow = “gpu”.

It is highly recommended to visit custom installation if you wish to do a custom installation of Keras and Tensorflow.

2 Data Preparation

# urlToData = "https://assets.datacamp.com/production/course_1025/datasets/loan_data_ch1.rds"
# savePath = tempfile(fileext = ".rds")
# download.file(urlToData, destfile = savePath, mode = "wb")
# loanData = readRDS(savePath)
# # Clean up
# rm(urlToData, savePath)
# # Convert default status to factor
# setDT(loanData)
# #loanData[, `:=`(loan_status, factor(loan_status))]
# loanData= as.data.frame.matrix(loanData) 

setwd("C:\\Users\\Hassan Omidi\\Desktop\\ML and AI\\Implementation_part\\Credit_Keras_RStudio\\Credit_Keras_RStudio")

loanData = read.csv("loanData.csv", header = TRUE, sep = ",")[-1]

# theme_set(theme_bw())
# plot_missing(loanData)
# plot_bar(loanData)
# plot_boxplot(loanData, "loan_status")

# for(i in c(3,5)){
#   loanData[which(is.na(loanData[,i])), i] = mean(loanData[,i], na.rm = TRUE)
# }



# for(i in c(3,5)){
#   loanData[which(is.na(loanData[,i])), i] = preProcess(method="bagImpute", loanData[, i])
# }

str(loanData)
set.seed(12345)

loanData$annual_inc = log(loanData$annual_inc)


loanData.pre = preProcess(loanData, method="bagImpute")

loanData.bagged = loanData

loanData.bagged = predict(loanData.pre, loanData.bagged)

loanData = loanData.bagged



( nidx <- grep(paste(c("numeric","integer"), collapse="|"), sapply(loanData, class)) )

maxs = apply(loanData[,nidx], 2, max) 
mins = apply(loanData[,nidx], 2, min)


# cormat <- round(cor(Data_to_numeric),2)
# head(cormat)
# melted_cormat <- melt(cormat)
# head(melted_cormat)
# 
# 
# ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
#   geom_tile()




loanData[, nidx] = as.data.frame(scale(loanData[,nidx], 
                                       center = mins, scale = maxs - mins))



Levels = c("G","F","E","D","C","B","A")

loanData$grade = factor(loanData$grade, levels = Levels, ordered = TRUE)
loanData$home_ownership = factor(loanData$home_ownership)
#loanData$loan_status = factor(loanData$loan_status)



Data_to_numeric = sparse.model.matrix(loan_status ~ .-1, data=loanData)
Data_to_numeric = as.matrix(Data_to_numeric)
Data_to_numeric = as.data.frame(Data_to_numeric)


out = matrix(0, nrow = length(loanData$grade), ncol = 7)

for (i in 1:length(loanData$grade)) {
  out[i, 1:as.numeric(loanData$grade[i])] = 1
}
colnames(out) = paste("grade", Levels, sep = "")

Data_to_numeric$gradeG = out[,1]
Data_to_numeric$gradeF = out[,2]
Data_to_numeric$gradeE = out[,3]
Data_to_numeric$gradeD = out[,4]
Data_to_numeric$gradeC = out[,5]
Data_to_numeric$gradeB = out[,6]
Data_to_numeric$gradeA = out[,7]


Data_to_numeric = cbind(loanData$loan_status,Data_to_numeric)
colnames(Data_to_numeric)[1] = "loan_status"


splitData = initial_split(Data_to_numeric, prop = 2/3, strata = "loan_status")
trainData = training(splitData)
testData = setdiff(Data_to_numeric,trainData)

as.data.frame(table(trainData$loan_status))


trainData$loan_status = as.factor(trainData$loan_status)

trainData = SMOTE(loan_status~., trainData, perc.over = 300, perc.under=130)

as.data.frame(table(trainData$loan_status))



set.seed(12)

trainData = trainData[sample(1:nrow(trainData)), ]

x_train = trainData[-1]
x_train = as.matrix(x_train)
x_test = testData[-1]
x_test = as.matrix(x_test)
y_train = trainData[1]

y_test = testData[1]
y_test = as.vector(t(y_test))


y_train = y_train == 1

for (i in 1 : length(y_train)){
  if (y_train[i] == TRUE) {y_train[i] = 1} else {y_train[i] = 0}}

y_train = as.vector(t(y_train))

We save this script as Input_data.R.

3 Keras model training and evaluation

Similar to any other predictive models, the process of DTE in Keras takes the following steps:

  • Defining the model, which can be broken into:
    • providing the number of layers and neurons for each layer; and
    • adding any regularization technique to avoid overfitting
  • Compilng the model, which includes:
    • defining the optimizer;
    • providing the loss function; and
    • identifying the metrics
  • Fitting the model, which involves:
    • introducing the number of batches;
    • the number of epochs; and
    • the validation split
  • Evaluating the model; for instance:
    • evaluating the model on the test data set; and
    • demonstrating the plots.
  • Predicting the classes/probabilities for the test data set.

a compact flowchart of the process is demonstrated in the following figure.

3.1 Fundamental functions in Keras

3.1.1 Basic functions

In this section a detailed breakdown of the required basic built-in functions and procedures in Keras library, which are used in the aforementioned steps, are introduced. Supplementary functions including those which are used for regularization techniques, tuning parameters task, among others, are discussed in the Supplementary functions subsection.

Define

  • The first step in the implementation of a multi-perceptron neural network is to define a model. By model we mean the way to organize layers. For our training we use the sequential model. It can be defined as follows:
model = keras_model_sequential()
  • Once a sequential model has been defined, we can then add layers to the model (input, hidden and output layers) by calling layer_dense function as follows:
 model %>%
  layer_dense(units = 20, activation = 'sigmoid', input_shape = c(15))

Note 1: The %>% is the pipe operator in R. Pipe operator in simple term helps avoid opening and closing lots of parentheses when you write your code. This helps the code be more readable. If you are interested in knowing more about pipe operator in R see pip.

  • The common arguments used in practice in layer_dense function for deep learning models are:
    • units: which is a positive integer representing the dimension of the output space of the layer;
    • activation: which is the name of the activation function to use. For a list of activation functions available in Keras see activation;
    • bias_initializer: which specifies the initializer for the bias vector. For a list of initializers available in Keras see initializer;
    • weights: which specifies the initial weights for layer;
    • input_shape: which specifies the dimentionality of the input in the first layer of the model. This argument is only used when defining the first layer of the model.
  • After adding customized layers and neurons to the model, we can see the details of the model in a more organized way by calling summary() function as follows:
summary(model)

Compile

  • In this part, we need to compile the previously defined model with appropriate loss function, optimization technique, to name a few. compile() function which is the appropriate function to employ for this purpose is called as follows:
model %>% compile(
  loss = 'binary_crossentropy',
  #optimizer = optimizer_rmsprop(lr = 0.001),
  optimizer = optimizer_adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999),
  metrics = c('accuracy')
)
  • The common arguments used in practice in compile() function for deep learning models are:
    • optimizer: it specifies the optimization technique (e.g., SGD, Adam, etc.)used for weights and biases updating. For a list of various optimization techniques in Keras, please visit optimizer and optimizerRStudio;
    • loss: it specifies the name of objective function or objective function (e.g., binary_crossentropy, categorical_crossentropy, etc.). For a list of various loss functions in Keras, please visit loss or lossRStudio;
    • metrics: it specifies the list of metrics to be evaluated by the model during training and testing (e.g., binary_accuracy, etc.). List of various metrics in Keras can be found in metrics and kerasRStudio. A custom metrics can also be defined and used in the metrics argument of compile() function. It is also worth mentioning that more than one metrics can be used at the same time when training and testing a model. Please visit kerasRStudio for a test case.

Fit

  • So far, we have not discussed about training date sets, training configurations like the number of epochs and the batch size, among others. To specify these arguments and other training-related configurations, we employ fit() function in Keras. The fit() function can be employed as follows:
Fitted_model = model %>% fit(
  x_train, y_train,
  batch_size = 128,
  epochs = 20,
  view_metrics = TRUE,
  #callbacks = callback_tensorboard("logs/run_a"),
  validation_split = 0.2)
  • Full list of arguments for fit() function can be found in fit(). However, the common arguments mostly used in practice are:
    • x,y: the training data and the label data respectively;
    • batch_size: The number of samples per gradient updates. The default is 32;
    • epochs: The number of epochs to train the model;
    • callbacks: List of callbacks to be called during training;
    • validation_split: Float between 0 and 1. The model sets apart the last fraction of the x and y data provided and use it as a validation set.
    • validation_data: Data provided to evaluate the model metrics and loss at the end of each epoch. If provided, it will override validation_split.
Evaluate
  • The final step in the modeling process is to evaluate the trained model on the test data and predict on the new data set. The Keras function used for the evaluation purpose is evaluate(). This function can be deployed as follows:
score = model %>% evaluate(
  x_test, y_test
   )

cat('Test loss:', score$loss, '\n')
cat('Test accuracy:', score$acc, '\n')

To generate predictions on new data, we can use predict_classes as follows:

model %>% predict_classes(x_test)

3.1.2 Supplementary functions

This subsection deals with the functions in R interface to Keras which are used for regularization techniques, tuning parameters task, among others.

  • The first regularization technique is \(l1\)/\(l2\) regularization technique. Depending on which norm we use in the penalty function, we call either \(l1\)-related function or \(l2\)-related function in layer_dense function in Keras. The arrangement in layer_dense() function dealing with this type of regularization is kernel_regularizer which can be used as follows:
layer_dense(units = 10, activation = 'sigmoid',
              kernel_regularizer = regularizer_l2(l = 0.001))

regularizer_l2(l = 0.001) specifies l2 regularization with regularization factor \(l = 0.001\).

  • The other regularization technique which can be employed in KerasRStudio is the dropout technique. This technique can be called through layer_dropout as follows:
layer_dense(units = 10, activation = 'sigmoid'
              ) %>%
  layer_dropout(rate = .4) %>%

As can be seen from the above code, layer_dropout() function is added to the layer we wants to add to our model. There is rate argument in layer_dropout() which specifies a probability/fraction rate which is used to randomly set that fraction of inputs neurons to zero.

Tuning parameters is another step in the process of training a model with the hope to improve the metrics as much as we can.

To perform hyperparameters tuning in KerasRStudio, we can implement all three methods of hyperparameters tuning, manual, grid, and Bayesian hyperparameter optimization. In this subsection, we explain how to specify the parameters for which we need to perform the tuning process and how to call them in the body of the model. The final step which is to call the tuning method (mainly grid or Bayesian) is discussed and implemented in Hyperparameters tuning section.

Regardless of which method we use to tune the parameters, we need to specify/identify the parameters we need to tune and assign flag to them. Depending on the class of the parameter, there are four different types of flags in KerasRStudio.

  • These flags are:
    • flag_numeric(name, default, description = NULL)

    • flag_integer(name, default, description = NULL)

    • flag_boolean(name, default, description = NULL)

    • flag_string(name, default, description = NULL)

In all these four flags types, there is the argument name which specifies the name of the parameter (that can be called when training the model), the argument default which specifies the default value of the name argument, and description argument which provides an explanation of the name argument.

The following example demonstrates how the flags are used:

FLAGS = flags(
  flag_numeric("dropout1", 0.4),
  flag_numeric("dropout2", 0.3),
  flag_string("activation1", "relu"),
  flag_string("activation2", "relu"),
  )

the defined flags then need to be called in the body of the model in the DTE process. The following is an example of how the aforementioned flags are used:

model = keras_model_sequential()
model %>%
  layer_dense(units = 20, activation = FLAGS$activation1, input_shape = c(15)) %>%
  layer_dropout(rate = FLAGS$dropout1) %>%
  layer_dense(units = 10, activation = FLAGS$activation2,
              kernel_regularizer = regularizer_l2(l = 0.001)) %>%
  layer_dropout(rate = FLAGS$dropout2) %>%
  layer_dense(units = 1, activation = "sigmoid")

3.2 Baseline model

In this section we provide a plain implementation of a multi-perceptron neural network using Keras in R. The fundamental functions required to perform DTE task are presented in Fundamental functions in Keras section. Treating any overfitting or regularization problem and other model tuning are discussed in different sections.

In the following piece of code we put all together the DTE process for a plain multi-perceptron neural network.

source("C:\\Users\\Hassan Omidi\\Desktop\\ML and AI\\Implementation_part\\Credit_Keras_RStudio\\Credit_Keras_RStudio\\Input_data.R")

# Model definition-------------------------------------
model = keras_model_sequential()
model %>%
  layer_dense(units = 20, activation = 'sigmoid', input_shape = c(15)) %>%
  layer_dense(units = 10, activation = 'sigmoid') %>%
  layer_dense(units = 1, activation = 'sigmoid')

summary(model)

model %>% compile(
  loss = 'binary_crossentropy',
  optimizer = optimizer_adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999),
  metrics = c('accuracy')
)


Fitted_model = model %>% fit(
  x_train, y_train,
  epochs = 20, batch_size = 128,
  validation_split = 0.2
  )


plot(Fitted_model)

score <- model %>% evaluate(
  x_test, y_test,
  verbose = 0
)

cat('Test loss:', score$loss, '\n')
cat('Test accuracy:', score$acc, '\n')

3.3 Multiple models with regularization

This section is devoted to the demonstration of the overfitting problem and DTE implementation of the regularized versions of the baseline model. Two regularization techniques are discussed: \(l1\), \(l2\) regularization and dropout techniques. We graphically show how to compare the performance of these models.

We break down the DTE implementation of the regularized versions of the baseline model in four pieces. In the first part we provide the baseline implementation of the DTE process and in the second part we demonstrate how DTE is implemented using \(l2\) regularization technique. In the third part we provide the DTE implementation of dropout regularization technique for neural networks. In the last part, the combined methods are implemented.

It is worth highlighting that at each step (when implementing the regularization) we graphically compare the loss and other metrics between the baseline and the regularized version of it.

  • Baseline model
model = keras_model_sequential()
model %>%
  layer_dense(units = 20, activation = 'sigmoid', input_shape = c(15)) %>%
  layer_dense(units = 10, activation = 'sigmoid') %>%
  layer_dense(units = 1, activation = 'sigmoid')

summary(model)

model %>% compile(
  loss = 'binary_crossentropy',
  optimizer = optimizer_adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999),
  metrics = c('accuracy')
)


Fitted_model = model %>% fit(
  x_train, y_train,
  epochs = 20, batch_size = 128,
  validation_split = 0.2
  )
  • Regularized model using \(l2\) technique
l2_model <- keras_model_sequential()
l2_model %>%
  layer_dense(units = 20, activation = 'sigmoid', input_shape = c(15)) %>%
  layer_dense(units = 10, activation = 'sigmoid',
              kernel_regularizer = regularizer_l2(l = 0.001)) %>%
  layer_dense(units = 1, activation = 'sigmoid')

l2_model %>% compile(
  loss = 'binary_crossentropy',
  optimizer = optimizer_rmsprop(lr = 0.001),
  metrics = c('accuracy')
)

l2_history <- l2_model %>% fit(
  x_train, y_train,
  batch_size = 128,
  epochs = 20,
  view_metrics = TRUE,
  validation_split = 0.2
)

comparison_l2 = data.frame(
  Baseline_train= Baseline_history$metrics$loss,
  Baseline_val = Baseline_history$metrics$val_loss,
  l2_train = l2_history$metrics$loss,
  l2_val = l2_history$metrics$val_loss
)%>%rownames_to_column()
  comparison_l2$rowname = as.integer(comparison_l2$rowname)
comparison_l2 = comparison_l2%>%
  gather(key = "type", value = "value", -rowname)

ggplot(comparison_l2, aes(x = rowname, y = value, color = type)) +
  geom_line() +
  xlab("epoch") +
  ylab("loss")
  • Regularized model using dropout technique
drop_model <- keras_model_sequential()
drop_model %>%
  layer_dense(units = 20, activation = 'sigmoid', input_shape = c(15)) %>%
  layer_dropout(rate = .3) %>%
  layer_dense(units = 10, activation = 'sigmoid'
              ) %>%
  layer_dropout(rate = .4) %>%
  layer_dense(units = 1, activation = 'sigmoid')

drop_model %>% compile(
  loss = 'binary_crossentropy',
  optimizer = optimizer_rmsprop(lr = 0.001),
  metrics = c('accuracy')
)

drop_history <- drop_model %>% fit(
  x_train, y_train,
  batch_size = 128,
  epochs = 20,
  view_metrics = TRUE,
  validation_split = 0.2
)

comparison_drop = data.frame(
  Baseline_train= Baseline_history$metrics$loss,
  Baseline_val = Baseline_history$metrics$val_loss,
  drop_train = drop_history$metrics$loss,
  drop_val = drop_history$metrics$val_loss
)%>%rownames_to_column()
  comparison_drop$rowname = as.integer(comparison_drop$rowname)
comparison_drop = comparison_drop%>%
  gather(key = "type", value = "value", -rowname)

ggplot(comparison_drop, aes(x = rowname, y = value, color = type)) +
  geom_line() +
  xlab("epoch") +
  ylab("loss")
  • Regularized model using both \(l2\) and dropout technique
dropl2_model <- keras_model_sequential()
dropl2_model %>%
  layer_dense(units = 20, activation = 'sigmoid', input_shape = c(15)) %>%
  layer_dropout(rate = .3) %>%
  layer_dense(units = 10, activation = 'sigmoid',
              kernel_regularizer = regularizer_l2(l = 0.001)) %>%
  layer_dropout(rate = .4) %>%
  layer_dense(units = 1, activation = 'sigmoid')

dropl2_model %>% compile(
  loss = 'binary_crossentropy',
  optimizer = optimizer_rmsprop(lr = 0.001),
  metrics = c('accuracy')
)

dropl2_history <- dropl2_model %>% fit(
  x_train, y_train,
  batch_size = 128,
  epochs = 20,
  view_metrics = TRUE,
  validation_split = 0.2
)


comparison_dropl2 = data.frame(
  Baseline_train= Baseline_history$metrics$loss,
  Baseline_val = Baseline_history$metrics$val_loss,
  dropl2_train = dropl2_history$metrics$loss,
  dropl2_val = dropl2_history$metrics$val_loss
)%>%rownames_to_column()
  comparison_dropl2$rowname = as.integer(comparison_dropl2$rowname)
comparison_dropl2 = comparison_dropl2%>%
  gather(key = "type", value = "value", -rowname)

ggplot(comparison_dropl2, aes(x = rowname, y = value, color = type)) +
  geom_line() +
  xlab("epoch") +
  ylab("loss")

3.4 Hyperparameters tuning

In this section we provide the full implementation of hyperparameters tuning process using flags in RStudio. We implement both grid search and Bayesian hyperparameter optimization techniques. The important functions to call for this purpose are discussed in Supplementary functions subsection.

Note 2: It is imperative to call appropriate type of flags when defining the hyperparameters in the model. If the parameter is a numeric one, we need to call flag_numeric(), if it’s a string type of parameter, we must call flag_string(), and so on. We refer to flag for further information on these functions and the relevant arguments.

The following chunk of code, demonstrates the full implementation of the flag labeling as well as their call in the body of the model before using a tuning technique. For the sake of later recalling, we save the following R script as “Credit_Keras_RStudio.R”.

rm(list = ls())

FLAGS = flags(
  flag_numeric("dropout1", 0.4),
  flag_numeric("dropout2", 0.3),
  flag_string("activation1", "relu"),
  flag_string("activation2", "relu"),
  )

model = keras_model_sequential()
model %>%
  layer_dense(units = 20, activation = FLAGS$activation1, input_shape = c(15)) %>%
  layer_dropout(rate = FLAGS$dropout1) %>%
  layer_dense(units = 10, activation = FLAGS$activation2,
              kernel_regularizer = regularizer_l2(l = 0.001)) %>%
  layer_dropout(rate = FLAGS$dropout2) %>%
  layer_dense(units = 1, activation = "sigmoid")

model %>% compile(
  loss = 'binary_crossentropy',
  optimizer = optimizer_adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999),
  metrics = c('accuracy')
)


history = model %>% fit(
  x_train, y_train,
  batch_size = 128,
  epochs = 20,
  view_metrics = TRUE,
  validation_split = 0.2
)


score = model %>% evaluate(
  x_test, y_test,
  verbose = 0
)

cat('Test loss:', score$loss, '\n')
cat('Test accuracy:', score$acc, '\n')

3.4.1 Grid approach

To perform the tuning process under the grid approach in KerasRStudio, we call tuning_run() function in tfruns library as follows:

runs <- tuning_run("Credit_hyperparameter.R", sample = 0.3, flags = list(
   dropout1 = c(0.2, 0.4, 0.5),
   dropout2 = c(0.1, 0.3, 0.5),
   activation1 = c("relu", "softmax", "sigmoid"),
   activation2 = c("relu", "softmax", "sigmoid")
   ))

As can be seen from the previous chunk of code, the tuning_run() function takes different arguments among which are the file name, sample, and flags. File name provides path to training script (in our example it is “Credit_hyperparameter.R”), sample specifies the sampling rate for flag combinations. Sometimes the combination of different flags with multiple values makes the tuning process quite computationally expensive. In this case sample argument randomly performs the tuning on the sampling rate for flag combinations instead of all combinations. The flags arguments in tuning_run() function specifies the list of all parameters names with multiple flag values.

3.4.2 Bayesian approach (CloudML)

Another approach to the hyperparameter tuning process is the Bayesian approach. This approach can be employed in RStudio using CloudML and R interface to Google CloudML, cloudml package. To install “cloudml” package as well as “Google Cloud SDK”, we refer to CloudML.

Once the required packages are installed, we set up the training configuration for the later use in cloudml-related functions. The following script demonstrates how to create a training configuration file. We bname it “tuning.yml”" for later recalling.

trainingInput:
  scaleTier: CUSTOM
masterType: standard_gpu
hyperparameters:
  goal: MAXIMIZE
hyperparameterMetricTag: val_acc
maxTrials: 10
maxParallelTrials: 2
params:
  - parameterName: dropout1
    type: DOUBLE
    minValue: 0.2
    maxValue: 0.5
    scaleType: UNIT_LINEAR_SCALE
  - parameterName: dropout2
    type: DOUBLE
    minValue: 0.1
    maxValue: 0.5
    scaleType: UNIT_LINEAR_SCALE
    - parameterName: activation1
    type: CATEGORICAL
    categoricalValues: [relu, softmax, sigmoid]
    - parameterName: activation2
    type: CATEGORICAL
    categoricalValues: [relu, softmax, sigmoid]

Note 3: It is clear from tuning.yml file that there are several parameters which need to be specified when defining the training configuration file for the purpose of tuning job. For instance, hyperparameterMetricTag specifies the metric to optimze for (either maximize or minimize) when training the model. For Keras, there are “acc”, “loss”, “val_acc” and “val_loss”. The Type parameters can take one of “integer”, “double”, “categorical” or “discrete”. For other configurations we refer to training config.

The next step is to submit a hyperparameter tuning job using CloudML. This task can be done by calling cloudml_train function as follows:

cloudml_train("Credit_hyperparameter.R", config = "tuning.yml")

As can be seen from the previous line of code, we need to specify the training configuration when we call “cloudml_train” function for the sake tuning job.

3.4.3 Bayesian optimization in R

In this subsection, we provide an overview of how to employ rBayesianOptimization package in R when trying to tune hyperparameters using Bayesian approach.

The full implementation of hyperparameter tuning using rBayesianOptimization package does not require any flag labeling to the hyperparameters.

To apply BayesianOptimization() function, we first define a function (with hyperparameters as inputs) which performs the DTE process as follows:

rm(list = ls())

setwd("C:\\Users\\Hassan Omidi\\Desktop\\ML and AI\\Implementation_part\\Credit_Keras_RStudio\\Credit_Keras_RStudio")
source("C:\\Users\\Hassan Omidi\\Desktop\\ML and AI\\Implementation_part\\Credit_Keras_RStudio\\Credit_Keras_RStudio\\Input_data.R")


training_credit = function(initParams){

# Define Model --------------------------------------------------------------

model = keras_model_sequential()
model %>%
  layer_dense(units = 20, activation = 'sigmoid', input_shape = c(15)) %>%
  layer_dropout(rate = initParams$dropout1) %>%
  layer_dense(units = 10, activation = 'sigmoid',
              kernel_regularizer = regularizer_l2(l = 0.001)) %>%
  layer_dropout(rate = initParams$dropout2) %>%
  layer_dense(units = 1, activation = 'sigmoid')

summary(model)

model %>% compile(
  loss = 'binary_crossentropy',
  #optimizer = optimizer_rmsprop(lr = 0.001),
  optimizer = optimizer_adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999),
  metrics = c('accuracy')
)



# Training & Evaluation ----------------------------------------------------

history = model %>% fit(
  x_train, y_train,
  batch_size = 128,
  epochs = 20,
  view_metrics = TRUE,
  verbose = 0,
  #callbacks = callback_tensorboard("logs/run_a"),
  validation_split = 0.2
)


score = model %>% evaluate(
  x_test, y_test,
  verbose = 0
)


A = model %>% predict_classes(x_test)

A = as.factor(A)
y_test2 = y_test
y_test2 = as.factor(y_test2)

con = confusionMatrix(A, y_test2, positive="1")


#return(history$metrics$val_acc[20])
return(con$table[2,2])
}

In the next step, we define another function maximizing/minimizing our desired metrics. This function then needs to be fed into the “BayesianOptimization()” function to perform the tuning process.

initParams = list ( dropout1 = 0.1, dropout2 = 0.1)
maximizeACC = function(dropout1, dropout2) {
  
  replaceParams = list ( dropout1 = dropout1, dropout2 = dropout2)
  updatedParams = modifyList(initParams, replaceParams)
  
  score = training_credit(updatedParams)
  results = list (Score = score,  Pred = 0)
  return(results)
}


boundsParams = list (dropout1 = c(0.1, 0.7), dropout2 = c(0.1, 0.7))

Final_calibrated = BayesianOptimization(maximizeACC, bounds = boundsParams, 
  init_grid_dt = as.data.table(boundsParams), 
    init_points = 10, n_iter = 30, acq = "ucb", 
  kappa = 2.576, eps = 0, verbose = FALSE)


tail(Final_calibrated$History)

4 Model interpretability/diagnosis

Once a ML model is fitted to the training data, one of the important question to pose is:

Question: How do the features influence the prediction of the underlying fitted machine learning?

Answer: there are several ways to verify the impact of a feature on the fitted model, among which are:

  • Partial Dependence Plot (PDP): This plot shows the marginal effect of a feature on the outcome of a fitted model. Mathematically that is,the marginal distribution of all features, but the assumed one, is used when PDP is plotted;
  • Accumulated Local Effects (ALE) Plot: This plot again shows the effect of a feature on the outcome of a fitted model. However, one of the main differences between ALE plot and PDP is that it uses considitional distribution rather than marginal distribution when calculating the average effect of the feature;
  • Permutation feature importance (PFI): PFI provides scores for each feature after we permute each feature’s values. The idea behind this concept is that a feature is important for the underlying model if permuting its values increases the model error.

Note 4: Each of the aforementioned tests/techniques comes with advantages and disadvantages. In the following subsection, we describe these techniques and explain the advantages/disadvantages as well as the R implementation of the techniques.

4.1 Partial dependence plot (PDP)

PDP essentially tries to see the effect of a given feature at different values of that feature on the final predicted model’s outputs when averaging over all other features. This plot is quite intuitive meaning that the plot curve at a certain value of the underlying feature represents the average prediction of the model under the assumption that all data points take on the given value of the underlying feature. Furthermore, the plot is quite easy to interpret and when the features are uncorrelated it simply shows on average how the prediction changes when a given feature is changed.

The partial dependence function is defined as follows:

\[ \tilde{f} (x) = \int_{y \in Y} \tilde{f}(x,\tilde{y}) d\mathbb{P}(\tilde{y}),\] where \(x\) is the fixed value for the underlying feature, \(\tilde{y}\) is the vector of all features but the fixed one and \(\tilde{f}\) is the machine learning model being studied.

For each value of the fixed feature, we calculate \(\tilde{f} (x)\) and then plot them to get PDP for that given feature.

Caveat1: Partial dependence function/plot is based on an assumption that the fixed feature is uncorrelated with the other features in the model. This is a strong assumption and in many practical cases, this can be strongly violated. In the case of correlated features, PDP can be misleading and the derived inference based on it might result in miss-selection of features. The following plot demonstares the case when two features, \(X_1\) and \(X_2\) are correlated. In fact, PDP falsely assumes that the consitional distibution of \(X_2\) at \(X_1 = 0.75\) is the same as the marginal distribution of \(X_2\). This tends to assign weights to the unlikely ocombinations of \(X_1\) and \(X_2\).

Correlated features and PDP

Correlated features and PDP

Implementation of PDP in R (pdp package)

Partial dependence plot can be drawn in R by calling pdp package. Three main functions which are required to call in this package are partial(), autoplot() and grid.arrange(). In simple words, we first need to define which feature we need to select and study its importance on the model’s outputs. This can be done by calling partial() function. In the next step we use the predefined object using partial() as an argument to autoplot() function. Finally, we use the output of autoplot() as an argument to the grid.arrange() function to draw PDP for the selected feature.

We recommend visiting this page if interested in a practical implementation of pdp package. However, in the following we provide a block of codes implementing pdp plots for “grade” and “age” features for the credit default model.

Model_grade = partial(fitted_model, pred.var = c("grade"), chull = TRUE)

Model_grade_plot = autoplot(Model_grade, contour = TRUE)

Model_age = partial(fitted_model, pred.var = c("age"), chull = TRUE)

Model_age_plot = autoplot(Model_grade, contour = TRUE)

grid.arrange(Model_grade_plot, Model_age_plot)

4.2 Accumulated local effects (ALE) plot

One of the main problems with PDP which was highlighted in the previous subsection was about the underlying assumption around the distribution of features. The uncorrelated assumption between the features poses a serious danger in the final inference about the feature importance analysis of the model when the features are correlated.

To circumvent this problem, accumulated local effects plot was introduced. ALE plot is an alternative to PDP. However, ALE plot uses conditional distribution of the features and takes average over the differences in the predictions rather than predictions alone. In a nutshell, ALE plots show how the model predictions change around a small window of a fixed value of the underlying feature for all data instances in that window.

We recommend visiting ALE plot if interested in further information about ALE plot and the theory behind it accompanied with different examples.

Implementation of ALE plots in R (ALEPlot package)

ALE plots can be created using ALEPlot package in R. Main function to employ in this package is ALEPlot(). Arguments frequently used in this function are:

  • X, the data frame of predictor variables to which the model was fit;
  • X.model, the fitted learning model;
  • pred.fun, a user defined function with two argument, one the X.model and the other one the data which X.model is applied on to predict the response variable;
  • J, a numeric scalar or two-length vector of indices of the features fro which the ALE plot will be calculated.

  • K, the number of intervals into which the feature range is divided for the sake of ALE plot calculation.

In the following we provide the code for ALE plot for the case of credit risk using “grade” feature.

DAT = data.frame(loan_status = loanData$loan_status,
                 loan_amnt = loanData$loan_amnt, int_rate = loanData$int_rate,
                 grade = loanData$grade, emp_length = loanData$emp_length,
                 home_ownership = loanData$home_ownership,
                 annual_inc = loanData$annual_inc, age = loanData$age)

Neural.DAT = fitted_model(DAT)
prediction.DAT = function(X.model, newdata) as.numeric(predict_classes(X.model, newdata))

par(mfrow = c(2,4))

ALE.grade=ALEPlot(DAT[,2:8], neural.DAT, pred.fun = prediction.DAT , J=3, K=50, NA.plot = TRUE)

4.3 Permutation feature importance (PFI)

This subsection is devoted to study the feature’s importance concept in machine learning models. The concept tries to shed light on a black-box model aiming at understanding the importance (contribution) of the underlying features of the fitted model.

Permutation feature importance (PFI) or model reliance in simple terms is to measure a feature’s importance by calculating the increase of the fitted model’s prediction error after permuting the feature. A feature is considered important if the model error increases when the feature’s values are permuted; otherwise, the feature is considered as unimportant.

Algorithm underlying PFI The following algorithm is based on Breiman (2001) and Fisher, Rudin, and Dominici (2018).

Let’s assume that the trained model, \(\tilde{f}\) is applied on the feature matrix \(X\) with the target vector \(Y\) and the error measure \(L(Y,\tilde{Y})\).

To calculate PFI for each variable:

  • Estimate the original model error noted as \(e_{orig}(\tilde{f}) = L(Y, \tilde{f}(X))\);
  • For each feature, \(i = 1, \dots, n\):
    • Generate feature matrix \(X_{perm_i}\) by permuting feature \(X_i\) in \(X\);
    • Estimate the model error under the new feature matrix denoted as \(e_{perm_i}(\tilde{f}) = L(Y, \tilde{f}(X_{perm_i}))\);
    • Then calculate the relative PFI for feature \(i\) as \(FI_i = \frac{e_{perm_i}(\tilde{f})}{e_{orig}(\tilde{f})}\) or the absolute one as \(FI_i = e_{perm_i}(\tilde{f}) - e_{orig}(\tilde{f})\).
  • This procedure can be followed to calculate FI for all relevant features of the model. Once completed, sort the variables by descending FI.

The following block of code provides the implementation of PFI in R.

PFI = function(X, Y, f){
  score = c()
  
  score[i] = f %>% evaluate(
    X, Y, verbose = 0)
  
  for (i in 1: ncol(X)){
    xi = sample(X[,i], replace = FALSE)
    Xi = cbind(xi, X[,-i])
    
    score[i] = f %>% evaluate(
      Xi, Y, verbose = 0)
  }
  
  FI [i] = score[i]/score_orig
  PFI_final = data.frame("Features" = colnames(X), "Scores" = score)
  return(PFI_final[order(PFI_final$Features, decreasing = TRUE),])
}

Should We Compute Importance on Training or Test Data?

There is no definite answer!

Personally I prefer TEST data.

5 Feature engineering/selection

There are various ways to engineer an optimal set of features that very well represent the relationship between the response variable and the inputs.

In a very general, yet accurate term, we call a feature good for a model if it is relevant to the model output but is not redundant (highly dependent) to any of the other relevant features.

The feature selection methods are broadly categorized into three types:

  • Filter methods;
  • Wrapper methods; and
  • Embedded methods.

In this note, we only cover filter-based methods and leave the other two approaches to the reader for further investigation.

5.1 Filter-Based techniques

Filter methods, basically, use ranking techniques as principle criteria for the feature selection. That is, candidate variables/features are assigned a score by an appropriate metrics/ranking criterion and the variables having score below a predetermined threshold are removed.

Some of the well-known techniques/tests which fall under the filter methods’ umbrella are listed in the following table.

This table lists the tests for checking the relation between input variables and the response variable. However, the tests can also be used to check the relation betwwen features as well.

This table lists the tests for checking the relation between input variables and the response variable. However, the tests can also be used to check the relation betwwen features as well.

One of the pitfalls of the tests listed above is that they don’t consider the redundancy between features when selecting the optimal set of features. In the other words, they only consider the relevance of the selected feature to the response variable. To overcome this limitation, a correlation-based method (based on information entropy) called Fast Correlation based Feature Selection (FCBF) was introduced in the paper by Lei,Y. et al.

In the remainder of this section, we provide a better explanation of FCBF algorithm and it underlying theoretical framework.

Correlation-Based Measure (FCBF algorithm)

This approach was developed to overcome the shortcomings of previously highlighted tests and techniques. FCBF is an association/correlation measure based on the concept of entropy. To demonstrate FCBF, we first need to provide a quick overview of some of the prerequisite notations and concepts. Entropy, in a simple term, measures the average rate of uncertainty of a given random variable. In mathematical term:

\[ H(X) = -\sum_{i} \mathbb{P}(x_i) log_2(\mathbb{P}(x_i)),\] where \(X\) is the underlying variable which takes different values, \(x_i\). The definition above can be extended to a continuous random variable as well.

Applying conditional and joint distributions between two variables, \(X\) and \(Y\), we can define the entropy of \(X\) after observing values of \(Y\) as follows:

\[H(X|Y) = -\sum_{j} \mathbb{P}(y_j) \sum_i \mathbb{P}(x_i|y_j) log_2(\mathbb{P}(x_i|y_j)),\] Following conditional entropy we define another measure called Information gain as follows:

\[IG(X|Y ) = H(X) - H(X|Y).\]

Based on the equation above, the higher the Information gain (IG), the highly correlated the two variables are. The last concept/measure before defining FCBF is symmetrical uncertainty (SU) defined as follows:

\[ SU(X,Y) = 2 \left[\frac{IG(X|Y)}{H(X) + H(Y)}\right].\] SU has values between 0 and 1 with the value 1 indicating that knowledge of the value of either one completely predicts the value of the other and the value 0 indicating that X and Y are independent. The underlying metrics used in the definition of FCBF method is the SU metrics.

In the remainder of this part we provide a high level description of the method and refer the reader to the article FCBF for a thorough understanding of the algorithm.

Define the concept of Predominant Correlation as follows:

The correlation between a feature \(F_i\) and the class \(C\) is predominant if and only if \(SU_{i,c}\geq \delta,\) and \(\forall F_j \in S' (j\neq i),\) there exists no \(F_j\) such that \(SU_{j,i} \geq SU_{i,c}\), where \(S'\) is the set of features whose correlation with the class C is at least \(\delta\).

If there exists such \(F_j\) to a feature \(F_i\), we call it a redundant peer to \(F_i\). We define \(S_{P_{i}}\) to denote the set of all redundant peers for \(F_i\). If \(S_{P_{i}}\) for a given \(F_i \in S'\), then we decide \(S_{P_{i}}\) in two parts as follows:

\[S^{+}_{P_{i}} = \{ F_j | F_j \in S_{P_i}, SU_{j,c} > SU_{i,c}\},\] \[S^{-}_{P_{i}} = \{ F_j | F_j \in S_{P_i}, SU_{j,c} \leq SU_{i,c}\}.\]

A feature is predominant to the class, if and only if its correlation to the class is predominant or can become predominant after removing its redundant peers.

The following figure demonstrates the FCBF algorithm based on SU metrics and the concept of predominancy of features.

FCBF Algorithm

FCBF Algorithm

Implementation of FCBF algorithm in R (Biocomb package)

The FCBF algorithm is implemented in Biocomb package in R. We can call select.fast.filter() function in Biocomb package to select good features in the sense previously highlighted. The general arguments used in select.fast.filter() function is as follows:

  • matrix: a dataset, a matrix of feature with the last column for the class labels. Class labels could be numerical or character values. The maximal number of classes is ten;

  • disc.method: a method for feature discretization including, minimal description length (MDL), equal frequency, among others;

  • threshold: a numeric value for the correlation of features with class or within features.

select.fast.filter(matrix,disc.method,threshold)

An example of how to use the function above is provided in the following block of code.

data_set # is the matrix of features and the last column represents 
# the class labels
disc.method = "MDL"
threshold = 0.5

select.fast.filter(data_set,disc.method,threshold)