This is a demo on end-to-end implementation of deep neural networks (DNN), a subclass of machine learning (artificial intelligence) class in R, using R interface to Keras, a high-level neural networks API developed in Python. In this demo, we apply DNN models to a loan default data set. This demo is organized as follows:
In Section Why Keras?, we provide an overview on why Keras is important when dealing with DNN models. Other alternatives to Keras are also listed. In Section Why R interface?, we highlight key points on what makes R and R interface to Keras very useful tool to work with. Installation procedure is also documented along with calling the required R codes.
We demonstrate the important steps in the exploratory data analysis step in the model fitting process in Section Data Preparation on a credit default loans data set.
Model diagnosis metrics/tests including PDP, ALE plot and PFI are discussed in Section Model interpretability/diagnosis.
There are various deep learning frameworks available today enabling developers, academicians, and practitioners turns ideas into results. Just to name a few, Tensorflow, Theano, Caffe and Keras.
However, there are several advantages of using Keras over other frameworks which include:
It is worth highlighting that Keras, after Tensorflow, has the strongest adoption among all other deep learning frameworks in the industry and the research community (including CERN and NASA).
Due to the user friendly feature of R software, this program has a strong influence among different industries and academics. From a data science perspective, R has numerous packages helping implement deep learning models similar to the other machine learning models. For an overview of deep learning packages in R just try to click here. No danger, I promise! :)
However, due to the advantages of using Keras over other frameworks and the user friendly feature of R, there exist two R interfaces to Keras, kerasR package and RStudio’s keras package.
Question: What does it mean when a package, such as the RStudio’s keras package, is “an interface” to another package, the Python Keras?
Answer: In short, R interface means that you, as an user, can enjoy the flexibility and user friendly features of R and at the same time have access to the strong power of the Python Keras package. The best of both worlds
We demonstrate how to install RStudio’s keras package in the next tab.
To install RStudio’s keras package, first install R package from CRAN as follows:
install.packages("keras")
In the next step we need to install Tensorflow and Keras libraries. This is because the Keras R interface uses the TensorFlow backend engine by default. In order to install both libraries together, we use install_keras().
library(keras)
install_keras()
install_keras function has several arguments as follows:
install_keras(method = c("auto", "virtualenv", "conda"),
conda = "auto", tensorflow = "default",
extra_packages = c("tensorflow-hub"))
As can be seen from previous chunk of codes, there are three methods to install Keras and Tensorflow when using install_keras function. It is worth mentioning that the only supported installation method on Windows is “conda”.
The default version of Tensorflow is the CPU version; however, if you wish to enjoy your GPU, you are welcomed to change the configuration and specify tensorflow = “gpu”.
It is highly recommended to visit custom installation if you wish to do a custom installation of Keras and Tensorflow.
# urlToData = "https://assets.datacamp.com/production/course_1025/datasets/loan_data_ch1.rds"
# savePath = tempfile(fileext = ".rds")
# download.file(urlToData, destfile = savePath, mode = "wb")
# loanData = readRDS(savePath)
# # Clean up
# rm(urlToData, savePath)
# # Convert default status to factor
# setDT(loanData)
# #loanData[, `:=`(loan_status, factor(loan_status))]
# loanData= as.data.frame.matrix(loanData)
setwd("C:\\Users\\Hassan Omidi\\Desktop\\ML and AI\\Implementation_part\\Credit_Keras_RStudio\\Credit_Keras_RStudio")
loanData = read.csv("loanData.csv", header = TRUE, sep = ",")[-1]
# theme_set(theme_bw())
# plot_missing(loanData)
# plot_bar(loanData)
# plot_boxplot(loanData, "loan_status")
# for(i in c(3,5)){
# loanData[which(is.na(loanData[,i])), i] = mean(loanData[,i], na.rm = TRUE)
# }
# for(i in c(3,5)){
# loanData[which(is.na(loanData[,i])), i] = preProcess(method="bagImpute", loanData[, i])
# }
str(loanData)
set.seed(12345)
loanData$annual_inc = log(loanData$annual_inc)
loanData.pre = preProcess(loanData, method="bagImpute")
loanData.bagged = loanData
loanData.bagged = predict(loanData.pre, loanData.bagged)
loanData = loanData.bagged
( nidx <- grep(paste(c("numeric","integer"), collapse="|"), sapply(loanData, class)) )
maxs = apply(loanData[,nidx], 2, max)
mins = apply(loanData[,nidx], 2, min)
# cormat <- round(cor(Data_to_numeric),2)
# head(cormat)
# melted_cormat <- melt(cormat)
# head(melted_cormat)
#
#
# ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
# geom_tile()
loanData[, nidx] = as.data.frame(scale(loanData[,nidx],
center = mins, scale = maxs - mins))
Levels = c("G","F","E","D","C","B","A")
loanData$grade = factor(loanData$grade, levels = Levels, ordered = TRUE)
loanData$home_ownership = factor(loanData$home_ownership)
#loanData$loan_status = factor(loanData$loan_status)
Data_to_numeric = sparse.model.matrix(loan_status ~ .-1, data=loanData)
Data_to_numeric = as.matrix(Data_to_numeric)
Data_to_numeric = as.data.frame(Data_to_numeric)
out = matrix(0, nrow = length(loanData$grade), ncol = 7)
for (i in 1:length(loanData$grade)) {
out[i, 1:as.numeric(loanData$grade[i])] = 1
}
colnames(out) = paste("grade", Levels, sep = "")
Data_to_numeric$gradeG = out[,1]
Data_to_numeric$gradeF = out[,2]
Data_to_numeric$gradeE = out[,3]
Data_to_numeric$gradeD = out[,4]
Data_to_numeric$gradeC = out[,5]
Data_to_numeric$gradeB = out[,6]
Data_to_numeric$gradeA = out[,7]
Data_to_numeric = cbind(loanData$loan_status,Data_to_numeric)
colnames(Data_to_numeric)[1] = "loan_status"
splitData = initial_split(Data_to_numeric, prop = 2/3, strata = "loan_status")
trainData = training(splitData)
testData = setdiff(Data_to_numeric,trainData)
as.data.frame(table(trainData$loan_status))
trainData$loan_status = as.factor(trainData$loan_status)
trainData = SMOTE(loan_status~., trainData, perc.over = 300, perc.under=130)
as.data.frame(table(trainData$loan_status))
set.seed(12)
trainData = trainData[sample(1:nrow(trainData)), ]
x_train = trainData[-1]
x_train = as.matrix(x_train)
x_test = testData[-1]
x_test = as.matrix(x_test)
y_train = trainData[1]
y_test = testData[1]
y_test = as.vector(t(y_test))
y_train = y_train == 1
for (i in 1 : length(y_train)){
if (y_train[i] == TRUE) {y_train[i] = 1} else {y_train[i] = 0}}
y_train = as.vector(t(y_train))
We save this script as Input_data.R.
Similar to any other predictive models, the process of DTE in Keras takes the following steps:
a compact flowchart of the process is demonstrated in the following figure.
In this section a detailed breakdown of the required basic built-in functions and procedures in Keras library, which are used in the aforementioned steps, are introduced. Supplementary functions including those which are used for regularization techniques, tuning parameters task, among others, are discussed in the Supplementary functions subsection.
Define
model = keras_model_sequential()
model %>%
layer_dense(units = 20, activation = 'sigmoid', input_shape = c(15))
Note 1: The %>% is the pipe operator in R. Pipe operator in simple term helps avoid opening and closing lots of parentheses when you write your code. This helps the code be more readable. If you are interested in knowing more about pipe operator in R see pip.
summary(model)
Compile
model %>% compile(
loss = 'binary_crossentropy',
#optimizer = optimizer_rmsprop(lr = 0.001),
optimizer = optimizer_adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999),
metrics = c('accuracy')
)
Fit
Fitted_model = model %>% fit(
x_train, y_train,
batch_size = 128,
epochs = 20,
view_metrics = TRUE,
#callbacks = callback_tensorboard("logs/run_a"),
validation_split = 0.2)
score = model %>% evaluate(
x_test, y_test
)
cat('Test loss:', score$loss, '\n')
cat('Test accuracy:', score$acc, '\n')
To generate predictions on new data, we can use predict_classes as follows:
model %>% predict_classes(x_test)
This subsection deals with the functions in R interface to Keras which are used for regularization techniques, tuning parameters task, among others.
layer_dense(units = 10, activation = 'sigmoid',
kernel_regularizer = regularizer_l2(l = 0.001))
regularizer_l2(l = 0.001) specifies l2 regularization with regularization factor \(l = 0.001\).
layer_dense(units = 10, activation = 'sigmoid'
) %>%
layer_dropout(rate = .4) %>%
As can be seen from the above code, layer_dropout() function is added to the layer we wants to add to our model. There is rate argument in layer_dropout() which specifies a probability/fraction rate which is used to randomly set that fraction of inputs neurons to zero.
Tuning parameters is another step in the process of training a model with the hope to improve the metrics as much as we can.
To perform hyperparameters tuning in KerasRStudio, we can implement all three methods of hyperparameters tuning, manual, grid, and Bayesian hyperparameter optimization. In this subsection, we explain how to specify the parameters for which we need to perform the tuning process and how to call them in the body of the model. The final step which is to call the tuning method (mainly grid or Bayesian) is discussed and implemented in Hyperparameters tuning section.
Regardless of which method we use to tune the parameters, we need to specify/identify the parameters we need to tune and assign flag to them. Depending on the class of the parameter, there are four different types of flags in KerasRStudio.
flag_numeric(name, default, description = NULL)
flag_integer(name, default, description = NULL)
flag_boolean(name, default, description = NULL)
flag_string(name, default, description = NULL)
In all these four flags types, there is the argument name which specifies the name of the parameter (that can be called when training the model), the argument default which specifies the default value of the name argument, and description argument which provides an explanation of the name argument.
The following example demonstrates how the flags are used:
FLAGS = flags(
flag_numeric("dropout1", 0.4),
flag_numeric("dropout2", 0.3),
flag_string("activation1", "relu"),
flag_string("activation2", "relu"),
)
the defined flags then need to be called in the body of the model in the DTE process. The following is an example of how the aforementioned flags are used:
model = keras_model_sequential()
model %>%
layer_dense(units = 20, activation = FLAGS$activation1, input_shape = c(15)) %>%
layer_dropout(rate = FLAGS$dropout1) %>%
layer_dense(units = 10, activation = FLAGS$activation2,
kernel_regularizer = regularizer_l2(l = 0.001)) %>%
layer_dropout(rate = FLAGS$dropout2) %>%
layer_dense(units = 1, activation = "sigmoid")
In this section we provide a plain implementation of a multi-perceptron neural network using Keras in R. The fundamental functions required to perform DTE task are presented in Fundamental functions in Keras section. Treating any overfitting or regularization problem and other model tuning are discussed in different sections.
In the following piece of code we put all together the DTE process for a plain multi-perceptron neural network.
source("C:\\Users\\Hassan Omidi\\Desktop\\ML and AI\\Implementation_part\\Credit_Keras_RStudio\\Credit_Keras_RStudio\\Input_data.R")
# Model definition-------------------------------------
model = keras_model_sequential()
model %>%
layer_dense(units = 20, activation = 'sigmoid', input_shape = c(15)) %>%
layer_dense(units = 10, activation = 'sigmoid') %>%
layer_dense(units = 1, activation = 'sigmoid')
summary(model)
model %>% compile(
loss = 'binary_crossentropy',
optimizer = optimizer_adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999),
metrics = c('accuracy')
)
Fitted_model = model %>% fit(
x_train, y_train,
epochs = 20, batch_size = 128,
validation_split = 0.2
)
plot(Fitted_model)
score <- model %>% evaluate(
x_test, y_test,
verbose = 0
)
cat('Test loss:', score$loss, '\n')
cat('Test accuracy:', score$acc, '\n')
This section is devoted to the demonstration of the overfitting problem and DTE implementation of the regularized versions of the baseline model. Two regularization techniques are discussed: \(l1\), \(l2\) regularization and dropout techniques. We graphically show how to compare the performance of these models.
We break down the DTE implementation of the regularized versions of the baseline model in four pieces. In the first part we provide the baseline implementation of the DTE process and in the second part we demonstrate how DTE is implemented using \(l2\) regularization technique. In the third part we provide the DTE implementation of dropout regularization technique for neural networks. In the last part, the combined methods are implemented.
It is worth highlighting that at each step (when implementing the regularization) we graphically compare the loss and other metrics between the baseline and the regularized version of it.
model = keras_model_sequential()
model %>%
layer_dense(units = 20, activation = 'sigmoid', input_shape = c(15)) %>%
layer_dense(units = 10, activation = 'sigmoid') %>%
layer_dense(units = 1, activation = 'sigmoid')
summary(model)
model %>% compile(
loss = 'binary_crossentropy',
optimizer = optimizer_adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999),
metrics = c('accuracy')
)
Fitted_model = model %>% fit(
x_train, y_train,
epochs = 20, batch_size = 128,
validation_split = 0.2
)
l2_model <- keras_model_sequential()
l2_model %>%
layer_dense(units = 20, activation = 'sigmoid', input_shape = c(15)) %>%
layer_dense(units = 10, activation = 'sigmoid',
kernel_regularizer = regularizer_l2(l = 0.001)) %>%
layer_dense(units = 1, activation = 'sigmoid')
l2_model %>% compile(
loss = 'binary_crossentropy',
optimizer = optimizer_rmsprop(lr = 0.001),
metrics = c('accuracy')
)
l2_history <- l2_model %>% fit(
x_train, y_train,
batch_size = 128,
epochs = 20,
view_metrics = TRUE,
validation_split = 0.2
)
comparison_l2 = data.frame(
Baseline_train= Baseline_history$metrics$loss,
Baseline_val = Baseline_history$metrics$val_loss,
l2_train = l2_history$metrics$loss,
l2_val = l2_history$metrics$val_loss
)%>%rownames_to_column()
comparison_l2$rowname = as.integer(comparison_l2$rowname)
comparison_l2 = comparison_l2%>%
gather(key = "type", value = "value", -rowname)
ggplot(comparison_l2, aes(x = rowname, y = value, color = type)) +
geom_line() +
xlab("epoch") +
ylab("loss")
drop_model <- keras_model_sequential()
drop_model %>%
layer_dense(units = 20, activation = 'sigmoid', input_shape = c(15)) %>%
layer_dropout(rate = .3) %>%
layer_dense(units = 10, activation = 'sigmoid'
) %>%
layer_dropout(rate = .4) %>%
layer_dense(units = 1, activation = 'sigmoid')
drop_model %>% compile(
loss = 'binary_crossentropy',
optimizer = optimizer_rmsprop(lr = 0.001),
metrics = c('accuracy')
)
drop_history <- drop_model %>% fit(
x_train, y_train,
batch_size = 128,
epochs = 20,
view_metrics = TRUE,
validation_split = 0.2
)
comparison_drop = data.frame(
Baseline_train= Baseline_history$metrics$loss,
Baseline_val = Baseline_history$metrics$val_loss,
drop_train = drop_history$metrics$loss,
drop_val = drop_history$metrics$val_loss
)%>%rownames_to_column()
comparison_drop$rowname = as.integer(comparison_drop$rowname)
comparison_drop = comparison_drop%>%
gather(key = "type", value = "value", -rowname)
ggplot(comparison_drop, aes(x = rowname, y = value, color = type)) +
geom_line() +
xlab("epoch") +
ylab("loss")
dropl2_model <- keras_model_sequential()
dropl2_model %>%
layer_dense(units = 20, activation = 'sigmoid', input_shape = c(15)) %>%
layer_dropout(rate = .3) %>%
layer_dense(units = 10, activation = 'sigmoid',
kernel_regularizer = regularizer_l2(l = 0.001)) %>%
layer_dropout(rate = .4) %>%
layer_dense(units = 1, activation = 'sigmoid')
dropl2_model %>% compile(
loss = 'binary_crossentropy',
optimizer = optimizer_rmsprop(lr = 0.001),
metrics = c('accuracy')
)
dropl2_history <- dropl2_model %>% fit(
x_train, y_train,
batch_size = 128,
epochs = 20,
view_metrics = TRUE,
validation_split = 0.2
)
comparison_dropl2 = data.frame(
Baseline_train= Baseline_history$metrics$loss,
Baseline_val = Baseline_history$metrics$val_loss,
dropl2_train = dropl2_history$metrics$loss,
dropl2_val = dropl2_history$metrics$val_loss
)%>%rownames_to_column()
comparison_dropl2$rowname = as.integer(comparison_dropl2$rowname)
comparison_dropl2 = comparison_dropl2%>%
gather(key = "type", value = "value", -rowname)
ggplot(comparison_dropl2, aes(x = rowname, y = value, color = type)) +
geom_line() +
xlab("epoch") +
ylab("loss")
In this section we provide the full implementation of hyperparameters tuning process using flags in RStudio. We implement both grid search and Bayesian hyperparameter optimization techniques. The important functions to call for this purpose are discussed in Supplementary functions subsection.
Note 2: It is imperative to call appropriate type of flags when defining the hyperparameters in the model. If the parameter is a numeric one, we need to call flag_numeric(), if it’s a string type of parameter, we must call flag_string(), and so on. We refer to flag for further information on these functions and the relevant arguments.
The following chunk of code, demonstrates the full implementation of the flag labeling as well as their call in the body of the model before using a tuning technique. For the sake of later recalling, we save the following R script as “Credit_Keras_RStudio.R”.
rm(list = ls())
FLAGS = flags(
flag_numeric("dropout1", 0.4),
flag_numeric("dropout2", 0.3),
flag_string("activation1", "relu"),
flag_string("activation2", "relu"),
)
model = keras_model_sequential()
model %>%
layer_dense(units = 20, activation = FLAGS$activation1, input_shape = c(15)) %>%
layer_dropout(rate = FLAGS$dropout1) %>%
layer_dense(units = 10, activation = FLAGS$activation2,
kernel_regularizer = regularizer_l2(l = 0.001)) %>%
layer_dropout(rate = FLAGS$dropout2) %>%
layer_dense(units = 1, activation = "sigmoid")
model %>% compile(
loss = 'binary_crossentropy',
optimizer = optimizer_adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999),
metrics = c('accuracy')
)
history = model %>% fit(
x_train, y_train,
batch_size = 128,
epochs = 20,
view_metrics = TRUE,
validation_split = 0.2
)
score = model %>% evaluate(
x_test, y_test,
verbose = 0
)
cat('Test loss:', score$loss, '\n')
cat('Test accuracy:', score$acc, '\n')
To perform the tuning process under the grid approach in KerasRStudio, we call tuning_run() function in tfruns library as follows:
runs <- tuning_run("Credit_hyperparameter.R", sample = 0.3, flags = list(
dropout1 = c(0.2, 0.4, 0.5),
dropout2 = c(0.1, 0.3, 0.5),
activation1 = c("relu", "softmax", "sigmoid"),
activation2 = c("relu", "softmax", "sigmoid")
))
As can be seen from the previous chunk of code, the tuning_run() function takes different arguments among which are the file name, sample, and flags. File name provides path to training script (in our example it is “Credit_hyperparameter.R”), sample specifies the sampling rate for flag combinations. Sometimes the combination of different flags with multiple values makes the tuning process quite computationally expensive. In this case sample argument randomly performs the tuning on the sampling rate for flag combinations instead of all combinations. The flags arguments in tuning_run() function specifies the list of all parameters names with multiple flag values.
Another approach to the hyperparameter tuning process is the Bayesian approach. This approach can be employed in RStudio using CloudML and R interface to Google CloudML, cloudml package. To install “cloudml” package as well as “Google Cloud SDK”, we refer to CloudML.
Once the required packages are installed, we set up the training configuration for the later use in cloudml-related functions. The following script demonstrates how to create a training configuration file. We bname it “tuning.yml”" for later recalling.
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
hyperparameters:
goal: MAXIMIZE
hyperparameterMetricTag: val_acc
maxTrials: 10
maxParallelTrials: 2
params:
- parameterName: dropout1
type: DOUBLE
minValue: 0.2
maxValue: 0.5
scaleType: UNIT_LINEAR_SCALE
- parameterName: dropout2
type: DOUBLE
minValue: 0.1
maxValue: 0.5
scaleType: UNIT_LINEAR_SCALE
- parameterName: activation1
type: CATEGORICAL
categoricalValues: [relu, softmax, sigmoid]
- parameterName: activation2
type: CATEGORICAL
categoricalValues: [relu, softmax, sigmoid]
Note 3: It is clear from tuning.yml file that there are several parameters which need to be specified when defining the training configuration file for the purpose of tuning job. For instance, hyperparameterMetricTag specifies the metric to optimze for (either maximize or minimize) when training the model. For Keras, there are “acc”, “loss”, “val_acc” and “val_loss”. The Type parameters can take one of “integer”, “double”, “categorical” or “discrete”. For other configurations we refer to training config.
The next step is to submit a hyperparameter tuning job using CloudML. This task can be done by calling cloudml_train function as follows:
cloudml_train("Credit_hyperparameter.R", config = "tuning.yml")
As can be seen from the previous line of code, we need to specify the training configuration when we call “cloudml_train” function for the sake tuning job.
In this subsection, we provide an overview of how to employ rBayesianOptimization package in R when trying to tune hyperparameters using Bayesian approach.
The full implementation of hyperparameter tuning using rBayesianOptimization package does not require any flag labeling to the hyperparameters.
To apply BayesianOptimization() function, we first define a function (with hyperparameters as inputs) which performs the DTE process as follows:
rm(list = ls())
setwd("C:\\Users\\Hassan Omidi\\Desktop\\ML and AI\\Implementation_part\\Credit_Keras_RStudio\\Credit_Keras_RStudio")
source("C:\\Users\\Hassan Omidi\\Desktop\\ML and AI\\Implementation_part\\Credit_Keras_RStudio\\Credit_Keras_RStudio\\Input_data.R")
training_credit = function(initParams){
# Define Model --------------------------------------------------------------
model = keras_model_sequential()
model %>%
layer_dense(units = 20, activation = 'sigmoid', input_shape = c(15)) %>%
layer_dropout(rate = initParams$dropout1) %>%
layer_dense(units = 10, activation = 'sigmoid',
kernel_regularizer = regularizer_l2(l = 0.001)) %>%
layer_dropout(rate = initParams$dropout2) %>%
layer_dense(units = 1, activation = 'sigmoid')
summary(model)
model %>% compile(
loss = 'binary_crossentropy',
#optimizer = optimizer_rmsprop(lr = 0.001),
optimizer = optimizer_adam(lr = 0.001, beta_1 = 0.9, beta_2 = 0.999),
metrics = c('accuracy')
)
# Training & Evaluation ----------------------------------------------------
history = model %>% fit(
x_train, y_train,
batch_size = 128,
epochs = 20,
view_metrics = TRUE,
verbose = 0,
#callbacks = callback_tensorboard("logs/run_a"),
validation_split = 0.2
)
score = model %>% evaluate(
x_test, y_test,
verbose = 0
)
A = model %>% predict_classes(x_test)
A = as.factor(A)
y_test2 = y_test
y_test2 = as.factor(y_test2)
con = confusionMatrix(A, y_test2, positive="1")
#return(history$metrics$val_acc[20])
return(con$table[2,2])
}
In the next step, we define another function maximizing/minimizing our desired metrics. This function then needs to be fed into the “BayesianOptimization()” function to perform the tuning process.
initParams = list ( dropout1 = 0.1, dropout2 = 0.1)
maximizeACC = function(dropout1, dropout2) {
replaceParams = list ( dropout1 = dropout1, dropout2 = dropout2)
updatedParams = modifyList(initParams, replaceParams)
score = training_credit(updatedParams)
results = list (Score = score, Pred = 0)
return(results)
}
boundsParams = list (dropout1 = c(0.1, 0.7), dropout2 = c(0.1, 0.7))
Final_calibrated = BayesianOptimization(maximizeACC, bounds = boundsParams,
init_grid_dt = as.data.table(boundsParams),
init_points = 10, n_iter = 30, acq = "ucb",
kappa = 2.576, eps = 0, verbose = FALSE)
tail(Final_calibrated$History)
Once a ML model is fitted to the training data, one of the important question to pose is:
Question: How do the features influence the prediction of the underlying fitted machine learning?
Answer: there are several ways to verify the impact of a feature on the fitted model, among which are:
Note 4: Each of the aforementioned tests/techniques comes with advantages and disadvantages. In the following subsection, we describe these techniques and explain the advantages/disadvantages as well as the R implementation of the techniques.
PDP essentially tries to see the effect of a given feature at different values of that feature on the final predicted model’s outputs when averaging over all other features. This plot is quite intuitive meaning that the plot curve at a certain value of the underlying feature represents the average prediction of the model under the assumption that all data points take on the given value of the underlying feature. Furthermore, the plot is quite easy to interpret and when the features are uncorrelated it simply shows on average how the prediction changes when a given feature is changed.
The partial dependence function is defined as follows:
\[ \tilde{f} (x) = \int_{y \in Y} \tilde{f}(x,\tilde{y}) d\mathbb{P}(\tilde{y}),\] where \(x\) is the fixed value for the underlying feature, \(\tilde{y}\) is the vector of all features but the fixed one and \(\tilde{f}\) is the machine learning model being studied.
For each value of the fixed feature, we calculate \(\tilde{f} (x)\) and then plot them to get PDP for that given feature.
Caveat1: Partial dependence function/plot is based on an assumption that the fixed feature is uncorrelated with the other features in the model. This is a strong assumption and in many practical cases, this can be strongly violated. In the case of correlated features, PDP can be misleading and the derived inference based on it might result in miss-selection of features. The following plot demonstares the case when two features, \(X_1\) and \(X_2\) are correlated. In fact, PDP falsely assumes that the consitional distibution of \(X_2\) at \(X_1 = 0.75\) is the same as the marginal distribution of \(X_2\). This tends to assign weights to the unlikely ocombinations of \(X_1\) and \(X_2\).
Implementation of PDP in R (pdp package)
Partial dependence plot can be drawn in R by calling pdp package. Three main functions which are required to call in this package are partial(), autoplot() and grid.arrange(). In simple words, we first need to define which feature we need to select and study its importance on the model’s outputs. This can be done by calling partial() function. In the next step we use the predefined object using partial() as an argument to autoplot() function. Finally, we use the output of autoplot() as an argument to the grid.arrange() function to draw PDP for the selected feature.
We recommend visiting this page if interested in a practical implementation of pdp package. However, in the following we provide a block of codes implementing pdp plots for “grade” and “age” features for the credit default model.
Model_grade = partial(fitted_model, pred.var = c("grade"), chull = TRUE)
Model_grade_plot = autoplot(Model_grade, contour = TRUE)
Model_age = partial(fitted_model, pred.var = c("age"), chull = TRUE)
Model_age_plot = autoplot(Model_grade, contour = TRUE)
grid.arrange(Model_grade_plot, Model_age_plot)
One of the main problems with PDP which was highlighted in the previous subsection was about the underlying assumption around the distribution of features. The uncorrelated assumption between the features poses a serious danger in the final inference about the feature importance analysis of the model when the features are correlated.
To circumvent this problem, accumulated local effects plot was introduced. ALE plot is an alternative to PDP. However, ALE plot uses conditional distribution of the features and takes average over the differences in the predictions rather than predictions alone. In a nutshell, ALE plots show how the model predictions change around a small window of a fixed value of the underlying feature for all data instances in that window.
We recommend visiting ALE plot if interested in further information about ALE plot and the theory behind it accompanied with different examples.
ALE plots can be created using ALEPlot package in R. Main function to employ in this package is ALEPlot(). Arguments frequently used in this function are:
J, a numeric scalar or two-length vector of indices of the features fro which the ALE plot will be calculated.
In the following we provide the code for ALE plot for the case of credit risk using “grade” feature.
DAT = data.frame(loan_status = loanData$loan_status,
loan_amnt = loanData$loan_amnt, int_rate = loanData$int_rate,
grade = loanData$grade, emp_length = loanData$emp_length,
home_ownership = loanData$home_ownership,
annual_inc = loanData$annual_inc, age = loanData$age)
Neural.DAT = fitted_model(DAT)
prediction.DAT = function(X.model, newdata) as.numeric(predict_classes(X.model, newdata))
par(mfrow = c(2,4))
ALE.grade=ALEPlot(DAT[,2:8], neural.DAT, pred.fun = prediction.DAT , J=3, K=50, NA.plot = TRUE)
This subsection is devoted to study the feature’s importance concept in machine learning models. The concept tries to shed light on a black-box model aiming at understanding the importance (contribution) of the underlying features of the fitted model.
Permutation feature importance (PFI) or model reliance in simple terms is to measure a feature’s importance by calculating the increase of the fitted model’s prediction error after permuting the feature. A feature is considered important if the model error increases when the feature’s values are permuted; otherwise, the feature is considered as unimportant.
Algorithm underlying PFI The following algorithm is based on Breiman (2001) and Fisher, Rudin, and Dominici (2018).
Let’s assume that the trained model, \(\tilde{f}\) is applied on the feature matrix \(X\) with the target vector \(Y\) and the error measure \(L(Y,\tilde{Y})\).
To calculate PFI for each variable:
The following block of code provides the implementation of PFI in R.
PFI = function(X, Y, f){
score = c()
score[i] = f %>% evaluate(
X, Y, verbose = 0)
for (i in 1: ncol(X)){
xi = sample(X[,i], replace = FALSE)
Xi = cbind(xi, X[,-i])
score[i] = f %>% evaluate(
Xi, Y, verbose = 0)
}
FI [i] = score[i]/score_orig
PFI_final = data.frame("Features" = colnames(X), "Scores" = score)
return(PFI_final[order(PFI_final$Features, decreasing = TRUE),])
}
Should We Compute Importance on Training or Test Data?
There is no definite answer!
Personally I prefer TEST data.
There are various ways to engineer an optimal set of features that very well represent the relationship between the response variable and the inputs.
In a very general, yet accurate term, we call a feature good for a model if it is relevant to the model output but is not redundant (highly dependent) to any of the other relevant features.
The feature selection methods are broadly categorized into three types:
In this note, we only cover filter-based methods and leave the other two approaches to the reader for further investigation.
Filter methods, basically, use ranking techniques as principle criteria for the feature selection. That is, candidate variables/features are assigned a score by an appropriate metrics/ranking criterion and the variables having score below a predetermined threshold are removed.
Some of the well-known techniques/tests which fall under the filter methods’ umbrella are listed in the following table.
One of the pitfalls of the tests listed above is that they don’t consider the redundancy between features when selecting the optimal set of features. In the other words, they only consider the relevance of the selected feature to the response variable. To overcome this limitation, a correlation-based method (based on information entropy) called Fast Correlation based Feature Selection (FCBF) was introduced in the paper by Lei,Y. et al.
In the remainder of this section, we provide a better explanation of FCBF algorithm and it underlying theoretical framework.
Correlation-Based Measure (FCBF algorithm)
This approach was developed to overcome the shortcomings of previously highlighted tests and techniques. FCBF is an association/correlation measure based on the concept of entropy. To demonstrate FCBF, we first need to provide a quick overview of some of the prerequisite notations and concepts. Entropy, in a simple term, measures the average rate of uncertainty of a given random variable. In mathematical term:
\[ H(X) = -\sum_{i} \mathbb{P}(x_i) log_2(\mathbb{P}(x_i)),\] where \(X\) is the underlying variable which takes different values, \(x_i\). The definition above can be extended to a continuous random variable as well.
Applying conditional and joint distributions between two variables, \(X\) and \(Y\), we can define the entropy of \(X\) after observing values of \(Y\) as follows:
\[H(X|Y) = -\sum_{j} \mathbb{P}(y_j) \sum_i \mathbb{P}(x_i|y_j) log_2(\mathbb{P}(x_i|y_j)),\] Following conditional entropy we define another measure called Information gain as follows:
\[IG(X|Y ) = H(X) - H(X|Y).\]
Based on the equation above, the higher the Information gain (IG), the highly correlated the two variables are. The last concept/measure before defining FCBF is symmetrical uncertainty (SU) defined as follows:
\[ SU(X,Y) = 2 \left[\frac{IG(X|Y)}{H(X) + H(Y)}\right].\] SU has values between 0 and 1 with the value 1 indicating that knowledge of the value of either one completely predicts the value of the other and the value 0 indicating that X and Y are independent. The underlying metrics used in the definition of FCBF method is the SU metrics.
In the remainder of this part we provide a high level description of the method and refer the reader to the article FCBF for a thorough understanding of the algorithm.
Define the concept of Predominant Correlation as follows:
The correlation between a feature \(F_i\) and the class \(C\) is predominant if and only if \(SU_{i,c}\geq \delta,\) and \(\forall F_j \in S' (j\neq i),\) there exists no \(F_j\) such that \(SU_{j,i} \geq SU_{i,c}\), where \(S'\) is the set of features whose correlation with the class C is at least \(\delta\).
If there exists such \(F_j\) to a feature \(F_i\), we call it a redundant peer to \(F_i\). We define \(S_{P_{i}}\) to denote the set of all redundant peers for \(F_i\). If \(S_{P_{i}}\) for a given \(F_i \in S'\), then we decide \(S_{P_{i}}\) in two parts as follows:
\[S^{+}_{P_{i}} = \{ F_j | F_j \in S_{P_i}, SU_{j,c} > SU_{i,c}\},\] \[S^{-}_{P_{i}} = \{ F_j | F_j \in S_{P_i}, SU_{j,c} \leq SU_{i,c}\}.\]
A feature is predominant to the class, if and only if its correlation to the class is predominant or can become predominant after removing its redundant peers.
The following figure demonstrates the FCBF algorithm based on SU metrics and the concept of predominancy of features.
The FCBF algorithm is implemented in Biocomb package in R. We can call select.fast.filter() function in Biocomb package to select good features in the sense previously highlighted. The general arguments used in select.fast.filter() function is as follows:
matrix: a dataset, a matrix of feature with the last column for the class labels. Class labels could be numerical or character values. The maximal number of classes is ten;
disc.method: a method for feature discretization including, minimal description length (MDL), equal frequency, among others;
threshold: a numeric value for the correlation of features with class or within features.
select.fast.filter(matrix,disc.method,threshold)
An example of how to use the function above is provided in the following block of code.
data_set # is the matrix of features and the last column represents
# the class labels
disc.method = "MDL"
threshold = 0.5
select.fast.filter(data_set,disc.method,threshold)