library(tidyverse)
library(gridExtra)
Artificial neural networks (ANN), from now on neural networks, are computer systems that perform tasks in a way vaguely analogous to human brain. This ressemblance resides in that, like the human brain, neural networks are a network in which nodes transmit information to and from the nodes they are connected to. Here we will discuss how to apply neural networks for classification and numerical prediction tasks.
This figure presents the structure of a neural network that aims to predict outcome \(y\) from features \(x1, \dots, x5\). Between outcome and features lie three hidden layers of nodes (sometimes called neurons). Features are included in the input layer, and outcomes in the output layer. In this neural network, hidden layers are structured as dense layers. All nodes of a dense layer are connected to the nodes of the previous layer. Here previous means closer to the input layer. Neural networks like the presented in the figure are structured as a sequence of dense layers, so they are called sequential networks. The architecture of sequential networks is defined by:
The approach to model fit of neural networks is to define a feature for each node of the hidden layers, and to obtain predictions for the output layer. The resulting set of features hopefully will hopefully predict the outcome more precisely than the original features. These new features are are obtained as nonlinear interactions of the nodes to which the node is connected. In a shallow network there is a single hidden layer, so hidden nodes are obtained from interactions of the features in the input layer. In a deep network there are several layers of hidden nodes, so the values for deep layers are obtained from nodes of previous hidden layers, depending on network architecture.
We are making deep learning when we use neural networks with many layers of hidden nodes to a classification or numerical prediction task. Deep learning may lead to very accurate predictions if we have a very large dataset. The adequate number of nodes in each layer will be proportional to the number of features. In
The availability of large datasets has allowed to use deep learning for tasks like image or speech recognition. Research of this field has allowed to define specific implementations of neural networs:
The applications of neural networks go beyond prediction tasks:
The values of variables of hidden and output layers are obtained in a two-step process.
First, we obtain a score \(z\) for each variable. This score is obtained from the values of previous nodes, each multiplied by a weight \(w\) and maybe adding an intercept \(b\). Loosely stated, this means that \(z = wx + b\) for nodes of the first layer, and \(z= wh + b\) for nodes of the next layers. The \(z\) for the output layer is obtained from values of the last hidden layer. The neural network has parameters:
Once obtained the score \(z\), we obtain the value of the variable applying an activation function. Activation functions model the nonlinear relationship between variables, and can be different for each node of the hidden and output layers.
Let’s go back to the figure and see how can we obtain the values of some of the variables, assuming that all intercepts are zero. let’s start with \(h_{12}\), a node belonging to the first hidden layer. The score \(z_{12}\) is obtained doing:
\[\begin{align} z_{12} &= w_{0112}x_{1} + w_{0212}x_{2} + \dots + w_{0512}x_{5} \end{align}\]
and then we obtain the variable value applying its specific activation function \(f_{12}\):
\[\begin{align} h_{12} &= f_{12} \left( z_{12} \right) \end{align}\]
Similarly, the value of \(h_{23}\) is obtained as:
\[\begin{align} z_{23} &= w_{1123}h_{11} + w_{1223}h_{12} + w_{1323}h_{13} \\ h_{23} &= f_{23} \left( z_{23} \right) \end{align}\]
And the outcome is obtained as:
\[\begin{align} z_{10} &= w_{2110}h_{21} + w_{2210}h_{22} + w_{2310}h_{23} \\ y &= f_{10} \left( z_{10} \right) \end{align}\]
In the figure are presented some usual activation functions.
For binary classification tasks, we need a probability of classification, so we will use a sigmoid activation function for the output layer. For numerical prediction, usually we use a linear activation function in the output layer.
As for the hidden layers, the most usual choice is the rectified linear unit relu activation function, specially when using many nodes in each layer. relu interactions allow to account for nonlinear effects like other activation functions like tanh, and have a better behaviour when applying gradient descent. tanh has a low value of derivative for extreme values of input, and this may lead to slow convergence in model fit.
Training a neural network consists of finding the value of neural network parameters that minimizes a loss function. For binary classification problems, the usual choice is the cross entropy function:
\[\begin{align} J \left( \mathbf{w}, \mathbf{b} \right) &= - \frac{1}{N} \sum_{i=1}^N \left( y_ilog\left(\hat{y}_i\right) + \left(1-y_i\right) log\left(1 - \hat{y}_i\right) \right) \end{align}\]
where \(\hat{y}_i\) is the estimated probability that observation \(i\) belongs to the category defined by \(y_i=1\). In this context, \(N\) is the number of observations.
For numerical prediction, the most common loss function is mean squared error, sometimes called L2 loss:
\[\begin{align} J \left( \mathbf{w}, \mathbf{b} \right) &= \frac{1}{N} \sum_{i=1}^N \left( y_i - \hat{y}_i\right)^2 \end{align}\]
The standard way of minimizing the loss function is through gradient descent. It is an iterative algorithm through which we update parameters at each step \(t\) doing:
\[\begin{align} w_t &\leftarrow w_{t-1} - \alpha\frac{\partial J}{\partial w} \end{align}\]
The parameter \(\alpha\) is the learning rate of the gradient descent. As the value of partial derivatives depends of variable values, each iteration in the gradient descent has two stages:
Once performed forward and backward propagation, we can update the values of network parameters for this iteration.
In the straightforward implementation of gradient descent we have just described, we use all observations to obtain the variable values in the forward propagation, and to get the partial derivatives in the backward propagation. This process ensures a smooth convergence of the algorithm, although it can make the optimization process slow, specially for large datasets.
A way to speed-up the process is to apply mini-batch gradient descent. We split the dataset into small subsets, called mini-batches. Then, we perform each step of forward and backward propagation using only elements of a different mini-batch each time. Once we have used all mini-batches, we have ended an epoch of the optimization process. In the following epoch, we use the same mini-batches to keep iterating. When using mini-batches, we usually fix the length of the optimization process specifying a number of epochs.
Experts recommend using mini-batches with train datasets of more than 2000 observations. Usually mini-batch sizes are expressed in powers of two (16, 32, 64, …) to optimize memory usage. When mini-batch size is equal to one, the fitting process is called stochastic gradient descent.
A common problem of effective prediction algorithms is that they tend to overfit to the train test, so predictions on validation and test sets can be hindered. A usual way to reduce overfitting is to eliminate nodes of a layer with a given probability. We call this practice dropout regularization, and the probability dropout rate. Dropout is not applied when making predictions at test time.
When defining dropout rate, we can consider setting a higher dropout rate in layers with many nodes, where overfitting will likely take place. Dropout is a technique to reduce overfitting, and should not be used by default.
In some cases, if we set too much epochs in the gradient descent algorithm, the neural network can overfit to the train set. We can check this fitting the model with a validation set, different from the train and test sets. If after a number of iterations we see that the model overfits to the train set, we can consider an early stopping strategy, so that we stop the training process at a lower number of epochs.
keraskeras is an open-source neural network library written in Python. keras is designed as a user-friendly, modular, and extensible platform to enable fast experimentation with deep neural networks. It can be considered a user-friendly interface to lower-level libraries like TensorFlow, the neural networks library used by Google for deep learning, and with other neural network frameworks.
The link to the keras website is https://keras.io/, and to the R interface to keras https://keras.rstudio.com/
keras in RThe best way to install keras in R is through the RStudio repository:
devtools::install_github("rstudio/keras")
The keras R interface uses the TensorFlow backend engine by default. To install both the core keras library as well as the TensorFlow backend use:
library(keras)
install_keras()
If you do not have any Python installation, install_keras() will install a minimal installation to run keras and tensorflow.
kerasLet’s perform a binary classification job with keras. The job will be predicting if a red wine is good of bad using the WineQuality dataset reachable in the BAdatasets package. We start loading the required libraries:
library(BAdatasets)
library(tidyverse)
library(caret)
library(keras)
As keras is Python-based, you need to feed all variables as numeric. This means that here we will define the good target variable as numeric instead of factor. We use createDataPartition from caret to obtain train and test sets.
red <- WineQuality$red
red <- red %>% mutate(good = ifelse(quality >= 6, 1, 0)) #factor must be encoded as numeric
red <- red %>% select(-quality)
set.seed(2020)
inTrain <- createDataPartition(red$good, p=0.8, list = FALSE)
red_train <- red %>% slice(inTrain)
red_test <- red %>% slice(-inTrain)
A common practice in neural networks is scaling variables in the input layer. The scale() function does this job and, at the same time, transforms the data frame into a matrix, the required input for keras.
For classification tasks, keras requires a dummy variable for each category. We get this using to_categorical, specifying that we have two different categories.
The variable transformations for the train and test sets are:
x_train <- red_train %>%
select(-good) %>%
scale() #we get a matrix doing this
y_train <- to_categorical(red_train$good, 2)
x_test <- red_test %>%
select(-good) %>%
scale()
y_test <- to_categorical(red_test$good, 2)
Let’s define the neural network using the keras syntax.
model <- keras_model_sequential()
model %>%
layer_dense(units = 256, activation = 'relu', input_shape = ncol(x_train)) %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 128, activation = 'relu') %>%
layer_dropout(rate = 0.3) %>%
layer_dense(units = 2, activation = 'sigmoid')
From this code, we see that:
keras_model_sequential() to define it.input_shape parameter. We have defined a dropout rate of 0.4 for this layer.Before fitting the model, we compile it specifying the type of loss function, the optimizer and the metric used for evaluating performance. The RMSprop optimizer is an improved version of gradient descent (see https://keras.io/api/optimizers/rmsprop/ for details).
model %>% compile(
loss = 'binary_crossentropy',
optimizer = optimizer_rmsprop(),
metrics = c('accuracy')
)
We use fit to estimate network parameters. Here we establish that feature data are x_train and output data y_train. We are using mini-batches of batch_size 256, and use each mini-batch 100 times, the number of epochs. We use 20% of train data as validation split. If you run this code from a R script, you can turn verbose=1 to see parameters of fitting process evolution.
model %>% fit(
x_train, y_train,
epochs = 100,
batch_size = 256,
validation_split = 0.2,
verbose=0
)
Once the model is fitted, we use evaluate to obtain loss and performance metrics for train and test sets. We can also use the confusionMatrix function of caret to obtain model metrics of the test set.
model %>% evaluate(x_train, y_train)
## loss accuracy
## 0.4152052 0.8210937
model %>% evaluate(x_test, y_test)
## loss accuracy
## 0.4977671 0.7648903
keraspred <- model %>% predict_classes(x_test)
confusionMatrix(as.factor(keraspred), as.factor(red_test$good), positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 110 42
## 1 33 134
##
## Accuracy : 0.7649
## 95% CI : (0.7144, 0.8103)
## No Information Rate : 0.5517
## P-Value [Acc > NIR] : 2.12e-15
##
## Kappa : 0.5275
##
## Mcnemar's Test P-Value : 0.3556
##
## Sensitivity : 0.7614
## Specificity : 0.7692
## Pos Pred Value : 0.8024
## Neg Pred Value : 0.7237
## Prevalence : 0.5517
## Detection Rate : 0.4201
## Detection Prevalence : 0.5235
## Balanced Accuracy : 0.7653
##
## 'Positive' Class : 1
##
kerasLet’s now apply the keras frame to the BostonHousing dataset available from mlbench.
We prepare data removing tax feature, highly correlated with other features. We have a categorical feature chas, which we need to transform into numeric. Once that settled, we can define train and test sets.
library(mlbench)
data("BostonHousing")
bh2 <- BostonHousing %>% select(-tax)
bh2 <- bh2 %>% mutate(chas=as.numeric(chas)) #transforming factor chas
set.seed(2020)
inTrain <- createDataPartition(bh2$medv, p=0.8, list=FALSE)
bh2_train <- bh2 %>% slice(inTrain)
bh2_test <- bh2 %>% slice(-inTrain)
Let’s scale data and turn into matrices:
bh_train <- bh2_train %>%
select(-medv) %>%
scale()
label_train <- bh2_train$medv
bh_test <- bh2_test %>%
select(-medv) %>%
scale()
label_test <- bh2_test$medv
Let’s build bh_model, a model of neural networks for this problem. We have two hidden layers of 64 nodes each, activated with relu. The output layer has a single output, and uses a linear activation function. We have also compiled the model, choosing mean absolute error to track model evolution.
bh_model <- keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu",
input_shape = ncol(bh_train)) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 1)
bh_model %>% compile(
loss = "mse",
optimizer = optimizer_rmsprop(),
metrics = list("mean_absolute_error"))
Let’s now fit the model, storing the fit into a fit_bh variable. If you are doing this job from a script, you can make verbose=1 to examine fit evolution. As we have a small dataset, we have kept a small validation split of 10% of the train set. The dataset is too small to use mini-batches, so the epochs variable shows the number of iterations.
fit_bh <- bh_model %>% fit(
bh_train,
label_train,
epochs = 500,
validation_split = 0.1,
verbose = 0
)
Let’s see the results of model fit:
bh_model %>% evaluate(bh_train, label_train)
## loss mean_absolute_error
## 5.962403 1.443220
bh_model %>% evaluate(bh_test, label_test)
## loss mean_absolute_error
## 17.615368 2.978366
Seems like there is overfitting to the train set. Let’s look at training evolution plot to see what’s happened:
plot(fit_bh)
We can see that using too many iterations in the optimization process leads to a small improvement of performance on the train set, but to a large deterioration of performance on the test set. We can reduce overfit with early stopping, that is, reducing the iterations to train the model. In this context, it can be convenient setting epoch=100. Befere fitting the model again, we need to remove previous results with rm(bh_model).
rm(bh_model)
bh_model <- keras_model_sequential() %>%
layer_dense(units = 64, activation = "relu",
input_shape = ncol(bh_train)) %>%
layer_dropout(rate = 0.1) %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dropout(rate = 0.1) %>%
layer_dense(units = 1)
bh_model %>% compile(
loss = "mse",
optimizer = optimizer_rmsprop(),
metrics = list("mean_absolute_error"))
fit_bh <- bh_model %>% fit(
bh_train,
label_train,
epochs = 100,
validation_split = 0.2,
verbose = 1
)
Now we see that we have stopped training before overfitting:
plot(fit_bh)
Results seem better now for the test set:
bh_model %>% evaluate(bh_train, label_train)
## loss mean_absolute_error
## 9.995400 2.177307
bh_model %>% evaluate(bh_test, label_test)
## loss mean_absolute_error
## 13.849889 2.901489
We can also show model performance with postResample function of caret.
keras_train <- bh_model %>% predict(bh_train)
keras_test <- bh_model %>% predict(bh_test)
postResample(keras_train[,1], bh2_train$medv)
## RMSE Rsquared MAE
## 3.161550 0.882244 2.177308
postResample(keras_test[,1], bh2_test$medv)
## RMSE Rsquared MAE
## 3.7215440 0.8527924 2.9014884
An alternative implementation of this job can be seen at https://keras.rstudio.com/articles/tutorial_basic_regression.html.