suppressMessages(library(readr))
suppressMessages(library(keras))
suppressMessages(library(DT))

Introduction

This chapter introduces a densely connected deep neural network with two hidden layers and an output layer. It serves as a first example of the concepts that follow directly from the preceding chapter.

A densely connected network is the simplest form of a network. Each of the nodes in each of the hidden and the output layers are connected to each of the nodes in the layers before it in the algorithm.

New concepts are introduced as well. The first considers one-hot-encoding of the binary target variable. There is also a new activation function, softmax, which will be demonstrated here, but described in more detail in a following chapter.

One of the most important concepts in machine learning is also shown in this demonstration. It pertains to splitting the data into two sets. The first is a training set. This is the data that will be passed to the neural network. The second is the test set. This data is kept from the network. Once the network has learned from the training data, it can be tested against the unseen data in the test set so as to determine how well the learning performed.

Data

With the working directory (folder) set to the directory in which this markdown file is saved, the read_csv() function is used below to import a .csv file, which resides in the same directory (therefor negating the need to type the full address of the file).

The file consists of \(50000\) observations with \(10\) feature variables and a binary target variable.

The read_cvs() function is part of the readr package that was imported above. It has subtle differences from the standard read.csv() function. It creates a tibble which is different from a standard data.frame, the latter resulting from importing using read.csv(). A tibble displays large datasets better than a data.frame. It also never uses row names and does not store variables as special attributes.

The code below reads the .csv file and the displays a random \(1\)% of it in a table using the DT package. This display is intended for output to a web page. It uses the datatable() function. The data.set object is passed as argument with square bracket denoting a row,column address. The rows are selected at random using the sample() function. The first argument of this function state the total number of rows of the full dataset. The second states explicitly not to replace rows once they have been selected. The last argument show the size, which is 0.01 of the total number of rows. Note the closing parenthesis, ) and the a comma. There is only a space after the comma, which is shorthand for indicating all the columns.

data.set <- read_csv("SimulatedBinaryClassificationDataset.csv",
                 col_names = TRUE)

## Parsed with column specification:
## cols(
##   Var1 = col_double(),
##   Var2 = col_double(),
##   Var3 = col_double(),
##   Var4 = col_double(),
##   Var5 = col_double(),
##   Var6 = col_double(),
##   Var7 = col_double(),
##   Var8 = col_double(),
##   Var9 = col_double(),
##   Var10 = col_double(),
##   Target = col_integer()
## )

datatable(data.set[sample(nrow(data.set),
                          replace = FALSE,
                          size = 0.01 * nrow(data.set)), ])

The summary() function provides descriptive statistics for each of the variables.

summary(data.set)

##       Var1                Var2                Var3          
##  Min.   :-4.404214   Min.   :-4.413886   Min.   :-4.374043  
##  1st Qu.:-0.680166   1st Qu.:-0.677792   1st Qu.:-0.671550  
##  Median :-0.000612   Median :-0.005248   Median : 0.004296  
##  Mean   :-0.001853   Mean   :-0.005588   Mean   : 0.005143  
##  3rd Qu.: 0.676390   3rd Qu.: 0.667976   3rd Qu.: 0.675082  
##  Max.   : 3.766234   Max.   : 4.562115   Max.   : 3.826255  
##       Var4               Var5                Var6          
##  Min.   :-4.25136   Min.   :-6.562910   Min.   :-4.015382  
##  1st Qu.:-0.68108   1st Qu.:-0.712045   1st Qu.:-1.061773  
##  Median :-0.01094   Median : 0.082631   Median :-0.063422  
##  Mean   :-0.00506   Mean   : 0.001955   Mean   :-0.004981  
##  3rd Qu.: 0.67401   3rd Qu.: 1.303538   3rd Qu.: 1.001450  
##  Max.   : 3.82499   Max.   : 3.347969   Max.   : 6.161397  
##       Var7                Var8                Var9          
##  Min.   :-4.004546   Min.   :-4.003598   Min.   :-3.601650  
##  1st Qu.:-0.675130   1st Qu.:-0.668504   1st Qu.:-1.023422  
##  Median :-0.001151   Median : 0.012026   Median :-0.244104  
##  Mean   :-0.000119   Mean   : 0.006157   Mean   : 0.002998  
##  3rd Qu.: 0.668163   3rd Qu.: 0.677135   3rd Qu.: 1.002847  
##  Max.   : 3.878217   Max.   : 4.202026   Max.   : 5.387780  
##      Var10              Target      
##  Min.   :-3.84684   Min.   :0.0000  
##  1st Qu.:-0.73220   1st Qu.:0.0000  
##  Median : 0.07816   Median :1.0000  
##  Mean   :-0.00422   Mean   :0.5001  
##  3rd Qu.: 0.82045   3rd Qu.:1.0000  
##  Max.   : 3.58177   Max.   :1.0000

Preparing the data (preprocessing)

The data structure that exists after importing must be prepared before it can be passed to a neural network. Several steps are involved in this preparation and are given below.

Transformation into a matrix

The data structure is transformed into a mathematical matrix using the as_matrix() function before removing the variable (column) names.

# Cast dataframe as a matrix
data.set <- as.matrix(data.set)

# Remove column names
dimnames(data.set) = NULL

Train and test split

The dataset, which now exists as a matrix, must be split into a training and a test set as mentioned in the introduction. There are various ways in R to perform this split. Once such method is shown below.

The set.seed() function, with its arbitrary argument 123, ensures that the random splitting numbers generated to split the data will follow a pattern that is repeated when the code is re-executed later or by others. This simply ensures reprodicibility for the sake of this text.

The code creates an object named indx. Thesample() function creates a random sample. The first argument, 2, is a list item. The count starts at \(1\) and goes up in steps of \(1\). The sample will thus be selected from the sample space of only two elements, \(\left\{1,2\right\}\).

The next argument stipulates how many samples are required for the indx object. In this instance, it is set to the number of rows in the dataset, thereby ensuring that there is a number, either \(1\) or \(2\), for each row.

The replace = TRUE argument stipulates that the elements \(\left\{1,2\right\}\) are replaced after each round of randomly selecting a \(1\) or a \(2\), thereby ensuring a random sample of more than just two elements.

The prob = c() argument gives the probability that each respective element in the sample space has of being selected during each round.

# Split for train and test data
set.seed(123)
indx <- sample(2,
               nrow(data.set),
               replace = TRUE,
               prob = c(0.9, 0.1)) # Makes index with values 1 and 2

The probability of a \(1\) being selected is set at \(90\)% and of a \(2\) being selected is set at only \(10\)%. Note that these values must sum to \(100\)%. These numbers, \(1\) and \(2\) are going to added to each row and thereby allow for the split along these two values. The split will therefor create a sub-dataset that contains \(90\)% of the original dataset and another that will contain the remaining \(10\). This is a choice that the designer of the neural network must make.

The first sub-dataset is ultimately going to be the training set that is passed to the network from which it will learn the optimum parameters (so as to minimize the cost function). The second will be the test set against which the learned parameters will be tested. Generally, the larger the original dataset, the smaller the second set can be. There are two forces at play. The training set must be as large as possible to maximize the learning phase. The test set, though, must be big enough to be representative of the data as a whole. This ensures generalization to real-world data for which each network is ultimately designed.

In a tiny dataset containing only \(200\) samples, a \(10\)% test set contains only \(20\) samples, which might not be representative. In the case of the dataset used in this chapter, \(10\)% comprises a massive \(5000\) samples (roughly, as the precise number of \(2\)^s are random). This still leaving \(45000\) for training. The approximate \(5000\) samples should be quite enough to be representative for testing, whilst the \(45000\) should be enough to maximize learning.

The code below is very compact, but achieves a lot. It creates two objects named x_train and x_test. It is customary to use an x when referring to the matrix of feature variables. The _train and _test post-fixes differentiates the two objects for their ultimate roles.

The square bracket notation references addressing. Each value in a matrix has an address, given by its row number and then its column number, separated by a comma. The code then takes the list of randomly created \(1\) and \(2\) values from the indx object and selects those where the indx object has a value of 1 (by row) to go into the x_train object. The columns are 1:10 specifying shorthand for columns \(1\) through \(10\), i.e. only the \(10\) feature variables.

# Select only the feature variables
# Take rows with index = 1
x_train <- data.set[indx == 1, 1:10]
x_test <- data.set[indx == 2, 1:10]

Processing the target variable

The target variable must be split in a similar way. A separate object, y_test_actual, is created to hold the ground-truth (actual) feature values of the test set for later use. Note the use of indexing (row, column), indicating that these belong to the test set and that only the last column, 11 (the target) is included.

y_test_actual <- data.set[indx == 2, 11]

This chapter will use the softmax activation function in the output (see below). This requires the target variable to be one-hot-encoded . The concept is quite simple. Since there are only two elements in the sample set of the target variable in this example, \(0\) and \(1\), two variables are created by one-hot-encoding. The names for these variable are natural numbers starting at \(0\). Consider a target variable that was not either a \(0\) or a \(1\), i.e. benign and malignant. The target variable will be a list of the two elements, one for each subject. One-hot-encoding will then have two variable, names \(0\) and \(1\). The designer of the network might choose to encode benign as \(\left\{ 1,0 \right\}\) and malignant as \(\left\{ 0,1 \right\}\). In this case the first variable \(0\) references benign and the second, malignant. If a particular subject is then benign, the first variable \(0\) contains a \(1\) and the second, the \(1\), contains a \(0\).

It should be clear then why this encoding is referred to as one-hot-encoding. A number of dummy variables are created, the number being equal to the number of elements in the sample space of the target variable. For any given subject, a \(1\) will be introduced for the particular dummy variable and \(0\) for the rest.

The target variable of the training and test sets can be one-hot-encoded using the Keras function to_categorical(). Note the use of addressing and the column specified as \(11\), the target variable.

# Using similar indices to correspond to the training and test set
y_train <- to_categorical(data.set[indx == 1, 11])
y_test <- to_categorical(data.set[indx == 2, 11])

The code below shows the first five actual target variables of the test set and then the corresponding one-hot-encoded equivalent. Its uses cbind() to bind the data (listd as arguments) as columns. The first column is the actual first \(10\) samples and columns two and three are the encoded equivalent.

cbind(y_test_actual[1:10],
      y_test[1:10, ])

##       [,1] [,2] [,3]
##  [1,]    1    0    1
##  [2,]    1    0    1
##  [3,]    1    0    1
##  [4,]    1    0    1
##  [5,]    1    0    1
##  [6,]    0    1    0
##  [7,]    0    1    0
##  [8,]    1    0    1
##  [9,]    0    1    0
## [10,]    0    1    0

Creating the model

With the data prepared, the next step involves the design of the actual deep neural network. The code below saves the network as an object named model. As with functions where the function() keyword is used to denote that the object is not a normal object, model is specified to be a keras_model_sequential() object.

Keras has two network creation types. The first is used here and allows for the creation of one hidden layer after the next. There is also an API functional type that allows for much finer control over the design of the network.

Once the model has been instantiated (created), the layers can be added. There is more than one way to do this. In this example, the layers are specified by their type, layer_dense(), each containing all their specifications.

Note the use of the pipe, %>%, symbol. It passes what is on the left of it, as first argument to what is on its right (next line in this case).

The first hidden layer is then a densely connected layer. Names can be specified (optional, with no illegal characters such as spaces). It contains 10 nodes and uses the relu activation function. In this first hidden layer, the shape of the input vector must be specified. This represents the number of feature variables. Since the forward propagation step involves the inner product of tensors, the dimensions specified must be correct. If not, the tensor multiplication cannot occur.

The ccurrent layer is passed to the next hidden layer, again via the pipe symbol. This second hidden layer also contains 10 nodes and uses the relu activation function. The size of this layer need not be specified (for the ake of dimensionality required for the tensor multiplications), as it will be inferred.

The last layer is the output layer. It contains two nodes since the target was one-hot-encoded. It specifies the softmax activation function. This function provides a probability to each of the output nodes (\(0\) and \(1\)), such that the probabilities (of the two in this case) sum to one. Activation functions will be covered in more depth in a later chapter.

# Creating the model
model <- keras_model_sequential()

model %>% 
  layer_dense(name = "DeepLayer1",
              units = 10,
              activation = "relu",
              input_shape = c(10)) %>% 
  layer_dense(name = "DeepLayer2",
              units = 10,
              activation = "relu") %>% 
  layer_dense(name = "OutputLayer",
              units = 2,
              activation = "softmax")

summary(model)

## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## DeepLayer1 (Dense)               (None, 10)                    110         
## ___________________________________________________________________________
## DeepLayer2 (Dense)               (None, 10)                    110         
## ___________________________________________________________________________
## OutputLayer (Dense)              (None, 2)                     22          
## ===========================================================================
## Total params: 242
## Trainable params: 242
## Non-trainable params: 0
## ___________________________________________________________________________

The summary() function provides a summary of the model. There are three columns in the summary,the first giving the layer name (as optionally specified when the network was created) and its type. All of the layers are densely connected layers in this example. The Output Shape column specifies the output shape (after tensor multiplication, bias addition, and activation, i.e. forward propagation). The Param # column indicates the number of parameters (weights and biases) that the specific layer must learn. For layer one (feature variables), since there was \(10\) input nodes connected to \(10\) nodes in the first hidden layer, that results in \(10 \times 10 = 100\) parameters plus the column vector of bias values, of which there are also \(10\), resulting in \(110\) parameters. The next two layers follow a similar explanation.

The model for this chapter is depicted below, showing all 242 parameters (weight and bias) values that are to be optimized (minimizing the cost function), through backpropagation and gradient descent.

Compiling the model

Before fitting the training data (passing the training data to the model), the model requires compilation. The loss function, optimizer, and metrics are specified during this step. In this example, categorical cross-entropy is used as the loss function (since this is a multi-class classification problem). A standard ADAM optimizer is used for gradient descent and accuracy is used as the metric.

This loss function is different from the mean-squared-error used in preceding chapters. Gradient descent optimizers will be discussed in a following chapter.

# Compiling the model
model %>% compile(loss = "categorical_crossentropy",
                  optimizer = "adam",
                  metrics = c("accuracy"))

Fitting the data

The training set can now be fited (passed) to the compiled model. In addition, a validation set is created during the training and is set to comprise a fraction of \(0.1\) of the training data. This represents another split in the data similar to the initial train and test split. It allows for determining the accuracy of the model as it trains. Discrepancies between the loss and accuracy of the training and the validation gives clues as to how to change the hyperparameters during the re-design phase and will be discussed in a following chapter.

The fitted model is saved in a computer variable named history. Twenty epochs are run, with a mini-batch size of \(256\), yet more concepts for a following chapters.

When using Keras in RStudio, two live plots are created in the Viewer tab. The top shows the loss values for the training and validation sets. The bottom plot shows the accuracy of the two sets.

history <- model %>% 
  fit(x_train,
      y_train,
      epoch = 10,
      batch_size = 256,
      validation_split = 0.1,
      verbose = 2)

A simple plot can be created to show the loss and the accuracy over the epochs.

plot(history)

Model evaluation

The test feature and target sets can be used to evaluate the model. The results show the overall loss and accuracy by using the evaluate() function. It takes two arguments referencing the feature and target test sets.

model %>% 
  evaluate(x_test,
           y_test)

## $loss
## [1] 0.1593915
## 
## $acc
## [1] 0.9584102

A confusion matrix can be constructed. A computer variable is created to store the predicted classes given the test set, x_test. It passes this dataset through the model and uses the learned parameters to predict an output expressed as probability for each of the two output nodes (and ultimately a choice between a predicted \(0\) or \(1\), depending on which has the highest probability).

In the code below, a table is created using the initially saved ground-truth values, y_test_actual. The result is a confusion matrix showing how many times \(0\) and \(1\) were correctly and incorrectly predicted.

pred <- model %>% 
  predict_classes(x_test)

table(Predicted = pred,
      Actual = y_test_actual)

##          Actual
## Predicted    0    1
##         0 2416  161
##         1   42 2262

The predict_proba() function creates probabilities for each of the two classes in each of the test cases. The case with the highest probability is chosen as the predicted target class.

prob <- model %>% 
  predict_proba(x_test)

The code chunk below prints the first \(5\) probabilities. Since there are only two classes in the sample space of the target, only the first probability (for predicting a 0) is shown. For the sake of simplicity, this value is subtracted from \(1\) so as to give an indication of whether a 0 or a 1 is predicted. The former is predicted when the probability is less than \(0.5\) and the latter is predicted when the probability is greater than or equal to \(0.05\).

1 - prob[1:5]

## [1] 0.9901748 0.9995959 0.7805094 0.9879969 0.9988921

Since all these values are greater than or equal to \(0.5\) all of the first five predictions are for 1.

The predicted values, and the ground-truth values can be printed by combining them using cbind(). This function binds data into columns. The first column shows the probability for a \(1\). The second column shows the first \(10\) predictions and the last column shows the actual values (saved as an object at the start of this chapter).

cbind(1 - prob[1:10],
      pred[1:10],
      y_test_actual[1:10])

##             [,1] [,2] [,3]
##  [1,] 0.99017481    1    1
##  [2,] 0.99959592    1    1
##  [3,] 0.78050940    1    1
##  [4,] 0.98799686    1    1
##  [5,] 0.99889208    1    1
##  [6,] 0.13785559    0    0
##  [7,] 0.05658281    0    0
##  [8,] 0.99525713    1    1
##  [9,] 0.04575014    0    0
## [10,] 0.06233585    0    0

Note that subjects 6,7,9,10 have probabilities for 1 of less than 0.5 and hence a 0 is predicted. All of the first \(10\) subjects have correct predictions.

Conclusion

This chapter introduced very important concepts in machine learning. The first showed the preparation of data. This is a required step before the data can be passed to a network and the networks accuracy tested. It is important to have data that the network has never seen when testing.

The Keras package allows for the easy construction of a network, with simple, clear syntax.

Deep neural network example using R

Dr Juan H Klopper