suppressMessages(library(readr))
suppressMessages(library(keras))
suppressMessages(library(DT))
This chapter introduces a densely connected deep neural network with two hidden layers and an output layer. It serves as a first example of the concepts that follow directly from the preceding chapter.
A densely connected network is the simplest form of a network. Each of the nodes in each of the hidden and the output layers are connected to each of the nodes in the layers before it in the algorithm.
New concepts are introduced as well. The first considers one-hot-encoding of the binary target variable. There is also a new activation function, softmax, which will be demonstrated here, but described in more detail in a following chapter.
One of the most important concepts in machine learning is also shown in this demonstration. It pertains to splitting the data into two sets. The first is a training set. This is the data that will be passed to the neural network. The second is the test set. This data is kept from the network. Once the network has learned from the training data, it can be tested against the unseen data in the test set so as to determine how well the learning performed.
With the working directory (folder) set to the directory in which this markdown file is saved, the read_csv()
function is used below to import a .csv
file, which resides in the same directory (therefor negating the need to type the full address of the file).
The file consists of \(50000\) observations with \(10\) feature variables and a binary target variable.
The read_cvs()
function is part of the readr
package that was imported above. It has subtle differences from the standard read.csv()
function. It creates a tibble which is different from a standard data.frame
, the latter resulting from importing using read.csv()
. A tibble displays large datasets better than a data.frame. It also never uses row names and does not store variables as special attributes.
The code below reads the .csv
file and the displays a random \(1\)% of it in a table using the DT
package. This display is intended for output to a web page. It uses the datatable()
function. The data.set
object is passed as argument with square bracket denoting a row,column
address. The rows are selected at random using the sample()
function. The first argument of this function state the total number of rows of the full dataset. The second states explicitly not to replace rows once they have been selected. The last argument show the size, which is 0.01
of the total number of rows. Note the closing parenthesis, )
and the a comma. There is only a space after the comma, which is shorthand for indicating all the columns.
data.set <- read_csv("SimulatedBinaryClassificationDataset.csv",
col_names = TRUE)
## Parsed with column specification:
## cols(
## Var1 = col_double(),
## Var2 = col_double(),
## Var3 = col_double(),
## Var4 = col_double(),
## Var5 = col_double(),
## Var6 = col_double(),
## Var7 = col_double(),
## Var8 = col_double(),
## Var9 = col_double(),
## Var10 = col_double(),
## Target = col_integer()
## )
datatable(data.set[sample(nrow(data.set),
replace = FALSE,
size = 0.01 * nrow(data.set)), ])
The summary()
function provides descriptive statistics for each of the variables.
summary(data.set)
## Var1 Var2 Var3
## Min. :-4.404214 Min. :-4.413886 Min. :-4.374043
## 1st Qu.:-0.680166 1st Qu.:-0.677792 1st Qu.:-0.671550
## Median :-0.000612 Median :-0.005248 Median : 0.004296
## Mean :-0.001853 Mean :-0.005588 Mean : 0.005143
## 3rd Qu.: 0.676390 3rd Qu.: 0.667976 3rd Qu.: 0.675082
## Max. : 3.766234 Max. : 4.562115 Max. : 3.826255
## Var4 Var5 Var6
## Min. :-4.25136 Min. :-6.562910 Min. :-4.015382
## 1st Qu.:-0.68108 1st Qu.:-0.712045 1st Qu.:-1.061773
## Median :-0.01094 Median : 0.082631 Median :-0.063422
## Mean :-0.00506 Mean : 0.001955 Mean :-0.004981
## 3rd Qu.: 0.67401 3rd Qu.: 1.303538 3rd Qu.: 1.001450
## Max. : 3.82499 Max. : 3.347969 Max. : 6.161397
## Var7 Var8 Var9
## Min. :-4.004546 Min. :-4.003598 Min. :-3.601650
## 1st Qu.:-0.675130 1st Qu.:-0.668504 1st Qu.:-1.023422
## Median :-0.001151 Median : 0.012026 Median :-0.244104
## Mean :-0.000119 Mean : 0.006157 Mean : 0.002998
## 3rd Qu.: 0.668163 3rd Qu.: 0.677135 3rd Qu.: 1.002847
## Max. : 3.878217 Max. : 4.202026 Max. : 5.387780
## Var10 Target
## Min. :-3.84684 Min. :0.0000
## 1st Qu.:-0.73220 1st Qu.:0.0000
## Median : 0.07816 Median :1.0000
## Mean :-0.00422 Mean :0.5001
## 3rd Qu.: 0.82045 3rd Qu.:1.0000
## Max. : 3.58177 Max. :1.0000
The data structure that exists after importing must be prepared before it can be passed to a neural network. Several steps are involved in this preparation and are given below.
The data structure is transformed into a mathematical matrix using the as_matrix()
function before removing the variable (column) names.
# Cast dataframe as a matrix
data.set <- as.matrix(data.set)
# Remove column names
dimnames(data.set) = NULL
The dataset, which now exists as a matrix, must be split into a training and a test set as mentioned in the introduction. There are various ways in R
to perform this split. Once such method is shown below.
The set.seed()
function, with its arbitrary argument 123
, ensures that the random splitting numbers generated to split the data will follow a pattern that is repeated when the code is re-executed later or by others. This simply ensures reprodicibility for the sake of this text.
The code creates an object named indx
. Thesample()
function creates a random sample. The first argument, 2
, is a list item. The count starts at \(1\) and goes up in steps of \(1\). The sample will thus be selected from the sample space of only two elements, \(\left\{1,2\right\}\).
The next argument stipulates how many samples are required for the indx
object. In this instance, it is set to the number of rows in the dataset, thereby ensuring that there is a number, either \(1\) or \(2\), for each row.
The replace = TRUE
argument stipulates that the elements \(\left\{1,2\right\}\) are replaced after each round of randomly selecting a \(1\) or a \(2\), thereby ensuring a random sample of more than just two elements.
The prob = c()
argument gives the probability that each respective element in the sample space has of being selected during each round.
# Split for train and test data
set.seed(123)
indx <- sample(2,
nrow(data.set),
replace = TRUE,
prob = c(0.9, 0.1)) # Makes index with values 1 and 2
The probability of a \(1\) being selected is set at \(90\)% and of a \(2\) being selected is set at only \(10\)%. Note that these values must sum to \(100\)%. These numbers, \(1\) and \(2\) are going to added to each row and thereby allow for the split along these two values. The split will therefor create a sub-dataset that contains \(90\)% of the original dataset and another that will contain the remaining \(10\). This is a choice that the designer of the neural network must make.
The first sub-dataset is ultimately going to be the training set that is passed to the network from which it will learn the optimum parameters (so as to minimize the cost function). The second will be the test set against which the learned parameters will be tested. Generally, the larger the original dataset, the smaller the second set can be. There are two forces at play. The training set must be as large as possible to maximize the learning phase. The test set, though, must be big enough to be representative of the data as a whole. This ensures generalization to real-world data for which each network is ultimately designed.
In a tiny dataset containing only \(200\) samples, a \(10\)% test set contains only \(20\) samples, which might not be representative. In the case of the dataset used in this chapter, \(10\)% comprises a massive \(5000\) samples (roughly, as the precise number of \(2\)s are random). This still leaving \(45000\) for training. The approximate \(5000\) samples should be quite enough to be representative for testing, whilst the \(45000\) should be enough to maximize learning.
The code below is very compact, but achieves a lot. It creates two objects named x_train
and x_test
. It is customary to use an x
when referring to the matrix of feature variables. The _train
and _test
post-fixes differentiates the two objects for their ultimate roles.
The square bracket notation references addressing. Each value in a matrix has an address, given by its row number and then its column number, separated by a comma. The code then takes the list of randomly created \(1\) and \(2\) values from the indx
object and selects those where the indx
object has a value of 1
(by row) to go into the x_train
object. The columns are 1:10
specifying shorthand for columns \(1\) through \(10\), i.e. only the \(10\) feature variables.
# Select only the feature variables
# Take rows with index = 1
x_train <- data.set[indx == 1, 1:10]
x_test <- data.set[indx == 2, 1:10]
The target variable must be split in a similar way. A separate object, y_test_actual
, is created to hold the ground-truth (actual) feature values of the test set for later use. Note the use of indexing (row, column), indicating that these belong to the test set and that only the last column, 11
(the target) is included.
y_test_actual <- data.set[indx == 2, 11]
This chapter will use the softmax
activation function in the output (see below). This requires the target variable to be one-hot-encoded . The concept is quite simple. Since there are only two elements in the sample set of the target variable in this example, \(0\) and \(1\), two variables are created by one-hot-encoding. The names for these variable are natural numbers starting at \(0\). Consider a target variable that was not either a \(0\) or a \(1\), i.e. benign and malignant. The target variable will be a list of the two elements, one for each subject. One-hot-encoding will then have two variable, names \(0\) and \(1\). The designer of the network might choose to encode benign as \(\left\{ 1,0 \right\}\) and malignant as \(\left\{ 0,1 \right\}\). In this case the first variable \(0\) references benign and the second, malignant. If a particular subject is then benign, the first variable \(0\) contains a \(1\) and the second, the \(1\), contains a \(0\).
It should be clear then why this encoding is referred to as one-hot-encoding. A number of dummy variables are created, the number being equal to the number of elements in the sample space of the target variable. For any given subject, a \(1\) will be introduced for the particular dummy variable and \(0\) for the rest.
The target variable of the training and test sets can be one-hot-encoded using the Keras
function to_categorical()
. Note the use of addressing and the column specified as \(11\), the target variable.
# Using similar indices to correspond to the training and test set
y_train <- to_categorical(data.set[indx == 1, 11])
y_test <- to_categorical(data.set[indx == 2, 11])
The code below shows the first five actual target variables of the test set and then the corresponding one-hot-encoded equivalent. Its uses cbind()
to bind the data (listd as arguments) as columns. The first column is the actual first \(10\) samples and columns two and three are the encoded equivalent.
cbind(y_test_actual[1:10],
y_test[1:10, ])
## [,1] [,2] [,3]
## [1,] 1 0 1
## [2,] 1 0 1
## [3,] 1 0 1
## [4,] 1 0 1
## [5,] 1 0 1
## [6,] 0 1 0
## [7,] 0 1 0
## [8,] 1 0 1
## [9,] 0 1 0
## [10,] 0 1 0
With the data prepared, the next step involves the design of the actual deep neural network. The code below saves the network as an object named model
. As with functions where the function()
keyword is used to denote that the object is not a normal object, model
is specified to be a keras_model_sequential()
object.
Keras
has two network creation types. The first is used here and allows for the creation of one hidden layer after the next. There is also an API
functional type that allows for much finer control over the design of the network.
Once the model has been instantiated (created), the layers can be added. There is more than one way to do this. In this example, the layers are specified by their type, layer_dense()
, each containing all their specifications.
Note the use of the pipe, %>%
, symbol. It passes what is on the left of it, as first argument to what is on its right (next line in this case).
The first hidden layer is then a densely connected layer. Names can be specified (optional, with no illegal characters such as spaces). It contains 10
nodes and uses the relu
activation function. In this first hidden layer, the shape of the input vector must be specified. This represents the number of feature variables. Since the forward propagation step involves the inner product of tensors, the dimensions specified must be correct. If not, the tensor multiplication cannot occur.
The ccurrent layer is passed to the next hidden layer, again via the pipe symbol. This second hidden layer also contains 10
nodes and uses the relu
activation function. The size of this layer need not be specified (for the ake of dimensionality required for the tensor multiplications), as it will be inferred.
The last layer is the output layer. It contains two nodes since the target was one-hot-encoded. It specifies the softmax
activation function. This function provides a probability to each of the output nodes (\(0\) and \(1\)), such that the probabilities (of the two in this case) sum to one. Activation functions will be covered in more depth in a later chapter.
# Creating the model
model <- keras_model_sequential()
model %>%
layer_dense(name = "DeepLayer1",
units = 10,
activation = "relu",
input_shape = c(10)) %>%
layer_dense(name = "DeepLayer2",
units = 10,
activation = "relu") %>%
layer_dense(name = "OutputLayer",
units = 2,
activation = "softmax")
summary(model)
## ___________________________________________________________________________
## Layer (type) Output Shape Param #
## ===========================================================================
## DeepLayer1 (Dense) (None, 10) 110
## ___________________________________________________________________________
## DeepLayer2 (Dense) (None, 10) 110
## ___________________________________________________________________________
## OutputLayer (Dense) (None, 2) 22
## ===========================================================================
## Total params: 242
## Trainable params: 242
## Non-trainable params: 0
## ___________________________________________________________________________
The summary()
function provides a summary of the model. There are three columns in the summary,the first giving the layer name (as optionally specified when the network was created) and its type. All of the layers are densely connected layers in this example. The Output Shape column specifies the output shape (after tensor multiplication, bias addition, and activation, i.e. forward propagation). The Param # column indicates the number of parameters (weights and biases) that the specific layer must learn. For layer one (feature variables), since there was \(10\) input nodes connected to \(10\) nodes in the first hidden layer, that results in \(10 \times 10 = 100\) parameters plus the column vector of bias values, of which there are also \(10\), resulting in \(110\) parameters. The next two layers follow a similar explanation.
The model for this chapter is depicted below, showing all 242 parameters (weight and bias) values that are to be optimized (minimizing the cost function), through backpropagation and gradient descent.
Before fitting the training data (passing the training data to the model), the model requires compilation. The loss function, optimizer, and metrics are specified during this step. In this example, categorical cross-entropy is used as the loss function (since this is a multi-class classification problem). A standard ADAM optimizer is used for gradient descent and accuracy is used as the metric.
This loss function is different from the mean-squared-error used in preceding chapters. Gradient descent optimizers will be discussed in a following chapter.
# Compiling the model
model %>% compile(loss = "categorical_crossentropy",
optimizer = "adam",
metrics = c("accuracy"))
The training set can now be fited (passed) to the compiled model. In addition, a validation set is created during the training and is set to comprise a fraction of \(0.1\) of the training data. This represents another split in the data similar to the initial train and test split. It allows for determining the accuracy of the model as it trains. Discrepancies between the loss and accuracy of the training and the validation gives clues as to how to change the hyperparameters during the re-design phase and will be discussed in a following chapter.
The fitted model is saved in a computer variable named history
. Twenty epochs are run, with a mini-batch size of \(256\), yet more concepts for a following chapters.
When using Keras
in RStudio, two live plots are created in the Viewer tab. The top shows the loss values for the training and validation sets. The bottom plot shows the accuracy of the two sets.
history <- model %>%
fit(x_train,
y_train,
epoch = 10,
batch_size = 256,
validation_split = 0.1,
verbose = 2)
A simple plot can be created to show the loss and the accuracy over the epochs.
plot(history)
The test feature and target sets can be used to evaluate the model. The results show the overall loss and accuracy by using the evaluate()
function. It takes two arguments referencing the feature and target test sets.
model %>%
evaluate(x_test,
y_test)
## $loss
## [1] 0.1593915
##
## $acc
## [1] 0.9584102
A confusion matrix can be constructed. A computer variable is created to store the predicted classes given the test set, x_test
. It passes this dataset through the model and uses the learned parameters to predict an output expressed as probability for each of the two output nodes (and ultimately a choice between a predicted \(0\) or \(1\), depending on which has the highest probability).
In the code below, a table is created using the initially saved ground-truth values, y_test_actual
. The result is a confusion matrix showing how many times \(0\) and \(1\) were correctly and incorrectly predicted.
pred <- model %>%
predict_classes(x_test)
table(Predicted = pred,
Actual = y_test_actual)
## Actual
## Predicted 0 1
## 0 2416 161
## 1 42 2262
The predict_proba()
function creates probabilities for each of the two classes in each of the test cases. The case with the highest probability is chosen as the predicted target class.
prob <- model %>%
predict_proba(x_test)
The code chunk below prints the first \(5\) probabilities. Since there are only two classes in the sample space of the target, only the first probability (for predicting a 0
) is shown. For the sake of simplicity, this value is subtracted from \(1\) so as to give an indication of whether a 0
or a 1
is predicted. The former is predicted when the probability is less than \(0.5\) and the latter is predicted when the probability is greater than or equal to \(0.05\).
1 - prob[1:5]
## [1] 0.9901748 0.9995959 0.7805094 0.9879969 0.9988921
Since all these values are greater than or equal to \(0.5\) all of the first five predictions are for 1
.
The predicted values, and the ground-truth values can be printed by combining them using cbind()
. This function binds data into columns. The first column shows the probability for a \(1\). The second column shows the first \(10\) predictions and the last column shows the actual values (saved as an object at the start of this chapter).
cbind(1 - prob[1:10],
pred[1:10],
y_test_actual[1:10])
## [,1] [,2] [,3]
## [1,] 0.99017481 1 1
## [2,] 0.99959592 1 1
## [3,] 0.78050940 1 1
## [4,] 0.98799686 1 1
## [5,] 0.99889208 1 1
## [6,] 0.13785559 0 0
## [7,] 0.05658281 0 0
## [8,] 0.99525713 1 1
## [9,] 0.04575014 0 0
## [10,] 0.06233585 0 0
Note that subjects 6,7,9,10
have probabilities for 1
of less than 0.5
and hence a 0
is predicted. All of the first \(10\) subjects have correct predictions.
This chapter introduced very important concepts in machine learning. The first showed the preparation of data. This is a required step before the data can be passed to a network and the networks accuracy tested. It is important to have data that the network has never seen when testing.
The Keras
package allows for the easy construction of a network, with simple, clear syntax.