Shallow Neural Net from Scratch Using R

0.1 Introduction

A lot of deep learning frameworks often abstract away the mechanics behind training a neural network. While that has the advantage of building deep learning models at a very fast speed, it is equally important to slow down and understand how they work. In this 2-part series, we’ll dig deep and build our own neural-net from scratch. This will help us understand, at a basic level, how those big frameworks work. The network we’ll build will contain a single hidden layer and perform binary classification using a vectorized implementation of backpropagation, all written in base-R.

In this 2-part series, we will describe in detail what a single-layer neural network is, how it works, and the equations used to describe it. We will see what kind of data preparation is required to be able to use it with a neural-network. Then, we will implement a neural-net step-by-step from scratch and examine the output at each step. Finally, to see how our neural-net fares, we will describe a few metrics used for classification problems and use them.

We will also compare our neural-net with a logistic regression model and see how the decision boundaries differ. By the end of this series, you will have a deeper understanding of the maths behind neural-networks and the ability to implement it yourself from scratch!

In this first part, we’ll go through the dataset we are going to use, the pre-processing involved, train-test split, and the describe in detail the architecture of the model. Then we’ll build our neural-net chunk-by-chunk. It will involve writing functions for initializing parameters and running forward propagation.

In the second part, we’ll implement backpropagation by writing functions to calculate gradients and update the weights. Finally, we’ll make predictions on the test data and see how accurate our model is using metrics such as Accuracy, Recall, Precision, and F1-score. We’ll compare our neural-net with a logistic regression model andv visualize the difference in the decision boundaries produced by these models.

0.1.1 Set Seed

Before we start, let’s set a seed value to ensure reproducibility of the results.

set.seed(69)

0.1.2 Architecture Definition

To understand the matrix multiplications better and keep the numbers digestible, we will describe a very simple 3-layer neural net ie. a neural-net with a single hidden layer. The \(1^{st}\) layer will take in the inputs and the \(3^{rd}\) layer will spit out an output.

The input layer will have 2 (input) neurons, the hidden layer 4 (hidden) neurons, and the output layer 1 (output) neuron.

Our input layer has 2 neurons because we’ll be passing 2 features (columns of a dataframe) as the input. A single output neuron because we’re performing binary classification. This means two output classes - 0 and 1. Our output will actually be a probability (a number that lies between 0 and 1). We’ll define a threshold for rounding off this probability to 0 or 1. For instance, this threshold can be 0.5.

In a deep neural-net, multiple hidden layers are stacked together (hence the name “deep”). Each hidden layer can contain any number of neurons you want.

In this series, we’re implementing a single-layer neural-net which, as the name suggests, contains a single hidden layer.

n_x: the size of the input layer (set this to 2).
n_h: the size of the hidden layer (set this to 4).
n_y: the size of the output layer (set this to 1).

Figure 1: Single layer NNet Architecture. Credits: deeplearning.ai

Neural networks flow from left to right, ie. input to output. In the above example, we have 2 features (2 columns from the input dataframe) that arrive at the input neurons from the first-row of the input dataframe. These 2 numbers are then multiplied by a set of weights (randomly initialised at first and later optimised).

An activation function is then applied on the result of this multiplication. This new set of numbers becomes the neurons in our hidden layer. These neurons are again multiplied by another set of weights (randomly initialised) with an activation function applied to this result. The final result we obtain is a single number. This is the prediction of our neural-net. It’s a number that lies between 0 and 1.

Once we have a prediction, we then compare it to the true output. To optimize the weights in order to make our predictions more accurate (because right now our input is being multiplied by random weights to give a random prediction), we need to first calculate how far-off is our prediction from the actual value. Once we have this loss, we calculate the gradients with respect to each weight.

The gradients tell us the amount by which we need to increase or decrease each weight parameter in order to minimise the loss. All the weights in the network are updated as we repeat the entire process with the second input sample (second row).

After all the input samples have been used to optimise weights, we say that 1 epoch has passed. We repeat this process for multiple number of epochs till our loss stops decreasing.

At this point you might be wondering what an activation function is. An activation function adds non-linearity to our network and enables it to learn complex features. If you look closely, a neural network consists of a bunch multiplications and additions. It’s linear and we know that a linear classification model will not be able to learn complex features in high dimensions.

Here are a few popular activation functions -

Figure 2: Sigmoid Activation Function. Credits - analyticsindiamag

We will use tanh() and sigmoid() activation functions in our neural net. Because tanh() is already available in base-R, we will implement the sigmoid() function ourselves later on.

0.1.3 Dry Run

For now, let’s see how the numbers flow through the above described neural-net by writing out the equations for a single sample (1 input row).

For one input sample \(x^{(i)}\) where \(i\) is the row-number:

First, we calculate the output \(Z\) from the input \(x\). We will tune the parameters \(W\) and \(b\). Here, the superscript in square brackets tell us the layer number and the one in parenthesis tell use the neuron number. For instance \(z^{[1] (i)}\) is the output from the \(i^{{th}}\) neuron of the \(1^{{st}}\) layer.

\[z^{[1] (i)} = W^{[1]} x^{(i)} + b^{[1] (i)}\tag{1}\]

Then we’ll pass this value through the tanh() activation function to get \(a\).

\[a^{[1] (i)} = \tanh(z^{[1] (i)})\tag{2}\]

After that, we’ll calculate the value for the final output layer using the hidden layer values.

\[z^{[2] (i)} = W^{[2]} a^{[1] (i)} + b^{[2] (i)}\tag{3}\]

Finally, we’ll pass this value through the sigmoid() activation function and obtain our output probability. \[\hat{y}^{(i)} = a^{[2] (i)} = \sigma(z^{ [2] (i)})\tag{4}\]

To obtain our prediction class from output probabilities, we round off the values as follows. \[y^{(i)}_{prediction} = \begin{cases} 1 & \mbox{if } a^{[2](i)} > 0.5 \\ 0 & \mbox{otherwise } \end{cases}\tag{5}\]

Once, we have the prediction probabilities, we’ll compute the loss in order to tune our parameters (\(w\) and \(b\) can be adjusted using gradient-descent).

Given the predictions on all the examples, we will compute the cost \(J\) the cross-entropy loss as follows:
\[J = - \frac{1}{m} \sum\limits_{i = 0}^{m} \large\left(\small y^{(i)}\log\left(\hat{y}^{(i)}\right) + (1-y^{(i)})\log\left(1- \hat{y}^{ (i)}\right) \large \right) \small \tag{6}\]

Once we have our loss, we need to calculate the gradients. I’ve calculated them for you so you don’t differentiate anything. We’ll directly use these values -

\(dZ^{[2]} = A^{[2]} - Y\)
\(dW^{[2]} = \frac{1}{m} dZ^{[2]}A^{[1]^T}\)
\(db^{[2]} = \frac{1}{m}\sum dZ^{[2]}\)
\(dZ^{[1]} = W^{[2]^T} * g^{[1]'} Z^{[1]}\) where \(g\) is the activation function.
\(dW^{[1]} = \frac{1}{m}dZ^{[1]}X^{T}\)
\(db^{[1]} = \frac{1}{m}\sum dZ^{[1]}\)

Now that we have the gradients, we will update the weights. We’ll multiply these gradients with a number known as the learning rate. The learning rate is represented by \(\alpha\).

\(W^{[2]} = W^{[2]} - \alpha * dW^{[2]}\)
\(b^{[2]} = b^{[2]} - \alpha * db^{[2]}\)
\(W^{[1]} = W^{[1]} - \alpha * dW^{[1]}\)
\(b^{[1]} = b^{[1]} - \alpha * db^{[1]}\)

This process is repeated multiple times until our model converges ie. we have a learnt a good set of weights that fit our data well.

0.2 Load and Visualize the Data

Since, the goal of the series is to understand how neural-networks work behind the scene, we’ll use a small dataset so that our focus is on building our neural-net.

We’ll use a planar dataset that looks like a flower. The output classes cannot be seperated accurately using a straight line.

0.2.1 Construct Dataset

df <- read.csv("../../data/planar_flower.csv")

Let’s shuffle our dataset so that our model is invariant to the order of samples. This is good for generalisation and will help increase performance on unseen (test) data.

df <- df[sample(nrow(df)), ]

df

0.2.2 Visualize Data

We have 400 samples where 200 belong each class.

Here’s a scatter plot between our input variables. As you can see, the output classes are not easily separable.

Here’s a bar-plot that tells us class-distribution. We have 400 rows where the classes are equally distributed (200 rows in each class).

0.2.3 Train-Test Split

Now that we have our dataset prepared, let’s go ahead and split it into train and test sets.

We’ll put 80% of our data into our train set and the remaining 20% into our test set.

To keep the focus on the neural-net, we’ll not be using a validation set here.

train_test_split_index <- 0.8 * nrow(df)

0.2.3.1 Train Dataset

Because we’ve already shuffled the dataset above, we can go ahead and extract the first 80% rows into train set.

train <- df[1:train_test_split_index,]
train

Let’s visualize the number of samples per class in our train set to ensure that there isn’t a major class imbalance.

0.2.3.2 Test Dataset

Having extracted the train-set, we can now safely select last 20% rows of the shuffled dataset to be our test set.

test <- df[(train_test_split_index+1): nrow(df),]
test

Here’s the class distribution of our test set.

0.3 Preprocess

Neural networks work best when the input values are standardized. So, we’ll scale all the values to to have their mean=0 and standard-deviation=1.

Standardizing input values speeds up the training and ensures faster convergence.

To standardize the input values, we’ll use the scale() function in R. Note that we’re standardizing the input values (X) only and not the output values (y).

X_train <- scale(train[, c(1:2)])

y_train <- train$y
dim(y_train) <- c(length(y_train), 1) # add extra dimension to vector

X_test <- scale(test[, c(1:2)])

y_test <- test$y
dim(y_test) <- c(length(y_test), 1) # add extra dimension to vector

The output below tells us the shape and size of our input data.

## Shape of X_train (row, column): 
##  320 2 
## Shape of y_train (row, column) : 
##  320 1 
## Number of training samples: 
##  320 
## Shape of X_test (row, column): 
##  80 2 
## Shape of y_test (row, column) : 
##  80 1 
## Number of testing samples: 
##  80

Because neural-nets are made up of a bunch matrix multiplications, let’s convert our input and output to matrices from dataframes. While dataframes are a good way to represent data in a tabular form, we choose to convert to a matrix type because matrices are smaller than an equivalent dataframe and often speed up the computations.

We will also change the shape of X and y by taking its transpose. This will make the matrix calculations slightly more intuitive as we’ll see in the second part. There’s really no difference though. Some of you might find this way better, while others might prefer the non-transposed way. I feel this this makes more sense.

We’re going to use the as.matrix() method to construct out matrix. We’ll fill out matrix row-by-row.

X_train <- as.matrix(X_train, byrow=TRUE)
X_train <- t(X_train)
y_train <- as.matrix(y_train, byrow=TRUE)
y_train <- t(y_train)

X_test <- as.matrix(X_test, byrow=TRUE)
X_test <- t(X_test)
y_test <- as.matrix(y_test, byrow=TRUE)
y_test <- t(y_test)

Here are the shapes of our matrices after taking the transpose.

## Shape of X_train: 
##  2 320 
## Shape of y_train: 
##  1 320 
## Shape of X_test: 
##  2 80 
## Shape of y_test: 
##  1 80

0.4 Build a neural-net

Now that we’re done processing our data, let’s move on to building our neural net. As discussed above, we will broadly follow the steps outlined below.

Define the neural net architecture.
Initialize the model’s parameters from a random-uniform distribution.
Loop:
- Implement forward propagation.
- Compute loss.
- Implement backward propagation to get the gradients.
- Update parameters.

0.4.1 Get layer sizes

A neural-network optimizes certain parameters to get to the right output. These parameters are initialised randomly. However, the size of these matrices is dependent upon the number of layers in different layers of neural-net.

To generate matrices with random parameters, we need to first obtain the size (number of neurons) of all the layers in our neural-net. We’ll write a function to do that. Let’s denote n_x, n_h, and n_y as the number of neurons in input layer, hidden layer, and output layer respectively.

We will obtain these shapes from our input and output data matrices created above.

dim(X)[1] gives us \(2\) because the shape of X is (2, 320). We do the same for dim(y)[1].

getLayerSize <- function(X, y, hidden_neurons, train=TRUE) {
  n_x <- dim(X)[1]
  n_h <- hidden_neurons
  n_y <- dim(y)[1]   
  
  size <- list("n_x" = n_x,
               "n_h" = n_h,
               "n_y" = n_y)
  
  return(size)
}

As we can see below, the number of neurons is decided based on shape of the input and output matrices.

layer_size <- getLayerSize(X_train, y_train, hidden_neurons = 4)
layer_size

## $n_x
## [1] 2
## 
## $n_h
## [1] 4
## 
## $n_y
## [1] 1

0.4.2 Initialise parameters

Before we start training our parameters, we need to initialize them. Let’s initialize the parameters based on random uniform distribution.

The function initializeParameters() takes as argument an input matrix and a list which contains the layer sizes ie. number of neurons. The function returns the trainable parameters W1, b1, W2, b2.

Our neural-net has 3 layers, which gives us 2 sets of parameter. The first set is W1 and b1. The second set is W2 and b2. Note that these parameters exist as matrices.

These random weights matrices W1, b1, W2, b2 are created based on the layer sizes of the different layers (n_x, n_h, and n_y).

The sizes of these weights matrices are -

W1 = (n_h, n_x)
b1 = (n_h, 1)
W2 = (n_y, n_h)
b2 = (n_y, 1)

initializeParameters <- function(X, list_layer_size){

    m <- dim(data.matrix(X))[2]
    
    n_x <- list_layer_size$n_x
    n_h <- list_layer_size$n_h
    n_y <- list_layer_size$n_y
        
    W1 <- matrix(runif(n_h * n_x), nrow = n_h, ncol = n_x, byrow = TRUE) * 0.01
    b1 <- matrix(rep(0, n_h), nrow = n_h)
    W2 <- matrix(runif(n_y * n_h), nrow = n_y, ncol = n_h, byrow = TRUE) * 0.01
    b2 <- matrix(rep(0, n_y), nrow = n_y)
    
    params <- list("W1" = W1,
                   "b1" = b1, 
                   "W2" = W2,
                   "b2" = b2)
    
    return (params)
}

For our network, the size of our weight matrices are as follows. Remember that, number of input neurons n_x = 2, hidden neurons n_h = 4, and output neuron n_y = 1. layer_size is calculate above.

init_params <- initializeParameters(X_train, layer_size)
lapply(init_params, function(x) dim(x))

## $W1
## [1] 4 2
## 
## $b1
## [1] 4 1
## 
## $W2
## [1] 1 4
## 
## $b2
## [1] 1 1

0.4.3 Define the sigmoid function

We’ll implement the sigmoid() activation function that we’ll use at the output layer.

sigmoid <- function(x){
    return(1 / (1 + exp(-x)))
}

\[S(x) = \frac {1} {1 + e^{-x}}\]

0.4.4 Define the tanh function

tanh() function is already present in R. Here’s what its equation looks like.

\[T(x) = \frac {e^x - e^{-x}} {e^x + e^{-x}}\]

0.4.5 Forward Propagation

Now onto defining the forward propagation. The function forwardPropagation() takes as arguments the input matrix X, the parameters list params, and the list of layer_sizes.

We extract the layers sizes and weights from the respective functions defined above.

To perform matrix multiplication, we use the %*% operator.

Before we perform the matrix multiplications, we need to reshape the parameters b1 and b2. Why do we do this? Let’s find out.

Note that, the parameter shapes are -

W1: (4, 2)
b1: (4, 1)
W2: (1, 4)
b2 : (1, 1)

And the layers sizes are:

n_x = 2
n_h = 4
n_y = 1

Finally, shape of input matrix \(X\) (input layer):

X: (2, 320)

If we talk about the input => hidden; the hidden layer obtained by the equation A1 = activation(y1) = W1 %*% X + b1, would be as follows:

For the matrix multiplication of W1 and X, their shapes are correct by default: (4, 2) x (2, 320). The shape of the output matrix W1 %*% X is (4, 320).
Now, b1 is of shape (4, 1). Since, W1 %*% X is of the shape (4, 320), we need to repeat it b1 320 times, one for each input sample We do that using the command rep(b1, m) where m is calculated as dim(X)[2] which selects the second dimension of the shape of X.
The shape of A1 is (4, 320).

In the case of hidden => output; the output obtained by the equation y2 = W2 %*% A1 + b2, would be as follows:

To shapes of W2 and A1 are correct for us to perform matrix multiplication on them. W2 is (1, 4) and A1 is (4, 320). The output W2 %*% A1 has the shape (1, 320). b2 has a shape of (1, 1). We will again repeat b2 like we did above. So, b2 now becomes (1, 320).
The shape of A2 is now (1, 320).

We use the tanh() activation for the hidden layer and sigmoid() activation for the output layer.

forwardPropagation <- function(X, params, list_layer_size){
    
    m <- dim(X)[2]
    n_h <- list_layer_size$n_h
    n_y <- list_layer_size$n_y
    
    W1 <- params$W1
    b1 <- params$b1
    W2 <- params$W2
    b2 <- params$b2
    
    b1_new <- matrix(rep(b1, m), nrow = n_h)
    b2_new <- matrix(rep(b2, m), nrow = n_y)
    
    Z1 <- W1 %*% X + b1_new
    A1 <- sigmoid(Z1)
    Z2 <- W2 %*% A1 + b2_new
    A2 <- sigmoid(Z2)
    
    cache <- list("Z1" = Z1,
                  "A1" = A1, 
                  "Z2" = Z2,
                  "A2" = A2)

    return (cache)
}

Even though we only need the value A2 for forward propagation, you’ll notice we return all other calculated values as well. We do this because these values will be needed during backpropagation. Saving them here will reduce the the time it takes for backpropagation because we don’t have to calculate it again.

Another thing to notice is the Z and A of a particular layer will always have the same shape. This is because A = activation(Z) which does not change the shape of Z. An activation function only introduces non-linearity in a network.

fwd_prop <- forwardPropagation(X_train, init_params, layer_size)
lapply(fwd_prop, function(x) dim(x))

## $Z1
## [1]   4 320
## 
## $A1
## [1]   4 320
## 
## $Z2
## [1]   1 320
## 
## $A2
## [1]   1 320

0.4.6 Compute cost

We will use Binary Cross Entropy loss function (aka log loss). Here, \(y\) is the true label and \(\hat{y}\) is the predicted output.

\[ cost = - 1/N\sum_{i=1}^{N} y_{i}\log(\hat{y}_{i}) + (1 - y_{i})(\log(1 - \hat{y}_{i})) \]

The computeCost() function takes as arguments the input matrix \(X\), the true labels \(y\) and a cache. cache is the output of the forward pass that we calculated above. To calculate the error, we will only use the final output A2 from the cache.

computeCost <- function(X, y, cache) {
    
    m <- dim(X)[2]
    
    A2 <- cache$A2

    logprobs <- (log(A2) * y) + (log(1-A2) * (1-y))
    cost <- -sum(logprobs/m)
    
    return (cost)
}

cost <- computeCost(X_train, y_train, fwd_prop)
cost

## [1] 0.6930115

0.4.7 Backpropagation

Now comes the best part of this all: Backpropagation!

We’ll write a function that will calculate the gradient of the loss function with respect to the parameters. Generally, in a deep network, we have something like the following.

Figure 3: Backpropagation with cache. Credits: deeplearning.ai

The above figure has 2 hidden layers. During backpropagation (red boxes), we use the output cached during forward propagation (purple boxes).

Our neural net has only 1 hidden layer. More specifically, we have the following -

Figure 4: Backpropagation for a single layer. Credits: deeplearning.ai

To compute backpropagation, we write a function that takes as arguments an input matrix X, the train labels y, the output activations from the forward pass as cache, and a list of layer_sizes.

The three outputs \((dW^{[l]}, db^{[l]}, dA^{[l-1]})\) are computed using the input \(dZ^{[l]}\) where \(l\) is the layer number.

We first differentiate the loss function with respect to the weight \(W\) of the current layer.

\[ dW^{[l]} = \frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} \tag{8}\]

Then we differentiate the loss function with respect to the bias \(b\) of the current layer. \[ db^{[l]} = \frac{\partial \mathcal{L} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}\tag{9}\]

Once we have these, we calculate the derivative of the previous layer with respect to \(A\), the output + activation from the previous layer.

\[ dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]} \tag{10}\]

Because we only have a single hidden layer, we first calculate the gradients for the final (output) layer and then the middle (hidden) layer. In other words, the gradients for the weights that lie between the output and hidden layer are calculated first. Using this (and chain rule), gradients for the weights that lie between the hidden and input layer are calculated next.

Finally we return a list of gradient matrices.

These gradients tell us the the small value by which we should increase/decrease our weights such that the loss decreases.

Here are the equations for the gradients. I’ve calculated them for you so you don’t differentiate anything. We’ll directly use these values -

\(dZ^{[2]} = A^{[2]} - Y\)
\(dW^{[2]} = \frac{1}{m} dZ^{[2]}A^{[1]^T}\)
\(db^{[2]} = \frac{1}{m}\sum dZ^{[2]}\)
\(dZ^{[1]} = W^{[2]^T} * g^{[1]'} Z^{[1]}\) where \(g\) is the activation function.
\(dW^{[1]} = \frac{1}{m}dZ^{[1]}X^{T}\)
\(db^{[1]} = \frac{1}{m}\sum dZ^{[1]}\)

backwardPropagation <- function(X, y, cache, params, list_layer_size){
    
    m <- dim(X)[2]
    
    n_x <- list_layer_size$n_x
    n_h <- list_layer_size$n_h
    n_y <- list_layer_size$n_y

    A2 <- cache$A2
    A1 <- cache$A1
    W2 <- params$W2

    
    dZ2 <- A2 - y
    dW2 <- 1/m * (dZ2 %*% t(A1)) 
    db2 <- matrix(1/m * sum(dZ2), nrow = n_y)
    db2_new <- matrix(rep(db2, m), nrow = n_y)
    
    dZ1 <- (t(W2) %*% dZ2) * (1 - A1^2)
    dW1 <- 1/m * (dZ1 %*% t(X))
    db1 <- matrix(1/m * sum(dZ1), nrow = n_h)
    db1_new <- matrix(rep(db1, m), nrow = n_h)
    
    grads <- list("dW1" = dW1, 
                  "db1" = db1,
                  "dW2" = dW2,
                  "db2" = db2)
    
    return(grads)
}

As you can see below, the shapes of the gradients are the same as their corresponding weights ie. W1 has the same shape as dW1 and so on. This is important because we are going to use these gradients to update our actual weights.

back_prop <- backwardPropagation(X_train, y_train, fwd_prop, init_params, layer_size)
lapply(back_prop, function(x) dim(x))

## $dW1
## [1] 4 2
## 
## $db1
## [1] 4 1
## 
## $dW2
## [1] 1 4
## 
## $db2
## [1] 1 1

0.4.8 Update Parameters

From the gradients calculated by the backwardPropagation(), we update our weights using the updateParameters() function. The updateParameters() function takes as arguments the gradients, network parameters, and a learning rate.

Why a learning rate? Because sometimes the weight updates (gradients) are too large and because of that we miss the minima completely. Learning rate is a hyper-parameter that is set by us, the user, to control the impact of weight updates.

The value of learning rate lies between \(0\) and \(1\). This learning rate is multiplied with the gradients before being subtracted from the weights.

The weights are updated as follows where the learning rate is defined by \(\alpha\).

\(W^{[2]} = W^{[2]} - \alpha * dW^{[2]}\)
\(b^{[2]} = b^{[2]} - \alpha * db^{[2]}\)
\(W^{[1]} = W^{[1]} - \alpha * dW^{[1]}\)
\(b^{[1]} = b^{[1]} - \alpha * db^{[1]}\)

Updated parameters are returned by updateParameters() function. We take the gradients, weight parameters, and a learning rate as the input. grads and params are calculated above while we choose the learning_rate.

updateParameters <- function(grads, params, learning_rate){

    W1 <- params$W1
    b1 <- params$b1
    W2 <- params$W2
    b2 <- params$b2
    
    dW1 <- grads$dW1
    db1 <- grads$db1
    dW2 <- grads$dW2
    db2 <- grads$db2
    
    
    W1 <- W1 - learning_rate * dW1
    b1 <- b1 - learning_rate * db1
    W2 <- W2 - learning_rate * dW2
    b2 <- b2 - learning_rate * db2
    
    updated_params <- list("W1" = W1,
                           "b1" = b1,
                           "W2" = W2,
                           "b2" = b2)
    
    return (updated_params)
}

As we can see, the weights still maintain their original shape. This means we’ve done things correctly till this point.

update_params <- updateParameters(back_prop, init_params, learning_rate = 0.01)
lapply(update_params, function(x) dim(x))

## $W1
## [1] 4 2
## 
## $b1
## [1] 4 1
## 
## $W2
## [1] 1 4
## 
## $b2
## [1] 1 1

0.5 Train the model

Now that we have all our components, let’s go ahead write a function that will train our model.

We will use all the functions we have written above in the following order.

Run forward propagation
Calculate loss
Calculate gradients
Update parameters
Repeat

This trainModel() function takes as arguments the input matrix X, the true labels y, and the number of epochs.

Get the sizes for layers and initialize random parameters.
Initialize a vector called cost_history which we’ll use to store the calculated loss value per epoch.
Run a for-loop:
- Run forward prop.
- Calculate loss.
- Update parameters.
- Replace the current parameters with updated parameters.

This function returns the updated parameters which we’ll use to run our model inference. It also returns the cost_history vector.

trainModel <- function(X, y, num_iteration, hidden_neurons, lr){
    
    layer_size <- getLayerSize(X, y, hidden_neurons)
    init_params <- initializeParameters(X, layer_size)
    
    cost_history <- c()

    for (i in 1:num_iteration) {
        fwd_prop <- forwardPropagation(X, init_params, layer_size)
        cost <- computeCost(X, y, fwd_prop)
        back_prop <- backwardPropagation(X, y, fwd_prop, init_params, layer_size)
        update_params <- updateParameters(back_prop, init_params, learning_rate = lr)
        init_params <- update_params

        cost_history <- c(cost_history, cost)
        
        if (i %% 10000 == 0) cat("Iteration", i, " | Cost: ", cost, "\n")
    }
    
    model_out <- list("updated_params" = update_params,
                      "cost_hist" = cost_history)

    return (model_out)
}

Now that we’ve defined our function to train, let’s run it!

We’re going to train our model, with 40 hidden neurons, for 60000 epochs with a learning rate of 0.9.

We will print out the loss after every 10000 epochs.

EPOCHS = 60000
HIDDEN_NEURONS = 40
LEARNING_RATE = 0.9

train_model <- trainModel(X_train, y_train, hidden_neurons = HIDDEN_NEURONS, num_iteration = EPOCHS, lr = LEARNING_RATE)

## Iteration 10000  | Cost:  0.3711597 
## Iteration 20000  | Cost:  0.246912 
## Iteration 30000  | Cost:  0.2576157 
## Iteration 40000  | Cost:  0.2900886 
## Iteration 50000  | Cost:  0.2074697 
## Iteration 60000  | Cost:  0.2858471

0.6 Logistic Regression

Before we go ahead and test our neural-net, let’s quickly train a simple logistic regression model so that we can compare its performance with our neural-net. Since, a logistic regression model can learn only linear boundaries, it will not fit the data well. A neural-network on the other hand will.

We’ll use the glm() function in R to build this model.

lr_model <- glm(y ~ x1 + x2, data = train)
lr_model

## 
## Call:  glm(formula = y ~ x1 + x2, data = train)
## 
## Coefficients:
## (Intercept)           x1           x2  
##     0.51697      0.00889     -0.05207  
## 
## Degrees of Freedom: 319 Total (i.e. Null);  317 Residual
## Null Deviance:       79.95 
## Residual Deviance: 76.41     AIC: 457.8

Let’s now make generate predictions of the logistic regression model on the test set.

lr_pred <- round(as.vector(predict(lr_model, test[, 1:2])))
lr_pred

##  [1] 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 0 1 0 1 1 1 1 1 1
## [39] 1 1 1 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 1 0 0
## [77] 1 1 1 0

0.7 Test the model

Finally, it’s time to make predictions. To do that -

First get the layer sizes.
Run forward propagation.
Return the prediction.

During inference time, we do not need to perform backpropagation as you can see below. We only perform forward propagation and return the final output from our neural network.

Note that instead of randomly initializing parameters, we’re using the trained parameters here.

makePrediction <- function(X, y, hidden_neurons){
    layer_size <- getLayerSize(X, y, hidden_neurons)
    params <- train_model$updated_params
    fwd_prop <- forwardPropagation(X, params, layer_size)
    pred <- fwd_prop$A2
    
    return (pred)
}

After obtaining our output probabilities (Sigmoid), we round-off those to obtain output labels.

y_pred <- makePrediction(X_test, y_test, HIDDEN_NEURONS)
y_pred <- round(y_pred)

Here are the true labels and the predicted labels.

## Neural Net: 
##  1 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1

## Ground Truth: 
##  0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 1 1 1 0 1 0 0 1 1 1 0 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 1

## Logistic Reg: 
##  1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 1 0

0.7.1 Decision Boundaries

In the following visualization, we’ve plotted our test-set predictions on top of the decision boundaries.

0.7.1.1 Neural Net

As we can see, our neural-net was able to learn the non-linear decision boundary and has produced accurate results.

0.7.1.2 Logistic Regression

On the other hand Logistic Regression, with it’s linear decision boundary, could not fit the data very well.

0.7.2 Confusion Matrix

A confusion matrix is often used to describe the performance of a classifier. It is defined as -

\[\mathbf{Confusion Matrix} = \left[\begin{array} {rr} True Negative & False Positive \\ False Negative & True Positive \end{array}\right] \]

Let’s go over the basic terms used in a confusion matrix through an example. Consider the case where we were trying to predict if an email was spam or not.

True Positive: Email was predicted to be spam and it actually was spam.
True Negative: Email was predicted as not-spam and it actually was not-spam.
False Positive: Email was predicted to be spam but it actually was not-spam.
False Negative: Email was predicted to be not-spam but it actually was spam.

tb_nn <- table(y_test, y_pred)
tb_lr <- table(y_test, lr_pred)

cat("NN Confusion Matrix: \n")

## NN Confusion Matrix:

tb_nn

##       y_pred
## y_test  0  1
##      0 33 11
##      1  7 29

cat("\nLR Confusion Matrix: \n")

## 
## LR Confusion Matrix:

tb_lr

##       lr_pred
## y_test  0  1
##      0 14 30
##      1 18 18

0.7.3 Accuracy Metrics

We’ll calculate the Precision, Recall, F1 Score, Accuracy. These metrics, derived from the confusion matrix, are defined as -

Precision is defined as the number of true positives over the number of true positives plus the number of false positives.

\[Precision = \frac {True Positive}{True Positive + False Positive} \]

Recall is defined as the number of true positives over the number of true positives plus the number of false negatives.

\[Recall = \frac {True Positive}{True Positive + False Negative} \]

F1-score is the harmonic mean of precision and recall.

\[F1 Score = 2 \times \frac {Precision \times Recall}{Precision + Recall} \]

Accuracy gives us the percentage of the all correct predictions out total predictions made.

\[Accuracy = \frac {True Positive + True Negative} {True Positive + False Positive + True Negative + False Negative} \]

To better understand these terms, let’s continue the example of “email-spam” we used above.

If our model had a precision of 0.6, that would mean when it predicts an email as spam, then it is correct 60% of the time.
If our model had a recall of 0.8, then it would mean our model correctly classifies 80% of all spam.
The F-1 score is way we combine the precision and recall together. A perfect F1-score is represented with a value of 1, and worst with 0

Now that we have an understanding of the accuracy metrics, let’s actually calculate them. We’ll define a function that takes as input the confusion matrix. Then based on the above formulas, we’ll calculate the metrics.

calculate_stats <- function(tb, model_name) {
  acc <- (tb[1] + tb[4])/(tb[1] + tb[2] + tb[3] + tb[4])
  recall <- tb[4]/(tb[4] + tb[3])
  precision <- tb[4]/(tb[4] + tb[2])
  f1 <- 2 * ((precision * recall) / (precision + recall))
  
  cat(model_name, ": \n")
  cat("\tAccuracy = ", acc*100, "%.")
  cat("\n\tPrecision = ", precision*100, "%.")
  cat("\n\tRecall = ", recall*100, "%.")
  cat("\n\tF1 Score = ", f1*100, "%.\n\n")
}

Here are the metrics for our neural-net and logistic regression.

## Neural Network : 
##  Accuracy =  77.5 %.
##  Precision =  80.55556 %.
##  Recall =  72.5 %.
##  F1 Score =  76.31579 %.

## Logistic Regression : 
##  Accuracy =  40 %.
##  Precision =  50 %.
##  Recall =  37.5 %.
##  F1 Score =  42.85714 %.

As we can see, the logistic regression performed horribly because it cannot learn non-linear boundaries. Neural-nets on the other hand, are able to learn non-linear boundaries and as a result, have fit our complex data very well.

0.8 Conclusion

In this 2 part series, we’ve built a neural net from scratch with a vectorized implementation of backpropagation. We went through the entire life-cycle of training a model; right from data pre-processing to model evaluation.

Along the way, we learnt about the mathematics that makes a neural-network. We went over basic concepts of linear algebra and calculus and implemented them as functions. We saw how to initialize weights, perform forward propagation, gradient descent, and back-propagation.

We learnt about the ability of a neural-net to fit to non-linear data and understood the importance of the role activation-functions play in it. We trained a neural-net and compared it’s performance to a logistic-regression model. We visualized the decision boundaries of both these models and saw how a neural-net was able to fit better than logistic regression. We learnt about metrics like Precision, Recall, F1-Score, and Accuracy by evaluating our models against them.

By the end of this blog, you should have a pretty solid understanding of how neural-networks are built.

I hope you had as much fun reading this as I had in writing this.

If I’ve made a mistake somewhere, I’d love to hear about it so I can correct it. Suggestions and constructive criticism are welcome. :)

Akshaj Verma