Neural Network From Scratch-Part 1

MISSION

rm(list=ls()) #clear workspace

I wanted to set myself the challenge of building a neural network from scratch, depending solely on base R. I think the best way to fully get to grips with an algorithm is to code it from the ground up - one is left with little wiggle room for winging it. Moreover, I wanted to tackle the well-known MNIST (“Modified National Institute of Standards and Technology”) digit classification problem. The dataset is a collection of handwritten digits: with each row representing a single digit, and each column representing a 0-255 grey scale pixel. I kept asking myself the question, could I design a neural network algorithm to achieve at least 90% accuracy in the field of simple digit recognition? For the vast majority of the project, I found myself scribbling away with good old fashioned pen and paper - obtaining the key derivatives was by far the trickiest aspect of the process! In order to simplify the process, I made the neural network almost as small as possible - in this way, I was able to liberate my attention to focus on the key mathematics involved.

This project is condensed in to a two part series- the first part (this document), provides a broad introduction to neural networks and includes the relevant codes that I employed to build the neural network. The second part shows the relevant working I used to obtain the key derivatives - it is mathematically heavy and assumes previous knowledge of multivariable calculus and linear alegbra. For the mathematically inclined, I include a link at the bottom of the page for the second part.

ENSURE KEY LIBRARIES LOADED

Whilst I will used a few libraries in this post, they were not used for the purposes of building either the neural network function or the associated predictive function. I used the dslabs library to source the digits and their respective pixellated translations. Plus, I employed one of my favourite visual packages HighCharter to visualise the movement of the loss functions. Okay, let me take you through the building process!

If you have RStudio and are interested, here is a neat function that will help ensure you have the prerequisite libraries to replicate my results.

library_sort <- function(x){
  for( library in x ){
    #  require returns TRUE invisibly if it was able to load package
    if( ! require( library , character.only = TRUE ) ){
      # if package unavailable, then install
      install.packages( library , dependencies = TRUE )
      #  Load package after installing
      require( library , character.only = TRUE )
    }
  }
}

#  Then try/install packages...
library_sort( c("dslabs" , "caret" , "tidyverse", "rafalib","matrixStats","highcharter") )

Categorical data are defined by label values- unfortunately, not all machine learning models are designed to work in harmony with categorical data. Most tree based models can learn from categorical data without applying transformations. However, neural networks- like most machine learning models - require the input and output to be numeric. Sometimes, integer coding (1,2,3…) can be recruited when categorical data can be ranked - in other words, when the respective data is inherently ordinal. One-hot encoding is recruited when there is no clear order in the data being examined: all elements will be 0s with the exception of the element that corresponds to the actual category being defined (which will be labelled as a 1). Neural networks cannot distinguish between character labels - however, the algorithm can discriminate between 1’s and 0’s. The dependent variable (column containing the classification of digits), required one-hot encoding. I put together a function that will perform one-hot encoding:

one_hot_convert<-function(Y){
                   
              categories<-sort(unique(Y)) 
                  
              matrix_hot<-matrix(0, ncol=length(unique(categories)), 
                  byrow = T,
                  nrow= nrow(Y), 
                  dimnames = list(NULL, c(unique(categories))))
                  
                  for(i in 1:nrow(Y)){
                  for(j in 1:length(categories)){
                  if(Y[i] == categories[j]){matrix_hot[i,j]<-1} 
                   else next
                                }}
                                
                  return(matrix_hot)
                  }

After obtaining the raw training data from the dslabs library - I did a little spring cleaning, before splitting the data into a training and validation set.

if(!exists("mnist"))mnist<-read_mnist()
x<-mnist$train$images
y<-mnist$train$labels
combined<-cbind(x,y)
df<-cbind(x,y)
df<-data.frame(df)

#keep all single digits [0-9] 
df<-df%>% filter(y != 10)            

#omit cases with missing values
df<- na.omit(df)                                                                                                                                                                    
ynew<-df[,ncol(df)]      


#remove zero variance columns
removeZeroVar <- function(df){
    df[, !sapply(df, function(x) min(x) == max(x))]
}


#apply zero variance removal function
reduced<-removeZeroVar(df) 


#partition data into training/test sets
n = nrow(reduced)
trainIndex = sample(1:n, size = round(0.7*n), replace=FALSE)
train = reduced[trainIndex ,]
test = reduced[-trainIndex ,]

#Y training before one-hot encoding
YORIG<-train[,ncol(train)]  

#Add 1 to the Y values (for intuitive index comparison later)
YORIG<-YORIG+1

#training design matrix
X<-as.matrix(train[,-ncol(train)])  

#apply one-hot encoding to the dependent variable
Y<-as.matrix(train[,ncol(train)])
Y<-Y+1

Y<-one_hot_convert(Y)

Let’s set aside and prep the test set

XTEST<-as.matrix(test[,-ncol(test)])
YORIG_TEST<-test[,ncol(test)]

#will be used for indexing matching later
YORIG_TEST<-YORIG_TEST+1

NEURAL NETWORKS

It is no exaggeration to claim that the proliferation of artificial neural networks (NN) over the last decade or so have revolutionised the field of artificial intelligence (AI). The application of neural networks have enjoyed phenomenal success in the domains of image recognition and natural language processing. Moreover, neural networks are proving to be of great utility in a wide range of other fields - there is every indication that this trend is set to continue in the advancing years.

NEURAL NETWORK ARCHITECTURE

I decided to construct a fully connected neural network - the architecture of such a network entails the connection of every neuron to every other neuron in adjoining layers. The training of a neural network primarily consists of two phases: the forward and backward pass. After the forward pass, the loss is computed, after which we recursively apply the chain rule of calculus back to the respective inputs. The chain rule is a technique that enables us to easily evaluate the derivative of the composition of two or more functions. Once the respective derivatives are established, the weights can then be updated using the gradient descent algorithm.

VISUAL OVERVIEW OF FEED FORWARD PHASE

One of the best ways to grasp what is going on in a neural network, is to view its internal workings via a computaional graph. I will make a slight break with convention by visualising the activation (a0,a1,a2) and z product layers separately.

Below, the feedforward phase is illustrated:

                                                          htmltools::img(src = knitr::image_uri("C:/Users/gamal/Downloads/20220307_233706.jpg" ), alt = 'logo')

$logo$

KEY NOTATION

W1 : first weight matrix

B1 : first bias matrix

W2 : second weight matrix

B2 : second bias matrix

The superscripts, appearing at the top right of letters, signify the layer of the network. For each weight matrix, the subscript notation of jk indicates the weight from the k th node in the (l-1)th layer to the j th node in the l th layer.

The feedforward calculations run as follows:

The inputs from the entry layer a0 are multiplied by a group of weights interlinking each input to a node in the adjoining hidden layer (as we move rightwards).
The respective weighted inputs are summed, after which a bias is added. This produces the pre-activation signal for the following layer.
An activation function acts upon the pre-activation signal, this transformed output creates the feed-forward activations that head towards the next layer.
Again, the newly created activated signals are multiplied by the second set of weights adjoining the hidden layer to the adjacent output layer, before being summed and a bias added (Z2).
Thereafter, the output activation function, the softmax in this case, is applied to the weighted Z2 to produce the network’s predictions (aka Yhat/a[2]).
The predicted outputs are then evaluated against the truth labels (actual outcomes): the average error/cost is then evaluated.

OUTPUT OF NEURAL NETWORK

The final output of the neural network can be mathematically expressed as a nested composite function. The computational graph shows us that ‘C’ is a function of a[2], a[2] is a function of Z[2], and Z[2] is a function of W[2], B[2] and a[1]. Therefore, in order to determine the partial derivative of the cost function with respect to W[2], the chain rule can be employed.

OVERALL VIEW OF GRADIENT DESCENT

Once the loss/cost has been computed, the goal of the neural network is to improve upon the calculated loss - in other words, to reduce loss. In this project, I use the terms loss and cost interchangeably, however they are frequently differentiated. Loss is generally the evaluated error in relation to a single observation, whereas the cost function often refers to the average of the calculated loss functions. In a single epoch,loss is evaluated for every observation, however the cost is evaluated once.

For every potential set of weights/biases, there is an associated loss. The weights of the network are initialized randomly, or pseudo-randomly. Thus, we cannot expect the model to perform optimally after the first round - we are essentially blindfolded at the outset. The overall goal of the neural network is to discover parameters, in terms of weights and biases, that minimize the cost function. We need a reliable strategy that will help us discover parameters that will minimise the overall loss of the network - backpropagation to the rescue!

Backpropagation allows us to compute the derivatives/gradients of the loss function with respect to differing points throughout the network; we can discover how much the cost changes at particular locations of the loss curve. By employing multi-variable calculus, backpropagation provides us with a way of calculating the derivatives of the cost function, in a backwards fashion, by systematically recruiting the chain rule. This method enables us to effectively descend the gradients with respect to all of the respective weights. A big shout-out to Leibniz and Newton for their co-discovery of calculus!

PARTIAL DERIVATIVES

Each partial derivative represents the instantaneous rate of change of the loss/cost function with respect to the weight in question assuming that every other weight remains constant. We should bear in mind that we are updating all of the weights simultaneously. Gradient descent is a first-order iterative optimization algorithm that is used to locate local/global minima of differentiable functions. Partial derivatives provide us with the direction of steepest ascent; we move a step in the opposite direction, in other words, in the direction of greatest descent- by doing so, we are moving toward a local/global minima of the cost function. From time to time, loss will increase, however, generally this is unlikely to produce problems.

BENEFITS OF BIAS TERMS

The addition of bias aids shifting the result of the activation function to the left or right. In addition, it also keeps the model learning/training when all the input features are set to 0.

CHOICE OF ACTIVATION

For simplicity, I employ a simple structure comprising of one input layer, one hidden layer, and an output layer. Initially, I used a sigmoid activation function in the hidden layer, however I quickly ran in to issues. The sigmoid Function’s derivative is confined to (0-0.25) range: unfortuately, this inclines the gradient toward zero as the layers increase. Ultimately, this often results in the deactivation of neurons, and the learning process of the neural network is thwarted.

By recruiting the sigmoid function, it is likely that I encountered the classic issue of vanishing gradients. I then exchanged the sigmoid function for a choice of tanh/leaky ReLU activation function - this simple substitution worked wonders! The ReLU activation function has a number of notable advantages over other activation functions: it tends to converge faster than Tanh/sigmoid activation functions. Plus, it is computationally cheap.

WHEN TO STOP USING BACKPROPAGATION?

Backpropagation repeatedly alters the weights/biases of the interconnections with the aim of minimizing the discrepancy between the predicted output and the truth labels. We are constantly posing the question: how does the loss alter when we adjust our weights. Generally, we continue until there is a convergence in loss, or the improvement in loss/cost plateaus.

LEARNING RATE

The learning rate, or alpha term, is multiplied by the respective partial derivative of the loss function. In order to acquire an updated weight/bias, the resultant product is subtracted. The learning rate regulates the step size of the algorithm’s movement toward a minimnum: it effectively scales the size of our weight updates. Choosing the size of the learning rate is somewhat arbitrary; however, if it calibrated too low, the training procedure will progress extremely slowly. On the other hand, if it is set too high, it can produce erratic behaviour whereby it overshoots the discovery of the desired minimum.

WHY NORMALIZE

Normalizing our initial input data to a standard scale, improves the learning speed of the network. Generally, convergence is quicker if the mean of each input variable over the training set is close to zero. On the contrary, if all elements of an input vector are positive, all of the updates of weights that go into a node will be characterised by the same sign. This results in an inefficient learning process.

REGULARIZATION

In terms of regularisation, I have opted to use the Tikhonov regularization. In the neural network literature, this is often referred to as L² regularization or weight decay (as this aspect decays the respective weights towards zero), the term is simply the squared Euclidean norm ² of the weight matrix of the hidden layer (or the sum of all such squared norms, in the case of multiple hidden layers, and including the output layer) of the network. The parameter, lambda(λ), adjusts the strength of the regularization.

Regularization is aimed at minimising the generalization error - this refers to the algorithm’s performance on unseen/new data. Regularization is not used to improve training error. As in other supervised machhine learning/deep learning tasks, we need to skillfully navigate the bias-variance tradeoff. Potentially, the algorithm may overfit (high variance)- this phenomenon can occur when the model overresembles the training data. On the flip side,if the model underfits (high bias), the model may not have sufficiently captured the core underlying patterns.

EPOCHS

In the lingo of neural networks, one epoch refers to the movement of a complete dataset being passed through the forward and backward phases of the network precisely one time. In instances where a complete dataset is unable to be passed through a neural network in one epoch/cycle, the respective dataset can be partitioned into mini-batches.The number of cases/exemplars in one mini-batch is referred to as batch size.

Determining the number of epochs required for a neural network, is somewhat arbitatry and should be monitored by the analyst. When the batch size happens to be the entire dataset, the number of epochs is equivalent to the number of iterations.

OFFICIAL FUNCTION LIST

Here, I created some functions that are recruited by the neural network.

#function for rounding and tidying numbers.
tidy_round <- function(x) format(x, digits = 2, big.mark = ",")


#simple scale function
SCALE <- function(x) {
  x_mean <- mean(x, na.rm = TRUE)
  x_sd <- sd(x, na.rm = TRUE)
  scaledX <- (x - x_mean) / x_sd
  return(scaledX)
}


#leaky relu
leaky_relu<-function(a) {
pmax(a* .01,a)
}

#derivative leaky relu
d_leaky<-function(x){
ifelse(x>0,1, .01)
}


#calculate softmax
softmax = function(x) exp(x) / rowSums(exp(x))

#bias matrix expansion helper
row.replicate<-function(x,n){
    matrix(rep(x,each=n),nrow=n)
}

#for bias derivatives
ADD_cols <- function(x){apply(x, 2, sum)}


#tanh derivative
d_tanh<-function(x){1-(tanh(x)^2)}

#activation derivative
choice_derivative <- function(Z, activation_choice){
if(activation_choice == "tanh"){
    d_tanh(Z) 
} else if (activation_choice == "leaky_relu"){ 
    d_leaky(Z)
    
} }


#activation choice for feed forward
forward_activation <- function(Z, activation_choice){
if(activation_choice == "tanh"){
    tanh(Z) 
} else if (activation_choice == "leaky_relu"){ 
    leaky_relu(Z)
    
} }

After many meanderings, I have finally put together a generic neural network. Here, I use tanh as the default activation function.

NN_Solver <- function(X, Y, SCALE=FALSE, learn_rate = 0.4, activation_choice = "tanh", NODES_TO_FIT, Reg_lev=0.01, epochs){


if (anyNA(X) | anyNA(Y))
stop("Please address missing values")


  
if (length(X) < 2 | length(Y) < 2)
stop("There are insufficient observations") 

  
if (nrow(X) != nrow(Y)) 
stop("'X' and 'Y' must contain the same number of rows")  
  
 
if(!all(purrr::map_lgl(X, is.numeric)))
stop("all rows should be numeric")
  

if (!is.matrix(X))  {X=as.matrix(X)}
  
  
if (!is.matrix(Y))  {Y=as.matrix(Y)}
  

if (SCALE) {X=SCALE(X)}

 

NEXAMPS<-nrow(X) #quantity of cases
NCLASSES<-ncol(Y) #quantity of output classes
NFEATURES<-ncol(X) #quantity of features


W1<-matrix(rnorm(NFEATURES*NODES_TO_FIT), nrow=NFEATURES) * 0.01 #number weights required for nodes
B1 <- matrix(0, nrow = 1, ncol = NODES_TO_FIT) #first bias
W2<-matrix(rnorm(NODES_TO_FIT*NCLASSES), nrow=NODES_TO_FIT) * 0.01   #number weights required for nodes
B2 <- matrix(0, nrow = 1, ncol = NCLASSES) #second bias
B2DERIV <- matrix(1, nrow = 1, ncol = NCLASSES) #B2 DERIVATIVE
B1DERIV<-matrix(1, nrow = 1, ncol = NODES_TO_FIT) #B1 DERIVATIVE



cost_history <- c()# vector to contain error history, for plotting later



#forward propagation calculation


for (i in 0:epochs){


Z1 <- X%*% W1 + row.replicate(B1,NEXAMPS)


Z1 <- matrix(Z1, nrow = NEXAMPS)


#apply leaky RELu/tanh
A1<-forward_activation(Z1, activation_choice)


Z2 <- A1%*%W2 + row.replicate(B2,NEXAMPS)

#apply softmax
Quasiprobs<-softmax(Z2)

#Calculate Categorical Cross Entropy Error
entropy_loss <- mean(-rowSums(Y * log(Quasiprobs)))


#Calculate regularized loss
L2_loss <- 0.5*Reg_lev*sum(W1*W1) + 0.5*Reg_lev*sum(W2*W2)


#Regularized loss
error <- entropy_loss + L2_loss


#select highest predicted likelihood
estimated_class <- apply(Quasiprobs, MARGIN = 1, FUN = which.max)

#truth value
actual_class <- apply(Y, MARGIN = 1, FUN = which.max)


#calculate overall accuracy
accuracy <- mean(estimated_class == actual_class)


#capture progressing cost    
cost_history <- c(cost_history, error)


#apply modulo operator(it translates as remainder of division)
if (i%%10 == 0){


print(paste("Cycle", i,': Error', error, ' : accuracy:', tidy_round(accuracy)))}
   
#Backpropagation phase (apply derivatives)



CL_CZ2 <- Quasiprobs- Y  #Partial C/Partial Z


CL_CZ2<-CL_CZ2/NEXAMPS   #Evaluate average derivative


CL_B2<-ADD_cols(CL_CZ2)  #Partial C/ B2


CL_W2 <- t(A1)%*%CL_CZ2  #Partial C/ W2


CL_X2<- CL_CZ2%*%t(W2)   #Partial C/ A1


CL_CZ1<- CL_X2 * choice_derivative(Z1, activation_choice) #Partial X2/ Z1
  

CL_CW1<-t(X)%*%CL_CZ1           #Partial C/W1


CL_B1<-ADD_cols(CL_CZ1)         #Partial C/B1
   

CL_W2 <- CL_W2 + Reg_lev *W2    #Regularized derivative C/W2


CL_CW1 <- CL_CW1 + Reg_lev *W1  #Regularized derivative C/W1
   
   
   
   
    
    W2<- W2-learn_rate*CL_W2     #implement gradient descent W2
    B2 <- B2-learn_rate*CL_B2    #implement gradient descent B2
   
    W1<- W1-learn_rate*CL_CW1    #implement gradient descent W1
    B1 <- B1-learn_rate*CL_B1    #implement gradient descent WB1
    
    accuracy<-accuracy           #overall accuracy
   
   
  }
  

key_metrics <- list("W2" = W2, 
                  "B2" = B2,
                  "W1" = W1,
                  "B1" = B1,
                  "accuracy"=accuracy,
                  "cost_hist" = cost_history,
                  "ALTERNATE_B1"="ALTERNATE_B1",
                  "ALTERNATE_B2"="ALTERNATE_B2")
    

return(key_metrics)



}

Finally, it was time for the fun part -this neural network was now ready do its thing! I trained the neural network by running 1000 epochs.

set.seed(123)
TRAINED_MODEL <- NN_Solver(X, Y, SCALE=TRUE, learn_rate = 0.2, NODES_TO_FIT=40, Reg_lev=0.01, epochs=700)

## [1] "Cycle 0 : Error 2.3177841402353  : accuracy: 0.12"
## [1] "Cycle 10 : Error 1.79690527322953  : accuracy: 0.62"
## [1] "Cycle 20 : Error 1.12760417463118  : accuracy: 0.75"
## [1] "Cycle 30 : Error 0.829581065414324  : accuracy: 0.84"
## [1] "Cycle 40 : Error 0.683339604139169  : accuracy: 0.87"
## [1] "Cycle 50 : Error 0.601725961190896  : accuracy: 0.88"
## [1] "Cycle 60 : Error 0.552529141188573  : accuracy: 0.89"
## [1] "Cycle 70 : Error 0.520855926350999  : accuracy: 0.9"
## [1] "Cycle 80 : Error 0.499146444959498  : accuracy: 0.9"
## [1] "Cycle 90 : Error 0.483428752981504  : accuracy: 0.9"
## [1] "Cycle 100 : Error 0.471523166390466  : accuracy: 0.91"
## [1] "Cycle 110 : Error 0.462170293168145  : accuracy: 0.91"
## [1] "Cycle 120 : Error 0.45460426287667  : accuracy: 0.91"
## [1] "Cycle 130 : Error 0.448336900592417  : accuracy: 0.91"
## [1] "Cycle 140 : Error 0.44304369086073  : accuracy: 0.92"
## [1] "Cycle 150 : Error 0.43850076945238  : accuracy: 0.92"
## [1] "Cycle 160 : Error 0.434548635322093  : accuracy: 0.92"
## [1] "Cycle 170 : Error 0.431070449766368  : accuracy: 0.92"
## [1] "Cycle 180 : Error 0.427978602582879  : accuracy: 0.92"
## [1] "Cycle 190 : Error 0.425206121444476  : accuracy: 0.92"
## [1] "Cycle 200 : Error 0.422701001519115  : accuracy: 0.92"
## [1] "Cycle 210 : Error 0.420422339248197  : accuracy: 0.93"
## [1] "Cycle 220 : Error 0.418337606815972  : accuracy: 0.93"
## [1] "Cycle 230 : Error 0.416420669369226  : accuracy: 0.93"
## [1] "Cycle 240 : Error 0.414650306755711  : accuracy: 0.93"
## [1] "Cycle 250 : Error 0.413009095794207  : accuracy: 0.93"
## [1] "Cycle 260 : Error 0.411482561438331  : accuracy: 0.93"
## [1] "Cycle 270 : Error 0.410058532943485  : accuracy: 0.93"
## [1] "Cycle 280 : Error 0.408726656851416  : accuracy: 0.93"
## [1] "Cycle 290 : Error 0.407478029220668  : accuracy: 0.93"
## [1] "Cycle 300 : Error 0.406304917684225  : accuracy: 0.93"
## [1] "Cycle 310 : Error 0.40520055047097  : accuracy: 0.93"
## [1] "Cycle 320 : Error 0.404158954924963  : accuracy: 0.93"
## [1] "Cycle 330 : Error 0.403174832470047  : accuracy: 0.93"
## [1] "Cycle 340 : Error 0.40224346041093  : accuracy: 0.93"
## [1] "Cycle 350 : Error 0.401360613516814  : accuracy: 0.93"
## [1] "Cycle 360 : Error 0.400522500201872  : accuracy: 0.94"
## [1] "Cycle 370 : Error 0.399725709515011  : accuracy: 0.94"
## [1] "Cycle 380 : Error 0.398967166226286  : accuracy: 0.94"
## [1] "Cycle 390 : Error 0.398244092122213  : accuracy: 0.94"
## [1] "Cycle 400 : Error 0.397553972234098  : accuracy: 0.94"
## [1] "Cycle 410 : Error 0.396894525152899  : accuracy: 0.94"
## [1] "Cycle 420 : Error 0.396263676865372  : accuracy: 0.94"
## [1] "Cycle 430 : Error 0.395659537715546  : accuracy: 0.94"
## [1] "Cycle 440 : Error 0.395080382186611  : accuracy: 0.94"
## [1] "Cycle 450 : Error 0.394524631239659  : accuracy: 0.94"
## [1] "Cycle 460 : Error 0.393990836959451  : accuracy: 0.94"
## [1] "Cycle 470 : Error 0.393477669259172  : accuracy: 0.94"
## [1] "Cycle 480 : Error 0.392983904395354  : accuracy: 0.94"
## [1] "Cycle 490 : Error 0.392508415045575  : accuracy: 0.94"
## [1] "Cycle 500 : Error 0.392050161706724  : accuracy: 0.94"
## [1] "Cycle 510 : Error 0.391608185181307  : accuracy: 0.94"
## [1] "Cycle 520 : Error 0.391181599933853  : accuracy: 0.94"
## [1] "Cycle 530 : Error 0.390769588120472  : accuracy: 0.94"
## [1] "Cycle 540 : Error 0.390371394123313  : accuracy: 0.94"
## [1] "Cycle 550 : Error 0.389986319458907  : accuracy: 0.94"
## [1] "Cycle 560 : Error 0.389613717974089  : accuracy: 0.94"
## [1] "Cycle 570 : Error 0.389252991291438  : accuracy: 0.94"
## [1] "Cycle 580 : Error 0.38890358451149  : accuracy: 0.94"
## [1] "Cycle 590 : Error 0.388564982213051  : accuracy: 0.94"
## [1] "Cycle 600 : Error 0.388236704808581  : accuracy: 0.94"
## [1] "Cycle 610 : Error 0.387918305305147  : accuracy: 0.94"
## [1] "Cycle 620 : Error 0.387609366494938  : accuracy: 0.94"
## [1] "Cycle 630 : Error 0.387309498560309  : accuracy: 0.94"
## [1] "Cycle 640 : Error 0.387018337037308  : accuracy: 0.94"
## [1] "Cycle 650 : Error 0.386735541049158  : accuracy: 0.94"
## [1] "Cycle 660 : Error 0.386460791704654  : accuracy: 0.94"
## [1] "Cycle 670 : Error 0.386193790558891  : accuracy: 0.94"
## [1] "Cycle 680 : Error 0.385934258053964  : accuracy: 0.94"
## [1] "Cycle 690 : Error 0.385681931890279  : accuracy: 0.94"
## [1] "Cycle 700 : Error 0.38543656531804  : accuracy: 0.94"

PLOTTING ERROR OVER EPOCHS

Here, I plot the variation in error as training progressed.

number_epochs<-length(TRAINED_MODEL$cost_hist)
newcol<-1:number_epochs
costChange<-TRAINED_MODEL$cost_hist

dfprep<-cbind(newcol,costChange)
dataframe<-data.frame(dfprep)

Time for plotting (here I used the highcharter package- allowing us to easily obtain high calibre interactive visuals with minimal effort).

highchart() %>%
    hc_add_series(data = dataframe,
                  type = "line",
                  hcaes(x = newcol,
                        y = costChange 
                        )) %>%
    
    hc_xAxis(title = list(text="Epochs")) %>%
    hc_yAxis(title = list(text="Cost")) %>%
    hc_plotOptions(series = list(marker = list(symbol = "circle"))) %>%
    hc_legend(align = "right", 
              verticalAlign = "top")

PREDICTION FUNCTION

Here, I put together a prediction function that enables the neural network to make new predictions on unseen data.

Neural_Predictor <- function(TRAINED_MODEL,X,activation_choice = "tanh"){
  
 #number of cases  
N <- nrow(X)

#commence forward-activation
Z1 <- X%*%TRAINED_MODEL$W1 + row.replicate(TRAINED_MODEL$B1,N)

Z1 <- matrix(Z1, nrow = N)

A1<-forward_activation(Z1, activation_choice)

Z2 <- A1%*%TRAINED_MODEL$W2 + row.replicate(TRAINED_MODEL$B2,N)

pred_class <- apply(Z2, 1, which.max) #identify index with highest value

return(pred_class) }

Let’s see how accurately the trained model classified data from the test set:

PRED_CLASS <- Neural_Predictor(X=XTEST, TRAINED_MODEL, activation_choice = "tanh") #run model with optimised parameters


mean_round_prep<-mean(PRED_CLASS == YORIG_TEST)
Percentage=paste('Accuracy percentage:',paste0(round(mean_round_prep*100,2),"%"))
Percentage

## [1] "Accuracy percentage: 89.84%"

I wanted a little more nuance in the quality of the predictions; how well did the model perform in relation to specific digits:

library(caret)
confusionMatrix(as.factor(PRED_CLASS), reference = as.factor(YORIG_TEST))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4    5    6    7    8    9   10
##         1  1735    0   15   10    2   76   22   20    8   12
##         2     0 1820    1    2    2    6    0   11    6    2
##         3     4   14 1625   34   10   15   13   47   10    1
##         4     4   12   16 1640    4  108    0   31   11   29
##         5     1    2   24    0 1528   18   10   19    2   22
##         6     2    1    1   15    0 1221    3    0    2    1
##         7     8    1   38    9   16   36 1720    0    4    1
##         8     0    4    8    6    1    0    0 1596    0   16
##         9    24  179   55   89   57  189   44   21 1697   61
##         10    6    2    3    9  105   15    0  135    6 1589
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8984          
##                  95% CI : (0.8939, 0.9028)
##     No Information Rate : 0.1131          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8871          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity           0.97253   0.8943  0.90985  0.90408  0.88580  0.72506
## Specificity           0.98982   0.9981  0.99087  0.98672  0.99398  0.99847
## Pos Pred Value        0.91316   0.9838  0.91653  0.88410  0.93973  0.97994
## Neg Pred Value        0.99696   0.9867  0.99008  0.98922  0.98797  0.97236
## Prevalence            0.09911   0.1131  0.09922  0.10078  0.09583  0.09356
## Detection Rate        0.09639   0.1011  0.09028  0.09111  0.08489  0.06783
## Detection Prevalence  0.10556   0.1028  0.09850  0.10306  0.09033  0.06922
## Balanced Accuracy     0.98118   0.9462  0.95036  0.94540  0.93989  0.86176
##                      Class: 7 Class: 8 Class: 9 Class: 10
## Sensitivity           0.94923  0.84894  0.97194   0.91638
## Specificity           0.99302  0.99783  0.95576   0.98272
## Pos Pred Value        0.93835  0.97854  0.70240   0.84973
## Neg Pred Value        0.99431  0.98265  0.99686   0.99101
## Prevalence            0.10067  0.10444  0.09700   0.09633
## Detection Rate        0.09556  0.08867  0.09428   0.08828
## Detection Prevalence  0.10183  0.09061  0.13422   0.10389
## Balanced Accuracy     0.97112  0.92338  0.96385   0.94955

Sensitivity captures the true positive rate (TPR) - it measures the proportion of the time that a particular digit is present and the neural network correctly predicted that the respective digit was there. Class 1 refers to the 0 digit, class 2 refers to the digit 1 , and so forth.

For the mathematically inclined, please follow this link.Therein, I demonstrate how to obtain the key derivatives used in the neural network function featured in this article.