A neural network is a network of neurons which are modeled mathematically by using learning algorithms inspired by the brain to store information. An artificial neural network is made up of artificial neurons where each neuron is referred to as node. So, we can say a neural network is either a biological neural network, comprised of biological neurons, or an artificial neural network which can solve many interesting problems in artificial intelligence. The strengths of the biological neuron are replicated in artificial neural networks as multitudes of weights between nodes. A positive weight denotes an excitatory connection, and negative values translate in to inhibitory connections. All inputs are adjusted by a weight and summed by means of linear combinations. Finally, an activation function is applied to control the amplitude of the output, also providing non-linearity to capture the non-linear behavior of most of the natural systems.
McCulloch and Pitts (1943) created a computational model for neural networks based on mathematics and algorithms. They called this model threshold logic. The model paved the way for neural network research to split into two distinct approaches. One approach focused on biological processes in the brain and the other focused on the application of neural networks to artificial intelligence. (Source: Wikipedia)
Rosenblatt (1958) created the perceptron, an algorithm for pattern recognition based on a two-layer learning computer network using simple addition and subtraction. With mathematical notation, Rosenblatt also described circuitry not in the basic perceptron, such as the exclusive-or circuit, a circuit whose mathematical computation could not be processed until after the backpropagation algorithm was created by Werbos. (Source: Wikipedia)
Here’s the diagram of a perceptron:
In the modern sense, the perceptron is an algorithm for learning a binary classifier called a threshold function: a function that maps its input x (a real-valued vector) to an output value \(f(x)\) (a single binary value):
where w is a vector of real-valued weights, \(w.x\) is the dot product, \(\sum_{i = 1}^{m} w_{i}x_{i}\) where m is the number of inputs to the perceptron, and b is the bias. The bias shifts the decision boundary away from the origin and does not depend on any input value.
The value of \(f(x)\) (0 or 1) is used to classify x as either a positive or a negative instance, in the case of a binary classification problem.
The perceptron learning algorithm does not terminate if the learning set is not linearly separable. If the vectors are not linearly separable learning will never reach a point where all vectors are classified properly. The most famous example of the perceptron’s inability to solve problems with linearly nonseparable vectors is the Boolean exclusive-or problem. Boolean function called XOR, short for eXclusive OR, which only has a true output if either one of the inputs A or B is true, but not both. That is, when the inputs are both false, or both true, the output is false. The following table summarises this:
Now let’s take a look at a plot of this function on a grid with the outputs coloured:
It is, in fact, impossible to have a single straight line that successfully divides the red from the green regions for the Boolean XOR. That is, a simple linear classifier can’t learn the Boolean XOR if presented with training data that was governed by the XOR function. We’ve just illustrated a major limitation of the simple linear classifier. A simple linear classifier is not useful if the underlying problem is not separable by a straight line.
The diagram below which has two straight lines to separate out the different regions suggests the fix. we use multiple classifiers working together. That’s an idea central to neural networks.
Biological neurons transmit an electrical signal from one end to the other, from the dendrites along the axons to the terminals. These signals are then passed from one neuron to another. This is how our body senses light, sound, touch pressure, heat and so on. Signals from specialized sensory neurons are transmitted along our nervous system to your brain, which itself is mostly made of neurons too.
Here is the image of an interconnected biological neurons:
The electrical signals are collected by the dendrites and these combine to form a stronger electrical signal. If the signal is strong enough to pass the threshold, the neuron fires a signal own the axon towards the terminals to pass onto the next neuron’s dendrites.
A biological neuron doesn’t produce an output that is simply a simple linear function of the input. Observations suggest that neurons don’t react readily, but instead suppress the input until it has grown so large that it triggers an output. We can think of this as a threshold that must be reached before any output is produced. A function that takes the input signal and generates an output signal, but takes into account some kind of threshold is called an activation function. Mathematically, there are many such activation functions that could achieve this effect. The most frequently used activation function is known as the logistic or sigmoid function.
\[S(x)= \frac{1}{1+e^{-x}}\]
One way to replicate this from nature to an artificial model is to have layers of neurons, with each connected to every other one in the preceding and subsequent layer. The following diagram shows the connected nodes, but this time a weight is shown associated with each connection. A low weight will de-emphasize a signal, and a high weight will amplify it.
Here, the weight \(w_{ij}\) is simply the weight associated with the signal that passed between node i in a layer to node j in the next layer.
We can express all the calculations that go into working out the combined moderated signal, x, into each node of the second layer using matrix multiplication. And this can be expressed as concisely as:
\[X=W.I\]
That is, W is the matrix of weights, I is the matrix of inputs, and X is the resultant matrix of combined moderated signals. The activation function simply applies a threshold and squishes the response to be more like that seen in biological neurons. So the final output from the second layer is:
\[O=sigmoid (X)\]
Here O is a matrix, which contains all the outputs from the final layer of the neural network.
The following diagram shows an example neural network with 3 layers, each with 3 nodes. In order to keep the diagram clear, not all the weights are marked.
The first layer is the input layer. The final layer is the output layer, as we also know. The middle layer is called the hidden layer.
We can see the three inputs are 0.9, 0.1 and 0.8. So the input matrix I is:
The weight matrix for input layer to hidden layer is:
We can see the weight between the first input node and the first node of the middle hidden layer is \(W_{1,1}=0.9\), just as in the network diagram above. Similarly we can see the weight for the link between the second node of the input and the second node of the hidden layer is \(W_{2,2}=0.8\) from the diagram.
The following diagram shows the second matrix containing the weights between nodes of the hidden layer and the output layer.
So now, we have got the weight matrices for both the input-to-hidden layer and hidden-to-final layer.
The combined moderated input to hidden layer can be expressed and calculated by this formulation:
\[X_{H}=W_{input-to-hidden}.I\]
Here \(I\) is the output from input layer and \(X_H\) matrix is the moderated input to the hidden layer.
Let’s visualize these combined moderated inputs into the second hidden layer.
Now we will apply the sigmoid activation function to those moderated input signals at each node of the hidden layer. The application of this activation function mimics the non-linear inner working mechanism of our biological neurons better.
\[O_{H}=sigmoid(X_H)\]
Here, \(O_H\) is the output of the hidden layer after the activation function is applied to \(X_H\).
Let’s update the diagram with the newly found information.
Now, we have one more layer to go, the final layer. We should remember the output \(O_H\) from the hidden layer will now act as the new input signal for the final layer and we need to moderate these signals before presenting them to the nodes of the final layers.
\[X_O=W_{hidden-to-final}.O_H\]
\(X_O\) is the combined moderated inputs for each node of the final output layer.
Let’s update our previous diagram of the feed forward network again.
Now, all we have to do is apply the sigmoid activation function to those moderated inputs in to the nodes of the final layer so we get our final output from the neural net we have designed so far.
Work done! We got our final outputs from the neural net. Let’s also update this on the diagram.
Now, we can use the output from the neural network and compare it with the training example to work out an error. We need to use that error to refine the neural network itself so that it improves its outputs. But how do we update link weights when more than one node contributes to an output and its error? The following illustrates this problem.
One good approach is to split the errors proportionally among all the contributing nodes. Here’s a simple diagram to get the idea.
The link weights are 3.0 and 1.0. If we split the error proportionally to the weights, then the larger weight will receive 3/4th of the blame/error and the smaller weight will receive 1/4th of the blame or error.
We simply take those errors associated with the output of the hidden layer nodes \(e_h\) , and split those again proportionately across the preceding links between the input and hidden layers \(W_{ih}\) . The next diagram shows this logic.
It does not really mater how many layers we do have, we’d repeatedly apply this same idea to each layer working backwards from the final output layer. The flow of error information makes intuitive sense. We can appreciate why this is called error backpropagation.
We need an error for the hidden layer nodes so we can use it to update the weights in the preceding layer. We call these \(e_h\). But there is a caveat! How do we actually measure the error for each of the nodes in the hidden layer?!! Because we don’t have any target value for the hidden layer nodes to compare and measure the errors against the hidden layer outputs. Remember, we have previously worked out the proportional blame/errors for each of the links connecting the hidden layer nodes to the output layer.
So the error in the first hidden node is the sum of the split errors in all the links connecting forward from same node. We can write it down like this.
\[e_{h,1}=e_{o,1}*\frac{W_{11}}{W_{11}+W_{21}}+e_{o,2}*\frac{W_{12}}{W_{12}+W_{22}}........(1)\]
Here, \(e_{h,1}\) is the error measured for the first node of the hidden layer. \(e_{0,1}\) and \(e_{0,2}\) are the errors measured from the two nodes of the output layer. We can observe all the theory in action in the following diagram.
Let’s try to vectorize the whole process.
We can write down the errors from the output layer as
So, we have the following matrix for the hidden layer.
From equation (1), we can see the denominators in each term of the right hand side are merely normalizing factors. So, our matrix representation of would be much simpler if we could get rid of those pesky normalizing factors. And we can do that in theory. Here’s the new representation for the errors in the hidden layer.
Now, looking carefully we can easily spot that this new weight matrix is the transpose of the original weight matrix. So we can write down the matrix approach to propagating the errors back mathematically:
\[e_h=W^T.e_o\]
Here, \(e_h\) is the error measured for the hidden layer nodes and \(e_o\) is the error from the output layer nodes.
We have propagated the errors back in to each layer of the neural network. These errors will guide us how to adjust the link weights so we can improve the overall answer by the network. We can just simply try random combinations of weights until to find a good combinations of weights for a high precision. This type of brute force or algebraic method will cost a huge amount of computations which will explode in no time with additional layers. This is simply not feasible.
Gradient descent comes to the rescue to get rid of this time and space complexity. If we can imagine a complex landscape as a mathematical function then gradient descent algorithm provides us an ability to find the minimum without the need to understand the complex function completely to work it out mathematically.
Gradient Descent algorithm works as follows:
A negative gradient means we increase the weight and a positive gradient means we decrease the weight. Diagrams are given for reference.
Gradient descent has the tendency to get stuck in the wrong valley when the complex mathematical function has many valleys. The wrong valley is the one which is not the minimum point. This 3-dimensional representation is a complex function with many valleys. If we choose different starting points we may end up at the wrong valley which is not the global minimum.
We can avoid ending up in the wrong valley, or function minimum, we train neural networks several times starting from different points on the hill to ensure we don’t always ending up in the wrong valley. Different starting points means choosing different starting parameters, and in the case of neural networks this means choosing different starting link weights. The following illustrates three different goes at gradient descent, with one of them ending in the wrong valley.
The following table aggregates a couple of candidate error functions with training and actual output values for three nodes.
Here, we see the first candidate for error function is quite simple, (Target-Actual). But this has very little implication as an error function as the summation of the errors from the table is zero.
Now, what about the second error function, |Target-Actual|. The summation does not add up to zero but it renders a problem. The derivative of this error function isn’t continuous near the minimum and this makes gradient descent not work so well, because we can bounce around the V shaped valley that this error function has.
The third option is to take the square of the difference \((target-actual)^2\).
Choosing this as an error function has a couple of advantages:
Let’s start with an illustration to showcase what we are trying to achieve.
Mathematically we can express it like this.
\[\frac{\partial E}{\partial W_{jk}}\]
For now, we will focus on the weights between the hidden and the final output layer. The following diagram shows this area of interest highlighted.
We can expand the error function and write it like this.
\[\frac{\partial E}{\partial W_{jk}}=\frac {\partial}{\partial W_{jk}}\sum_{n}(t_{n}-o_{n})^2\]
We can simplify the above expression a little bit. Note that the output at node k, \(o_{k}\), only depends on the linked weights \(W_{jk}\). Hence, we can omit the summation symbol from the expression and write it as follows:
\[\frac{\partial E}{\partial W_{jk}}=\frac {\partial}{\partial W_{jk}}(t_{k}-o_{k})^2\]
\(t_{k}\) does not depend on \(W_{jk}\). Only the output \(o_{k}\) depends on \({W_{jk}}\). So, we can apply the chain rule to differentiate the error function w.r.t \(W_{jk}\).
\[\frac{\partial E}{\partial W_{jk}}= \frac{\partial E}{\partial o_{k}}.\frac{\partial o_{k}}{\partial W_{jk}}\]
which returns,
\[\frac{\partial E}{\partial W_{jk}}=-2(t_{k}-o_{k}).\frac{\partial o_{k}}{\partial W_{jk}}\]
The \(o_{k}\) in the previous equation is the output of the node k which is the sigmoid function applied to the weighted sum of the connected incoming signals. So we can rewrite the equation as
\[\frac{\partial E}{\partial W_{jk}}=-2(t_{k}-o_{k}).\frac{\partial }{\partial W_{jk}}sigmoid(\sum_{j}W_{jk}.o_{j})........(2)\]
For simplicity, we will express sigmoid function as \(S(x)\).
There is a special property of sigmoid function. When differentiated the sigmoid function takes a very simple form. That’s another good reason why sigmoid function is very popular in neural network as an activation function.
\[\frac{\partial }{\partial x}S(x)=S(x).(1-S(x))\]
So following from the equation (2),
\[\frac{\partial E}{\partial W_{jk}}=-2(t_{k}-o_{k}).S(\sum_{j}W_{jk}.o_{j}).(1-S(\sum_{j}W_{jk}.o_{j})).\frac{\partial }{\partial W_{jk}}(\sum_{j}W_{jk}.o_{j})\]
\(\frac{\partial }{\partial W_{jk}}(\sum_{j}W_{jk}.o_{j})=o_{j}\), so the derivative becomes,
\[\frac{\partial E}{\partial W_{jk}}=-2(t_{k}-o_{k}).S(\sum_{j}W_{jk}.o_{j}).(1-S(\sum_{j}W_{jk}.o_{j})).o_{j}\]
We still can express this equation in much simpler form as we know, \(S(\sum_{j}W_{jk}.o_{j})=o_{k}\). We can also get rid of the factor 2 in the equation as we are only interested in the direction of the slope of the error function so we can descend it.
\[\frac{\partial E}{\partial W_{jk}}=-(t_{k}-o_{k}).o_{k}.(1-o_{k}).o_{j}\]
We can simplify the above expression a bit more by replacing \((t_{k}-o_{k})\) with \(e_{k}\), where \(e_{k}\) is the error at the \(k\) output node.
\[\frac{\partial E}{\partial W_{jk}}=-e_{k}.o_{k}.(1-o_{k}).o_{j}\]
So, we have formulated the expression to refine the weights between hidden and output layers. But, still we are left to figure out the derivatives or the error slope for the weights between input and hidden layers. We can start over with all the algebra again!
Rather, we can simply take a note on the physical symmetry of the previous expression and write the new expressions for the error slopes between input and hidden layers.
\[\frac{\partial E}{\partial W_{ij}}=-e_{j}.o_{j}.(1-o_{j}).o_{i}\]
Finally we found all the magic recipes to update each weight in the neural network after each iteration of training example.
\[W_{jk}^{new}=W_{jk}^{old}-\alpha*\frac {\partial E}{\partial W_{jk}}\]
The new updated weight \(W_{jk}\) is the old weight adjusted by the negative of the error slope we just worked out. It’s negative because we want to increase the weight if we have a positive slope, and decrease it if we have a negative slope, as we saw earlier. The symbol alpha \(\alpha\), is a factor which moderates the strength of these changes to make sure we don’t overshoot. It’s often called a learning rate.
We can express the weight-update in the following matrix formation to implement them in a programming language.
\[\Delta W_{jk}= \alpha*E_{k}*O_{k}*(1-O_{k}).O_{j}^T\]
We know as the inputs in the network become larger so does the activation function get flatter.
A tiny gradient means we’ve limited the ability to learn. This is called saturating a nneural network. That means we should try to keep the inputs small. But not too small! The mathematical expression we derived also depends on the incoming signal \(O_{j}\) so we should not make it too small either.
A good suggestion is rescaling the inputs in the range of 0.0 to 1.0. It is also usual to add a small offset to the inputs just to avoid zeroing the inputs which are not expected because they jeopardize the learning ability by zeroing the weight update expression by setting \(O_{j}=0\).
From our discussion earlier, it is understood that we should not use large input values for the network which implies we should not use large initial weights as well although randomized.
A rule of thumb is, sampling from a normal distribution with mean zero and a standard deviation which is the inverse of the square root of the number of links into a node. But this is always the case. There are other sophisticated methods to choose the randomized initial values for the weights.
We should not assign zero value to any initial weight as it will kill the input signal.
One standard method for initializing weights for Sigmoid or TanH activation function is to use glorot” or “xavier” initialization. The xavier initialization method is calculated as a random number with a uniform probability distribution between the range \(-\frac {1}{\sqrt n}\) and \(\frac {1}{\sqrt n}\), where n is the number of inputs to the node.
\[W_{initial}=U[-\frac {1}{\sqrt n},\frac {1}{\sqrt n}]\]
The normalized xavier initialization method is calculated as a random number with a uniform probability distribution between the range \(-\frac {\sqrt 6}{\sqrt {n+m}}\) and \(\frac {\sqrt 6}{\sqrt {n+m}}\), where n is the number of inputs to the node or the number of nodes in the previous layer and m is the number of outputs from the layer or the number of nodes in the current layer.
\[W_{initial}=U[-\frac {\sqrt 6}{\sqrt {n+m}},\frac {\sqrt 6}{\sqrt {n+m}}]\]
The current standard approach for initialization of the weights of neural network layers and nodes that use the rectified linear (ReLU) activation function is called he initialization.
The he initialization method is calculated as a random number with a Normal probability distribution with a mean of 0.0 and a standard deviation of \(\sqrt\frac {2}{n}\), where n is the number of inputs to the node.
\[W_{initial} \sim N(0.0,\frac {2}{n})\]