We will use a single library in this tutorial, namely the plotly
library. This will allow us to visualize the process of gradient descent and allow us to zoom into the plots.
library(plotly)
No doubt you learned about straight lines at parabolas at school. The quintessential straight line is \(y = x\) or \(f \left( x \right) = x\). Plugging in a value for \(x\), produces the exact same value for \(y\).
Below we use \(x\) values between \(-1\) and \(5\) to plot this line.
x <- c(-1:5)
y <- x
p1 <- plot_ly(x = ~x,
y = ~y,
type = "scatter",
mode = "lines") %>%
layout(title = "A straight line with a slope of 1 going through the origin",
xaxis = list(title = "x axis"),
yaxis = list(title = "y axis"))
p1
The quintessential parabola is \(y = x^2\). Below we create enough point to produce this plot.
x <- seq(from = -3,
to = 3,
by = 0.1)
y <- x^2
p1 <- plot_ly(x = ~x,
y = ~y,
type = "scatter",
mode = "lines") %>%
layout(title = "A parabola",
xaxis = list(title = "x axis"),
yaxis = list(title = "y axis"))
p1
For those familiar with calculus, we can calculate the slope (the rise over run) for every point on a curve. This is straightforward for a straight line. The slope is constant. For every change in \(x\), there is a corresponding change in \(y\).
The slope is also nothing other than the first derivative. The equation for the first derivative of a single variable function \(f \left( x \right)\) is given in (1) below.
\[f' \left( x \right) = \frac{df}{dx} = \lim_{\Delta x \rightarrow 0} \frac{f \left( x + \Delta x\right) - f \left( x \right)}{\Delta x} \tag{1}\]
Skipping a long explanation, for a polynomial such as \(f \left( x \right) = x^n\), we would have the first derivative of \(f\) with respect to \(x\) given as \(f' \left( x \right) = n x^{n-1}\). This means that if \(f \left( x \right) = x^2\) then \(f' \left(x \right) = 2x\). At every point of the function \(f \left( x \right)\) the slope is \(2x\).
Below we plot the slope at \(x = 2\), i.e. \(f' \left( 2 \right) = 2 \left( 2 \right) = 4\).
xd <- seq(from = 0,
to = 3,
by = 0.1)
yd <- 4 * xd - 4
p2 <- plot_ly(x = ~x,
y = ~y,
type = "scatter",
mode = "lines",
name = "Parabola") %>%
add_trace(x = ~xd,
y = ~yd,
type = "scatter",
mode = "lines",
name = "Derivative") %>%
layout(title = "A parabola and its derivative at x=2",
xaxis = list(title = "x axis"),
yaxis = list(title = "y axis"))
p2
Given our simple parabola above, it is easy to see the \(y\) reaches a minimum at \(x = 0\). Given a more complicated polynomial, this might not be the case.
We will use the simple case of our parabola to illustrate how to find a the minimum, by starting at an arbitrary point such as \(x_0 = 2\). This process is termed gradient descent.
At \(x_0\) the slope is \(4\) as we have already seen. We can see that we need to move our \(x\) value to the left to get to the minimum. We will call this next \(x\) value \(x_1\). We can build a formula to update to this \(x_1\). First, we define a small step size, called a learning rate ,\(h\). We will make it quite small, something like \(h = 0.01\).
Now we multiply our current slope (let’s call it \(D\)) by the learning rate, i.e. \(4 \times 0.01 = 0.04\). We subtract this from \(x_0\) to get \(x_1\). For any step in this algorithm we therefor have equation (2).
\[ x_{i+1} = x_i - D \cdot h \tag{2}\]
In our case \(x_1 = 2 - 0.04 = 1.96\). At this point we have a derivative (slope) of $f’ ( 3.96 ) = 3.92. We update our graph as follows.
xd <- seq(from = 0,
to = 3,
by = 0.1)
yd <- 3.92 * xd - 3.8416
p3 <- plot_ly(x = ~x,
y = ~y,
type = "scatter",
mode = "lines",
name = "Parabola") %>%
add_trace(x = ~xd,
y = ~yd,
type = "scatter",
mode = "lines",
name = "Derivative") %>%
layout(title = "A parabola and its derivative at x=1.96",
xaxis = list(title = "x axis"),
yaxis = list(title = "y axis"))
p3
Since we are plotting using the plotly
library, we can zoom in and see that we have indeed moved towards a point where the function is at a minimum.
We can continue with this algorithm until we reach a point that has a slope arbitrarily close to zero. Remember that a function has a minimum where the slope is zero.
Gradient descent is at the heart of deep learning. The function is much more complex than our simple parabola above. For one, it will be multivariable (with many variables). The process remains the same, though. We simple use partial derivatives with respect to each variable individually.
The aim of gradient descent is to find the minimum across all the variables. We will learn that these variables are weights and biases that minimize a cost function. The latter is a function generated by our deep learning algorithm and is a measure of how correct it is given a set of weight and bias values. It is constructed such that it has minima in multidimensional space. We update the weight and bias values just as we did above, so as to minimize the cost function.
With a small learning rate this can be a laborious (time and computer resource intensive) task. We do not want to make the learning rate too fast, though. This might lead to an overshoot when we do get close to the minimum.
We also need to iterate over this algorithm enough times to approach the minimum. This also takes time and resources.
The last common problem we will mention here is that of local minima, where the algorithm terminate at a local minimum, which is not the global minimum.
All of these concepts will become clear as we learn more about the cost function and back propagation.