This chapter puts the concepts of the preceding chapter to good use. The aim is to expand intuition around deep learning through becoming more intimately familiar with the idea of learning the values of parameters. The values that the parameters ultimately take bring the predicted values as close as possible to the real target values. The actual values of the latter are refereed to as the ground truth.
The act of learning is expressed in mathematical form. This means that the ultimate goal is to create a function, for which a minimum value can be calculated. Understandably, and for now, this should make very little sense!
Linear regression with a single feature variable provides the simplest example to clear up this understanding and forms the basis for this chapter.
The emphasis is on a single feature variable predicting a target variable. Equation (1) below is taken from the preceding chapter and shows how a single value in the target variable set is predicted (calculated from) the corresponding feature variable value.
Probably the most difficult concept to understand in equation (1) is to rid the memory of school algebra, where \(x\) and \(y\) were variables. In equation (1) they are, in fact, constants. Each pair of values (feature and target variable value pair) are both constants. It is \(\beta_0\) and \(\beta_1\) that are the variables. Given a value pair of \(\left(2, 4 \right)\) and replacing \(\beta_0\) and \(\beta_1\) with the more familiar school variables (not to be confused with the \(x\) and \(y\) in equation (1)), the equation would read \(4 = x + 2y\). In very common form, and through algebraic manipulation, this is the same as \(y = -\frac{1}{2}x + 2\).
\[ \hat{y}_i \left( x_i \right) = \beta_0 + \beta_1 x_i \tag{1} \]
Equation (1) is not plucked from the air. It is a linear equation ( a straight line) which aims to draw a straight line through the set of points in a graph (representing the value pairs) that serve as a model. From this model, and given appropriate values for \(\beta_0\) and \(\beta_1\), a future value of the target variable can be predicted (calculated) given a value for the feature variable.
Remember to that \(\hat{y}_i\) is the predicted value and that \(i\) takes on counting values from \(1\) to \(n\), where \(n\) is the number of samples. The actual corresponding ground truth (target) value for pair \(i\) is \(y_i\).
In real-life, given values for \(\beta_0\) and \(\beta_1\), every predicted value, \(\hat{y}_i\), will be slightly different from the ground truth value, \(y_i\). The squared error is a way of quantifying the error (difference between the two values). This error can be calculated for every value pair, \(\left( x_i , y_i \right)\). In deep learning, this error is referred to as the loss function, \(L\), given in equation (2).
\[ L \left( x_i \right) = {\left[ \hat{y}_i \left( x_i \right) - y_i \right]}^{2} \tag{2} \]
The loss function is calculated for each pair in the \(n\)-sample dataset. There are many ways to combine this loss function over all of the \(n\) samples. One way is to average over all the errors, i.e. summing all the \(n\) errors and dividing by \(n\). This is shown in equation (3).
\[ C \left( \beta_0 , \beta_1 \right) = \frac{1}{n} \sum_{i=1}^{n} L \tag{3} \]
Replacing equations (2) and then (1) into equation (3) shows the complete cost function, given in equation (4) below.
\[ C \left( \beta_0 , \beta_1 \right) = \frac{1}{n} \sum_{i=1}^{n} {\left[ \beta_0 + \beta_1 x_i - y_i \right]}^{2} \tag{4} \]
The aim, as mentioned in the introduction, is to minimize the cost function by changing the parameters \(\beta_0\) and \(\beta_1\).
The best way to understand how the cost function is minimized, is by example. Below are two computer variables (objects), feature.var and target.var. This represents a linear regression problem, where the aim is to solve for values of \(\beta_0\) and \(\beta_1\) so as to use the values in feature.var to predict the values in target.var.
There are five pairs of values. The feature variable values are hard-coded and the target variable values are created by adding a random value to each of the five target variable values.
set.seed(1234) # For reproducibility
feature.var <- c(1.3, 2.1, 2.9, 3.1, 3.3) # Five hard-coded values
target.var <- feature.var + round(rnorm(5,mean = 0,sd = 0.5),digits = 1) # Adding random noise
Below is a scatter plot of the five value pairs. The feature variable value of each marker (dot) is on the \(x\)-axis (independent variable) and the target variable value of each marker is on the \(y\)-axis (dependent variable).
p <- plot_ly(type = "scatter",
mode = "markers",
x = ~feature.var,
y = ~target.var,
marker = list(size =14,
color = "rgba(255, 180, 190, 0.8)",
line = list(color = "rgba(150, 0, 0, 0.8)",
width = 2)))%>%
layout(title = "Scatter plot",
xaxis = list(title = "Feature variable", zeroline = FALSE),
yaxis = list(title = "Target variable", zeroline = FALSE))
p
The code chunk below shows the pair of values as row vectors.
feature.var
## [1] 1.3 2.1 2.9 3.1 3.3
target.var
## [1] 0.7 2.2 3.4 1.9 3.5
Equation (4) can now be used to plug in all five of the pairs. This is shown in equation (5) below.
\[ C = \frac{1}{5} \times \left\{ { \left[ \beta_0 + \beta_1 \left( 1.3 \right) - 0.7 \right]}^{2} + { \left[ \beta_0 + \beta_1 \left( 2.1 \right) - 2.2 \right]}^{2} + { \left[ \beta_0 + \beta_1 \left( 2.9 \right) - 3.4 \right]}^{2} \\ + { \left[ \beta_0 + \beta_1 \left( 3.1 \right) - 1.9 \right]}^{2} + { \left[ \beta_0 + \beta_1 \left( 3.3 \right) - 3.5 \right]}^{2} \right\} \tag{5} \]
Simple algebraic manipulation results in equation (6).
\[ C = 6.55 - 4.68 {\beta}_{0} + {\beta}_{0}^{2} - 13.132 {\beta}_{1} + 5.08 {\beta}_{0} {\beta}_{1} + 7.002 {\beta}_{1}^{2} \tag{6} \]
Note that this is an equation in two unknown and that it can be graphed in 3D space as shown in the figure below.
All of this brings us to a very simple conclusion. The mathematical concept of minimizing the error is simply finding values for \(\beta_0\) and \(\beta_1\) that will show the point in the 3D graph that is the lowest point, called the global minimum.
Since the problem was reduced to that of mathematical function that requires the finding of the global minimum, partial differentiation with respect to each variable allows for the calculation of this minimum.
In this extremely simple example of a single feature variable with two unknowns, the global minimum is calculated by the two partial derivatives shown in equation (7) below.
\[ \frac{\partial C}{\partial \beta_0} = 2 \beta_0 + 5.08 \beta_1 - 4.68 \\ \frac{\partial C}{\partial \beta_1} = 5.08 \beta_0 + 14.004 \beta_1 - 13.132 \tag{7} \]
Setting both partial derivatives equal to \(0\) results in two equations with two unknowns. These two equations are solved very easily through row-reduction of an augmented matrix, shown in equation (8)
\[ 2 \beta_0 + 5.08 \beta_1 - 4.68 = 0 \\ 5.08 \beta_0 + 14.004 \beta_1 - 13.132 = 0 \\ 2 \beta_0 + 5.08 \beta_1 = 4.68 \\ 5.08 \beta_0 + 14.004 \beta_1 = 13.132 \\ \begin{bmatrix} 2 && 5.08 && 4.68 \\ 5.08 && 14.004 && 13.132 \end{bmatrix} \tag{8} \]
Equation (9) shows the row-reduced form of the matrix above.
\[ \begin{bmatrix} 1 && 0 && -0.532267 \\ 0 && 1 && 1.13081 \end{bmatrix} \tag{9} \]
From this row-reduced augmented matrix the final values for \(\beta_0\) and \(\beta_1\) is shown in equation (10) below.
\[ \beta_0 = -0.532267 \\ \beta_1 = 1.13081 \tag{10} \]
As before the lm() function in R shows the results, which are exactly those in equation (10).
model <- lm(target.var ~ feature.var)
summary(model)
##
## Call:
## lm(formula = target.var ~ feature.var)
##
## Residuals:
## 1 2 3 4 5
## -0.2378 0.3576 0.6529 -1.0733 0.3006
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5323 1.2536 -0.425 0.700
## feature.var 1.1308 0.4737 2.387 0.097 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7859 on 3 degrees of freedom
## Multiple R-squared: 0.6551, Adjusted R-squared: 0.5401
## F-statistic: 5.698 on 1 and 3 DF, p-value: 0.097
For this problem and for those with more than a single feature variable, the alternative method for finding the global minimum is done through a process of gradient decent.
This process involves selecting an arbitrary (random) value for the parameters. In an effort to simplify the explanation as was done above and thereby maximizing the likelihood of intuitive understanding, the problem can be reduced to a single parameter (not \(\beta_0\) and \(\beta_1\)). Instead of a 3D graph, this results in a 2D graph. To simplify matters to the extreme, an example from school will suffice.
Consider then the equation \(y = x^2\). Again, these are not to be confused with \(x_i\) and \(y_i\) from above. In fact, her \(x\) represents only \(\beta_1\). The graph of this equation is shown below.
Clearly, the global minimum is at \(x=0\), i.e. \(y\) is at its lowest point when \(x=0\). The first derivative of \(y\) with respect to \(x\) is \(2x\). This is the equation for a slope of the curve at any given value for \(x\). Starting at an arbitrary point, say \(x=-2\) shows a slope of \(2 \times \left( -2 \right) = -4\).
This is a rather steep (negative) slope. At the global minimum, the slope will be \(0\). Clearly, there is a need to step closer to the point \(x=0\). This is done by updating the point \(x=2\) by subtracting a small value times the current slope. If this small value is \(0.01\) for the sake of argument, this update becomes \(- \left[ 0.01 \times \left( -4 \right) \right] = + 0.04\).
The new \(x\) value is now \(-2 + 0.04 = -1.96\). In later chapters this use of the derivative to update the parameter values is known as backpropagation. The process is repeated several times until the global minimum is reached. While it is simple to see from this contrived example where the global minimum is, this is not so trivial in multi-dimensional space, with a complicated, convoluted graph. This process of gradient descent is a reliable method of finding the global minimum, thereby minimizing the cost function.
The problem of predicting target variable values given feature variable values was reduced to the creation of a cost function for which a global minimum could be calculated. The global minimum represents the parameter values that bring the predicted values as close to the ground-truth target values as is possible. This process forms the bedrock of deep learning.