library(readr)
library(tibble)

Watch the video series on YouTube at https://www.youtube.com/watch?v=9-QYsN_knG4&list=PLsu0TcgLDUiIKPMXu1k_rItoTV8xPe1cj
Download the fileson Github at https://github.com/juanklopper/Deep-learning-using-R

Introduction

With the knowledge of the preceding chapters, it is time to view regression as a shallow network. All the calculations remain exactly as before. Conceptualizing the process as a neural network is the aim of this chapter.

Multiple linear regression as a network

Whereas the previous chapter viewed only a single feature variable, the dataset below represents three feature variables, all of continuous numerical data type, and a target variable, similarly of continuous numerical data type.

# Import a spreadsheet file
df <- read_csv("MultipleLinearRegression.csv")
## Parsed with column specification:
## cols(
##   x1 = col_double(),
##   x2 = col_double(),
##   x3 = col_double(),
##   y = col_double()
## )
df
## # A tibble: 10 x 4
##       x1    x2    x3     y
##    <dbl> <dbl> <dbl> <dbl>
##  1  20.1  39.3   1.3 394. 
##  2  23.6  31.6   1.5 211. 
##  3  29.2  36.9   1.4 251. 
##  4  29.3  34.1   1.2  85.4
##  5  30    37.2   1.2 249. 
##  6  22.9  39.3   1.9  46  
##  7  25.1  33     1.3 252. 
##  8  27.7  36     2   315. 
##  9  24.7  34.5   1.3 120. 
## 10  24.2  39.8   1.5 110.

Note that there are \(10\) samples and \(3\) feature variables, the aim is to calculate values for \(\beta_0\), \(\beta_1\), \(\beta_2\), and \(\beta_3\) that will minimize a specific cost function.

If each of the variable are seen as a column vector, expressed with underlines, \(\underline{x}_1\), \(\underline{x}_2\), and \(\underline{x}_3\), the requirement is to find values such that the predicted target is as close to the ground-truth values in the column vector \(\underline{y}\) as seen in equation (1).

\[\beta_0 + \beta_1 \underline{x}_1 + \beta_2 \underline{x}_2 + \beta_3 \underline{x}_3 \approx \underline{y} \tag{1}\]

As before, the loss function is calculated for every sample (row in the table above). The notation changes to a superscript \(\left( i \right)\) to indicate every sample and the loss function is given in equation (2).

\[L^{\left( i \right)} \left( \beta_0 , \beta_1 , \beta_2 , \beta_3 \right) = { \left( \beta_0 + \beta_1 x_1^{\left( i \right)} + \beta_2 x_2^{\left( i \right)} + \beta_3 x_3^{\left( i \right)} - y^{\left( i \right)} \right) }^{2} \tag{2}\]

The cost function will average all of the losses and finally the process of gradient descent will result in optimal values for \(\beta_0\), \(\beta_1\), \(\beta_2\), and \(\beta_3\).

Using the lm() function in R provides a quick and easy way of calculating the optimal values.

model <- lm(y ~ x1 + x2 + x3,
            data = df)
summary(model)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -151.55  -96.65   22.22   56.09  164.10 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 342.7300   777.7247   0.441    0.675
## x1           -3.5850    14.2899  -0.251    0.810
## x2            0.2326    16.6230   0.014    0.989
## x3          -38.0145   166.3937  -0.228    0.827
## 
## Residual standard error: 134.4 on 6 degrees of freedom
## Multiple R-squared:  0.01687,    Adjusted R-squared:  -0.4747 
## F-statistic: 0.03433 on 3 and 6 DF,  p-value: 0.9906

The solution shows \(\beta_0 = 342.7\), \(\beta_1 = -3.6\), \(\beta_2 = 0.2\), and \(\beta_3 = -38.0\). (Note the poor Mutiple R-squared value and the high p values. The data was created randomly, so even the best values for all the \(\beta\) parameters are still going to be a bit off the mark with their predicted values).

Flow diagram

The above model can be represented as a flow-diagram. This allows for the introduction of the concept of neurons called perceptrons (more about this in upcoming chapters).

The diagram below expresses all of the calculations above. The feature variables are turned on their sides (transposed) and represent the input state of the network.

Following this is a layer of three neurons. The values that each of the three neurons, called nodes take is the product of the corresponding input node and the associated parameter, called a weight.

The last layer is a single node and takes as input the sum of all the three previous nodes plus the value held in \(\beta_0\), called the bias node. This is the predicted or output value and will be compared to the target value.

And that it it! A neural network (of sorts) with a single deep (hidden) layer. This layer holds values that are different from the actual input values (actual feature variable values for each subject or row).

As before, all of the rows will pass through the network and a cost function will be created, consisting of four unknowns, \(\beta_0\), \(\beta_1\), \(\beta_2\), and \(\beta_3\).

This whole process as described is known as forward propagation. This is followed by the process of backpropagation. This uses the process of gradient descent to update the values of \(\beta_0\), \(\beta_1\), \(\beta_2\), and \(\beta_3\). These new values are then used in another forward propagation. Each pair of forward propagation and backpropagation is known as an epoch.

During the first forward pass, the values for \(\beta_0\), \(\beta_1\), \(\beta_2\), and \(\beta_3\), called the weights are initialized. This is the process of providing each of the weights with a random value allowing for the calculation of actual values for the hidden nodes and the output node. Backpropagation through gradient descent then updates all of these values with (hopefully) better ones. Below is a graph of the first row of data.

Conclusion

The process of forward propagation and back propagation allows a neural network to learn the optimal values of parameters such that the best possible prediction for a target variable can be made.

It is simply stated, a very, very elegant process, transforming the idea of learning into mathematical functions.

The important take-aways from this chapter are two-fold:

  1. Defining new terms for old concepts, introduced in the preceding chapters, i.e. unknowns or parameters are called weights.
  2. The idea of hidden layers consisting of nodes that hold the values from which the weights can be learned through numerous epochs.