The r markdown file is available from the link https://www.dropbox.com/sh/tpg8cj8kqzx147e/AACgWM0qOa9Y_wXdLrDt8JlKa?dl=0

Suppose that we have a dataset with 2 independent variables \(x_1\) and \(x_2\), and one dependent variable class indicate if a student achived or not a specific knowledge.

x1 x2 class
7.96 -6.19 not achieved
6.03 -8.11 not achieved
2.05 -3.00 not achieved
9.78 0.70 not achieved
-0.83 7.57 not achieved
-5.40 1.60 not achieved

The plot of the data points is as following:

Firstly, let’s do a simple logistic regression in R to class:

lr_mod <- glm(class ~ x1 + x2, family = binomial, data = skills)

The classification of accuracy using logistic regression is 0.67. And let us take a look at what the boundary of the two classes.

Artificial Neural Networks

The structure of the ANN is simple. It only contains one hidden layer.

x <- data.matrix(skills[c('x1','x2')])
y <- skills$class == 'achieved'

Activation function

We use the sigmoid activation function: \[f(x) = \frac{\exp(x)}{1+\exp(x)}\]

sigmoid <- function(x) {
 1/(1+exp(-x))
}

Forward propagation

First, compute the weighted linear combination of the inputs \[ \mathbf z_1 = \mathbf{XW}_1 = \begin{bmatrix}\mathbf 1 & \mathbf x\end{bmatrix} \mathbf W_1\] Second, pass the weighted sum into the activation fucntion to get the activaiton value of hidden nodes \[\mathbf h = \sigma(\mathbf z_1)\]

Third, compute the weighted linear combination of the hidden nodes: \[ \mathbf z_2 = \mathbf{hW}_2 = \begin{bmatrix}\mathbf 1 & \mathbf h\end{bmatrix} \mathbf W_2\]

Lastly, pass \(\mathbf z_2\) into activation function to get the outputs: \[\hat{\mathbf y} = \sigma(\mathbf z_2)\]

feedforward <- function(x, w1, w2) {
  z1 <- cbind(1, x) %*% w1
  h <- sigmoid(z1)
  z2 <- cbind(1, h) %*% w2

  list(output = sigmoid(z2), h = h)
}

Backpropagation

First, compute the difference between classification results \(\hat{\mathbf y}\) and truth \(\mathbf y\) using corss entropy loss: \[L = \sum_i(y_i\log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i))\] Second, to get the parameter \(\mathbf W_1\) and \(\mathbf W_2\) which minimize the loss function \(L\) using gradient descent, we use chain rule to calculate the gradient of loss function with respect to the two weights.

For \(\mathbf W_2\): \[\frac{\partial L}{\partial \mathbf W_2} = \frac{\partial L}{\partial\hat{\mathbf y}}\frac{\partial\hat{\mathbf y}}{\partial\mathbf W_2}\] where \[\frac{\partial L}{\partial\hat{\mathbf y}} = \frac{\mathbf y}{\mathbf {\hat y}} - \frac{1 - \mathbf y}{1- \mathbf {\hat y}}\] \[\frac{\partial\hat{\mathbf y}}{\partial\mathbf W_2} = \mathbf H^T \sigma(\mathbf {HW}_1)(1-\sigma(\mathbf {HW}_1)) = \mathbf H^T\mathbf {\hat y}(1-\mathbf {\hat y})\] for \(\mathbf W_1\): \[\frac{\partial L}{\partial \mathbf W_1} = \frac{\partial L}{\partial\hat{\mathbf y}}\frac{\partial\hat{\mathbf y}}{\partial\mathbf H}\frac{\partial \mathbf H}{\partial\mathbf W_1}\] where \[\frac{\partial\hat{\mathbf y}}{\partial\mathbf H} = \sigma(\mathbf {HW}_2)(1-\sigma(\mathbf {HW}_2))\mathbf W_2^T = \hat{\mathbf y}(1-\hat{\mathbf y})\mathbf W_2^T\] \[\frac{\partial \mathbf H}{\partial\mathbf W_1} = \mathbf X^T \begin{bmatrix} \mathbf 0 & \sigma(\mathbf{XW}_\text{in})\bigl( 1 - \sigma(\mathbf{XW}_\text{in}) \bigr) \end{bmatrix} = \mathbf X^T \begin{bmatrix} \mathbf 0 & \mathbf H( 1 - \mathbf H) \end{bmatrix} \]

Due to previous functions, the derivative of loss function with respective to \(\mathbf W_2\) and \(\mathbf W_1\) equals to : \(\mathbf H^T (\mathbf y - \mathbf{\hat{y}})\) and \((\mathbf y - \mathbf{\hat{y}}) \mathbf W_2 \mathbf X^T \mathbf H( 1 - \mathbf H)\)

backpropagate <- function(x, y, y_hat, h, w1, w2, learn_rate) {
  dw2 <- t(cbind(1, h)) %*% (y - y_hat)
  dh  <- (y - y_hat) %*% t(w2[-1, , drop = FALSE])
  dw1 <- t(cbind(1, x)) %*% (h * (1 - h) * dh)
  
  w1 <- w1 + learn_rate * dw1
  w2 <- w2 + learn_rate * dw2
  
  return(list(w1 = w1, w2 = w2))
}

Training function

train <- function(x, y, hidden = 5, learn_rate = 1e-2, iterations = 1e4) {
  d <- ncol(x) + 1
  w1 <- matrix(rnorm(d * hidden), d, hidden) # initialize teh weight matrix
  w2 <- as.matrix(rnorm(hidden + 1))
  for (i in 1:iterations) {
    ff <- feedforward(x, w1, w2) # calcualte the forward propagation
    bp <- backpropagate(x, y,
                        y_hat = ff$output,
                        w1, w2,
                        h = ff$h,
                        learn_rate = learn_rate) # backpropagation to update the parameters
    w1 <- bp$w1
    w2 <- bp$w2
  }
  list(output = ff$output, w1 = w1, w2 = w2)
}

Train nn with 5 hidden nodes and visualization results

## training model
nn_h5 <- train(x, y, hidden = 5, learn_rate = 1e-2, iterations = 10000)

## training accuracy
correct = mean((nn_h5$output > .5) == y)

The training accuracy of the ANN with 5 hidden nodes is 0.705.

Let increase the nonlinearity by adding more hidden nodes (60)

## training model
nn_h60 <- train(x, y, hidden = 60,  iterations = 100000)

## training accuracy
correct = mean((nn_h60$output > .5) == y)

The training accuracy of the ANN with 60 hidden nodes is 1.