The r markdown file is available from the link https://www.dropbox.com/sh/tpg8cj8kqzx147e/AACgWM0qOa9Y_wXdLrDt8JlKa?dl=0
Suppose that we have a dataset with 2 independent variables \(x_1\) and \(x_2\), and one dependent variable class indicate if a student achived or not a specific knowledge.
| x1 | x2 | class |
|---|---|---|
| 7.96 | -6.19 | not achieved |
| 6.03 | -8.11 | not achieved |
| 2.05 | -3.00 | not achieved |
| 9.78 | 0.70 | not achieved |
| -0.83 | 7.57 | not achieved |
| -5.40 | 1.60 | not achieved |
The plot of the data points is as following:
Firstly, let’s do a simple logistic regression in R to class:
lr_mod <- glm(class ~ x1 + x2, family = binomial, data = skills)
The classification of accuracy using logistic regression is 0.67. And let us take a look at what the boundary of the two classes.
The structure of the ANN is simple. It only contains one hidden layer.
x <- data.matrix(skills[c('x1','x2')])
y <- skills$class == 'achieved'
We use the sigmoid activation function: \[f(x) = \frac{\exp(x)}{1+\exp(x)}\]
sigmoid <- function(x) {
1/(1+exp(-x))
}
First, compute the weighted linear combination of the inputs \[ \mathbf z_1 = \mathbf{XW}_1 = \begin{bmatrix}\mathbf 1 & \mathbf x\end{bmatrix} \mathbf W_1\] Second, pass the weighted sum into the activation fucntion to get the activaiton value of hidden nodes \[\mathbf h = \sigma(\mathbf z_1)\]
Third, compute the weighted linear combination of the hidden nodes: \[ \mathbf z_2 = \mathbf{hW}_2 = \begin{bmatrix}\mathbf 1 & \mathbf h\end{bmatrix} \mathbf W_2\]
Lastly, pass \(\mathbf z_2\) into activation function to get the outputs: \[\hat{\mathbf y} = \sigma(\mathbf z_2)\]
feedforward <- function(x, w1, w2) {
z1 <- cbind(1, x) %*% w1
h <- sigmoid(z1)
z2 <- cbind(1, h) %*% w2
list(output = sigmoid(z2), h = h)
}
First, compute the difference between classification results \(\hat{\mathbf y}\) and truth \(\mathbf y\) using corss entropy loss: \[L = \sum_i(y_i\log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i))\] Second, to get the parameter \(\mathbf W_1\) and \(\mathbf W_2\) which minimize the loss function \(L\) using gradient descent, we use chain rule to calculate the gradient of loss function with respect to the two weights.
For \(\mathbf W_2\): \[\frac{\partial L}{\partial \mathbf W_2} = \frac{\partial L}{\partial\hat{\mathbf y}}\frac{\partial\hat{\mathbf y}}{\partial\mathbf W_2}\] where \[\frac{\partial L}{\partial\hat{\mathbf y}} = \frac{\mathbf y}{\mathbf {\hat y}} - \frac{1 - \mathbf y}{1- \mathbf {\hat y}}\] \[\frac{\partial\hat{\mathbf y}}{\partial\mathbf W_2} = \mathbf H^T \sigma(\mathbf {HW}_1)(1-\sigma(\mathbf {HW}_1)) = \mathbf H^T\mathbf {\hat y}(1-\mathbf {\hat y})\] for \(\mathbf W_1\): \[\frac{\partial L}{\partial \mathbf W_1} = \frac{\partial L}{\partial\hat{\mathbf y}}\frac{\partial\hat{\mathbf y}}{\partial\mathbf H}\frac{\partial \mathbf H}{\partial\mathbf W_1}\] where \[\frac{\partial\hat{\mathbf y}}{\partial\mathbf H} = \sigma(\mathbf {HW}_2)(1-\sigma(\mathbf {HW}_2))\mathbf W_2^T = \hat{\mathbf y}(1-\hat{\mathbf y})\mathbf W_2^T\] \[\frac{\partial \mathbf H}{\partial\mathbf W_1} = \mathbf X^T \begin{bmatrix} \mathbf 0 & \sigma(\mathbf{XW}_\text{in})\bigl( 1 - \sigma(\mathbf{XW}_\text{in}) \bigr) \end{bmatrix} = \mathbf X^T \begin{bmatrix} \mathbf 0 & \mathbf H( 1 - \mathbf H) \end{bmatrix} \]
Due to previous functions, the derivative of loss function with respective to \(\mathbf W_2\) and \(\mathbf W_1\) equals to : \(\mathbf H^T (\mathbf y - \mathbf{\hat{y}})\) and \((\mathbf y - \mathbf{\hat{y}}) \mathbf W_2 \mathbf X^T \mathbf H( 1 - \mathbf H)\)
backpropagate <- function(x, y, y_hat, h, w1, w2, learn_rate) {
dw2 <- t(cbind(1, h)) %*% (y - y_hat)
dh <- (y - y_hat) %*% t(w2[-1, , drop = FALSE])
dw1 <- t(cbind(1, x)) %*% (h * (1 - h) * dh)
w1 <- w1 + learn_rate * dw1
w2 <- w2 + learn_rate * dw2
return(list(w1 = w1, w2 = w2))
}
train <- function(x, y, hidden = 5, learn_rate = 1e-2, iterations = 1e4) {
d <- ncol(x) + 1
w1 <- matrix(rnorm(d * hidden), d, hidden) # initialize teh weight matrix
w2 <- as.matrix(rnorm(hidden + 1))
for (i in 1:iterations) {
ff <- feedforward(x, w1, w2) # calcualte the forward propagation
bp <- backpropagate(x, y,
y_hat = ff$output,
w1, w2,
h = ff$h,
learn_rate = learn_rate) # backpropagation to update the parameters
w1 <- bp$w1
w2 <- bp$w2
}
list(output = ff$output, w1 = w1, w2 = w2)
}