github.com/javierluraschi/deeplearning-sdss-2019
Why should I care about Deep Learning?
2012: ImageNet classification with deep convolutional neural networks
2016: AlphaGo wins match against Lee Sedol ranked 9-dan, one of the best players at Go.
2017: HBO’s Sillicon Valley show “hot dog or not” episode.
2018: OpenAI Five defeats top 99.95th percentile Dota players
2019: Tesla’s autonomy day presents autonomous driving plan.
See arXiv:1803.01164
McCulloch & Pitts show that neurons can be combined to construct a Turing machine (using ANDs, ORs, & NOTs).
The perceptron: A probabilistic model for information storage and organization in the brain
\[ f(x) = \begin{cases} 1 & \sum_{i=1}^m w_i x_i + b > 0\\ 0 & \text{otherwise} \end{cases} \]
\[ f(x) = \begin{cases} 1 & \sum_{i=1}^m w_i x_i + b > 0\\ 0 & \text{otherwise} \end{cases} \]
\[ f(x) = \begin{cases} 1 & \sum_{i=1}^m w_i x_i + b > 0\\ 0 & \text{otherwise} \end{cases} \]
\[ f(x) = \begin{cases} 1 & \sum_{i=1}^m w_i x_i + b > 0\\ 0 & \text{otherwise} \end{cases} \]
\[ f(x) = \begin{cases} 1 & \sum_{i=1}^m w_i x_i + b > 0\\ 0 & \text{otherwise} \end{cases} \]
Rosenblatt, with the image sensor of the Mark I Perceptron (…) it learned to differentiate between right and left after fifty attempts.
\[ f(x) = \begin{cases} 1 & \sum_{i=1}^m w_i x_i + b > 0\\ 0 & \text{otherwise} \end{cases} \]
Using multiple layered perceptrons should allow us to find a solution for the XOR table.
# A B (1 for bias) r = A XOR B
x <- matrix(c(0, 0, 1, # 0
0, 1, 1, # 1
1, 0, 1, # 1
1, 1, 1), ncol = 3, byrow = T) # 0
r <- c(0, 1, 1, 0)
w <- matrix(c( ___, ___, ___,
___, ___, ___,
___, ___, ___), ncol = 3, byrow = T)
yh1 <- ifelse(x %*% w[1,] > 0, 1, 0)
yh2 <- ifelse(x %*% w[2,] > 0, 1, 0)
x3 <- matrix(c(yh1, yh2, c(1,1,1,1)), ncol = 3)
ifelse(x3 %*% w[3,] > 0, 1, 0) == r# A B (1 for bias) r = A XOR B
x <- matrix(c(0, 0, 1, # 0
0, 1, 1, # 1
1, 0, 1, # 1
1, 1, 1), ncol = 3, byrow = T) # 0
r <- c(0, 1, 1, 0)
w <- matrix(c(0.5, -0.5, 0.5,
0.5, -0.5, 0.5,
0.0, 1.0, -0.5), ncol = 3, byrow = T)
yh <- ifelse(x %*% w[,1:2] > 0, 1, 0)
x3 <- matrix(c(yh[,1], yh[,2], c(1,1,1,1)), ncol = 3)
ifelse(x3 %*% w[,3] > 0, 1, 0) == rA Learning Algorithm for Boltzmann Machines
A function decreases fastest if one goes from in the direction of the negative gradient.
\[ a_{n+1} = a_n - \gamma \nabla F(a_n) \]
Gradient descent requires differentiable functions,
\[ f(x) = \begin{cases} 1 & \sum_{i=1}^m w_i x_i + b > 0\\ 0 & \text{otherwise} \end{cases} \]
We use instead the ReLU, which is almost differentiable:
\[ max(0,\sum_{i=1}^m w_i x_i + b) \]
Lets differentiate L2 over a perceptron with ReLU, where \(\theta\) is the step function.
\[ (f(w_1, w_2, b) - y)^2 = (max(0,(w_1 x_1 + w_2 x_2 + b)) - y)^2 \]
\[ \frac{df(w_1, w_2, b)}{w_1} = 2 * (f(w_1, w_2, b) - y) * \theta({w_1 x_1 + w_2 x_2 + b}) * x_1 \\ \frac{df(w_1, w_2, b)}{w_2} = 2 * (f(w_1, w_2, b) - y) * \theta({w_1 x_1 + w_2 x_2 + b}) * x_2 \\ \frac{df(w_1, w_2, b)}{b} = 2 * (f(w_1, w_2, b) - y) * \theta({w_1 x_1 + w_2 x_2 + b}) \] The we can iterate following the gradients direction,
\[ a_{n+1} = a_n - \gamma \nabla F(a_n) \]
w_1=0.1; w_2=0.2; b=0.3; learn=0.1;
# A B r = A AND B
x <- matrix(c(0, 0, # 0
0, 1, # 0
1, 0, # 0
1, 1), ncol = 2, byrow = T) # 1
r <- c(0, 0, 0, 1)
f <- function(w_1, w_2, b, x_1, x_2) w_1 * x_1 + w_2 * x_2 + b
step <- function(x) ifelse(x < 0, 0, 1)
for (i in 1:1000) {
f_1 <- f(w_1, w_2, b, x[,1], x[,2])
w_1 <- w_1 - ___
w_2 <- w_2 - ___
b <- b - ___
}
(f(w_1, w_2, b, x[,1], x[,2]) > 0.01) == as.logical(r)w_1=0.1; w_2=0.2; b=0.3; learn=0.1;
x <- matrix(c(0, 0, 1, 1,
0, 1, 0, 1), nrow = 4)
r <- c(0, 0, 0, 1)
f <- function(w_1, w_2, b, x_1, x_2) w_1 * x_1 + w_2 * x_2 + b
step <- function(x) ifelse(x < 0, 0, 1)
for (i in 1:1000) {
f_1 <- f(w_1, w_2, b, x[,1], x[,2])
w_1 <- w_1 - sum(learn * (2 * (pmax(0, f_1) - r) * step(f_1) * x[,1]))
w_2 <- w_2 - sum(learn * (2 * (pmax(0, f_1) - r) * step(f_1) * x[,2]))
b <- b - sum(learn * (2 * (pmax(0, f_1) - r)) * step(f_1))
}
(f(w_1, w_2, b, x[,1], x[,2]) > 0.01) == as.logical(r)1989: ALVINN: An Autonomous Land Vehicle in a Neural Network
The vanishing gradient problem is when the gradient will be vanishingly small, effectively preventing the weight from changing its value.
A fast learning algorithm for deep belief nets
Reducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networks
ImageNet Classification with Deep Convolutional Neural Networks
Used ReLU, dropout, augmentation, GPUs.
Computational graphs avoid manually computing gradients.
CS231n Winter 2016: Lecture 4: Backpropagation, Neural Network
Simple computational graph example.
Can then combine into very complex graphs using the chain rule:
Define the computation graph for the sigmoid function.
\[ \sigma (x) = \frac{1 }{1 + e^{-x} } \]
graph <- list(
forward = function(x) -1 * x,
backward = function(x) -1,
node = list(
forward = function(x) e ^ (-x),
backward = function(x) - e ^ (-x),
node = list(
forward = function(x) x + 1,
backward = function(x) 1,
node = list(
forward = function(x) 1 / x,
backward = function(x) - 1 / x ^ 2
)
)
)
)A simple mechanism that, once scaled, ends up looking like magic
github.com/rstudio/cheatsheets/raw/master/keras.pdf
What version you have installed?
If your local environment is not working…
Also, you can fint the version with,
[1] ‘1.13’
Load ’em packages!
Let’s look at the data we’re working with
Let’s look at the data we’re working with
We’re going to predict whether kyphosis is present.
First, we’ll perform an initial split into training/testing of the dataset.
We’re going to predict whether kyphosis is present.
First, we’ll perform an initial split into training/testing of the dataset.
Let’s build our favorite classification model!
Let’s build our favorite classification model!
Create a data frame with the predictions.
Create a data frame with the predictions.
Calculate AUC
Calculate AUC
[1] 0.719
“Neural net”
Data prep
Data prep
Fit the model
Fit the model
predictions <- predict(model1, testing_data %>%
select(Age, Number, Start) %>%
as.matrix())
predicted <- testing_data %>%
mutate(present = predictions[,2])
roc_auc(predicted, Kyphosis, present)Try adding more layers to make this perform better!
Neural nets are easier to train when the predictors have similar magnitudes, see last section from notebook.