Sigrid Keydana, Trivadis
12/05/2017
Trivadis
My background
My passion
Where to find me
10 classes, 32 x 32 px, popular benchmarking dataset for image recognition
Let's just pick two classes, keeping it simple…
… and predict it's class probability:
[1] "Predicted probability for ship: 0.9979"
… and see what the net says now:
[1] "Predicted probability for ship: 0.4167"
Let's step back from more complicated architectures for a second and look at straightforward logistic regression (with Keras):
model <- keras_model_sequential()
model %>%
layer_dense(units = 1, input_shape = 3072) %>%
layer_activation("sigmoid")
model %>% compile(optimizer = optimizer_sgd(lr = 0.001), loss = "binary_crossentropy")
When does this model say frog (coded as 0) and when ship (coded as 1)?
Logistic regression uses the sigmoid function to decide which class to output, 0 or 1.
\( sigmoid(x,w) = \frac{1}{1 + e^{-\mathbf{w}^T\mathbf{x} + b}} \)
When \( \mathbf{w}^T\mathbf{x} + b \) > 0, sigmoid(x,w) > 0.5, and we get “ship” (1).
When \( \mathbf{w}^T\mathbf{x} + b \) < 0, sigmoid(x,w) < 0.5, and we get “frog” (0).
\( sigmoid(x,w)= \frac{1}{1 + e^{-\mathbf{w}^T\mathbf{x} + b}} \)
somehow has to end up < 0.5!
Let's take a small excerpt of this dot product, \( \mathbf{w}^T\mathbf{x} \), and look how we could make it smaller:
\( w_1^Tx_1 + w_2^Tx_2 + w_3^Tx_3 ... \)
This will get smaller when for all positive \( w_i \), we decrease \( x_i \), and for all negative \( w_i \), we increase \( x_i \).
Here's our frog…
To produce a ship, now we want the dot product to get bigger.
We extract the weights of the model and add them to the image, scaled by some factor.
model <- load_model_hdf5(model_name)
weights <- (model %>% get_weights())[[1]]
adv <- poor_frog + t(weights)/6
new_frog_prob <- model %>% predict_proba(adv)
print(paste("Predicted probability for ship: ", round(new_frog_prob,4)))
[1] "Predicted probability for ship: 0.6629"
How does this change the way our image looks?
The change is hardly visible - much less than with the first example that used a convnet and included nonlinearities.
… what can we do when we have deep neural networks?
And thus: several layers of weights in the game?
Normal training has the network get better in doing the right classification.
How about we have it get worse, or put differently, have an alternative definition of what is right?
In optimization, normally we iteratively subtract (a fraction of) the gradient to reach the/a minimum of the cost function.
In this graphic, cost is a quadratic function, e.g., mean squared error:
In two-class classification (be it shallow logistic regression or a deep convnet), the loss to minimize is cross entropy:
\( - \sum_j{t_j log(y_j)} \)
resp.
\( - (t\:log(y) + (1-t)\:log(1-y)) \) for the binary case.
Normally we'd try to minimize that loss …
… now we maximize the error instead of minimizing it!
Let's step back for a second and see this how this works in context.
Enter …
We'll gradually build up the gradient of the loss with respect to \( y_i \), the output of the last hidden layer, and the weights \( w_{ij} \), respectively.
Here
Step 1: Loss w.r.t. output (prediction)
At the output layer \( j \), we compare class prediction \( y_j \) and actual class \( t \), using binary cross entropy/logistic loss: \( - (t\:log(y) + (1-t)\:log(1-y)) \)
The gradient
\( \frac{dE}{dy_j} = - (\frac{t}{y} + (-1) \frac{1-t}{1-y}) = \frac{y-t}{y(1-y)} \)
describes how the error \( E \) changes as the prediction \( y_j \) changes.
Step 2: How does the prediction/output \( y_j \) change as the input \( z_j \) to the final neuron changes?
In the case of a logistic (sigmoid) neuron with output \( y_j \), this is described by the gradient
\( \frac{dy_j}{dz_j} = y_j(1-y_j) \).
Step 3a: How does the input \( z_j \) to the final layer change as the weight \( w_{ij} \) changes?
Here the gradient is
\( \frac{dz_j}{dw_{ij}} = y_i \),
that is, the output of layer \( i \).
Step 3b: How does the input \( z_j \) to the final layer change as the output of layer \( i \) changes?
Here we have to take into account all connections a neuron \( y_i \) has to the output layer: The gradient is
\( \sum_j \frac{dz_j}{dy_i} = \sum_j w_{ij} \),
that is, the sum of the weights going out of \( y_i \).
Now that we have the single gradients, we use the chain rule to back propagate the error:
The gradient of the loss w.r.t. \( y_i \) is
\( \frac{dE}{dy_i} = \frac{y-t}{y(1-y)} y_j(1-y_j) \sum_j w_{ij} \)
Accordingly, we get the gradient of the loss w.r.t. \( w_{ij} \) as
\( \frac{dE}{dy_i} = \frac{y-t}{y(1-y)} y_j(1-y_j) y_i \)
We can use the same mechanics of loss calculation and backpropagation, just
So instead of the weights, we'll extract the gradient from the model and add part of it to the input image.
How do we find the best scaling factor?
As shown in Goodfellow et al.,
Explaining and harnessing adversarial examples,
we can just take the sign of the gradient, scale it by a factor \( \epsilon \), reflecting the “resolution” of the measurements, and add that to the input image:
\( x = x + \epsilon \: sign(\nabla x (J(\theta, x ,y))) \)
Here's our frog again …
This time we get the gradient values, take their sign, and add that to the image, scaled by a factor \( \epsilon \):
input <- model$input; output <- model$output
target <- model %>% predict_classes(poor_frog); target_variable = K$variable(target)
loss <- metric_binary_crossentropy(model$output, target_variable)
gradients <- K$gradients(loss, model$input)
get_grad_values <- K$`function`(list(model$input), gradients)
grads <- get_grad_values(list(poor_frog))[[1]] %>% sign()
adv <- poor_frog + grads * 0.02
new_frog_prob <- model %>% predict_proba(adv)
print(paste("Predicted probability for ship: ", round(new_frog_prob,4)))
[1] "Predicted probability for ship: 0.5916"
And how does our frog look now?
Images are.
Our little
is just coded like this…
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0.00000 0.00000 0 0 0 0.00000 0.00000 0.00000
[2,] 0.00000 0.00000 0 0 0 0.00000 0.00000 0.00000
[3,] 0.00392 0.00392 0 0 0 0.00000 0.00000 0.00000
[4,] 0.00392 0.00392 0 0 0 0.00000 0.00000 0.00000
[5,] 0.00392 0.00392 0 0 0 0.00000 0.00000 0.00000
[6,] 0.00000 0.00000 0 0 0 0.00392 0.00392 0.00392
[7,] 0.00000 0.00000 0 0 0 0.00392 0.00392 0.00392
[8,] 0.00000 0.00000 0 0 0 0.00392 0.00784 0.00784
Documents are.
In text classification, we have term-document matrices
doc1 doc2 doc3 doc4 doc5 doc6 doc7 doc8 doc9
word1 2 1 1 1 0 0 0 0 1
word2 0 2 2 1 0 2 2 2 2
word3 0 0 0 0 0 0 0 0 0
word4 0 0 0 0 0 0 0 0 0
word5 1 2 2 2 2 2 2 2 2
word6 1 0 0 0 0 0 0 0 0
word7 0 0 0 0 0 0 0 0 0
word8 0 0 0 2 2 1 2 1 0
word9 0 0 0 0 0 0 0 0 0
word10 1 0 0 0 0 0 0 0 0
word11 2 2 2 2 2 2 2 2 2
word12 0 0 0 0 0 0 0 0 0
word13 0 0 0 0 0 0 0 0 0
word14 0 0 0 0 0 0 0 0 0
word15 1 1 1 1 1 1 1 1 1
word16 1 1 1 1 1 1 1 1 1
word17 0 0 0 0 0 0 0 0 0
word18 0 0 0 0 0 0 0 0 0
word19 0 0 0 0 0 0 0 0 0
word20 1 1 1 1 1 1 1 1 1
Even normal dataframes are.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
Let's say we're building our own neural network, starting with a 8*4 input matrix X and a hidden layer of capacity (number of neurons) 4.
X <- matrix(1:32, nrow = 8, ncol = 4, byrow = TRUE)
X
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16
[5,] 17 18 19 20
[6,] 21 22 23 24
[7,] 25 26 27 28
[8,] 29 30 31 32
What shape do we need for the weights matrix going into that hidden layer?
Weights matrix is 4*4.
W <- matrix(rnorm(16), nrow = 4, ncol = 4)
Then matrix-multiplying X and W gives the aggregated inputs into each hidden neuron for each input sample.
X %*% W
[,1] [,2] [,3] [,4]
[1,] -0.10 1.20 0.70 0.217
[2,] -1.97 5.06 1.11 -0.561
[3,] -3.85 8.92 1.52 -1.340
[4,] -5.72 12.78 1.93 -2.118
[5,] -7.60 16.64 2.34 -2.897
[6,] -9.47 20.50 2.75 -3.675
[7,] -11.35 24.35 3.16 -4.454
[8,] -13.22 28.21 3.57 -5.232
LeNet: First successful application of convolutional neural networks by Yann LeCun, Yoshua Bengio et al.
We have an input image x …
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
and a convolution kernel k…
[,1] [,2]
[1,] 1 2
[2,] 3 4
We transform k into a doubly block-circulant matrix …
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 1 2 0 3 4 0 0 0 0
[2,] 0 1 2 0 3 4 0 0 0
[3,] 0 0 0 1 2 0 3 4 0
[4,] 0 0 0 0 1 2 0 3 4
and flatten x into a vector…
[1] 1 2 3 4 5 6 7 8 9
Now we can get the convolved image by matrix-multiplying k and x:
k %*% x
[,1]
[1,] 35
[2,] 65
[3,] 45
[4,] 75
This is the same we get when performing the convolution “manually”:
y1 <- 1*1 + 2*2 + 3*4 + 4*5
y2 <- 1*2 + 2*3 + 3*5 + 4*6
y3 <- 1*4 + 2*5 + 3*7 + 4*8
y4 <- 1*5 + 2*6 + 3*8 + 4*9
Now just convert the vector back to a matrix:
[,1] [,2]
[1,] 37 47
[2,] 67 77
QR, LU, SVD, NMF, eigen…
Motivations
semantic: find latent factors/components
technical: facilitate computations (e.g., Least Squares)
?lm
lm(formula, data, subset, weights, na.action,
method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
singular.ok = TRUE, contrasts = NULL, offset, ...)
Linear regression \( \hat y = \alpha + \beta_1 x_1 + \beta_2 x_2 ... + \beta_n x_n \)
has a naive solution using the normal equations
\( A^TAx= A^Tb \) (from \( A^T(b - A \hat x) = 0) \))
which means we need to find the inverse \( (A^TA)^{-1} \) to get \( x \).
data(diabetes)
X <- diabetes$x
y <- diabetes$y
X_with_intercept <- cbind(rep(1, nrow(X)), X)
(beta_hat <- solve(t(X_with_intercept) %*% X_with_intercept) %*% t(X_with_intercept) %*% y)
[,1]
152.1
age -10.0
sex -239.8
bmi 519.8
map 324.4
tc -792.2
ldl 476.7
hdl 101.0
tch 177.1
ltg 751.3
glu 67.6
# A = QR
# QRx = b ### Q inverse = Q transpose
# Rx = Q'b ### x = backsubstitute(R, Q'b)
qr_solve <- function(A, b) {
A_qr <- qr(A)
R <- qr.R(A_qr)
Q <- qr.Q(A_qr)
backsolve(R, t(Q) %*% b)
}
Hope I've been successful a bit in arousing your curiosity about calculus and linear algebra :-)
There's a lot you could do without, but it'd be half the fun ;-)
Questions?
Thanks for your attention!!