I'm a developer, - why should I care about matrices or calculus?

Sigrid Keydana, Trivadis
12/05/2017

About me & my employer

Trivadis

DACH-based IT consulting and service company, from traditional technologies to big data/machine learning/data science

My background

from psychology/statistics via software development and database engineering to data science and ML/DL

My passion

machine learning and deep learning
data science and (Bayesian) statistics
explanation/understanding over prediction accuracy

Where to find me

blog: http://recurrentnull.wordpress.com
twitter: @zkajdan

Part 1: Why should I care about calculus?

CIFAR-10 dataset

10 classes, 32 x 32 px, popular benchmarking dataset for image recognition

plot of chunk unnamed-chunk-2

Let's just pick two classes, keeping it simple…

Ship or frog?

We've trained a small Convolutional Neural Network (CNN) from scratch to distinguish between ships and frogs, specifically.
After very few epochs, it tells both classes apart with more than 97% accuracy on the test set.
Now let's pick some ship …

plot of chunk unnamed-chunk-3

… and predict it's class probability:

[1] "Predicted probability for ship:  0.9979"

So let's make some "random" changes to that image ...

… and see what the net says now:

plot of chunk unnamed-chunk-5

[1] "Predicted probability for ship:  0.4167"

What's going on?

Let's step back from more complicated architectures for a second and look at straightforward logistic regression (with Keras):

model <- keras_model_sequential()

model %>%
  layer_dense(units = 1, input_shape = 3072) %>%
  layer_activation("sigmoid")

model %>% compile(optimizer = optimizer_sgd(lr = 0.001), loss = "binary_crossentropy")

When does this model say frog (coded as 0) and when ship (coded as 1)?

Logistic regression decision logic

Logistic regression uses the sigmoid function to decide which class to output, 0 or 1.

\( sigmoid(x,w) = \frac{1}{1 + e^{-\mathbf{w}^T\mathbf{x} + b}} \)

plot of chunk unnamed-chunk-7

When \( \mathbf{w}^T\mathbf{x} + b \) > 0, sigmoid(x,w) > 0.5, and we get “ship” (1).

When \( \mathbf{w}^T\mathbf{x} + b \) < 0, sigmoid(x,w) < 0.5, and we get “frog” (0).

So if we want the algorithm to say "0", for frog ...

\( sigmoid(x,w)= \frac{1}{1 + e^{-\mathbf{w}^T\mathbf{x} + b}} \)

somehow has to end up < 0.5!

Let's take a small excerpt of this dot product, \( \mathbf{w}^T\mathbf{x} \), and look how we could make it smaller:

\( w_1^Tx_1 + w_2^Tx_2 + w_3^Tx_3 ... \)

This will get smaller when for all positive \( w_i \), we decrease \( x_i \), and for all negative \( w_i \), we increase \( x_i \).

Let's turn a cute frog into a ship...

Here's our frog…

plot of chunk unnamed-chunk-8

To produce a ship, now we want the dot product to get bigger.

We extract the weights of the model and add them to the image, scaled by some factor.

model <- load_model_hdf5(model_name)
weights <- (model %>% get_weights())[[1]]
adv <- poor_frog + t(weights)/6
new_frog_prob <- model %>% predict_proba(adv)
print(paste("Predicted probability for ship: ", round(new_frog_prob,4)))

[1] "Predicted probability for ship:  0.6629"

How does this change the way our image looks?

The frog who turned ship (1)

plot of chunk unnamed-chunk-10

The change is hardly visible - much less than with the first example that used a convnet and included nonlinearities.

OK, but that was just simple logistic regression...

… what can we do when we have deep neural networks?

And thus: several layers of weights in the game?

Source: https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/lecture_1_0.pdf

Training a network, the evil way

Normal training has the network get better in doing the right classification.

How about we have it get worse, or put differently, have an alternative definition of what is right?

Minimizing the cost: Gradient Descent

In optimization, normally we iteratively subtract (a fraction of) the gradient to reach the/a minimum of the cost function.

In this graphic, cost is a quadratic function, e.g., mean squared error:

Source: Goodfellow et al. 2016, Deep Learning

Loss for logistic regression

In two-class classification (be it shallow logistic regression or a deep convnet), the loss to minimize is cross entropy:

\( - \sum_j{t_j log(y_j)} \)

resp.

\( - (t\:log(y) + (1-t)\:log(1-y)) \) for the binary case.

Normally we'd try to minimize that loss …

Being evil

… now we maximize the error instead of minimizing it!

Let's step back for a second and see this how this works in context.

What a neural network needs for learning

a cost function: How different is my prediction from my target?
a way to learn (optimization): How do I adapt my weights such that my predictions get better?
a way to back propagate the cost: How do weights earlier in the network change the predictions (and thus, the cost)?

Enter …

Backpropagation!

basically, just the chain rule: \( \frac{dz}{dx} = \frac{dz}{dy} \frac{dy}{dx} \)
chained over several layers:

Source: https://colah.github.io/posts/2015-08-Backprop/

Learning weights by backpropagation (1)

We'll gradually build up the gradient of the loss with respect to \( y_i \), the output of the last hidden layer, and the weights \( w_{ij} \), respectively.

Here

\( i \) and \( j \) are layers of the network (\( j \) being the output layer)
\( y_l \) is the output of layer \( l \)
\( z_l \) is the aggregated input going into layer \( l \) (before the activation function is applied)

Learning weights by backpropagation (2)

Step 1: Loss w.r.t. output (prediction)

At the output layer \( j \), we compare class prediction \( y_j \) and actual class \( t \), using binary cross entropy/logistic loss: \( - (t\:log(y) + (1-t)\:log(1-y)) \)

The gradient

\( \frac{dE}{dy_j} = - (\frac{t}{y} + (-1) \frac{1-t}{1-y}) = \frac{y-t}{y(1-y)} \)

describes how the error \( E \) changes as the prediction \( y_j \) changes.

Learning weights by backpropagation (2)

Step 2: How does the prediction/output \( y_j \) change as the input \( z_j \) to the final neuron changes?

In the case of a logistic (sigmoid) neuron with output \( y_j \), this is described by the gradient

\( \frac{dy_j}{dz_j} = y_j(1-y_j) \).

Learning weights by backpropagation (3)

Step 3a: How does the input \( z_j \) to the final layer change as the weight \( w_{ij} \) changes?

Here the gradient is

\( \frac{dz_j}{dw_{ij}} = y_i \),

that is, the output of layer \( i \).

Learning weights by backpropagation (4)

Step 3b: How does the input \( z_j \) to the final layer change as the output of layer \( i \) changes?

Here we have to take into account all connections a neuron \( y_i \) has to the output layer: The gradient is

\( \sum_j \frac{dz_j}{dy_i} = \sum_j w_{ij} \),

that is, the sum of the weights going out of \( y_i \).

Learning weights by backpropagation: Putting it all together

Now that we have the single gradients, we use the chain rule to back propagate the error:

The gradient of the loss w.r.t. \( y_i \) is

\( \frac{dE}{dy_i} = \frac{y-t}{y(1-y)} y_j(1-y_j) \sum_j w_{ij} \)

Accordingly, we get the gradient of the loss w.r.t. \( w_{ij} \) as

\( \frac{dE}{dy_i} = \frac{y-t}{y(1-y)} y_j(1-y_j) y_i \)

Backprop, the evil way...

We can use the same mechanics of loss calculation and backpropagation, just

we change the input, not the weights
we maximize the loss by adding, not subtracting (some fraction of) the gradient

So instead of the weights, we'll extract the gradient from the model and add part of it to the input image.

How do we find the best scaling factor?

Enter: Fast Gradient Sign Method (FGSM)

As shown in Goodfellow et al.,

Explaining and harnessing adversarial examples,

we can just take the sign of the gradient, scale it by a factor \( \epsilon \), reflecting the “resolution” of the measurements, and add that to the input image:

\( x = x + \epsilon \: sign(\nabla x (J(\theta, x ,y))) \)

Frog FGSM

Here's our frog again …

plot of chunk unnamed-chunk-11

This time we get the gradient values, take their sign, and add that to the image, scaled by a factor \( \epsilon \):

input <- model$input; output <- model$output
target <- model %>% predict_classes(poor_frog); target_variable = K$variable(target)
loss <- metric_binary_crossentropy(model$output, target_variable)
gradients <- K$gradients(loss, model$input) 
get_grad_values <- K$`function`(list(model$input), gradients)
grads <- get_grad_values(list(poor_frog))[[1]] %>% sign()

adv <- poor_frog + grads * 0.02
new_frog_prob <- model %>% predict_proba(adv)
print(paste("Predicted probability for ship: ", round(new_frog_prob,4)))

[1] "Predicted probability for ship:  0.5916"

And how does our frog look now?

The frog who turned ship (2)

plot of chunk unnamed-chunk-13

Part 2: Why should I care about matrices?

Everything is a matrix (or tensor)

Images are.

Our little

plot of chunk unnamed-chunk-14

is just coded like this…

        [,1]    [,2] [,3] [,4] [,5]    [,6]    [,7]    [,8]
[1,] 0.00000 0.00000    0    0    0 0.00000 0.00000 0.00000
[2,] 0.00000 0.00000    0    0    0 0.00000 0.00000 0.00000
[3,] 0.00392 0.00392    0    0    0 0.00000 0.00000 0.00000
[4,] 0.00392 0.00392    0    0    0 0.00000 0.00000 0.00000
[5,] 0.00392 0.00392    0    0    0 0.00000 0.00000 0.00000
[6,] 0.00000 0.00000    0    0    0 0.00392 0.00392 0.00392
[7,] 0.00000 0.00000    0    0    0 0.00392 0.00392 0.00392
[8,] 0.00000 0.00000    0    0    0 0.00392 0.00784 0.00784

Everything is a matrix (or tensor)

Documents are.

In text classification, we have term-document matrices

       doc1 doc2 doc3 doc4 doc5 doc6 doc7 doc8 doc9
word1     2    1    1    1    0    0    0    0    1
word2     0    2    2    1    0    2    2    2    2
word3     0    0    0    0    0    0    0    0    0
word4     0    0    0    0    0    0    0    0    0
word5     1    2    2    2    2    2    2    2    2
word6     1    0    0    0    0    0    0    0    0
word7     0    0    0    0    0    0    0    0    0
word8     0    0    0    2    2    1    2    1    0
word9     0    0    0    0    0    0    0    0    0
word10    1    0    0    0    0    0    0    0    0
word11    2    2    2    2    2    2    2    2    2
word12    0    0    0    0    0    0    0    0    0
word13    0    0    0    0    0    0    0    0    0
word14    0    0    0    0    0    0    0    0    0
word15    1    1    1    1    1    1    1    1    1
word16    1    1    1    1    1    1    1    1    1
word17    0    0    0    0    0    0    0    0    0
word18    0    0    0    0    0    0    0    0    0
word19    0    0    0    0    0    0    0    0    0
word20    1    1    1    1    1    1    1    1    1

Everything is a matrix (or tensor)

Even normal dataframes are.

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa

First things first: Matrix multiplication (1)

Let's say we're building our own neural network, starting with a 8*4 input matrix X and a hidden layer of capacity (number of neurons) 4.

X <- matrix(1:32, nrow = 8, ncol = 4, byrow = TRUE)
X

     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
[4,]   13   14   15   16
[5,]   17   18   19   20
[6,]   21   22   23   24
[7,]   25   26   27   28
[8,]   29   30   31   32

What shape do we need for the weights matrix going into that hidden layer?

First things first: Matrix multiplication (2)

Weights matrix is 4*4.

W <- matrix(rnorm(16), nrow = 4, ncol = 4)

Then matrix-multiplying X and W gives the aggregated inputs into each hidden neuron for each input sample.

X %*% W

       [,1]  [,2] [,3]   [,4]
[1,]  -0.10  1.20 0.70  0.217
[2,]  -1.97  5.06 1.11 -0.561
[3,]  -3.85  8.92 1.52 -1.340
[4,]  -5.72 12.78 1.93 -2.118
[5,]  -7.60 16.64 2.34 -2.897
[6,]  -9.47 20.50 2.75 -3.675
[7,] -11.35 24.35 3.16 -4.454
[8,] -13.22 28.21 3.57 -5.232

The magic behind convnets: Convolutions

LeNet: First successful application of convolutional neural networks by Yann LeCun, Yoshua Bengio et al.

Source: http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf

The convolution operation

Source: Goodfellow et al., Deep Learning.

Convolution as matrix multiplication (1)

We have an input image x …

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

and a convolution kernel k…

     [,1] [,2]
[1,]    1    2
[2,]    3    4

We transform k into a doubly block-circulant matrix …

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,]    1    2    0    3    4    0    0    0    0
[2,]    0    1    2    0    3    4    0    0    0
[3,]    0    0    0    1    2    0    3    4    0
[4,]    0    0    0    0    1    2    0    3    4

and flatten x into a vector…

[1] 1 2 3 4 5 6 7 8 9

Convolution as matrix multiplication (2)

Now we can get the convolved image by matrix-multiplying k and x:

k %*% x

     [,1]
[1,]   35
[2,]   65
[3,]   45
[4,]   75

This is the same we get when performing the convolution “manually”:

y1 <- 1*1 + 2*2 + 3*4 + 4*5
y2 <- 1*2 + 2*3 + 3*5 + 4*6
y3 <- 1*4 + 2*5 + 3*7 + 4*8
y4 <- 1*5 + 2*6 + 3*8 + 4*9

Now just convert the vector back to a matrix:

     [,1] [,2]
[1,]   37   47
[2,]   67   77

The hidden magic: Matrix decompositions

QR, LU, SVD, NMF, eigen…

Motivations

semantic: find latent factors/components
- topics (e.g., in document classification)
- features (e.g., in image or audio data)
technical: facilitate computations (e.g., Least Squares)

?lm

lm(formula, data, subset, weights, na.action,
   method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
   singular.ok = TRUE, contrasts = NULL, offset, ...)

Applications: How it works

Source: Essid, S. & Ozerov, A.: A tutorial on nonnegative matrix factorisation with applications to audiovisual content analysis

Finding topics in documents

Finding "facial features"

Soft clustering

Technical: Matrix factorizations for least squares

Linear regression \( \hat y = \alpha + \beta_1 x_1 + \beta_2 x_2 ... + \beta_n x_n \)

has a naive solution using the normal equations

\( A^TAx= A^Tb \) (from \( A^T(b - A \hat x) = 0) \))

which means we need to find the inverse \( (A^TA)^{-1} \) to get \( x \).

Source: http://www.stat.wisc.edu/~ifischer/Intro_Stat/Lecture_Notes/APPENDIX/A2._Geometric_Viewpoint/A2.3_-_Least_Squares_Approximation.pdf

So what's bad about taking the inverse?

slow
numerical instability
sparse matrices have dense inverses (most of the time)

Least Squares via matrix inverse

data(diabetes)
X <- diabetes$x
y <- diabetes$y
X_with_intercept <- cbind(rep(1, nrow(X)), X)

(beta_hat <- solve(t(X_with_intercept) %*% X_with_intercept) %*% t(X_with_intercept) %*% y)

      [,1]
     152.1
age  -10.0
sex -239.8
bmi  519.8
map  324.4
tc  -792.2
ldl  476.7
hdl  101.0
tch  177.1
ltg  751.3
glu   67.6

Least Squares via QR decomposition

Source: http://www.sharetechnote.com/html/Handbook_EngMath_Matrix_QRDecomposition.html

# A = QR  
# QRx = b    ### Q inverse = Q transpose
# Rx = Q'b   ### x = backsubstitute(R, Q'b)

qr_solve <- function(A, b) {
  A_qr <- qr(A)
  R <- qr.R(A_qr)
  Q <- qr.Q(A_qr)
  backsolve(R, t(Q) %*% b)
}

The end, kind of

Hope I've been successful a bit in arousing your curiosity about calculus and linear algebra :-)

There's a lot you could do without, but it'd be half the fun ;-)

Questions?

Thanks for your attention!!