Sigrid Keydana, Trivadis
2017/01/08
solving matrix computations
recognizing objects
understanding speech
talking
walking
driving
What's difficult for us is easy for the machine,- and vice versa.
What are the correct pixel values for a “bike” feature?
Several “origins” in different fields, see e.g.
Henry J. Kelley (1960). Gradient theory of optimal flight paths. Ars Journal, 30(10), 947-954.
Arthur E. Bryson (1961, April). A gradient method for optimizing multi-stage allocation processes. In Proceedings of the Harvard Univ. Symposium on digital computers and their applications.
Paul Werbos (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University.
Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). “Learning representations by back-propagating errors”. Nature. 323 (6088): 533-536.
So why the hype, ehm, success now?
We want to predict
\( f(\mathbf{x}; \mathbf{w}, b) = \mathbf{x}^T\mathbf{w} + b \)
\( f(\mathbf{x}; \mathbf{W}, \mathbf{c}, \mathbf{w}, b) = \mathbf{w}^T (\mathbf{W}^T\mathbf{x} + \mathbf{c}) + b \)
(X <- matrix(c(0,0,0,1,1,0,1,1), nrow = 4, ncol = 2, byrow = TRUE))
[,1] [,2]
[1,] 0 0
[2,] 0 1
[3,] 1 0
[4,] 1 1
(W <- matrix(c(1,1,1,1), nrow = 2, ncol = 2, byrow = TRUE))
[,1] [,2]
[1,] 1 1
[2,] 1 1
(c <- matrix(c(0,-1), nrow = 1, ncol = 2))
[,1] [,2]
[1,] 0 -1
(matmul <- X %*% W)
[,1] [,2]
[1,] 0 0
[2,] 1 1
[3,] 1 1
[4,] 2 2
(hidden <- matmul + rbind(c, c, c, c))
[,1] [,2]
[1,] 0 -1
[2,] 1 0
[3,] 1 0
[4,] 2 1
\( f(\mathbf{x}; \mathbf{W}, \mathbf{c}, \mathbf{w}, b) = \mathbf{w}^T max(0, \mathbf{W}^T\mathbf{x} + \mathbf{c}) + b \)
hidden_relu
X1 X2 CLASS
1 0 0 0
2 1 0 1
3 1 0 1
4 2 1 0
The remaining hidden-to-output transformation is linear.
But the classes are already linearly separable.
How about a net with several layers?
The loss (or cost) function indicates the cost incurred from false prediction / misclassification
probably the best-known loss function in machine learning is mean squared error:
\( \frac{1}{n} \sum_n{(\hat{y} - y)^2} \)
most of the time, in deep learning classification tasks we use cross entropy:
\( - \sum_j{t_j log(y_j)} \)
This is the negative log probability of the right answer.
Playing around with the convolution matrix filter in Gimp is a great way to gain intuition.
Why add recursion?
Jane walked into the room. John walked in too. It was late in the day, and everyone was walking home after a long day at work. Jane said hi to ___
(see Stanford CS 224D Deep Learning for NLP Lecture Notes)