Introduction to Deep Learning

Kevin Kuo, Javier Luraschi

Welcome

Intros

  • Kevin Kuo @kevinykuo
  • Javier Luraschi @javierluraschi

Agenda

  • Welcome!
  • Introduction to Deep Learning
  • Introduction to Tensorflow
  • Introduction to Keras

Local Installation

Workshop Notebook

github.com/javierluraschi/deeplearning-sdss-2019

Why Deep Learning?

Why should I care about Deep Learning?

ImageNet

2012: ImageNet classification with deep convolutional neural networks

AlphaGo

2016: AlphaGo wins match against Lee Sedol ranked 9-dan, one of the best players at Go.

Popularizing Deep Learning

2017: HBO’s Sillicon Valley show “hot dog or not” episode.

Dota

2018: OpenAI Five defeats top 99.95th percentile Dota players

Tesla

2019: Tesla’s autonomy day presents autonomous driving plan.

Introduction to Deep Learning

A Comprehensive Survey on Deep Learning Approaches

See arXiv:1803.01164

1943: McCulloch and Pitts

McCulloch & Pitts show that neurons can be combined to construct a Turing machine (using ANDs, ORs, & NOTs).

1943: McCulloch and Pitts – Turing Machines

1958: Rosenblatt – The Perceptron

The perceptron: A probabilistic model for information storage and organization in the brain

1958: Rosenblatt – The Perceptron

Perceptron Diagram

Perceptron Model

\[ f(x) = \begin{cases} 1 & \sum_{i=1}^m w_i x_i + b > 0\\ 0 & \text{otherwise} \end{cases} \]

1958: Rosenblatt – Perceptron (OR)

\[ f(x) = \begin{cases} 1 & \sum_{i=1}^m w_i x_i + b > 0\\ 0 & \text{otherwise} \end{cases} \]

1958: Rosenblatt – Solution (OR)

\[ f(x) = \begin{cases} 1 & \sum_{i=1}^m w_i x_i + b > 0\\ 0 & \text{otherwise} \end{cases} \]

1958: Rosenblatt – Perceptron (AND)

\[ f(x) = \begin{cases} 1 & \sum_{i=1}^m w_i x_i + b > 0\\ 0 & \text{otherwise} \end{cases} \]

1958: Rosenblatt – Solution (AND)

\[ f(x) = \begin{cases} 1 & \sum_{i=1}^m w_i x_i + b > 0\\ 0 & \text{otherwise} \end{cases} \]

1958: Rosenblatt – Demo

Rosenblatt, with the image sensor of the Mark I Perceptron (…) it learned to differentiate between right and left after fifty attempts.

1958: Rosenblatt – Predictions

Expected to walk, talk, see, write, reproduce itself and be conscious of its existence, although (…) it learned to differentiate between right and left after fifty attempts.

1958: Rosenblatt – Principles of Neurodynamics 1/3

1958: Rosenblatt – Principles of Neurodynamics 2/3

1958: Rosenblatt – Principles of Neurodynamics 3/3

1969: Minsky and Papert – Book (1/3)

1969: Minsky and Papert – Book (2/3)

1969: Minsky and Papert – Exercise (XOR)

\[ f(x) = \begin{cases} 1 & \sum_{i=1}^m w_i x_i + b > 0\\ 0 & \text{otherwise} \end{cases} \]

1969: Minsky and Papert – Book (3/3)

It ought to be possible to devise a training algorithm to optimize the weights in this using (…) we have not investigated this.

1969: Minsky and Papert – Diagram (XOR)

Using multiple layered perceptrons should allow us to find a solution for the XOR table.

1969: Minsky and Papert – Layered (XOR)

1969: Minsky and Papert – Solution (XOR)

1985: Hinton – Gradient Descent

A Learning Algorithm for Boltzmann Machines

1985: Hinton – Gradient Descent Today

A function decreases fastest if one goes from in the direction of the negative gradient.

\[ a_{n+1} = a_n - \gamma \nabla F(a_n) \]

1985: Hinton – Gradient Descent

1985: Hinton – Stochastic Gradient Descent

1985: Hinton – Back-Propagation

1985: Hinton – Exercise (AND) – Differentiable

Gradient descent requires differentiable functions,

\[ f(x) = \begin{cases} 1 & \sum_{i=1}^m w_i x_i + b > 0\\ 0 & \text{otherwise} \end{cases} \]

We use instead the ReLU, which is almost differentiable:

\[ max(0,\sum_{i=1}^m w_i x_i + b) \]

1985: Hinton – Exercise (AND) – Differentiate

Lets differentiate L2 over a perceptron with ReLU, where \(\theta\) is the step function.

\[ (f(w_1, w_2, b) - y)^2 = (max(0,(w_1 x_1 + w_2 x_2 + b)) - y)^2 \]

\[ \frac{df(w_1, w_2, b)}{w_1} = 2 * (f(w_1, w_2, b) - y) * \theta({w_1 x_1 + w_2 x_2 + b}) * x_1 \\ \frac{df(w_1, w_2, b)}{w_2} = 2 * (f(w_1, w_2, b) - y) * \theta({w_1 x_1 + w_2 x_2 + b}) * x_2 \\ \frac{df(w_1, w_2, b)}{b} = 2 * (f(w_1, w_2, b) - y) * \theta({w_1 x_1 + w_2 x_2 + b}) \] The we can iterate following the gradients direction,

\[ a_{n+1} = a_n - \gamma \nabla F(a_n) \]

1985: Hinton – Exercise (AND)

1985: Hinton – Solution (AND)

1985: Hinton – Applications

1989: ALVINN: An Autonomous Land Vehicle in a Neural Network

https://www.youtube.com/watch?v=ntIczNQKfjQ

1985: Hinton – Deep Networks

The vanishing gradient problem is when the gradient will be vanishingly small, effectively preventing the weight from changing its value.

2006: Hinton – Train One Layer

A fast learning algorithm for deep belief nets

2006: Hinton – Autoencoders

Reducing the dimensionality of data with neural networks

2006: Hinton – Dimensionality

Reducing the dimensionality of data with neural networks

2012: AlexNet –

ImageNet Classification with Deep Convolutional Neural Networks

Used ReLU, dropout, augmentation, GPUs.

2016: Karpathy – Computational Graphs

Computational graphs avoid manually computing gradients.

CS231n Winter 2016: Lecture 4: Backpropagation, Neural Network

2016: Karpathy – Simple Graph

Simple computational graph example.

2016: Karpathy – Complex Graph

Can then combine into very complex graphs using the chain rule:

2016: Karpathy – Exercise (Graph)

Define the computation graph for the sigmoid function.

\[ \sigma (x) = \frac{1 }{1 + e^{-x} } \]

2016: Karpathy – Solution (Graph)

Introduction to Tensorflow

What is TensorFlow?

Slides from JJ’s rstudio::cong 2018 Keynote

Why should R users care?

  • A new general purpose numerical computing library!
    • Hardware independent
    • Distributed execution
    • Large datasets
    • Automatic differentiation
  • Very general built-in optimization algorithms (SGD, Adam) that don’t require that all data is in RAM
  • Robust foundation for machine learning and deep learning applications
  • TensorFlow models can be deployed with a low-latency C++ runtime
  • R has a lot to offer as an interface language for TensorFlow

What is tensor “flow”?

Graph is generated from Code

What is deep learning?

A simple mechanism that, once scaled, ends up looking like magic

Keras Adoption

Keras for R cheatsheet

github.com/rstudio/cheatsheets/raw/master/keras.pdf

Introduction to Keras

Installing Tensorflow – Exercise

What version you have installed?

Installing Tensorflow – Help!

If your local environment is not working…

rstd.io/class

Also, you can fint the version with,

[1] ‘1.13’

Installing Keras – Exercise

Kyphosis – Packages

Load ’em packages!

Kyphosis – Quick View

Let’s look at the data we’re working with

Kyphosis – Quick View (Solution)

Let’s look at the data we’re working with

Kyphosis – Split

We’re going to predict whether kyphosis is present.

First, we’ll perform an initial split into training/testing of the dataset.

Kyphosis – Split (Solution)

We’re going to predict whether kyphosis is present.

First, we’ll perform an initial split into training/testing of the dataset.

Kyphosis – GLM

Let’s build our favorite classification model!

Kyphosis – GLM (Solved)

Let’s build our favorite classification model!

Kyphosis – Predictions

Create a data frame with the predictions.

Kyphosis – Predictions (Solved)

Create a data frame with the predictions.

Kyphosis – Calculate AUC

Calculate AUC

Kyphosis – Calculate AUC (Solved)

Calculate AUC

[1] 0.719

Kyphosis – “Neural net”

Kyphosis – “Neural net” (Solution)

“Neural net”

Kyphosis – Data prep

Data prep

Kyphosis – Data prep (Solution)

Data prep

Kyphosis – Fit the model

Fit the model

Kyphosis – Fit the model (Solution)

Fit the model

Kyphosis – Predictions and AUC

Kyphosis – Predictions and AUC (Solution)

Try adding more layers to make this perform better!

Kyphosis – Data Preprocessing

Neural nets are easier to train when the predictors have similar magnitudes, see last section from notebook.

JJ’s Keynote

beta.rstudioconnect.com/ml-with-tensorflow-and-r