Deep Learning Tutorial: COBAL-EBEB, 2024

Nick Polson and Vadim Sokolov

Introduction

Review of deep learning methods which provide insight into structured high-dimensional data.
Deep learning uses layers of semi-affine input transformations to provide a predictive rule.
Applying these layers of transformations leads to a set of attributes (or, features)
To which probabilistic statistical methods can be applied.
The best of both worlds can be achieved: scalable prediction rules fortified with uncertainty quantification, where sparse regularization finds the features.

Why DL?

Deep learning provides a powerful pattern matching tool suitable for many AI applications.
Image recognition and text analysis are probably two of the deep learning’s most successful.
An image or a text as a high dimensional matrices and vectors, respectively.

What is DL?

In general, a neural network can be described as follows. Let \(f_1 , \ldots , f_L\) be given univariate activation functions for each of the \(L\) layers.
Activation functions are nonlinear transformations of weighted data.
A semi-affine activation rule is then defined by \[ f_l^{W,b} = f_l \left ( \sum_{j=1}^{N_l} W_{lj} X_j + b_l \right ) = f_l ( W_l X_l + b_l )\,, \]

\[ \hat{Y}(X) = F(X) = \left ( f_l^{W_1,b_1} \circ \ldots \circ f_L^{W_L,b_L} \right ) ( X)\,. \]

What is DL?

Similar to a classic basis decomposition, the deep approach uses univariate activation functions to decompose a high dimensional \(X\).

Let \(Z^{(l)}\) denote the \(l\)th layer, and so \(X = Z^{(0)}\). The final output \(Y\) can be numeric or categorical. The explicit structure of a deep prediction rule is then \[ \begin{aligned} \hat{Y} (X) & = W^{(L)} Z^{(L)} + b^{(L)} \\ Z^{(1)} & = f^{(1)} \left ( W^{(0)} X + b^{(0)} \right ) \\ Z^{(2)} & = f^{(2)} \left ( W^{(1)} Z^{(1)} + b^{(1)} \right ) \\ \ldots & \\ Z^{(L)} & = f^{(L)} \left ( W^{(L-1)} Z^{(L-1)} + b^{(L-1)} \right )\,. \end{aligned} \] Here \(W^{(l)}\) is a weight matrix and \(b^{(l)}\) are threshold or activation levels. Designing a good predictor depends crucially on the choice of univariate activation functions \(f^{(l)}\). The \(Z^{(l)}\) are hidden features which the algorithm will extract.

Deep approach employs hierarchical predictors comprising of a series of \(L\) nonlinear transformations applied to \(X\).

Deep Learning and Least Squares

The goal is to find the optimal value of \(\theta\) that minimizes the expected loss function
Given a training data set \(D = \{x_i,y_i\}_{i=1}^n\).
The loss function is a measure of discrepancy between the true value of \(y\) and the predicted value \(f(x,\theta)\).

Model Estimation

Deep learning as well as a large number of statistical problems, can be expressed in the form \[ \min l(x) + \phi(x). \] In learning \(l(x)\) is the negative log-likelihood and \(\phi(x)\) is a penalty function that regularizes the estimate. From the Bayesian perspective, the solution to this problem may be interpreted as a maximum a posteriori \[p(y\mid x) \propto \exp\{-l(x)\}, ~ p(x) \propto \exp\{-\phi(x)\}.\]

The loss function

The loss function is usually defined as the negative log-likelihood function of the model. \[ l(\theta) = - \sum_{i=1}^n \log p(y_i | x_i, \theta), \] where \(p(y_i | x_i, \theta)\) is the conditional distribution of \(y_i\) given \(x_i\) and \(\theta\).

The loss function

Thus, in the case of regression, we have \[ y_i = f(x_i,\theta) + \epsilon, ~ \epsilon \sim N(0,\sigma^2), \] Thus, the loss function is \[ l(\theta) = - \sum_{i=1}^n \log p(y_i | x_i, \theta) = \sum_{i=1}^n (y_i - f(x_i, \theta))^2, \]

Optimisation

Second order optimisation algorithms, such as BFGS used for traditional statistical models do not work well for deep learning models.
The number of parameters a DL model has is large and estimating second order derivatives (Hessian or Fisher information matrix) becomes prohibitive from both computational and memory use standpoints.
Instead, first order gradient descent methods are used for estimating parameters of a deep learning models.

Optimisation

The problem of parameter estimation (when likelihood belongs to the exponential family) is an optimisation problem

\[ \min_{\theta} l(\theta) := \dfrac{1}{n} \sum_{i=1}^n \log p(y_i, f(x_i, \theta)) \] where \(l\) is the negative log-likelihood of a sample, and \(\theta\) is the vector of parameters.

Optimisation: Gradient Descent

Iterative algorithm
Starts with an initial guess \(\theta^{0}\)
Updates the parameter vector \(\theta\) at each iteration \(t\) as follows:

\[ \theta^{t+1} = \theta^t - \alpha_t \nabla l(\theta^t). \] - \(\alpha_t\) is the learning rate (step size). In DL model estimation it is a first class citizen parameter.

Gradient Descent for Linear Regression

\[ y_i = \theta_0 + \theta_1 x_i + \epsilon_i, \] or in matrix form \[ y = X \theta + \epsilon, \]

\(\epsilon_i \sim N(0, \sigma^2)\)
\(X\) is the design matrix

Gradient Descent

The negative log-likelihood function \[ l(\theta) = \sum_{i=1}^n (y_i - \theta_0 - \theta_1 x_i)^2. \] The gradient then is \[ \nabla l(\theta) = -2 \sum_{i=1}^n (y_i - \theta_0 - \theta_1 x_i) \begin{pmatrix} 1 \\ x_i \end{pmatrix}. \] In matrix form, we have \[ \nabla l(\theta) = -2 X^T (y - X \theta). \]

Regression

Regression is simply a neural network which is wide and shallow.

Code

data(iris)
y = iris$Petal.Length
x = cbind(rep(1,150),iris$Petal.Width)
# initialize theta
theta <- matrix(c(0, 0), nrow = 2, ncol = 1)
# learning rate
alpha <- 0.0001
# number of iterations
n_iter <- 1000
# gradient descent
for (i in 1:n_iter) {
    # compute gradient
    grad <- -2 * t(x) %*% (y - x %*% theta)
    # update theta
    theta <- theta - alpha * grad
}

Our gradient descent finds the following coefficients

Intercept (\(\theta_1\))	Petal.Width (\(\theta_2\))
1.084125	2.229565

Plot

Let’s plot the data and model estimated using gradient descent

Code

plot(x[,2],y,pch=16, xlab="Petal.Width")
abline(theta[2],theta[1], lwd=3,col="red")

Compare

Let’s compare it to the standard estimation algorithm

Code

m = lm(Petal.Length~Petal.Width, data=iris)

(Intercept)	Petal.Width
1.083558	2.22994

GD:

Intercept (\(\theta_1\))	Petal.Width (\(\theta_2\))
1.084125	2.229565

The values found by gradient descent are very close to the ones found by the standard OLS algorithm.

Logistic Regression

GLM with a logit link function \[ \log \left(\frac{p}{1-p}\right) = \theta_0 + \theta_1 x_1 + \ldots + \theta_p x_p, \]

The negative log-likelihood: \[ l(\theta) = - \sum_{i=1}^n \left[ y_i \log p_i + (1-y_i) \log (1-p_i) \right], \] where \(p_i = 1/\left(1 + \exp(-\theta_0 - \theta_1 x_{i1} - \ldots - \theta_p x_{ip})\right)\).

Derivative

The derivative of the negative log-likelihood function is \[ \nabla l(\theta) = - \sum_{i=1}^n \left[ y_i - p_i \right] \begin{pmatrix} 1 \\ x_{i1} \\ \vdots \\ x_{ip} \end{pmatrix}. \] In matrix notations, we have \[ \nabla l(\theta) = - X^T (y - p). \]

Code

Let’s implement gradient descent algorithm now.

Code

y = ifelse(iris$Species=="setosa",1,0)
x = cbind(rep(1,150),iris$Sepal.Length)
lrgd  = function(x,y, alpha, n_iter) {
  theta <- matrix(c(0, 0), nrow = 2, ncol = 1)
  for (i in 1:n_iter) {
    # compute gradient
    p = 1/(1+exp(-x %*% theta))
    grad <- -t(x) %*% (y - p)
    # update theta
    theta <- theta - alpha * grad
  }
  return(theta)
}
theta = lrgd(x,y,0.005,20000)

The gradient descent parameters are

Intercept (\(\theta_1\))	Sepal.Length (\(\theta_2\))
27.74433	-5.160146

Plot

Code

par(mar=c(4,4,0,0), bty='n')
p = 1/(1+exp(-x %*% theta))
plot(x[,2],y,pch=16, xlab="Sepal.Length")
lines(x[,2],p,type='p', pch=16,col="red")

Compare

Let’s compare it to the standard estimation algorithm

Code

glm(y~x-1, family=binomial(link="logit"))


Call:  glm(formula = y ~ x - 1, family = binomial(link = "logit"))

Coefficients:
    x1      x2  
27.829  -5.176  

Degrees of Freedom: 150 Total (i.e. Null);  148 Residual
Null Deviance:      207.9 
Residual Deviance: 71.84    AIC: 75.84

Stochastic Gradient Descent

Stochastic gradient descent (SGD) is a variant of the gradient descent algorithm. \[ \nabla l(\theta) \approx \dfrac{1}{|B|} \sum_{i \in B} \nabla l(y_i, f(x_i, \theta)), \] where \(B \in \{1,2,\ldots,n\}\) is the batch samples from the data set. This method can be interpreted as gradient descent using noisy gradients, which are typically called mini-batch gradients with batch size \(|B|\).

Stochastic Gradient Descent

In a small mini-batch regime, when \(|B| \ll n\) and typically \(|B| \in \{32,64,\ldots,1024\}\) it was shown that SGD converges faster than the standard gradient descent algorithm
It does converge to minimizers of strongly convex functions
It is robust to noise in the data
It can avoid saddle-points, which is often an issue with deep learning log-likelihood functions.

Implementation

Now, we implement SGD for logistic regression and compare performance for different batch sizes

Code

lrgd_minibatch  = function(x,y, alpha, n_iter, bs) {
  theta <- matrix(c(0, 0), nrow = 2, ncol = n_iter+1)
  n = length(y)
  for (i in 1:n_iter) {
    s = ((i-1)*bs+1)%%n
    e = min(s+bs-1,n)
    xl = x[s:e,]; yl = y[s:e]
    p = 1/(1+exp(-xl %*% theta[,i]))
    grad <- -t(xl) %*% (yl - p)
    # update theta
    theta[,i+1] <- theta[,i] - alpha * grad
  }
  return(theta)
}

Different Batch Sizes

Now run our SGD algorithm with different batch sizes.

Code

set.seed(92) # kuzy
ind = sample(150)
y = ifelse(iris$Species=="setosa",1,0)[ind] # shuffle data
x = cbind(rep(1,150),iris$Sepal.Length)[ind,] # shuffle data
nit=200000
lr = 0.01
th1 = lrgd_minibatch(x,y,lr,nit,5)
th2 = lrgd_minibatch(x,y,lr,nit,15)
th3 = lrgd_minibatch(x,y,lr,nit,30)

Stochastic Gradient Descent

We run it with 2^{5} iterations and the learning rate of 0.01 and plot the values of \(\theta_1\) every 1000 iteration.

Important points

We shuffle the data before using it.
The larger the batch size, the smaller number of iterations are required for convergence
The batch size does not change the number calculations required overall.
Let’s look at the same plot, but scale the x-axis according to the amount of computations

Different Batch Sizes

Code

plot(ind/1000,th1[1,ind], type='l', ylim=c(0,33), col=1, ylab=expression(theta[1]), xlab="Iteration")
abline(h=27.83, lty=2)
lines(ind/1000*3,th2[1,ind], type='l', col=2)
lines(ind/1000*6,th3[1,ind], type='l', col=3)
legend("bottomright", legend=c(5,15,30),col=1:3, lty=1, bty='n',title = "Batch Size")

Different Batch Sizes

There are several important considerations about choosing the batch size for SGD.

The larger the batch size, the more memory is required to store the data.
Parallelization is more efficient with larger batch sizes.
Third, the larger the batch size, the less noise in the gradient.

Feed-Forward ReLu Neural Networks

A simple neural network with one hidden layer
Motivate it using a problem of binary classification on a simulated data set.
We start by generating a simple dataset

Synthetic Data

Code

numSamples = 200 # total number of observations
radius = 10 # radius of the outer circle
noise = 0.0001 # amount of noise to be added to the data
d = matrix(0,ncol = 3, nrow = numSamples); # matrix to store our generated data

# Generate positive points inside the circle.
for (i in 1:(numSamples/2) ) {
  r = runif(1,0, radius * 0.4);
  angle = runif(1,0, 2 * pi);
  x = r * sin(angle);
  y = r * cos(angle);
  noiseX = runif(1,-radius, radius) * noise;
  noiseY = runif(1,-radius, radius) * noise;
  d[i,] = c(0,x,y)
}

# Generate negative points outside the circle.
for (i in (numSamples/2+1):numSamples ) {
  r = runif(1,radius * 0.8, radius);
  angle = runif(1,0, 2 * pi);
  x = r * sin(angle);
  y = r * cos(angle);
  noiseX = runif(1,-radius, radius) * noise;
  noiseY = runif(1,-radius, radius) * noise;
  d[i,] = c(1,x,y)
}
colnames(d) = c("label", "x1", "x2")

Synthetic Data

# Plot the training dataset
plot(d[,2],d[,3], col=d[,1]+2, pch=16, xlab="x1", ylab="x2")

Logistic Regression

Code

x1 = seq(-11,11,length.out = 100)
# Fit a logistic regression model
fit = glm(label~x1+x2, data=as.data.frame(d), family=binomial(link='logit'))
# Plot the training dataset
plot(d[,2],d[,3], col=d[,1]+2, pch=16, xlab="x1", ylab="x2")
th = fit$coefficients
# Plot the decision boundary
abline(-th[1]/th[3], -th[2]/th[3], col=2)

Need More Lines!

Code

plot(x1~x2, data=d,col=d[,1]+2, pch=16)
# Plot lines that separate once class (red) from another (green)
lines(x1, -x1 - 6); text(-4,-3,1)
lines(x1, -x1 + 6); text(4,3,2)
lines(x1,  x1 - 6); text(4,-3,3)
lines(x1,  x1 + 6); text(-3,4,4)

Model

Now, we do the same thing as in simple logistic regression and apply logistic function to each of those lines

Code

# Define sigmoid function
sigmoid  = function(z)
  return(exp(z)/(1+exp(z)))

# Define hidden layer of our neural network
features = function(x1,x2) {
  z1 =  6 + x1 + x2; a1 = sigmoid(z1)
  z2 =  6 - x1 - x2; a2 = sigmoid(z2)
  z3 =  6 - x1 + x2; a3 = sigmoid(z3)
  z4 =  6 + x1 - x2; a4 = sigmoid(z4)
  return(c(a1,a2,a3,a4))
}

Neural Network

Using the matrix notations, we have \[ z = \sigma(Wx + b), ~ W = \begin{bmatrix} 1 & 1 \\ -1 & -1 \\ -1 & 1 \\ 1 & -1 \end{bmatrix}, ~ b = \begin{bmatrix} 6 \\ 6 \\ 6 \\ 6 \end{bmatrix}, ~ \sigma(z) = \frac{1}{1+e^{-z}} \]

The model shown above is the first layer of our neural network. \[ \hat{y} = \sigma(w^Tz + b), ~ w = \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \end{bmatrix}, ~ b = -3.1, ~ \sigma(z) = \frac{1}{1+e^{-z}} \]

Output Layer

The output of the output layer is the probability of the positive class.

# Calculate prediction (classification) using our neural network
predict_prob = function(x){
  x1 = x[1]; x2 = x[2]
  z = features(x1,x2)
  # print(z)
  mu = sum(z) - 3.1
  # print(mu)
  sigmoid(mu)
}

Use the model

We can use our model to do the predictions now

Code

# Predict the probability of the positive class for a given point
predict_prob(c(0,0))

[1] 0.7089128

Code

predict_prob(c(0,10))

[1] 0.2565405

The model generates sensible predictions

Decision Boundary

Finally, let’s plot the decision boundary to see how well it separates the data.

Code

x1 = seq(-11,11,length.out = 100)
x2 = seq(-11,11,length.out = 100)
gr = as.matrix(expand.grid(x1,x2));

[1] 10000     2

Code

yhat = apply(gr,1,predict_prob)

[1] 10000

Decision Boundary

Code

# Plot everything together
image(x1,x2,matrix(yhat,ncol = 100), col = heat.colors(20,0.7))
lines(d[,2],d[,3], col=d[,1]+2, pch=16, xlab="x1", ylab="x2", type='p')
lines(x1, -x1 - 6); text(-4,-3,1)
lines(x1, -x1 + 6); text(4,3,2)
lines(x1,  x1 - 6); text(4,-3,3)
lines(x1,  x1 + 6); text(-3,4,4)

Automatic Differentiation (Backpropagation)

Symbolic differentiation uses a tree form representation of a function and applies chain rule to the tree to calculate the symbolic derivative of a given function.
A tree representation of of composition of affine and sigmoid functions (the first layer of our neural network).

Computational graph of the first layer of our neural network

Automatic differentiation

Another way: automatic differentiation (AD).
Similar to symbolic differentiation, AD recursively applies the chain rule and calculates the exact value of derivative
Avoids the problem of numerical instability.
The difference between AD and symbolic differentiation is that AD provides the value of derivative evaluated at a specific point, rather than an analytical representation of the derivative.
AD can differentiate complex functions which involve IF statements and loops, and AD can be implemented using either forward or backward mode.

Code

sigmoid = function(x,b,w){
  v1 = w*x;
  v2 = v1 + b
  v3 = 1/(1+exp(-v2))
}

Automatic Differentiation

Function calculations	Derivative calculations
1. `v1 = w*x = 6`	1. `dv1 = w = 3` (derivative of `v1` with respect to `x`)
2. `v2 = v1 + b = 11`	2. `dv2 = dv1 = 3` (derivative of `v2` with respect to `x`)
3. `v3 = 1/(1+exp(-v2)) = 0.99`	3. `dv3 = eps2exp(-v2)/(1+exp(-v2))*2 = 5e-05`

Variables dv1,dv2,dv3 correspond to partial (local) derivatives of each intermediate variables v1,v2,v3 with respect to \(x\), and are called dual variables.
Tracking for dual variables can either be implemented using source code modification tools that add new code for calculating the dual numbers or via operator overloading.

Automatic Differentiation

The reverse AD also applies chain rule recursively
Starts from the outer function, as shown in Table below.

Function calculations	Derivative calculations
1. `v1 = w*x = 6`	4. `dv1dx =w; dv1 = dv2dv1dx = 31.3e-05=5e-05`
2. `v2 = v1 + b = 11`	3. `dv2dv1 =1; dv2 = dv3*dv2dv1 = 1.3e-05`
3. `v3 = 1/(1+exp(-v2)) = 0.99`	2. `dv3dv2 = exp(-v2)/(1+exp(-v2))**2;`
4. `v4 = v3`	1. `dv4=1`

In DL this is called Back Propagation

Automatic Differentiation in Jax

Code

import jax.numpy as jnp
from jax import grad,jit
import pandas as pd
from jax import random
import matplotlib.pyplot as plt

def abline(slope, intercept):
    """Plot a line from slope and intercept"""
    axes = plt.gca()
    x_vals = jnp.array(axes.get_xlim())
    ylim = axes.get_xlim()
    y_vals = intercept + slope * x_vals
    plt.plot(x_vals, y_vals, '-'); plt.ylim(ylim)

d = pd.read_csv('circle.csv').values
x = d[:, 1:3]; y = d[:, 0]
def sigmoid(x):
    return 1 / (1 + jnp.exp(-x))
def predict(x, w1,b1,w2,b2):
    z = sigmoid(jnp.dot(x, w1)+b1)
    return sigmoid(jnp.dot(z, w2)+b2)[:,0]
def nll(x, y, w1,b1,w2,b2):
    yhat = predict(x, w1,b1,w2,b2)
    return -jnp.sum(y * jnp.log(yhat) + (1 - y) * jnp.log(1 - yhat))
@jit
def sgd_step(x, y, w1,b1,w2,b2, lr):
    grads = grad(nll,argnums=[2,3,4,5])(x, y, w1,b1,w2,b2)
    return w1 - lr * grads[0],b1 - lr * grads[1],w2 - lr * grads[2],b2 - lr * grads[3]
def accuracy(x, y, w1,b1,w2,b2):
    y_pred = predict(x, w1,b1,w2,b2)
    return jnp.mean((y_pred > 0.5) == y)
k = random.PRNGKey(0)
w1 = 0.1*random.normal(k,(2,4))
b1 = 0.01*random.normal(k,(4,))
w2 = 0.1*random.normal(k,(4,1))
b2 = 0.01*random.normal(k,(1,))

for i in range(1000):
    w1,b1,w2,b2 = sgd_step(x,y,w1,b1,w2,b2,0.003)
print(accuracy(x,y,w1,b1,w2,b2))

1.0

Code

fig, ax = plt.subplots()
ax.scatter(x[:,0], x[:,1], c=['r' if x==1 else 'g' for x in y],s=7); plt.xlabel("x1"); plt.ylabel("x2"); plt.xlim(-10,10)

<matplotlib.collections.PathCollection object at 0x12319a510>
Text(0.5, 0, 'x1')
Text(0, 0.5, 'x2')
(-10.0, 10.0)

Code

ax.spines['top'].set_visible(False)
# plt.scatter((x[:,1]*w1[1,0] - b1[0])/w1[0,0], x[:,1])
abline(w1[1,0]/w1[0,0],b1[0]/w1[0,0])
abline(w1[1,1]/w1[0,1],b1[1]/w1[0,1])
abline(w1[1,2]/w1[0,2],b1[2]/w1[0,2])
abline(w1[1,3]/w1[0,3],b1[3]/w1[0,3])
plt.show()

Discussion

Model estimation procedure and demonstrated that DL is an extension of a generalized linear model.
One goal of statistics is to build predictive models along with uncertainty and to develop an understanding of the data generating mechanism.
Data models are well studied in statistical literature, but often do not provide enough flexibility to learn the input-output relations.
Closed box predictive rules, such as trees and neural networks, are more flexible learners.
DL models have been almost exclusively used for problems of image analysis and natural language processing, more traditional data sets, which arise in finance, science and engineering, such as spatial and temporal data can be efficiently analyzed using deep learning.