Review of deep learning methods which provide insight into structured high-dimensional data.
Deep learning uses layers of semi-affine input transformations to provide a predictive rule.
Applying these layers of transformations leads to a set of attributes (or, features)
To which probabilistic statistical methods can be applied.
The best of both worlds can be achieved: scalable prediction rules fortified with uncertainty quantification, where sparse regularization finds the features.
Why DL?
Deep learning provides a powerful pattern matching tool suitable for many AI applications.
Image recognition and text analysis are probably two of the deep learning’s most successful.
An image or a text as a high dimensional matrices and vectors, respectively.
What is DL?
In general, a neural network can be described as follows. Let \(f_1 , \ldots , f_L\) be given univariate activation functions for each of the \(L\) layers.
Activation functions are nonlinear transformations of weighted data.
A semi-affine activation rule is then defined by \[
f_l^{W,b} = f_l \left ( \sum_{j=1}^{N_l} W_{lj} X_j + b_l \right ) = f_l ( W_l X_l + b_l )\,,
\]
Similar to a classic basis decomposition, the deep approach uses univariate activation functions to decompose a high dimensional \(X\).
Let \(Z^{(l)}\) denote the \(l\)th layer, and so \(X = Z^{(0)}\). The final output \(Y\) can be numeric or categorical. The explicit structure of a deep prediction rule is then \[
\begin{aligned}
\hat{Y} (X) & = W^{(L)} Z^{(L)} + b^{(L)} \\
Z^{(1)} & = f^{(1)} \left ( W^{(0)} X + b^{(0)} \right ) \\
Z^{(2)} & = f^{(2)} \left ( W^{(1)} Z^{(1)} + b^{(1)} \right ) \\
\ldots & \\
Z^{(L)} & = f^{(L)} \left ( W^{(L-1)} Z^{(L-1)} + b^{(L-1)} \right )\,.
\end{aligned}
\] Here \(W^{(l)}\) is a weight matrix and \(b^{(l)}\) are threshold or activation levels. Designing a good predictor depends crucially on the choice of univariate activation functions \(f^{(l)}\). The \(Z^{(l)}\) are hidden features which the algorithm will extract.
Deep approach employs hierarchical predictors comprising of a series of \(L\) nonlinear transformations applied to \(X\).
Deep Learning and Least Squares
The goal is to find the optimal value of \(\theta\) that minimizes the expected loss function
Given a training data set \(D = \{x_i,y_i\}_{i=1}^n\).
The loss function is a measure of discrepancy between the true value of \(y\) and the predicted value \(f(x,\theta)\).
Model Estimation
Deep learning as well as a large number of statistical problems, can be expressed in the form \[
\min l(x) + \phi(x).
\] In learning \(l(x)\) is the negative log-likelihood and \(\phi(x)\) is a penalty function that regularizes the estimate. From the Bayesian perspective, the solution to this problem may be interpreted as a maximum a posteriori \[p(y\mid x) \propto \exp\{-l(x)\}, ~ p(x) \propto \exp\{-\phi(x)\}.\]
The loss function
The loss function is usually defined as the negative log-likelihood function of the model. \[
l(\theta) = - \sum_{i=1}^n \log p(y_i | x_i, \theta),
\] where \(p(y_i | x_i, \theta)\) is the conditional distribution of \(y_i\) given \(x_i\) and \(\theta\).
The loss function
Thus, in the case of regression, we have \[
y_i = f(x_i,\theta) + \epsilon, ~ \epsilon \sim N(0,\sigma^2),
\] Thus, the loss function is \[
l(\theta) = - \sum_{i=1}^n \log p(y_i | x_i, \theta) = \sum_{i=1}^n (y_i - f(x_i, \theta))^2,
\]
Optimisation
Second order optimisation algorithms, such as BFGS used for traditional statistical models do not work well for deep learning models.
The number of parameters a DL model has is large and estimating second order derivatives (Hessian or Fisher information matrix) becomes prohibitive from both computational and memory use standpoints.
Instead, first order gradient descent methods are used for estimating parameters of a deep learning models.
Optimisation
The problem of parameter estimation (when likelihood belongs to the exponential family) is an optimisation problem
\[
\min_{\theta} l(\theta) := \dfrac{1}{n} \sum_{i=1}^n \log p(y_i, f(x_i, \theta))
\] where \(l\) is the negative log-likelihood of a sample, and \(\theta\) is the vector of parameters.
Optimisation: Gradient Descent
Iterative algorithm
Starts with an initial guess \(\theta^{0}\)
Updates the parameter vector \(\theta\) at each iteration \(t\) as follows:
\[
\theta^{t+1} = \theta^t - \alpha_t \nabla l(\theta^t).
\] - \(\alpha_t\) is the learning rate (step size). In DL model estimation it is a first class citizen parameter.
Gradient Descent for Linear Regression
\[
y_i = \theta_0 + \theta_1 x_i + \epsilon_i,
\] or in matrix form \[
y = X \theta + \epsilon,
\]
\(\epsilon_i \sim N(0, \sigma^2)\)
\(X\) is the design matrix
Gradient Descent
The negative log-likelihood function \[
l(\theta) = \sum_{i=1}^n (y_i - \theta_0 - \theta_1 x_i)^2.
\] The gradient then is \[
\nabla l(\theta) = -2 \sum_{i=1}^n (y_i - \theta_0 - \theta_1 x_i) \begin{pmatrix} 1 \\ x_i \end{pmatrix}.
\] In matrix form, we have \[
\nabla l(\theta) = -2 X^T (y - X \theta).
\]
Regression
Regression is simply a neural network which is wide and shallow.
Code
data(iris)y = iris$Petal.Lengthx =cbind(rep(1,150),iris$Petal.Width)# initialize thetatheta <-matrix(c(0, 0), nrow =2, ncol =1)# learning ratealpha <-0.0001# number of iterationsn_iter <-1000# gradient descentfor (i in1:n_iter) {# compute gradient grad <--2*t(x) %*% (y - x %*% theta)# update theta theta <- theta - alpha * grad}
Our gradient descent finds the following coefficients
Intercept (\(\theta_1\))
Petal.Width (\(\theta_2\))
1.084125
2.229565
Plot
Let’s plot the data and model estimated using gradient descent
Let’s compare it to the standard estimation algorithm
Code
glm(y~x-1, family=binomial(link="logit"))
Call: glm(formula = y ~ x - 1, family = binomial(link = "logit"))
Coefficients:
x1 x2
27.829 -5.176
Degrees of Freedom: 150 Total (i.e. Null); 148 Residual
Null Deviance: 207.9
Residual Deviance: 71.84 AIC: 75.84
Stochastic Gradient Descent
Stochastic gradient descent (SGD) is a variant of the gradient descent algorithm. \[
\nabla l(\theta) \approx \dfrac{1}{|B|} \sum_{i \in B} \nabla l(y_i, f(x_i, \theta)),
\] where \(B \in \{1,2,\ldots,n\}\) is the batch samples from the data set. This method can be interpreted as gradient descent using noisy gradients, which are typically called mini-batch gradients with batch size \(|B|\).
Stochastic Gradient Descent
In a small mini-batch regime, when \(|B| \ll n\) and typically \(|B| \in \{32,64,\ldots,1024\}\) it was shown that SGD converges faster than the standard gradient descent algorithm
It does converge to minimizers of strongly convex functions
It is robust to noise in the data
It can avoid saddle-points, which is often an issue with deep learning log-likelihood functions.
Implementation
Now, we implement SGD for logistic regression and compare performance for different batch sizes
Code
lrgd_minibatch =function(x,y, alpha, n_iter, bs) { theta <-matrix(c(0, 0), nrow =2, ncol = n_iter+1) n =length(y)for (i in1:n_iter) { s = ((i-1)*bs+1)%%n e =min(s+bs-1,n) xl = x[s:e,]; yl = y[s:e] p =1/(1+exp(-xl %*% theta[,i])) grad <--t(xl) %*% (yl - p)# update theta theta[,i+1] <- theta[,i] - alpha * grad }return(theta)}
Different Batch Sizes
Now run our SGD algorithm with different batch sizes.
There are several important considerations about choosing the batch size for SGD.
The larger the batch size, the more memory is required to store the data.
Parallelization is more efficient with larger batch sizes.
Third, the larger the batch size, the less noise in the gradient.
Feed-Forward ReLu Neural Networks
A simple neural network with one hidden layer
Motivate it using a problem of binary classification on a simulated data set.
We start by generating a simple dataset
Synthetic Data
Code
numSamples =200# total number of observationsradius =10# radius of the outer circlenoise =0.0001# amount of noise to be added to the datad =matrix(0,ncol =3, nrow = numSamples); # matrix to store our generated data# Generate positive points inside the circle.for (i in1:(numSamples/2) ) { r =runif(1,0, radius *0.4); angle =runif(1,0, 2* pi); x = r *sin(angle); y = r *cos(angle); noiseX =runif(1,-radius, radius) * noise; noiseY =runif(1,-radius, radius) * noise; d[i,] =c(0,x,y)}# Generate negative points outside the circle.for (i in (numSamples/2+1):numSamples ) { r =runif(1,radius *0.8, radius); angle =runif(1,0, 2* pi); x = r *sin(angle); y = r *cos(angle); noiseX =runif(1,-radius, radius) * noise; noiseY =runif(1,-radius, radius) * noise; d[i,] =c(1,x,y)}colnames(d) =c("label", "x1", "x2")
Synthetic Data
# Plot the training datasetplot(d[,2],d[,3], col=d[,1]+2, pch=16, xlab="x1", ylab="x2")
Logistic Regression
Code
x1 =seq(-11,11,length.out =100)# Fit a logistic regression modelfit =glm(label~x1+x2, data=as.data.frame(d), family=binomial(link='logit'))# Plot the training datasetplot(d[,2],d[,3], col=d[,1]+2, pch=16, xlab="x1", ylab="x2")th = fit$coefficients# Plot the decision boundaryabline(-th[1]/th[3], -th[2]/th[3], col=2)
Need More Lines!
Code
plot(x1~x2, data=d,col=d[,1]+2, pch=16)# Plot lines that separate once class (red) from another (green)lines(x1, -x1 -6); text(-4,-3,1)lines(x1, -x1 +6); text(4,3,2)lines(x1, x1 -6); text(4,-3,3)lines(x1, x1 +6); text(-3,4,4)
Model
Now, we do the same thing as in simple logistic regression and apply logistic function to each of those lines
Using the matrix notations, we have \[
z = \sigma(Wx + b), ~ W = \begin{bmatrix} 1 & 1 \\ -1 & -1 \\ -1 & 1 \\ 1 & -1 \end{bmatrix}, ~ b = \begin{bmatrix} 6 \\ 6 \\ 6 \\ 6 \end{bmatrix}, ~ \sigma(z) = \frac{1}{1+e^{-z}}
\]
The model shown above is the first layer of our neural network. \[
\hat{y} = \sigma(w^Tz + b), ~ w = \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \end{bmatrix}, ~ b = -3.1, ~ \sigma(z) = \frac{1}{1+e^{-z}}
\]
Output Layer
The output of the output layer is the probability of the positive class.
# Calculate prediction (classification) using our neural networkpredict_prob =function(x){ x1 = x[1]; x2 = x[2] z =features(x1,x2)# print(z) mu =sum(z) -3.1# print(mu)sigmoid(mu)}
Use the model
We can use our model to do the predictions now
Code
# Predict the probability of the positive class for a given pointpredict_prob(c(0,0))
[1] 0.7089128
Code
predict_prob(c(0,10))
[1] 0.2565405
The model generates sensible predictions
Decision Boundary
Finally, let’s plot the decision boundary to see how well it separates the data.
Symbolic differentiation uses a tree form representation of a function and applies chain rule to the tree to calculate the symbolic derivative of a given function.
A tree representation of of composition of affine and sigmoid functions (the first layer of our neural network).
Computational graph of the first layer of our neural network
Automatic differentiation
Another way: automatic differentiation (AD).
Similar to symbolic differentiation, AD recursively applies the chain rule and calculates the exact value of derivative
Avoids the problem of numerical instability.
The difference between AD and symbolic differentiation is that AD provides the value of derivative evaluated at a specific point, rather than an analytical representation of the derivative.
AD can differentiate complex functions which involve IF statements and loops, and AD can be implemented using either forward or backward mode.
1. dv1 = w = 3 (derivative of v1 with respect to x)
2. v2 = v1 + b = 11
2. dv2 = dv1 = 3 (derivative of v2 with respect to x)
3. v3 = 1/(1+exp(-v2)) = 0.99
3. dv3 = eps2*exp(-v2)/(1+exp(-v2))**2 = 5e-05
Variables dv1,dv2,dv3 correspond to partial (local) derivatives of each intermediate variables v1,v2,v3 with respect to \(x\), and are called dual variables.
Tracking for dual variables can either be implemented using source code modification tools that add new code for calculating the dual numbers or via operator overloading.
Automatic Differentiation
The reverse AD also applies chain rule recursively
Starts from the outer function, as shown in Table below.
Function calculations
Derivative calculations
1. v1 = w*x = 6
4. dv1dx =w; dv1 = dv2*dv1dx = 3*1.3e-05=5e-05
2. v2 = v1 + b = 11
3. dv2dv1 =1; dv2 = dv3*dv2dv1 = 1.3e-05
3. v3 = 1/(1+exp(-v2)) = 0.99
2. dv3dv2 = exp(-v2)/(1+exp(-v2))**2;
4. v4 = v3
1. dv4=1
In DL this is called Back Propagation
Automatic Differentiation in Jax
Code
import jax.numpy as jnpfrom jax import grad,jitimport pandas as pdfrom jax import randomimport matplotlib.pyplot as pltdef abline(slope, intercept):"""Plot a line from slope and intercept""" axes = plt.gca() x_vals = jnp.array(axes.get_xlim()) ylim = axes.get_xlim() y_vals = intercept + slope * x_vals plt.plot(x_vals, y_vals, '-'); plt.ylim(ylim)d = pd.read_csv('circle.csv').valuesx = d[:, 1:3]; y = d[:, 0]def sigmoid(x):return1/ (1+ jnp.exp(-x))def predict(x, w1,b1,w2,b2): z = sigmoid(jnp.dot(x, w1)+b1)return sigmoid(jnp.dot(z, w2)+b2)[:,0]def nll(x, y, w1,b1,w2,b2): yhat = predict(x, w1,b1,w2,b2)return-jnp.sum(y * jnp.log(yhat) + (1- y) * jnp.log(1- yhat))@jitdef sgd_step(x, y, w1,b1,w2,b2, lr): grads = grad(nll,argnums=[2,3,4,5])(x, y, w1,b1,w2,b2)return w1 - lr * grads[0],b1 - lr * grads[1],w2 - lr * grads[2],b2 - lr * grads[3]def accuracy(x, y, w1,b1,w2,b2): y_pred = predict(x, w1,b1,w2,b2)return jnp.mean((y_pred >0.5) == y)k = random.PRNGKey(0)w1 =0.1*random.normal(k,(2,4))b1 =0.01*random.normal(k,(4,))w2 =0.1*random.normal(k,(4,1))b2 =0.01*random.normal(k,(1,))for i inrange(1000): w1,b1,w2,b2 = sgd_step(x,y,w1,b1,w2,b2,0.003)print(accuracy(x,y,w1,b1,w2,b2))
1.0
Code
fig, ax = plt.subplots()ax.scatter(x[:,0], x[:,1], c=['r'if x==1else'g'for x in y],s=7); plt.xlabel("x1"); plt.ylabel("x2"); plt.xlim(-10,10)
Model estimation procedure and demonstrated that DL is an extension of a generalized linear model.
One goal of statistics is to build predictive models along with uncertainty and to develop an understanding of the data generating mechanism.
Data models are well studied in statistical literature, but often do not provide enough flexibility to learn the input-output relations.
Closed box predictive rules, such as trees and neural networks, are more flexible learners.
DL models have been almost exclusively used for problems of image analysis and natural language processing, more traditional data sets, which arise in finance, science and engineering, such as spatial and temporal data can be efficiently analyzed using deep learning.