OLS, logistic regression & gradient descent

Author

Tim

Comparing OLS and gradient descent

Canonical multivariate OLS has dimensions as columns and samples as rows. When the response is also multivariate, we assume that the dimensions of the response are independent of each other. Let’s simulate some data to enforce a ground truth and verify that gradient descent arrives there.

Simulate gene expression

Since we are dealing with counts, we will model them with Poisson emissions. We will start by simulating a matrix of counts from 1000 cells for 5 genes.

X0 <- rbind(gene1=c(rpois(n=500, lambda=5), rpois(n=500, lambda=0.1)),
            gene2=rpois(n=1000, lambda=1),
            gene3=c(rpois(n=400, lambda=0.1), rpois(n=600, lambda=2)),
            gene4=rpois(n=1000, lambda=3),
            gene5=c(rpois(n=300, lambda=0.1), rpois(n=700, lambda=10)))
colnames(X0) <- paste0("cell", seq_len(ncol(X0)))
X0 <- t(X0)

Log-normalize to decouple mean and variance for Poisson random variables

Normalize the counts from each cell, with a target of 10K fragments:

X <- round(sweep(X0, 1, rowSums(X0) + 1, `/`) * 10000)

Take logs, as the raw counts are Poisson and have a mean-variance dependency. log1p adds a pseudo-count of 1 so that the result is always non-negative.

X <- log1p(X)

Add an intercept column.

This is the easiest way to set the mean in model fitting.

X <- cbind(intercept=1, X)

Simulate protein abundance

Let’s assume protein abundance follows linear combinations of gene abundance. Since we are enforcing a particular generating model, we can see how well we recover it for each protein by ordinary least squares or gradient descent. In some cases the protein is constitutively expressed at a low level, in which case the intercept coefficient is nonzero. We also create coefs for each gene:

B <- cbind(protein1=c(intercept=0.001,
                      gene1=0.25,
                      gene2=0,
                      gene3=0.5,
                      gene4=0,
                      gene5=0),
           protein2=c(intercept=0,
                      gene1=0.5,
                      gene2=0,
                      gene3=0,
                      gene4=0,
                      gene5=0.25),
           protein3=c(intercept=0,
                      gene1=0,
                      gene2=0,
                      gene3=0.25,
                      gene4=0,
                      gene5=0.5))

Observe that \(y_j = \beta_0 + \beta_{gene1}X_{gene1} + ... + \beta_{gene5}X_{gene5}\) for each protein \(y_{j \in {1, 2, 3}}\).

This is exactly equivalent to saying \(y_j = X\beta_j\) and thus we will later estimate \(\beta_j\) as \(\widehat{\beta_j}\).

Now we have what we need to generate a matrix of protein abundance. We assume ADT counts are Poisson, so we will reverse our variance-stabilizing transformation to yield Y0, the “raw” counts.

Y0 <- round(exp(X %*% B))

For regression, we will want to use the variance stabilized version, of course.

Y <- log1p(Y0)

How does this look?

heatmap(t(Y))

Fit coefficients by ordinary least squares

The Gauss-Markov theorem establishes that the best linear unbiased estimator for the coefficients of a linear system is the ordinary least squares (or OLS for short) solution. Specifically, if

\(y \sim X^{T}\beta + \epsilon\), with \(\epsilon \sim N(0, \sigma^2)\)

then for an N by K matrix \(X\) and K by 1 matrix \(\beta\),

\(\widehat\beta = (X^{T}X)^{-1}X^{T}y\) minimizes \(\sum_{i=1}(\widehat{y_i} - y_i)^{2}\),

and \(\widehat\beta\) is the best linear unbiased estimator of \(\beta\).

(It turns out that we can use the QR decomposition for a more stable inverse.)

If \(y\) is an N by M matrix, we must solve for each row iteratively, and populate a K by M matrix \(\widehat\beta\). Nevertheless, as long as the mild Gauss-Markov conditions are satisfied, \(\widehat\beta\) will minimize the mean squared error, so it should also be the solution by gradient descent if \((y - \hat{y})^{2}\) is the cost function.

A visual explainer can be found at setosa.io . The OLS estimate is also the maximum likelihood estimator (MLE) for a linear regression, and a useful benchmark all around. Let’s make it a function, and let’s use the more numerically stable QR decomposition to handle inversion.

ols <- function(X, y) solve(crossprod(X)) %*% t(X) %*% y
ols_qr <- function(X, Y) qr.solve(X, Y)

Since the QR decomposition can be applied across the columns of a matrix, we do just that.

# Solve for X, Y:
B_ols <- ols(X, Y)
B_qr <- ols_qr(X, Y)
show(B_ols - B_qr)
               protein1      protein2      protein3
intercept -6.072920e-14  1.209310e-13 -1.038059e-13
gene1      2.053913e-15 -2.886580e-15  7.639722e-15
gene2     -6.964915e-16 -2.185535e-15 -9.217887e-16
gene3      1.221245e-15 -2.984375e-15 -5.551115e-17
gene4      4.420726e-15 -1.000982e-14  3.419140e-15
gene5      3.150258e-15  1.165734e-15  1.182388e-14
heatmap(t(B_qr))

We can see how we did by computing the root mean squared error (RMSE), \(\sqrt{(\widehat{Y} - Y)^{2}}\) .

rmse <- function(yhat, y) sqrt(sum((yhat - y)**2)/(nrow(y) * ncol(y)))
rmse(Y, X %*% B_qr)
[1] 0.07539937

Not too awful, all things considered. Now let’s try the same via gradient descent.

Gradient descent

Linear regression can be solved exactly via QR decomposition, so typically we would not bother with gradient descent. However, the minute you step outside the classical assumptions (homoskedasticity, uncorrelated errors, i.i.d. variables), this starts to break down. One way or another we need to start from a “guess” or initial value and proceed towards a minimum-error guess (which will be our estimate of the true parameters). The cost function can be the negative log of the joint probability of seeing the data given the proposed model parameters, in which case we refer to the result as the maximum likelihood estimate (we usually minimize the negative log-likelihood instead), or it could be the error sum of squares, \(\sum^n_{i=1}{(X^{T}_i\beta - y_i)^2}\) .

This process requires a step size \(\alpha\) and an update informed by \(\frac{\partial{X}}{\partial{y}}\) . More generally, the matrix of partial derivatives for each predictor forms the gradient \(\nabla f(\beta | X,y)\), where \(f(\beta)\) is the MSE. The goal is to proceed as quickly as possible in the direction that minimizes the cost. For OLS, \(\nabla f(\beta)\) is \(\frac{\partial{(X^{T}\beta - y)^2}}{\partial\widehat\beta}\), updating so that \(\beta_{t+1} = \beta_{t} - \alpha \nabla f(\beta_t)\).

A visual explainer for this can be found at setosa.io. Since the OLS is also the maximum likelihood estimator (MLE) for a linear regression, we can benchmark our gradient descent function and also what we’d expect from glm(y, X). We compose gradient descent from smaller functions.

# estimate of y from X and b
yhat <- function(X, b) X %*% b

# error of the estimate from X and b
eps <- function(y, X, b) yhat(X, b) - y

# mean squared error (a function of eps)
mse <- function(y1, y0) mean((y1 - y0)**2)

# root mean squared error for the estimate
rmsb <- function(b1, b0) sqrt(mse(b1, b0))

# update vector for a given starting guess `b` 
delta <- function(y, X, b) t(t(X) %*% eps(y, X, b))[1,] / length(y)

# update function: step along this vector with stride length alpha
update <- function(y, X, b, alpha) return(b - (delta(y, X, b) * alpha))

# log an interation with its update 
logupdate <- function(y, X, b, b0) c(b, mse=mse(yhat(X,b), y), rmsb=rmsb(b, b0))

# gradient descent functon, applying all of the above bits 
# note that we are not scaling the features, which would make it faster
ols_gd <- function(y, X, alpha=1e-2, iter=1000, tol=1e-5, verbose=FALSE) {

  b0 <- rep(0, ncol(X))
  for (i in seq_len(iter)) {
    b <- update(y, X, b0, alpha)
    results <- logupdate(y, X, b, b0)
    names(results)[seq_along(b)] <- names(b)
    if (i == 1) updates <- t(as.matrix(results, nrow=1))
    if (i > 1) updates <- rbind(updates, logupdate(y, X, b, b0))
    if (verbose) message("Iteration ", i, ", ",
                         "MSE: ", updates[i, "mse"], ", ",
                         "RMS(b): ", updates[i, "rmsb"])
    if (updates[i, "rmsb"] < tol & verbose) message("Converged after ", i, " iterations.")
    if (updates[i, "rmsb"] < tol) break()
    b0 <- b
  }
  updates <- data.frame(updates)
  updates$iter <- seq_len(nrow(updates))
  class(updates) <- "updates"
  attr(b, "updates") <- updates
  class(b) <- "results"
  return(b)

}

# test it 
invisible(apply(Y, 2, ols_gd, X))

It will also be handy to automate plotting for the resulting update logs.

# add plot function for updates
plot.updates <- function(upd, ...) {
  with(upd, plot(iter, mse, type="b", col=j, lwd=3, ...))
}

# wrap it for results 
plot.results <- function(res, ...) plot(attr(res, "updates"), ...)

We can compare this with the brute-force result of a stabilized OLS solution, using the QR decomposition to avoid singularities when inverting `t(X) %*% X`. Let’s compare and see if we get about the same result by gradient descent.

# set up plots
par(mfrow=c(2, 3))

# plot convergence for each protein
for (j in seq_len(ncol(Y))) plot(ols_gd(Y[, j], X), main=colnames(Y)[j])

# run per column
B_gd <- apply(Y, 2, ols_gd, X)
rownames(B_gd) <- colnames(X)

# plot B, B_qr, and B_gd
responses <- colnames(Y)
estimates <- c("B", "B_qr", "B_gd")
for (j in seq_along(responses)) {
  protein <- responses[j]
  allests <- cbind(B, B_gd, B_qr)
  minest <- min(allests)
  maxest <- max(allests)
  plot(c(), main=protein,
       xlim=c(0, ncol(X) - 1), xlab="gene",
       ylim=c(minest, maxest), ylab="coef")
  for (k in seq_along(estimates)) {
    method <- estimates[k]
    est <- get(method)[, protein]
    points(cbind(x=seq_along(colnames(X))-1, y=est), pch=k, col=j)
  }
}

We do indeed. So why bother with gradient descent?

Maximum likelihood estimation

Suppose our objective is not to predict Y, but rather the probability that Y=1. This immediately breaks all assumptions that prop up the Gauss-Markov theorem, and comes up regularly when Y is an indicator for e.g. membership in a group. Ordinary least squares is useless here because the residuals can be infinite! A general solution to this problem is maximum likelihood.

(Trivia: training a one layer neural net with softmax activation is equivalent to logistic regression.)

Logistic regression

The canonical link for regression of Pr(y==1) on X is logit(p), defined as log(p / (1 - p)). Several good reasons exist for this choice:

  • logit(p), the log-odds of seeing Y = 1 given p, ranges from -Inf to +Inf

  • expit(z), the inverse of logit, ranges from 0 to 1 and thus maps XB to p

  • the first and second derivatives of this likelihood are easy to calculate.

The (log) likelihood function

What is likelihood? It’s the joint probability of seeing the data we saw, given the model we propose. For a logistic regression, that model is a Bernoulli distribution.

If \(Y \sim Bernoulli(\pi)\), then the likelihood is \(L(\pi|y) = \prod^n_{i=1}{Pr(y_i|\pi_i)}\).

This function is easier to compute as the log-likelihood, \(l(\pi|y) = \sum^n_{i=1}{\log(Pr(y_i|\pi_i)}\)

For \(l(\pi)\) with this canonical link, the matrix of first partial derivatives \(\frac{\partial{l(\beta)}}{\partial{\beta}}\) is \(X(y - \hat{p})\), where \(\hat{p} = \frac{e^\eta}{1 + e^\eta}\) and \(\eta = X\widehat{\beta}\), and the matrix of second partial derivatives \(\frac{\partial^2{l(\beta)}}{\partial{\beta}\partial{\beta^{T}}}\) is \(-X^{T}WX\), where \(diag(W) = \pi_i(1-\pi_i)\). Note that \(W\) ends up scaling updates inversely proportional to variances.

With these two pieces, we can start iterating towards a numeric solution.

Our very own binary outcome

Let’s make this concrete by generating a binary outcome from the protein abundances in our previous example. Suppose the odds of a cell being malignant given \((intercept, protein1, protein2, protein3)\) are \((0, -2, 3, -1)\).

# log-odds coefs
bl <- c(0, -2, 3, -1)

# add an intercept
yy <- cbind(intercept=1, Y)

# untransformed 
eta <- yy %*% bl

# transformation
expit <- function(x) exp(x) / (1 + exp(x))

# transformed 
pee <- expit(eta)

# cheezy but useful helper function
rbernoulli <- function(...) rbinom(size=1, ...)

# simulate the response vector
yl <- rbernoulli(length(pee), pee)
names(yl) <- rownames(Y)

# tabulate: 
table(yl)
yl
  0   1 
471 529 

Now let’s solve for the coefficients by gradient descent, specifically by Fisher scoring. For a canonical logit link, Fisher scoring produces identical results to Newton-Raphson iteration, with the bonus feature of providing the covariance matrix among the predictors at convergence, which is the inverse of t(X) %*% W %*% X and thus relatively easily found with solve at convergence.

Logistic regression via gradient descent

lr_gd <- function(y, X, iter=1000, tol=1e-5) {

  y <- as.matrix(y)
  pi_i <- function(eta) return(exp(eta)/(1 + exp(eta)))

  # starting params 
  b0 <- matrix(0, ncol=1, nrow=ncol(X))
  p0 <- pi_i(X %*% b0)
  W0 <- diag(as.vector(p0 * (1 - p0)))
  I0 <- t(X) %*% W0 %*% X
  U0 <- t(X) %*% (y - p0)
  I <- I0
  U <- U0

  # iterate 
  for (i in seq_len(iter)) {
    b <- b0 + (solve(I) %*% U)
    if (all(abs(b - b0) < tol)) break
    p <- pi_i(X %*% b)
    W <- diag(as.vector(p * (1 - p)))
    I <- t(X) %*% W %*% X
    U <- t(X) %*% (y - p)
    b0 <- b
  }

  attr(b, "fitted") <- pi_i(X %*% b)
  attr(b, "vcov") <- solve(I)
  attr(b, "iter") <- i
  class(b) <- "results"
  return(b)

}

Add a couple convenience functions per usual:

print.results <- function(x) print(x[,])
vcov.results <- function(x) attr(x, "vcov")

Once again we will test out our function:

# test it 
bl_hat <- lr_gd(yl, yy)
bl_hat
 intercept   protein1   protein2   protein3 
-0.5942055 -2.1259283  3.1038991 -0.9097270 
bl
[1]  0 -2  3 -1
bl_hat - bl
  intercept    protein1    protein2    protein3 
-0.59420548 -0.12592831  0.10389911  0.09027302 
vcov(bl_hat)
            intercept    protein1    protein2    protein3
intercept  0.34969383  0.02450048 -0.05740108 -0.03414106
protein1   0.02450048  0.05856047 -0.08727168  0.02688160
protein2  -0.05740108 -0.08727168  0.14364397 -0.04788732
protein3  -0.03414106  0.02688160 -0.04788732  0.02846367
heatmap(vcov(bl_hat), revC=TRUE)

This is as it should be: the coef for protein2 is positive and anticorrelated with protein1 and protein3, which are negative. None of the other terms are correlated with the intercept term, which in our generating model is 0. The rather larger errors at convergence demonstrate something else of note: once we step away from linear models, we need a larger sample size to get the same tight bounds on our estimates that we would expect from OLS.

Fun stuff

For certain values of fun. Given \(y_l \sim Bernoulli(expit(Y\beta_Y))\) , and \(Y \sim X\beta_X\), can you model
\(y_l \sim X\)? Remember, the two have different activation and gradient functions. Any ideas?