OLS, logistic regression & gradient descent

Author

Tim

Comparing ordinary least squares and gradient descent for linear model fitting

Canonical multivariate ordinary least squares (OLS) has dimensions as columns and samples as rows. When the response is also multivariate (i.e., when the model produces a matrix rather than a vector of numbers), we assume that the dimensions of the response are independent of each other (if we want to, we can estimate the variance-covariance matrix that relates the columns of the response matrix, but that’s not a byproduct of OLS). Let’s simulate some data to enforce a ground truth and verify that gradient descent arrives there.

Simulate gene expression

Since we are dealing with counts, we will model them with Poisson emissions. We will start by simulating a matrix of mRNA counts for 1000 cells for 5 genes. Since we are hoping to fit models for proteins related to each gene later on, we will specify the “expression groups” of each cell, then generate counts based on those groupings.

# let's set this up as a Poisson generating model, with one parameter
# that one parameter ("lambda") controls both the mean and the variance
intercept <- c(gene1=0.1, gene2=0, gene3=0.1, gene4=0, gene5=0.1)
specific <- c(gene1=4.9, gene2=1, gene3=1.9, gene4=3, gene5=9.9)

# the "high expressor" cells will have counts with a mean of intercept + specific
# the "low expressor" cells will have counts with a mean of just the intercept
coefs <- cbind(intercept, specific)
exprgroups <- rbind(low=c(1, 0), high=c(1,1))
(lambda <- exprgroups %*% t(coefs))

     gene1 gene2 gene3 gene4 gene5
low    0.1     0   0.1     0   0.1
high   5.0     1   2.0     3  10.0

# so, if a cell is a "high expressor," its counts for gene5 will have a mean specified by ["high", "gene5"], i.e. 10
# and if a cell is a "low expressor", its counts for gene1 will have a mean specified by ["low", "gene1"], i.e. 0.1
# notice that the lambda parameter is completely specified by an intercept term and a state-specific weight term.

# We will simulate this many cells:
ncells <- 1000

# And we will order them sequentially for reasons that will soon be obvious:
cells <- seq_len(ncells)

# The above allows us to simulate a multifactorial generating model by design:
highexp <- rbind(gene1 = as.integer(cells <= 500),    # first 500 cells
                 gene2 = as.integer(cells <= ncells), # i.e., always true
                 gene3 = as.integer(cells > 400),     # last 600 cells
                 gene4 = as.integer(cells <= ncells), # i.e., always true
                 gene5 = as.integer(cells > 300))     # last 700 cells

# Bolt on the intercept term for every cell because that's what we have to do 
design_ary <- array(1, dim=list(nrow(highexp), ncol(highexp), 2))
design_ary[, , 2] <- highexp
dimnames(design_ary) <- list(rownames(highexp), NULL, c("intercept","highexp"))

# Get the coefficients for each gene for each cell: 
genes <- rownames(highexp)
names(genes) <- genes
lambdas <- do.call(cbind, lapply(genes, function(x) design_ary[x, , ] %*% coefs[x, ]))
colnames(lambdas) <- genes

# Now we multiply through to get simulated counts. 
counts <- do.call(cbind, lapply(genes, function(x) rpois(n=nrow(lambdas), lambda=lambdas[, x])))

Check for normality

These counts look like so (divided into “high expressor” and “low expressor” categories):

library(tibble)
library(reshape2)
library(ggforce)

Loading required package: ggplot2

library(ggplot2)

dat <- tibble(melt(counts, varnames=c("cell","gene"), value.name="counts"))
dat$expgroup <- ifelse(melt(t(highexp))$value == 1, "high", "low")

ggplot(dat, aes(x=counts, fill=gene)) + geom_histogram(binwidth=1) + facet_grid(gene ~ expgroup) + theme_no_axes()

The Poisson distribution is a great way to simulate discrete events (counts of molecules, horse kicks, etc.) but it is most certainly not normally distributed – the mean depends upon the variance. In fact, a key feature of the Poisson distribution is that the mean is the variance. That breaks a lot of assumptions baked into linear model fitting, particularly the one about homoskedasticity (i.e., “the variance does not depend upon the mean”). This is super obvious for the low expressing cells in the figure above. Fortunately, there is a relatively easy solution to this problem.

Log-normalize to decouple mean and variance for Poisson random variables

Normalize the counts from each cell, with a target of 10K fragments:

X <- round(sweep(counts, 1, rowSums(counts) + 1, `/`) * 10000)

Take logs, as the raw counts are Poisson and have a mean-variance dependency. log1p adds a pseudo-count of 1 so that the result is always non-negative.

X <- log1p(X)

Plot it again. It looks a bit goofy but at least it resembles a bell curve within each group.

dat2 <- tibble(melt(X, varnames=c("cell","gene"), value.name="lognormcounts"))
dat2$expgroup <- ifelse(melt(t(highexp))$value == 1, "high", "low")
ggplot(dat2, aes(x=lognormcounts, fill=gene)) + 
  geom_density() + 
  xlim(0.5, 10) + 
  facet_grid(gene ~ expgroup) + 
  theme_no_axes()

Warning: Removed 1587 rows containing non-finite values (`stat_density()`).

It’s hardly perfect, but the groups at least have a more symmetric distribution, regardless of their means.

Add an intercept column.

This is the easiest way to set the mean in model fitting. We did it above for the Poisson model of mRNA counts, too.

X <- cbind(intercept=1, X)

Simulate protein abundance

Let’s assume protein abundance follows linear combinations of gene abundance. Since we are enforcing a particular generating model, we can see how well we recover it for each protein by ordinary least squares or gradient descent. In some cases the protein is constitutively expressed at a low level, in which case the intercept coefficient is nonzero. We also create coefs for each gene:

B <- cbind(protein1=c(intercept=0.001,
                      gene1=0.25,
                      gene2=0,
                      gene3=0.5,
                      gene4=0,
                      gene5=0),
           protein2=c(intercept=0,
                      gene1=0.5,
                      gene2=0,
                      gene3=0,
                      gene4=0,
                      gene5=0.25),
           protein3=c(intercept=0,
                      gene1=0,
                      gene2=0,
                      gene3=0.25,
                      gene4=0,
                      gene5=0.5))

Observe that

y_{j} = β_{0} + β_{g e n e 1} X_{g e n e 1} + . . . + β_{g e n e 5} X_{g e n e 5}

for each protein

y_{j \in 1, 2, 3}

This is exactly equivalent to saying $y_{j} = X β_{j}$ and thus we will later estimate $β_{j}$ as $\hat{β_{j}}$ .

Now we have what we need to generate a matrix of protein abundance. We assume ADT counts are Poisson, so we will reverse our variance-stabilizing transformation to yield Y0, the “raw” counts.

Y0 <- round(exp(X %*% B))

For regression, we will want to use the variance stabilized version, of course.

Y <- log1p(Y0)

How does this look?

heatmap(t(Y))

Fit coefficients by ordinary least squares

The Gauss-Markov theorem establishes that the best linear unbiased estimator for the coefficients of a linear system is the ordinary least squares (or OLS for short) solution. Specifically, if

$y \sim X^{T} β + ϵ$ , with $ϵ \sim N (0, σ^{2})$

then for an N by K matrix $X$ and K by 1 matrix $β$ ,

$\hat{β} = (X^{T} X)^{- 1} X^{T} y$ minimizes $\sum_{i = 1} (\hat{y_{i}} - y_{i})^{2}$ ,

and $\hat{β}$ is the best linear unbiased estimator of $β$ .

(It turns out that we can use the QR decomposition for a more stable inverse.)

If $y$ is an N by M matrix, we must solve for each row iteratively, and populate a K by M matrix $\hat{β}$ . Nevertheless, as long as the mild Gauss-Markov conditions are satisfied, $\hat{β}$ will minimize the mean squared error, so it should also be the solution by gradient descent if $(y - \hat{y})^{2}$ is the cost function.

A visual explainer can be found at setosa.io . The OLS estimate is also the maximum likelihood estimator (MLE) for a linear regression, and a useful benchmark all around. Let’s make it a function, and let’s use the more numerically stable QR decomposition to handle inversion.

ols <- function(X, y) solve(crossprod(X)) %*% t(X) %*% y
ols_qr <- function(X, Y) qr.solve(X, Y)

Since the QR decomposition can be applied across the columns of a matrix, we do just that.

# Solve for X, Y:
B_ols <- ols(X, Y)
B_qr <- ols_qr(X, Y)
show(B_ols - B_qr)

               protein1      protein2      protein3
intercept -2.114975e-14  1.208755e-13 -1.114664e-13
gene1     -1.026956e-15 -6.994405e-15  4.486862e-15
gene2      5.894807e-16 -4.363914e-17  3.161534e-16
gene3     -1.387779e-15 -5.284618e-15 -4.246603e-15
gene4      3.372086e-15 -7.746408e-15  5.616167e-15
gene5      2.652392e-15  2.303713e-15  1.437739e-14

heatmap(t(B_qr))

We can see how we did by computing the root mean squared error (RMSE), $\sqrt{(\hat{Y} - Y)^{2}}$ .

rmse <- function(yhat, y) sqrt(sum((yhat - y)**2)/(nrow(y) * ncol(y)))
rmse(Y, X %*% B_qr)

[1] 0.07385406

Not too awful, all things considered. Now let’s try the same via gradient descent.

Same thing but via lm()

Of course, we could just use good old lm() for this task, which will fit a linear model for each protein.

exprs <- X[, -1] # drop the explicit intercept
fit <- lm(Y ~ exprs)
(fitted <- round(coef(fit), 2))

            protein1 protein2 protein3
(Intercept)     0.34     0.14     0.73
exprsgene1      0.22     0.48    -0.01
exprsgene2      0.00     0.00     0.00
exprsgene3      0.45     0.00     0.24
exprsgene4      0.00     0.00     0.00
exprsgene5      0.01     0.25     0.43

As a reminder, here are the true generating values:

          protein1 protein2 protein3
intercept    0.001     0.00     0.00
gene1        0.250     0.50     0.00
gene2        0.000     0.00     0.00
gene3        0.500     0.00     0.25
gene4        0.000     0.00     0.00
gene5        0.000     0.25     0.50

A bit noisy, but not too awful.

Gradient descent

Linear regression can be solved exactly via QR decomposition, so typically we would not bother with gradient descent. However, the minute you step outside the classical assumptions (homoskedasticity, uncorrelated errors, i.i.d. variables), this starts to break down. One way or another we need to start from a “guess” or initial value and proceed towards a minimum-error guess (which will be our estimate of the true parameters). The cost function can be the negative log of the joint probability of seeing the data given the proposed model parameters, in which case we refer to the result as the maximum likelihood estimate (we usually minimize the negative log-likelihood instead), or it could be the error sum of squares, $\sum_{i = 1}^{n} (X_{i}^{T} β - y_{i})^{2}$ .

This process requires a step size $α$ and an update informed by $\frac{\partial X}{\partial y}$ . More generally, the matrix of partial derivatives for each predictor forms the gradient $\nabla f (β | X, y)$ , where $f (β)$ is the MSE. The goal is to proceed as quickly as possible in the direction that minimizes the cost. For OLS, $\nabla f (β)$ is $\frac{\partial (X^{T} β - y)^{2}}{\partial \hat{β}}$ , updating so that $β_{t + 1} = β_{t} - α \nabla f (β_{t})$ .

A visual explainer for this can be found at setosa.io. Since the OLS is also the maximum likelihood estimator (MLE) for a linear regression, we can benchmark our gradient descent function and also what we’d expect from glm(y, X). We compose gradient descent from smaller functions.

# estimate of y from X and b
yhat <- function(X, b) X %*% b

# error of the estimate from X and b
eps <- function(y, X, b) yhat(X, b) - y

# mean squared error (a function of eps)
mse <- function(y1, y0) mean((y1 - y0)**2)

# root mean squared error for the estimate
rmsb <- function(b1, b0) sqrt(mse(b1, b0))

# update vector for a given starting guess `b` 
delta <- function(y, X, b) t(t(X) %*% eps(y, X, b))[1,] / length(y)

# update function: step along this vector with stride length alpha
update <- function(y, X, b, alpha) return(b - (delta(y, X, b) * alpha))

# log an interation with its update 
logupdate <- function(y, X, b, b0) c(b, mse=mse(yhat(X,b), y), rmsb=rmsb(b, b0))

# gradient descent functon, applying all of the above bits 
# note that we are not scaling the features, which would make it faster
ols_gd <- function(y, X, alpha=1e-2, iter=1000, tol=1e-5, verbose=FALSE) {

  b0 <- rep(0, ncol(X))
  for (i in seq_len(iter)) {
    b <- update(y, X, b0, alpha)
    results <- logupdate(y, X, b, b0)
    names(results)[seq_along(b)] <- names(b)
    if (i == 1) updates <- t(as.matrix(results, nrow=1))
    if (i > 1) updates <- rbind(updates, logupdate(y, X, b, b0))
    if (verbose) message("Iteration ", i, ", ",
                         "MSE: ", updates[i, "mse"], ", ",
                         "RMS(b): ", updates[i, "rmsb"])
    if (updates[i, "rmsb"] < tol & verbose) message("Converged after ", i, " iterations.")
    if (updates[i, "rmsb"] < tol) break()
    b0 <- b
  }
  updates <- data.frame(updates)
  updates$iter <- seq_len(nrow(updates))
  class(updates) <- "updates"
  attr(b, "updates") <- updates
  class(b) <- "results"
  return(b)

}

# test it 
invisible(apply(Y, 2, ols_gd, X))

It will also be handy to automate plotting for the resulting update logs.

# add plot function for updates
plot.updates <- function(upd, ...) {
  with(upd, plot(iter, mse, type="b", col=j, lwd=3, ...))
}

# wrap it for results 
plot.results <- function(res, ...) plot(attr(res, "updates"), ...)

We can compare this with the brute-force result of a stabilized OLS solution, using the QR decomposition to avoid singularities when inverting `t(X) %*% X`. Let’s compare and see if we get about the same result by gradient descent.

# set up plots
par(mfrow=c(2, 3))

# plot convergence for each protein
for (j in seq_len(ncol(Y))) plot(ols_gd(Y[, j], X), main=colnames(Y)[j])

# run per column
B_gd <- apply(Y, 2, ols_gd, X)
rownames(B_gd) <- colnames(X)

# plot B, B_qr, and B_gd
responses <- colnames(Y)
estimates <- c("B", "B_qr", "B_gd")
for (j in seq_along(responses)) {
  protein <- responses[j]
  allests <- cbind(B, B_gd, B_qr)
  minest <- min(allests)
  maxest <- max(allests)
  plot(c(), main=protein,
       xlim=c(0, ncol(X) - 1), xlab="gene",
       ylim=c(minest, maxest), ylab="coef")
  for (k in seq_along(estimates)) {
    method <- estimates[k]
    est <- get(method)[, protein]
    points(cbind(x=seq_along(colnames(X))-1, y=est), pch=k, col=j)
  }
}

We do indeed. So why bother with gradient descent?

Maximum likelihood estimation

Suppose our objective is not to predict Y, but rather the probability that Y=1. This immediately breaks all assumptions that prop up the Gauss-Markov theorem, and comes up regularly when Y is an indicator for e.g. membership in a group. Ordinary least squares is useless here because the residuals can be infinite! A general solution to this problem is maximum likelihood.

(Trivia: training a one layer neural net with softmax activation is equivalent to logistic regression.)

Logistic regression

The canonical link for regression of Pr(y==1) on X is logit(p), defined as log(p / (1 - p)). Several good reasons exist for this choice:

logit(p), the log-odds of seeing Y = 1 given p, ranges from -Inf to +Inf
expit(z), the inverse of logit, ranges from 0 to 1 and thus maps XB to p
the first and second derivatives of this likelihood are easy to calculate.

The (log) likelihood function

What is likelihood? It’s the joint probability of seeing the data we saw, given the model we propose. For a logistic regression, that model is a Bernoulli distribution.

If $Y \sim B e r n o u l l i (π)$ , then the likelihood is $L (π | y) = \prod_{i = 1}^{n} P r (y_{i} | π_{i})$ .

This function is easier to compute as the log-likelihood, $l (π | y) = \sum_{i = 1}^{n} \log (P r (y_{i} | π_{i})$

For $l (π)$ with this canonical link, the matrix of first partial derivatives $\frac{\partial l (β)}{\partial β}$ is $X (y - \hat{p})$ , where $\hat{p} = \frac{e^{η}}{1 + e^{η}}$ and $η = X \hat{β}$ , and the matrix of second partial derivatives $\frac{\partial^{2} l (β)}{\partial β \partial β^{T}}$ is $- X^{T} W X$ , where $d i a g (W) = π_{i} (1 - π_{i})$ . Note that $W$ ends up scaling updates inversely proportional to variances.

With these two pieces, we can start iterating towards a numeric solution.

Our very own binary outcome

Let’s make this concrete by generating a binary outcome from the protein abundances in our previous example. Suppose the odds of a cell being malignant given $(i n t e r c e p t, p r o t e i n 1, p r o t e i n 2, p r o t e i n 3)$ are $(0, - 2, 3, - 1)$ .

# log-odds coefs
bl <- c(0, -2, 3, -1)

# add an intercept
yy <- cbind(intercept=1, Y)

# untransformed 
eta <- yy %*% bl

# transformation
expit <- function(x) exp(x) / (1 + exp(x))

# transformed 
pee <- expit(eta)

# cheezy but useful helper function
rbernoulli <- function(...) rbinom(size=1, ...)

# simulate the response vector
yl <- rbernoulli(length(pee), pee)
names(yl) <- rownames(Y)

# tabulate: 
table(yl)

yl
  0   1 
468 532

Now let’s solve for the coefficients by gradient descent, specifically by Fisher scoring. For a canonical logit link, Fisher scoring produces identical results to Newton-Raphson iteration, with the bonus feature of providing the covariance matrix among the predictors at convergence, which is the inverse of t(X) %*% W %*% X and thus relatively easily found with solve at convergence.

Logistic regression via gradient descent

lr_gd <- function(y, X, iter=1000, tol=1e-5) {

  y <- as.matrix(y)
  pi_i <- function(eta) return(exp(eta)/(1 + exp(eta)))

  # starting params 
  b0 <- matrix(0, ncol=1, nrow=ncol(X))
  p0 <- pi_i(X %*% b0)
  W0 <- diag(as.vector(p0 * (1 - p0)))
  I0 <- t(X) %*% W0 %*% X
  U0 <- t(X) %*% (y - p0)
  I <- I0
  U <- U0

  # iterate 
  for (i in seq_len(iter)) {
    b <- b0 + (solve(I) %*% U)
    if (all(abs(b - b0) < tol)) break
    p <- pi_i(X %*% b)
    W <- diag(as.vector(p * (1 - p)))
    I <- t(X) %*% W %*% X
    U <- t(X) %*% (y - p)
    b0 <- b
  }

  attr(b, "fitted") <- pi_i(X %*% b)
  attr(b, "vcov") <- solve(I)
  attr(b, "iter") <- i
  class(b) <- "results"
  return(b)

}

Add a couple convenience functions per usual:

print.results <- function(x) print(x[,])
vcov.results <- function(x) attr(x, "vcov")

Once again we will test out our function:

# test it 
bl_hat <- lr_gd(yl, yy)
bl_hat

 intercept   protein1   protein2   protein3 
-0.4459375 -1.6434112  2.5578092 -0.8104543

bl

[1]  0 -2  3 -1

bl_hat - bl

 intercept   protein1   protein2   protein3 
-0.4459375  0.3565888 -0.4421908  0.1895457

vcov(bl_hat)

            intercept     protein1    protein2     protein3
intercept  0.33208743  0.020111058 -0.04263766 -0.040443067
protein1   0.02011106  0.031153470 -0.04144569  0.008310655
protein2  -0.04263766 -0.041445685  0.06621578 -0.017473990
protein3  -0.04044307  0.008310655 -0.01747399  0.016924482

heatmap(vcov(bl_hat), revC=TRUE)

This is as it should be: the coef for protein2 is positive and anticorrelated with protein1 and protein3, which are negative. None of the other terms are correlated with the intercept term, which in our generating model is 0. The rather larger errors at convergence demonstrate something else of note: once we step away from linear models, we need a larger sample size to get the same tight bounds on our estimates that we would expect from OLS.

Fun stuff

For certain values of fun. Given $y_{l} \sim B e r n o u l l i (e x p i t (Y β_{Y}))$ , and $Y \sim X β_{X}$ , can you model
$y_{l} \sim X$ ? Remember, the two have different activation and gradient functions. Any ideas?