Online Logistic Regression

Taylor expansion of log-likelihood

Suppose you have a log-likelihood function, \(l_0\), with maximum-likelihood parameters \(\beta_0\). If you make a subsequent set of observations, \(X_1\), with log-likelihood \(h\), then you can write the new log-likelihood, \(l_1\), as

\(l_1 = l_0 + h\).

This new log-likelihood will be maximized at a value \(\beta_1\), given by \(l'(\beta_1) = 0\). If the number of observations in \(X_1\) is small compared to the number in \(X_0\), then the change in the parameter \(\delta \beta = \beta_1 - \beta_0\) will be small. In this case we can expand the derivative around the point \(\beta_0\),

\(l_1'(\beta_1) = l_0'(\beta_1) + h'(\beta_1) \\ \approx l_0'(\beta_0) + l_0''(\beta_0) \delta \beta + h'(\beta_0) + h''(\beta_0) \delta \beta\).

Noting \(l_0'(\beta_0) = 0\), we have,

\(l_0''(\beta_0) \delta \beta + h'(\beta_0) + h''(\beta_0) \delta \beta = 0\)

or,

\(\delta \beta = \frac{- h'(\beta_0)}{l_0''(\beta_0) + h''(\beta_0)}\)

The above is appropriate for a single variable, in the case where \(\beta\) is a vector of parameters, then this generalizes to,

\((l_0''(\beta_0) + h''(\beta_0)) \cdot \delta \beta = -\nabla h(\beta_0)\),

where \(l''\) and \(h''\) can be interpreted as the Hessian matrices for \(l_0\) and \(h\) respectively.

Application to binary logistic regression

In the case of a binary logistic regression, we can specify the derivatives in terms of the predicted values.

\(l = \Sigma_n y_n \log \mu_n + (1 - y_n) \log (1-\mu_n)\)

\(\nabla l = X^T (y - \mu)\)

\(l'' = -X^T \mu (1-\mu) X\)

The update rule becomes

\(\delta \beta = [X^T \mu (1-\mu) X]^{-1} X_1^T (y_1 - \mu_1)\),

where \(X = X_0 + X_1\).

This has a similar form as the update rule for iteratively reweighted least squares.

Removing an observation

Instead of adding a new observation, we can go the other direction and ask what the parameter values would be if we removed a set of observations. In this case we have,

\(l_0'(\beta_0) = l_1'(\beta_1) - l_1''(\beta_1) \delta \beta + h'(\beta_1) - h''(\beta_1) \delta \beta\)

The update rule remains the same except that the residuals are taken with respect to only the hold-out set of data, and the denominator is the full data set, minus the hold-out set, viz.

\(\delta \beta = \frac{- h'(\beta_1)}{l_1''(\beta_1) - h''(\beta_1)}\)

Example in R

Here I generate a model with three parameters

generate_bin_data <- function(rseed=101, npts=10000, noise_sigma=0.2) {
  
  set.seed(rseed)
  xx <- matrix(rnorm(3 * npts, 0, 0.5), ncol=3)
  true_pars <- c(-0.1, -0.2, 0.5)
  lam <- xx %*% true_pars
  noise <- rnorm(npts, 0, noise_sigma)
  zn <- lam + noise
  z <- exp(zn)/(1+exp(zn))
  y <- rbinom(npts,1,z)  
  
  df_fit <- data.frame(x1=xx[,1], x2=xx[,2], x3=xx[,3], y=y)
  df_fit$y <- as.factor(df_fit$y)
  df_fit

}

npts = 10000
glm_data <- generate_bin_data(npts=npts)

Fit the model, leaving out the last observation (glm_mod0) and including it (glm_mod1),

glm_mod0 <- glm(y ~ ., data=glm_data[1:(npts-1),], family='binomial')
glm_mod1 <- glm(y ~ ., data=glm_data[1:(npts),], family='binomial')

Get the difference in the coefficients,

coef0 <- coef(glm_mod0)
coef1 <- coef(glm_mod1)
dcoef <- coef1 - coef0

Get the model matrix

xx <- model.matrix(y ~ ., data=glm_data)
y <- glm_data$y

x1 <- matrix(xx[npts,], nrow=1)
y1 <- matrix(as.numeric(as.character(y[npts])), ncol=1)

Compute the weights. Note that the predictions are done using the model that was determined when holding out the last observation, but are applied to all the observations, including the last

mu <- predict(glm_mod0, newdata=glm_data, type='response')
mu1 <- matrix(mu[npts], ncol=1)
ww <- Matrix::Diagonal(mu * (1-mu), n=length(mu))

Compute \(\delta \beta\) using the Taylor series approximation,

lhs_mat <- -t(xx) %*% ww %*% xx
rhs_mat <- t(x1) %*% (y1 - mu1)
dbeta <- solve(lhs_mat, -rhs_mat)

Finally, compare the two,

coef_compare <- data.frame(dcoef=dcoef, dbeta=dbeta)
coef_compare$compare <- coef_compare$dbeta - coef_compare$dcoef
print(coef_compare)

##                     dcoef         dbeta       compare
## (Intercept)  1.927206e-04  1.927187e-04 -1.876985e-09
## x1           1.365596e-05  1.365710e-05  1.142480e-09
## x2          -2.228601e-05 -2.228384e-05  2.167406e-09
## x3           1.550796e-04  1.550727e-04 -6.905886e-09

Online Logistic Regression

Ben Dilday

2017-11-21

Taylor expansion of log-likelihood

Application to binary logistic regression

Removing an observation

Example in R