In this homework exercise, you will code the iteratively re-weighted least squares (IRLS) method for solving logistic regression. IRLS is an application of the Newton’s Method, a very commonly used (and well studied) optimization method. Newton’s method seeks to find the minimizer of a function \(f(\beta)\) by approximating the function with a second-order Taylor series approximation around the current value \(\beta_t\), and its iterates are given by
\[ \beta_{t+1} = \beta_t - \frac{f'(\beta_t)}{f''(\beta_t)}, \] where \(\beta_t\) is the current iterate value, \(f'(\beta)\) and \(f''(\beta_t)\) are the derivative and the second derivative of the function \(f(\cdot)\), respectively.
In this exercise the \(f(\cdot)\) is given by the negative log-likelihood of the logistic regression formulation, something you don’t need to worry about.
The logistic regression model is used for classification problems for which the responses (dependent variables) \(y_i\) are given by Bernoulli random variables (coin flips). For simplicity, we consider the simple logistic regression problem, where the covariates (independent variables) are scalar, and are given by \(x_i\).
We characterize the distribution of the responses, \(Y\), with the following formula: \[ Y|(X=x) \sim \text{Bernoulli}\left(p=\frac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}\right) \] where \(x\) is the value of some covariates that are related to \(Y\), and \(e\) is the good old Euler’s number.
In practice, the quantities \(\beta_0\) and \(\beta_1\) are unknown and have to be estimated from data. Unfortunately, this estimation problem is a little more complex than your usual linear regression and requires complex optimization tools.
Surprisingly, problem of fitting logistic regression is actually related to an another problem from statistics: weighted linear regression. Weighted linear regression is used in cases where the errors of each observation might have a different variance, in other words, when heteroskadasticity exists.
We will skip the derivation, and present the results. If you are more interested in these derivations, feel free to search online for “Generalized Least Squares”.
In usual linear regression, if we denote the vector of responses by \(\mathbf{y}\) and the matrix of covariates by \(\mathbf{X}\), then the regression coefficients for least squares regression, \(\hat{\beta}\), is given by the normal equations: \[ \hat{\beta}=\left(\mathbf{X}^T \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{y}, \] where \(\mathbf{X}^T\) is the transpose of the matrix \(\mathbf{X}\).
In weighted linear regression, if the variance of the residual for observation \(i\) is given by \(\sigma_i^2\), the coefficients are instead given by \[ \hat{\beta}_{\text{weighted}}=\left(\mathbf{X}^T \mathbf{D} \mathbf{X}\right)^{-1} \mathbf{X}^T \mathbf{D} \mathbf{y}, \] where \(\mathbf{D}\) is a diagonal matrix and \(\mathbf{D}_{i,i}=1/\sigma_i^{2}\).
Following, you are given code for fitting (standard) linear regression using the normal equations. Adjust the second function, so that it instead fits weighted linear regression.
fit_lm <- function(X,y){
X <- cbind(1,X) #Add intercept to the X matrix
lhs <- t(X)%*%X #compute X^T X
rhs <- t(X)%*%y #compute X^T y
return(solve(lhs,rhs)) #compute inverse(X^T X)*t(X)%*%y
}
fit_wlm <- function(X,y,obs_var){
#obs_var is a vector of variances for each observation
X <- cbind(1,X) #Add intercept to the X matrix
D <- diag(1/obs_var) #create the D matrix
lhs <- t(X)%*%D%*%X #CHANGE THIS
rhs <- t(X)%*%D%*%y #CHANGE THIS
return(solve(lhs,rhs)) #compute inverse(X^T D X)*t(X)%*%D%*% y
}
Verify that your method fit_wlm works properly by comparing it with R’s lm. Run the following code, if your method works properly, you will see Your method works! as the output.
library(tidyverse)
n <- 100
x <- runif(n,min=0,max=10)
err_var <- (x+1)
y <- 1 + 2*x + rnorm(n,mean = 0,sd = sqrt(err_var))
lm_r <- lm(y~x,weights = 1/err_var) %>% coef
lm_yours <- fit_wlm(x,y,err_var)
lm_diff <- sum((lm_r-lm_yours)^2)
ifelse(lm_diff<=1e-6,"Your method works!\n","Check your code!\n") %>% cat
## Your method works!
We skip the derivation for the IRLS procedure, and refer the interested reader to lecture notes at this link for the full derivation. One of the fundamental ideas in the derivation is that the variance of each observation in logistic regression depends on the observations’ covariates. This is due to the variance formula of a Bernoulli random variable, which is given by \(p(1-p)\), and how \(p\) differs in each observation.
The IRLS algorithm steps for logistic regression are given as:
Next, we present an incomplete code for fitting IRLS. You are asked to make changes so that the full code fits logistic regression using IRLS.
beta<-c(0,0)
compute_p <- function(x,beta){
X <- cbind(1,x)
p <- exp(X%*%beta)/(1+exp(X%*%beta)) # CHANGE THIS so that this returns p_i (for step 2)
return(p)
}
compute_var <- function(p){
V <- p*(1-p) # CHANGE THIS so that this returns V_i given p_i (for step 3)
return(V)
}
compute_expected_resp <- function(x,beta,y,p,V){
X <- cbind(1,x)
z <- X%*%beta+(y-p)/V # CHANGE THIS so that this returns the expected responses (step 4)
return(z)
}
did_we_converge <- function(beta0,beta1,conv.eps){
sum((beta0-beta1)^2)<=conv.eps
}
IRLS <- function(X,y,max.iter=100,conv.eps=1e-12){
beta <- rep(0,ncol(cbind(1,X))) #initialize beta
beta_prev <- beta #beta_{t-1} (for comparisons)
for(iter in 1:max.iter){
p <- compute_p(X,beta) %>% as.numeric
V <- compute_var(p) %>% as.numeric
z <- compute_expected_resp(X,beta,y,p,V) %>% as.numeric
beta <- fit_wlm(X,z,1/V)
if(did_we_converge(beta_prev,beta,conv.eps)){
break
} else {
beta_prev <- beta
}
}
return(beta)
}
Finally verify that your method works by comparing it against R’s glm implementation. As before, run the following code, and if your method works properly, you will see Your method works! as the output.
data(iris)
iris_a <- iris %>% slice(51:150) %>%
select(Sepal.Length,Species) %>%
mutate(Species=1*(Species=="virginica"))
glm_yours <- IRLS(X=iris_a[,1],y=iris_a[,2])
glm_r <- glm(Species~Sepal.Length,family = binomial,data=iris_a) %>% coef
glm_diff <- sum((glm_r-glm_yours)^2)
ifelse(glm_diff<=1e-6,"Your method works!\n","Check your code!\n") %>% cat
## Your method works!
Finally, check how fast the method converges to the optimal coefficients by changing the max.iter parameter. In the space below, provide R code so that the final plot shows difference between the coefficients fitted with glm and your IRLS method for each max.iter.
beta_iter <- matrix(NA,nrow=2,ncol=20)
#this is a matrix such that beta_iter[i,j] the IRLS fitted coefficient for beta[i] with max.iter=j
#you need to fill in the matrix yourself
## YOUR SOLUTION GOES HERE
for (j in 1:20){
beta <- IRLS(X=iris_a[,1],y=iris_a[,2],j,1e-12)
for (i in 1:2){
beta_iter[i,j]<-beta[i]
}
}
beta_diff <- beta_iter - matrix(glm_r,nrow=2,ncol=20)
beta_diff_sum <- apply(beta_diff^2,2,sum)
plot(1:20,beta_diff_sum)