Generalized linear regression model and General linear regression model

Generalized linear regression model(GLM) and general linear regression are usually treated as different regression methods, in fact, the general linear regression is just one special case of GLM.For the general linear regression we can write the model as \[ y_i=\beta_0+\beta_1 x_{i1}+...+\beta_px_{ip}+\epsilon_i \]

where \(\epsilon_i\) has a certain distribution, usually normal N(0,\(\sigma\)), we treat \(\beta\) and \(x\) as constant here, then \(y_i\) can be seen has the same distribution as \(\epsilon_i\)

For the generalized linear model(GLM), there is no error term \(\epsilon\) anymore, we directly assume that the response variable \(Y\) has certain distributions such as Poisson, Bernoulli , Gamma distribution et al.ย For these distributions we assume the parameter \(\mu\) i.e \(E(Y)\) by some transformations are linear with predictor variables. These transformations are called link functions which can be written as: \[g(\mu_i)=\beta_0+\beta_1 x_{i1}+...+\beta_px_{ip} \] where \(g\) is called link function and usually we denote \[\eta_i= \beta_0+\beta_1 x_{i1}+...+\beta_px_{ip} \] (Note: \(i\) here means \(i^{th}\) row, subject, data record or observation.)

Estimate coefficients for a Poisson regression

To calculate the \(\beta\) and its confidence interval, maximum likelihood estimating (MLE) method is often used. Often, we cannot find an explicit solution for the MLE, therefore this method is usually combined with a Newton-Raphson algorithm which leads to so called iterative re-weighted least square method. Let suppose our responsive variable \(Y\) has a Poisson distribution, then we can write:

\[ f(y_i;\mu_i)=\frac{e^{-\mu_i}\mu_i^{y_i}}{y_i!}\] The likelihood function is: \[\begin{align*} L=\frac{e^{-\mu_1}\mu_1^{y_1}}{y_1!}\times\frac{e^{-\mu_2}\mu_2^{y_2}}{y_2!}\times...\times\frac{e^{-\mu_n} \mu_n^{y_n}}{y_n!}=\frac{e^{-\sum \mu_i}\mu_1^{y_1}\mu_2^{y_2}...\mu_n^{y_n}}{y_1!y_2!...y_n!} \end{align*}\] Then the log likelihood function is: \[ \ell=-\sum_i \mu_i+\sum_i y_i \log\mu_i-\sum_i \log y_i! \tag{1} \] Now we are going to choose a link function, we choose \(\log\) as the link function, note the \(\log\) function is the coefficient of \(y_i\) in \((1)\) we call the link function that has the same form as coefficient of \(y_i\) in the log likelihood function as the canonical link function. We have: \[ g(\mu_i)=\log(\mu_i)=\beta_0+\beta_1 x_{i1}+...+\beta_px_{ip}=\bf{x}_i^T \bf{\beta} \tag{2} \]

Here \(\bf{x}_i^T\) is the row vector \((x_{i0},x_{i1}, x_{i2},...,x_{ip})\) and \(\bf{\beta}\) is the column vector \((\beta_0, \beta_1,\beta_2,...\beta_p)\), \(i\) is the \(i^{th}\) subject in our data set and \(p\) is the number of predictors plus intercept.(In statistics, a vector is always treated as a column vector, therefore \(\bf{x_i}^T\) stands for a row vector.)

From \((2)\) we get \[\mu_i=e^{\bf{x}_i^T \bf{\beta} } \tag{3} \]

and we put \((3)\) back to \((1)\) \[ \ell=-\sum_i e^{\bf{x}_i^T \bf{\beta} }+\sum_i y_i \bf{x}_i^T \bf{\beta}-\sum_i \log y_i! \tag{4} \] Taking derivatives in term of \(\beta_j\) where \(j=1,2,...,p\) we get: \[\begin{align*} \frac{\partial \ell}{\partial \beta_j}&=-\sum_i e^{\bf{x}_i^T \bf{\beta}}x_{ij}+\sum_i y_ix_{ij} \\&=\sum_i x_{ij}(y_i-\sum_i e^{\bf{x}_i^T \bf{\beta} }) \end{align*}\] We can have \(1,2,...,p\) such equation and set them to zero, as long as we have number of subjects \(n>p\) we can solve these equations for \(\beta_1,...,\beta_p\). We can write above equations as matrix format \[\nabla \ell(\bf{\beta};\textbf{y,x})=\bf{X}^T(\bf{y}-e^{\bf{X\beta}})\tag{5} \] We also can write \((5)\) as \[\nabla \ell(\bf{\beta};\textbf{y,x})=\bf{X}^T(\bf{y}-\hat{\bf{y}})=\bf{X}^T(\bf{y}-{\bf{\mu}}) \] We call \(\nabla \ell(\bf{\beta};y)\) score function (\(\mathcal{S}\)), from \((5)\) we know that \(\bf{\beta}\) is the only unknown variable, when the score function is equal to zero we know the likelihood function \(\ell\) and \(L\) will be maximized. And we can use the Newton-Raphson method to solve the equation: \[\nabla \ell(\bf{\beta};\textbf{y,x})=\bf{X}^T(\bf{y}-e^{\bf{X\beta}})=0 \tag{6}\]

To use the Newton-Raphson method, we need to take derivative of the score function \((5)\) again. We get,

\[H=\nabla^2 \ell(\bf{\beta};\textbf{y,x})=-\bf{X}^T\bf{W}\bf{X} \tag{7} \] Where \(\bf{W}\) is a \(n\times n\) diagonal matrix with \(e^{\bf{x_i^T\beta}}\) for each diagonal element. Note, \((7)\) is the Hessian of the \(\log\) likelihood function \((4)\) and \(e^{\bf{x_i^T\beta}}=\bf{\mu_i}\), This \(\bf{W}\) in fact is the diagonal matrix with the Poisson variance since for the Poisson distribution variance equals to \(\mu_i\).

Combine \((5)\) and \((7)\) we can use Newton-Raphson method to solve equation \((6)\). \[\begin{align*} \bf{\beta}^{(m+1)}&=\bf{\beta}^{(m)}-\frac{\nabla \ell(\bf{\beta^{(m)}};\textbf{y,x})}{\nabla^2 \ell(\bf{\beta^{(m)}};\textbf{y,x})} \\&=\bf{\beta}^{(m)}+\frac{\bf{X}^T(\bf{y}-e^{\bf{X\beta}^{(m)}})}{\bf{X}^T\bf{W}_{(m)}\bf{X}}\tag{8} \end{align*}\]

We can stop here, the Newton-Raphson method can solve equation \((8)\) give us \(\mathbf{\beta}\).

From the equation \((8)\), we can do some new arrangements and obtain so called Iterative re-weighted least squares(IRLS) algorithm.

Let \(z_{(m)}=W_{(m)}^{-1}(\bf{y}-e^{\bf{X\beta^{(m)}}})\Rightarrow (\bf{y}-e^{\bf{X\beta^{(m)}}})=W_{(m)}z_{(m)}\). Then \((8)\) becomes \[ \bf{\beta}^{(m+1)}=\bf{\beta}^{(m)}+\frac{\bf{X}^TW_{(m)}z_{(m)}}{\bf{X}^T\bf{W}_{(m)}\bf{X}}=\bf{\beta}^{(m)}+{(\bf{X}^T\bf{W}_{(m)}\bf{X})}^{-1}\bf{X}^TW_{(m)}z_{(m)} \tag{9} \]

And \({(\bf{X}^T\bf{W}_{(m)}\bf{X})}^{-1}\bf{X}^TW_{(m)}z_{(m)}\) is exactly \(\bf{\beta}\) for a weighted least squares regression of \(z_{(m)}\) on \(X\), i.e we can use a weighted least squares regression method to solve this part. Hence, we showed how to develop the Iterative re-weighted least squares(IRLS) algorithm from the Newton-Raphson algorithm. Essentially, they are the same algorithm.

R code for the Newton-Raphson algorithm and the IRLS algorithm

We use a data set from the book An Introduction to Generalized Linear Models (3rd) on page 67 for the illustration. We make a design matrix, added a constant column for the intercept.We also make a data frame for the build in glm function of the R software.

y<-c(2,3,6,7,8,9,10,12,15)
x<-matrix(c(1,1,1,1,1,1,1,1,1,-1, -1, 0, 0, 0, 0, 1, 1, 1), nrow=9,ncol=2)
x1<-c(-1, -1, 0, 0, 0, 0, 1, 1, 1) #for glm function
data_4.3<-data.frame(y,x1)

poisson_Newton_Raphson<-function(x,y,b.init,tol=1e-8){
  change <- Inf
  b.old <- b.init
  while(change > tol) {
    eta <- x %*% b.old  # linear predictor
    w<-diag(as.vector(exp(eta)),nrow=nrow(x),ncol=nrow(x))
    b.new<-b.old+solve(t(x)%*%w%*%x)%*%t(x)%*%(y-exp(eta))
    change <- sqrt(sum((b.new - b.old)^2))
    b.old <- b.new
  }
  b.new
}
poisson_Newton_Raphson(x,y,rep(1,2))
##           [,1]
## [1,] 1.8892720
## [2,] 0.6697856

Next, we use the glm function to perform a Poisson regression model using the data frame data_4.3.

m1<-glm(y ~ x1, family=poisson(link="log"),data=data_4.3)
summary(m1)
## 
## Call:
## glm(formula = y ~ x1, family = poisson(link = "log"), data = data_4.3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8472  -0.2601  -0.2137   0.5214   0.8788  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.8893     0.1421  13.294  < 2e-16 ***
## x1            0.6698     0.1787   3.748 0.000178 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18.4206  on 8  degrees of freedom
## Residual deviance:  2.9387  on 7  degrees of freedom
## AIC: 41.052
## 
## Number of Fisher Scoring iterations: 4

Compare the results of our home made function and the R build in glm function we get the same results.

Next we go a little bit further to do a so called IRLS method, i.e we use a weighted least square regression model to update the coefficients.

poisson_IRLS<-function(x,y,b.init,tol=1e-8){
 change <- Inf
 b.old <- b.init
 while(change > tol) {
 eta <- x %*% b.old  # linear predictor
 w<-diag(as.vector(exp(eta)),nrow=nrow(x),ncol=nrow(x))
#diagonal matrix weight
 z<-solve(w)%*%(y-exp(eta))
 weight<-exp(eta) #weight vector for the least squares regression
 b.new<-b.old+lm(z ~ x - 1, weights = weight)$coef 
 change <- sqrt(sum((b.new - b.old)^2))
 b.old <- b.new
 }
  b.new
}
poisson_IRLS(x,y,rep(1,2))
##        x1        x2 
## 1.8892720 0.6697856

We can see the IRLS method has the same results as the home made Newton-Raphson method and the glm function method.

Estimate coefficients for a Logistic regression

Let suppose the response variable \(\bf{Y}\) has a Bernoulli distribution then we write

\[f(y_i;p_i)=p_i^{y_i}(1-p_i)^{(1-y_i)} \] The likelihood function is: \[\begin{align*} L&=p_1^{y_1}(1-p_1)^{(1-y_1)}\times p_2^{y_2}(1-p_2)^{(1-y_2)}\times...\times p_n^{y_n}(1-p_n)^{(1-y_n)} \\&= \prod_{i=1}^n p_i^{y_i}(1-p_i)^{(1-y_i)} \end{align*}\] The log likelihood function is: \[ \ell=\sum_{i=1}^n y_i\log p_i+\sum_{i=1}^n (1-y_i)\log(1-p_i)\tag{10} \]

The link function for logistic regession model is called logit function i.e : \[g(p_i)=logit(p_i)=\log \frac{p_i}{1-p_i}=\beta_0+\beta_1x_{i1}+\beta_2x_{i_2}+...+\beta_px_{ip}=\bf{x_i}^T\bf{\beta}\tag{11} \] From (11) we have \[\begin{align*} &\frac{p_i}{1-p_i}=e^{\bf{x_i}^T\bf{\beta}} \Rightarrow p_i=e^{\bf{x_i}^T\bf{\beta}}(1-p_i) \Rightarrow p_i=\frac{1}{1+e^{-\bf{x_i^T\beta}}}\tag{12} \end{align*}\] Now we put \((12)\) to \((10)\) we get: \[\begin{align*} \ell&=\sum_{i=1}^n \left\{y_i\log\frac{1}{1+e^{-\bf{x_i^T\beta}}}+(1-y_i)\log(1-\frac{1}{1+e^{-\bf{x_i^T\beta}}})\right\} \\&=\sum_{i=1}^n\left\{-y_i\log(1+e^{-\bf{x_i^T\beta}})+(1-y_i)\log\frac{e^{-\bf{x_i^T\beta}}}{1+e^{-\bf{x_i^T\beta}}}\right\} \\&=\sum_{i=1}^n\left\{-y_i\log(1+e^{-\bf{x_i^T\beta}})-(1-y_i){\bf{x_i^T\beta}}-(1-y_i)\log(1+e^{-\bf{x_i^T\beta}})\right\} \\&=\sum_{i=1}^n\left\{-y_i\log(1+e^{-\bf{x_i^T\beta}})-(1-y_i){\bf{x_i^T\beta}}-\log(1+e^{-\bf{x_i^T\beta}})+y_i\log(1+e^{-\bf{x_i^T\beta}})\right\} \\&=\sum_{i=1}^n\left\{-(1-y_i){\bf{x_i^T\beta}}-\log(1+e^{-\bf{x_i^T\beta}})\right\} \\&=\sum_{i=1}^n\left\{y_i{\bf{x_i^T\beta}}-\bf{x_i^T\beta}-\log(1+e^{-\bf{x_i^T\beta}})\right\} \tag{13} \end{align*}\] Taking derivatives in term of \(\beta_j\) where \(j=1,2,...,p\) we get: \[\begin{align*} \frac{\partial \ell}{\partial \beta_j }&=\sum_{i=1}^n\left\{y_ix_{ij}-x_{ij}+\frac{1}{1+e^{-\bf{x_i^T\beta}}}e^{-\bf{x_i^T\beta}}x_{ij}\right\} \\&=\sum_{i=1}^nx_{ij}(y_i-1+\frac{e^{-\bf{x_i^T\beta}}}{1+e^{-\bf{x_i^T\beta}}}) \\&=\sum_{i=1}^nx_{ij}(y_i-\frac{1}{1+e^{-\bf{x_i^T\beta}}}) \end{align*}\] Still we can write \(\frac{\partial \ell}{\partial \beta_j }=\sum_{i=1}^nx_{ij}(y_i-\hat{y_i})=\sum_{i=1}^nx_{ij}(y_i-p_i)\)

Next we can take derivative in term of \(\beta_0, \beta_1,...,\beta_{j-1}\beta_{j+1},...,\beta_{p}\) as long as \(p<n\) we can solve these equations and calculate \(\bf{\beta}\). We write these equations as matrix format: \[\begin{align*} \nabla(\ell;\textbf{y,x})&=\bf{X}^T(\bf{y}-\frac{1}{1+e^{-\bf{X}^T\bf{\beta}}})\tag{14} \\&=\bf{X}^T(\bf{y}-\bf{p}) \end{align*}\] This is the score function, in order to use Newton-Rapson method we need to take derivative in term of \(\bf{\beta}\) for the score function to get the Hesssian. \[\begin{align*} \bf{H}=\nabla^2(\ell;\textbf{y,x})&=-\bf{X}^T\left [-\frac{e^{-\bf{X}^T\bf{\beta}}}{(1+e^{-\bf{X}^T\bf{\beta}})^2} \right](-\bf{X}) \\&=-\bf{X}^T\left[\frac{1}{1+e^{-\bf{X}^T\bf{\beta}}}\times \frac{e^{-\bf{X}^T\bf{\beta}}}{1+e^{-\bf{X}^T\bf{\beta}}}\right]\bf{X}\tag{15} \\&=-\bf{X}^T\left[\bf{p}(1-\bf{p})\right]\bf{X} \\&=-\bf{X}^T\bf{W}\bf{X} \end{align*}\] where \(\bf{W}\) is a \(n\times n\) diagonal matrix with the Bernoulli variance \(p_i(1-p_i)\) as the diagonal elements. Combine \((14)\) and \((15)\) we can use Newton-Raphson method to solve : \[\nabla(\ell;\textbf{y,x})=0\] Using Netwon-Raphson method we have: \[\begin{align*} \bf{\beta}^{(m+1)}&=\bf{\beta}^{(m)}-\frac{\nabla \ell(\bf{\beta^{(m)}};\textbf{y,x})}{\nabla^2 \ell(\bf{\beta^{(m)}};\textbf{y,x})} \\&=\bf{\beta}^{(m)}+\frac{\bf{X}^T(\bf{y}-\frac{1}{1+e^{-\bf{X}^T\bf{\beta}^{(m)}}})}{\bf{X}^T\bf{W}_{(m)}\bf{X}}\tag{16} \end{align*}\] We can go even a little further than \((9)\), from \((16)\) for the IRLS. Let \(\bf{z}_{(m)}=\bf{W}^{-1}(\bf{y}-\frac{1}{1+e^{-\bf{X}^T\bf{\beta}^{(m)}}})\)

\[\begin{align*} \bf{\beta}^{(m+1)}&=\bf{\beta}^{(m)}+{(\bf{X}^T\bf{W}_{(m)}\bf{X})}^{-1}\bf{X}^TW_{(m)}z_{(m)} \tag{17} \\&={(\bf{X}^T\bf{W}_{(m)}\bf{X})}^{-1}\left[\bf{X}^T\bf{W}_{(m)}\bf{X}\bf{\beta}^{(m)}+\bf{X}^TW_{(m)}z_{(m)}\right] \\&={(\bf{X}^T\bf{W}_{(m)}\bf{X})}^{-1}\bf{X}^T\bf{W}_{(m)}\left[\bf{X}\bf{\beta}^{(m)}+z_{(m)}\right]\tag{18} \end{align*}\] From \((18)\) we can see\({(\bf{X}^T\bf{W}_{(m)}\bf{X})}^{-1}\bf{X}^T\bf{W}_{(m)}\left[\bf{X}\bf{\beta}^{(m)}+z_{(m)}\right]\) is the coefficients for the weighted least squares regression \(\left[\bf{X}\bf{\beta}^{(m)}+z_{(m)}\right]\) on \(\bf{X}\)

Next let us run logistic regression models using a data set from the UCLA IDRE website. First, we will use the glm function and then we will use the self-made IRLS R function.

library(Matrix)
library(MASS)
#########glm build in function
df <- read.csv('https://stats.idre.ucla.edu/stat/data/binary.csv')
mylogit <- glm(admit ~ gre + gpa + rank, data = df, family = "binomial"(link = "logit"))
summary(mylogit)
## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = binomial(link = "logit"), 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5802  -0.8848  -0.6382   1.1575   2.1732  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.449548   1.132846  -3.045  0.00233 ** 
## gre          0.002294   0.001092   2.101  0.03564 *  
## gpa          0.777014   0.327484   2.373  0.01766 *  
## rank        -0.560031   0.127137  -4.405 1.06e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 459.44  on 396  degrees of freedom
## AIC: 467.44
## 
## Number of Fisher Scoring iterations: 4
#########IRLS home made function
y<-df[,1]
x<-as.matrix(cbind(rep(1,400),df[,c(2,3,4)]))# add one column 1 for intercept

IRLS_logistic <- function(x, y, b.init, tol=1e-8) {
  change <- Inf
  b.old <- b.init
  while(change > tol) {
    eta <- x %*% b.old  # linear predictor
    p<-1/(1+exp(-eta))
    w<-diag(as.vector(p*(1-p)),nrow=nrow(x),ncol=nrow(x))#diagonal matrix
    weight<-p*(1-p) #vector weight for wls
    z<-solve(w)%*%(y-p)
    z.new<-x%*%b.old+z
    b.new <- lm(z.new ~ x - 1, weights = weight)$coef  # WLS regression
    change <- sqrt(sum((b.new - b.old)^2))
    b.old <- b.new
  }
  b.new
}
IRLS_logistic (x, y, rep(0.001,4))
## xrep(1, 400)         xgre         xgpa        xrank 
##  -3.44954840   0.00229396   0.77701357  -0.56003139

We can see using the self-written IRLS function we got exactly the same results as the glm function.

References

1.Dobson, A. J., & Barnett, A. G. (2008). An introduction to generalized linear models. Chapman and Hall/CRC.

2.Wood, Simon N. Generalized additive models: an introduction with R. chapman and hall/CRC, 2006.