Generalized linear regression model(GLM) and general linear regression are usually treated as different regression methods, in fact, the general linear regression is just one special case of GLM.For the general linear regression we can write the model as \[ y_i=\beta_0+\beta_1 x_{i1}+...+\beta_px_{ip}+\epsilon_i \]
where \(\epsilon_i\) has a certain distribution, usually normal N(0,\(\sigma\)), we treat \(\beta\) and \(x\) as constant here, then \(y_i\) can be seen has the same distribution as \(\epsilon_i\)
For the generalized linear model(GLM), there is no error term \(\epsilon\) anymore, we directly assume that the response variable \(Y\) has certain distributions such as Poisson, Bernoulli , Gamma distribution et al.ย For these distributions we assume the parameter \(\mu\) i.e \(E(Y)\) by some transformations are linear with predictor variables. These transformations are called link functions which can be written as: \[g(\mu_i)=\beta_0+\beta_1 x_{i1}+...+\beta_px_{ip} \] where \(g\) is called link function and usually we denote \[\eta_i= \beta_0+\beta_1 x_{i1}+...+\beta_px_{ip} \] (Note: \(i\) here means \(i^{th}\) row, subject, data record or observation.)
To calculate the \(\beta\) and its confidence interval, maximum likelihood estimating (MLE) method is often used. Often, we cannot find an explicit solution for the MLE, therefore this method is usually combined with a Newton-Raphson algorithm which leads to so called iterative re-weighted least square method. Let suppose our responsive variable \(Y\) has a Poisson distribution, then we can write:
\[ f(y_i;\mu_i)=\frac{e^{-\mu_i}\mu_i^{y_i}}{y_i!}\] The likelihood function is: \[\begin{align*} L=\frac{e^{-\mu_1}\mu_1^{y_1}}{y_1!}\times\frac{e^{-\mu_2}\mu_2^{y_2}}{y_2!}\times...\times\frac{e^{-\mu_n} \mu_n^{y_n}}{y_n!}=\frac{e^{-\sum \mu_i}\mu_1^{y_1}\mu_2^{y_2}...\mu_n^{y_n}}{y_1!y_2!...y_n!} \end{align*}\] Then the log likelihood function is: \[ \ell=-\sum_i \mu_i+\sum_i y_i \log\mu_i-\sum_i \log y_i! \tag{1} \] Now we are going to choose a link function, we choose \(\log\) as the link function, note the \(\log\) function is the coefficient of \(y_i\) in \((1)\) we call the link function that has the same form as coefficient of \(y_i\) in the log likelihood function as the canonical link function. We have: \[ g(\mu_i)=\log(\mu_i)=\beta_0+\beta_1 x_{i1}+...+\beta_px_{ip}=\bf{x}_i^T \bf{\beta} \tag{2} \]
Here \(\bf{x}_i^T\) is the row vector \((x_{i0},x_{i1}, x_{i2},...,x_{ip})\) and \(\bf{\beta}\) is the column vector \((\beta_0, \beta_1,\beta_2,...\beta_p)\), \(i\) is the \(i^{th}\) subject in our data set and \(p\) is the number of predictors plus intercept.(In statistics, a vector is always treated as a column vector, therefore \(\bf{x_i}^T\) stands for a row vector.)
From \((2)\) we get \[\mu_i=e^{\bf{x}_i^T \bf{\beta} } \tag{3} \]
and we put \((3)\) back to \((1)\) \[ \ell=-\sum_i e^{\bf{x}_i^T \bf{\beta} }+\sum_i y_i \bf{x}_i^T \bf{\beta}-\sum_i \log y_i! \tag{4} \] Taking derivatives in term of \(\beta_j\) where \(j=1,2,...,p\) we get: \[\begin{align*} \frac{\partial \ell}{\partial \beta_j}&=-\sum_i e^{\bf{x}_i^T \bf{\beta}}x_{ij}+\sum_i y_ix_{ij} \\&=\sum_i x_{ij}(y_i-\sum_i e^{\bf{x}_i^T \bf{\beta} }) \end{align*}\] We can have \(1,2,...,p\) such equation and set them to zero, as long as we have number of subjects \(n>p\) we can solve these equations for \(\beta_1,...,\beta_p\). We can write above equations as matrix format \[\nabla \ell(\bf{\beta};\textbf{y,x})=\bf{X}^T(\bf{y}-e^{\bf{X\beta}})\tag{5} \] We also can write \((5)\) as \[\nabla \ell(\bf{\beta};\textbf{y,x})=\bf{X}^T(\bf{y}-\hat{\bf{y}})=\bf{X}^T(\bf{y}-{\bf{\mu}}) \] We call \(\nabla \ell(\bf{\beta};y)\) score function (\(\mathcal{S}\)), from \((5)\) we know that \(\bf{\beta}\) is the only unknown variable, when the score function is equal to zero we know the likelihood function \(\ell\) and \(L\) will be maximized. And we can use the Newton-Raphson method to solve the equation: \[\nabla \ell(\bf{\beta};\textbf{y,x})=\bf{X}^T(\bf{y}-e^{\bf{X\beta}})=0 \tag{6}\]
To use the Newton-Raphson method, we need to take derivative of the score function \((5)\) again. We get,
\[H=\nabla^2 \ell(\bf{\beta};\textbf{y,x})=-\bf{X}^T\bf{W}\bf{X} \tag{7} \] Where \(\bf{W}\) is a \(n\times n\) diagonal matrix with \(e^{\bf{x_i^T\beta}}\) for each diagonal element. Note, \((7)\) is the Hessian of the \(\log\) likelihood function \((4)\) and \(e^{\bf{x_i^T\beta}}=\bf{\mu_i}\), This \(\bf{W}\) in fact is the diagonal matrix with the Poisson variance since for the Poisson distribution variance equals to \(\mu_i\).
Combine \((5)\) and \((7)\) we can use Newton-Raphson method to solve equation \((6)\). \[\begin{align*} \bf{\beta}^{(m+1)}&=\bf{\beta}^{(m)}-\frac{\nabla \ell(\bf{\beta^{(m)}};\textbf{y,x})}{\nabla^2 \ell(\bf{\beta^{(m)}};\textbf{y,x})} \\&=\bf{\beta}^{(m)}+\frac{\bf{X}^T(\bf{y}-e^{\bf{X\beta}^{(m)}})}{\bf{X}^T\bf{W}_{(m)}\bf{X}}\tag{8} \end{align*}\]
We can stop here, the Newton-Raphson method can solve equation \((8)\) give us \(\mathbf{\beta}\).
From the equation \((8)\), we can do some new arrangements and obtain so called Iterative re-weighted least squares(IRLS) algorithm.
Let \(z_{(m)}=W_{(m)}^{-1}(\bf{y}-e^{\bf{X\beta^{(m)}}})\Rightarrow (\bf{y}-e^{\bf{X\beta^{(m)}}})=W_{(m)}z_{(m)}\). Then \((8)\) becomes \[ \bf{\beta}^{(m+1)}=\bf{\beta}^{(m)}+\frac{\bf{X}^TW_{(m)}z_{(m)}}{\bf{X}^T\bf{W}_{(m)}\bf{X}}=\bf{\beta}^{(m)}+{(\bf{X}^T\bf{W}_{(m)}\bf{X})}^{-1}\bf{X}^TW_{(m)}z_{(m)} \tag{9} \]
And \({(\bf{X}^T\bf{W}_{(m)}\bf{X})}^{-1}\bf{X}^TW_{(m)}z_{(m)}\) is exactly \(\bf{\beta}\) for a weighted least squares regression of \(z_{(m)}\) on \(X\), i.e we can use a weighted least squares regression method to solve this part. Hence, we showed how to develop the Iterative re-weighted least squares(IRLS) algorithm from the Newton-Raphson algorithm. Essentially, they are the same algorithm.
We use a data set from the book An Introduction to Generalized Linear Models (3rd) on page 67 for the illustration. We make a design matrix, added a constant column for the intercept.We also make a data frame for the build in glm function of the R software.
y<-c(2,3,6,7,8,9,10,12,15)
x<-matrix(c(1,1,1,1,1,1,1,1,1,-1, -1, 0, 0, 0, 0, 1, 1, 1), nrow=9,ncol=2)
x1<-c(-1, -1, 0, 0, 0, 0, 1, 1, 1) #for glm function
data_4.3<-data.frame(y,x1)
poisson_Newton_Raphson<-function(x,y,b.init,tol=1e-8){
change <- Inf
b.old <- b.init
while(change > tol) {
eta <- x %*% b.old # linear predictor
w<-diag(as.vector(exp(eta)),nrow=nrow(x),ncol=nrow(x))
b.new<-b.old+solve(t(x)%*%w%*%x)%*%t(x)%*%(y-exp(eta))
change <- sqrt(sum((b.new - b.old)^2))
b.old <- b.new
}
b.new
}
poisson_Newton_Raphson(x,y,rep(1,2))
## [,1]
## [1,] 1.8892720
## [2,] 0.6697856
Next, we use the glm function to perform a Poisson regression model using the data frame data_4.3.
m1<-glm(y ~ x1, family=poisson(link="log"),data=data_4.3)
summary(m1)
##
## Call:
## glm(formula = y ~ x1, family = poisson(link = "log"), data = data_4.3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8472 -0.2601 -0.2137 0.5214 0.8788
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.8893 0.1421 13.294 < 2e-16 ***
## x1 0.6698 0.1787 3.748 0.000178 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18.4206 on 8 degrees of freedom
## Residual deviance: 2.9387 on 7 degrees of freedom
## AIC: 41.052
##
## Number of Fisher Scoring iterations: 4
Compare the results of our home made function and the R build in glm function we get the same results.
Next we go a little bit further to do a so called IRLS method, i.e we use a weighted least square regression model to update the coefficients.
poisson_IRLS<-function(x,y,b.init,tol=1e-8){
change <- Inf
b.old <- b.init
while(change > tol) {
eta <- x %*% b.old # linear predictor
w<-diag(as.vector(exp(eta)),nrow=nrow(x),ncol=nrow(x))
#diagonal matrix weight
z<-solve(w)%*%(y-exp(eta))
weight<-exp(eta) #weight vector for the least squares regression
b.new<-b.old+lm(z ~ x - 1, weights = weight)$coef
change <- sqrt(sum((b.new - b.old)^2))
b.old <- b.new
}
b.new
}
poisson_IRLS(x,y,rep(1,2))
## x1 x2
## 1.8892720 0.6697856
We can see the IRLS method has the same results as the home made Newton-Raphson method and the glm function method.
Let suppose the response variable \(\bf{Y}\) has a Bernoulli distribution then we write
\[f(y_i;p_i)=p_i^{y_i}(1-p_i)^{(1-y_i)} \] The likelihood function is: \[\begin{align*} L&=p_1^{y_1}(1-p_1)^{(1-y_1)}\times p_2^{y_2}(1-p_2)^{(1-y_2)}\times...\times p_n^{y_n}(1-p_n)^{(1-y_n)} \\&= \prod_{i=1}^n p_i^{y_i}(1-p_i)^{(1-y_i)} \end{align*}\] The log likelihood function is: \[ \ell=\sum_{i=1}^n y_i\log p_i+\sum_{i=1}^n (1-y_i)\log(1-p_i)\tag{10} \]
The link function for logistic regession model is called logit function i.e : \[g(p_i)=logit(p_i)=\log \frac{p_i}{1-p_i}=\beta_0+\beta_1x_{i1}+\beta_2x_{i_2}+...+\beta_px_{ip}=\bf{x_i}^T\bf{\beta}\tag{11} \] From (11) we have \[\begin{align*} &\frac{p_i}{1-p_i}=e^{\bf{x_i}^T\bf{\beta}} \Rightarrow p_i=e^{\bf{x_i}^T\bf{\beta}}(1-p_i) \Rightarrow p_i=\frac{1}{1+e^{-\bf{x_i^T\beta}}}\tag{12} \end{align*}\] Now we put \((12)\) to \((10)\) we get: \[\begin{align*} \ell&=\sum_{i=1}^n \left\{y_i\log\frac{1}{1+e^{-\bf{x_i^T\beta}}}+(1-y_i)\log(1-\frac{1}{1+e^{-\bf{x_i^T\beta}}})\right\} \\&=\sum_{i=1}^n\left\{-y_i\log(1+e^{-\bf{x_i^T\beta}})+(1-y_i)\log\frac{e^{-\bf{x_i^T\beta}}}{1+e^{-\bf{x_i^T\beta}}}\right\} \\&=\sum_{i=1}^n\left\{-y_i\log(1+e^{-\bf{x_i^T\beta}})-(1-y_i){\bf{x_i^T\beta}}-(1-y_i)\log(1+e^{-\bf{x_i^T\beta}})\right\} \\&=\sum_{i=1}^n\left\{-y_i\log(1+e^{-\bf{x_i^T\beta}})-(1-y_i){\bf{x_i^T\beta}}-\log(1+e^{-\bf{x_i^T\beta}})+y_i\log(1+e^{-\bf{x_i^T\beta}})\right\} \\&=\sum_{i=1}^n\left\{-(1-y_i){\bf{x_i^T\beta}}-\log(1+e^{-\bf{x_i^T\beta}})\right\} \\&=\sum_{i=1}^n\left\{y_i{\bf{x_i^T\beta}}-\bf{x_i^T\beta}-\log(1+e^{-\bf{x_i^T\beta}})\right\} \tag{13} \end{align*}\] Taking derivatives in term of \(\beta_j\) where \(j=1,2,...,p\) we get: \[\begin{align*} \frac{\partial \ell}{\partial \beta_j }&=\sum_{i=1}^n\left\{y_ix_{ij}-x_{ij}+\frac{1}{1+e^{-\bf{x_i^T\beta}}}e^{-\bf{x_i^T\beta}}x_{ij}\right\} \\&=\sum_{i=1}^nx_{ij}(y_i-1+\frac{e^{-\bf{x_i^T\beta}}}{1+e^{-\bf{x_i^T\beta}}}) \\&=\sum_{i=1}^nx_{ij}(y_i-\frac{1}{1+e^{-\bf{x_i^T\beta}}}) \end{align*}\] Still we can write \(\frac{\partial \ell}{\partial \beta_j }=\sum_{i=1}^nx_{ij}(y_i-\hat{y_i})=\sum_{i=1}^nx_{ij}(y_i-p_i)\)
Next we can take derivative in term of \(\beta_0, \beta_1,...,\beta_{j-1}\beta_{j+1},...,\beta_{p}\) as long as \(p<n\) we can solve these equations and calculate \(\bf{\beta}\). We write these equations as matrix format: \[\begin{align*} \nabla(\ell;\textbf{y,x})&=\bf{X}^T(\bf{y}-\frac{1}{1+e^{-\bf{X}^T\bf{\beta}}})\tag{14} \\&=\bf{X}^T(\bf{y}-\bf{p}) \end{align*}\] This is the score function, in order to use Newton-Rapson method we need to take derivative in term of \(\bf{\beta}\) for the score function to get the Hesssian. \[\begin{align*} \bf{H}=\nabla^2(\ell;\textbf{y,x})&=-\bf{X}^T\left [-\frac{e^{-\bf{X}^T\bf{\beta}}}{(1+e^{-\bf{X}^T\bf{\beta}})^2} \right](-\bf{X}) \\&=-\bf{X}^T\left[\frac{1}{1+e^{-\bf{X}^T\bf{\beta}}}\times \frac{e^{-\bf{X}^T\bf{\beta}}}{1+e^{-\bf{X}^T\bf{\beta}}}\right]\bf{X}\tag{15} \\&=-\bf{X}^T\left[\bf{p}(1-\bf{p})\right]\bf{X} \\&=-\bf{X}^T\bf{W}\bf{X} \end{align*}\] where \(\bf{W}\) is a \(n\times n\) diagonal matrix with the Bernoulli variance \(p_i(1-p_i)\) as the diagonal elements. Combine \((14)\) and \((15)\) we can use Newton-Raphson method to solve : \[\nabla(\ell;\textbf{y,x})=0\] Using Netwon-Raphson method we have: \[\begin{align*} \bf{\beta}^{(m+1)}&=\bf{\beta}^{(m)}-\frac{\nabla \ell(\bf{\beta^{(m)}};\textbf{y,x})}{\nabla^2 \ell(\bf{\beta^{(m)}};\textbf{y,x})} \\&=\bf{\beta}^{(m)}+\frac{\bf{X}^T(\bf{y}-\frac{1}{1+e^{-\bf{X}^T\bf{\beta}^{(m)}}})}{\bf{X}^T\bf{W}_{(m)}\bf{X}}\tag{16} \end{align*}\] We can go even a little further than \((9)\), from \((16)\) for the IRLS. Let \(\bf{z}_{(m)}=\bf{W}^{-1}(\bf{y}-\frac{1}{1+e^{-\bf{X}^T\bf{\beta}^{(m)}}})\)
\[\begin{align*} \bf{\beta}^{(m+1)}&=\bf{\beta}^{(m)}+{(\bf{X}^T\bf{W}_{(m)}\bf{X})}^{-1}\bf{X}^TW_{(m)}z_{(m)} \tag{17} \\&={(\bf{X}^T\bf{W}_{(m)}\bf{X})}^{-1}\left[\bf{X}^T\bf{W}_{(m)}\bf{X}\bf{\beta}^{(m)}+\bf{X}^TW_{(m)}z_{(m)}\right] \\&={(\bf{X}^T\bf{W}_{(m)}\bf{X})}^{-1}\bf{X}^T\bf{W}_{(m)}\left[\bf{X}\bf{\beta}^{(m)}+z_{(m)}\right]\tag{18} \end{align*}\] From \((18)\) we can see\({(\bf{X}^T\bf{W}_{(m)}\bf{X})}^{-1}\bf{X}^T\bf{W}_{(m)}\left[\bf{X}\bf{\beta}^{(m)}+z_{(m)}\right]\) is the coefficients for the weighted least squares regression \(\left[\bf{X}\bf{\beta}^{(m)}+z_{(m)}\right]\) on \(\bf{X}\)
Next let us run logistic regression models using a data set from the UCLA IDRE website. First, we will use the glm function and then we will use the self-made IRLS R function.
library(Matrix)
library(MASS)
#########glm build in function
df <- read.csv('https://stats.idre.ucla.edu/stat/data/binary.csv')
mylogit <- glm(admit ~ gre + gpa + rank, data = df, family = "binomial"(link = "logit"))
summary(mylogit)
##
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = binomial(link = "logit"),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5802 -0.8848 -0.6382 1.1575 2.1732
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.449548 1.132846 -3.045 0.00233 **
## gre 0.002294 0.001092 2.101 0.03564 *
## gpa 0.777014 0.327484 2.373 0.01766 *
## rank -0.560031 0.127137 -4.405 1.06e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 459.44 on 396 degrees of freedom
## AIC: 467.44
##
## Number of Fisher Scoring iterations: 4
#########IRLS home made function
y<-df[,1]
x<-as.matrix(cbind(rep(1,400),df[,c(2,3,4)]))# add one column 1 for intercept
IRLS_logistic <- function(x, y, b.init, tol=1e-8) {
change <- Inf
b.old <- b.init
while(change > tol) {
eta <- x %*% b.old # linear predictor
p<-1/(1+exp(-eta))
w<-diag(as.vector(p*(1-p)),nrow=nrow(x),ncol=nrow(x))#diagonal matrix
weight<-p*(1-p) #vector weight for wls
z<-solve(w)%*%(y-p)
z.new<-x%*%b.old+z
b.new <- lm(z.new ~ x - 1, weights = weight)$coef # WLS regression
change <- sqrt(sum((b.new - b.old)^2))
b.old <- b.new
}
b.new
}
IRLS_logistic (x, y, rep(0.001,4))
## xrep(1, 400) xgre xgpa xrank
## -3.44954840 0.00229396 0.77701357 -0.56003139
We can see using the self-written IRLS function we got exactly the same results as the glm function.
1.Dobson, A. J., & Barnett, A. G. (2008). An introduction to generalized linear models. Chapman and Hall/CRC.
2.Wood, Simon N. Generalized additive models: an introduction with R. chapman and hall/CRC, 2006.