Generalised Linear Models

Jake

26/06/2021

Generalised Linear Models

  • The goal of generalised linear models is to model a transformation of the mean response linearly. This allows for a lot of the guassian model assumptions to be relaxed and opens up the possibility of the use of other distributions.
    • Consider the following idea, where \(g(\mu_i)\) is a smooth, invertible transformation of the mean response.

\[ \mathbb{E}[y_i]=\mu_i,\qquad g(\mu_i)=\mathbf{x}_i^\top\boldsymbol{\beta}\]

  • Distributions that can be easily modelled by the generalised linear model framework belong to the exponential family.
    • \(\theta = g(\mu_i)\) is known as the cannonical link function
    • \(\phi\) is the scale parameter, which is possibly unknown
    • \(A_i\) are known weights and \(c(.),h(.)\) are known functions
    \[ f(y_i;\theta_i;\phi)=\exp\left(\frac{A_i(y_i\theta_i-c(\theta_i))}{\phi}+h\left(y_i,\frac{\phi}{A_i}\right)\right)\]
  • If \(y_i\) has density a part of the exponential family, the following hold:

\[ \mathbb{E}[y_i]=c'(\theta_i) \]

\[ \mathbb{V}ar(y_i)=\frac{\phi}{A_i}c''(\theta) \]

\[ c''(\theta_i) = c''(\theta_i(\mu_i)) = \mathbb{V}(\mu_i)\quad\therefore\mathbb{V}ar(y_i)=\frac{\phi}{A_i}\mathbb{V}(\mu_i) \]

Beta Estimates

  • Beta estimates are found through maximum likelihood estimation:

\[ \prod^n_{i=1}f(y_i;\theta)\]

  • As these estimates are MLE, they have nice properties, such as unbiased, consistent and asymptotically normally distributed. This allows for inference on the parameters similar to with the guassian model.

    • For large samples:

    \[ \boldsymbol{\hat{\beta}}\sim^{\text{approx}}N(\boldsymbol{\beta},\mathbf{C}^{-1})\]

  • Where \(\mathbf{C}\) is a hessian covariance matrix, with (i,j)th entry being the function:

\[ \mathbf{C}_{ii}=\frac{-\partial^2\log\ell(\boldsymbol{\beta})}{\partial\beta_i\partial\beta_j}\]

  • The Wald hypothesis test utilises the asymptotical normality property, therefore is best used in large samples.

\[ H_0: \beta_j = 0,\quad H_1:\beta_j\neq 0\]

  • With test statistic:

\[ Z = \frac{\hat{\beta}_j}{\hat{se}(\hat{\beta}_j)}\sim^{\text{approx}}N(0,1) \]

  • With p-value:

\[ p = P(|Z|\geq|z||\beta_j=0)\]

Iteratively Weighted Least Squares

  • Iteratively weighted least squares was created as an extension of the newton raphson method.

Scaled Deviance

  • We can calculate the deviance of a model by comparing the likelihood of a parsimonous model estimate against that of a saturated model estimate

    • A saturated model is one where there is a parameter for each observation.

    \[ \lambda(\beta) = -2\log\frac{\overbrace{\ell(\hat{\beta})}^{\text{Less}}}{\underbrace{\ell(\beta^*)}_{\text{Sat}}} \]

  • If we have a scale parameter \(\phi\neq 1\) then we have the scaled deviance:

\[ \frac{\lambda(\beta)}{\phi} \]

  • If we partition the beta vector we can test for signficance of one or more betas.

\[ \boldsymbol{\beta}_p = \left(\boldsymbol{\beta}^{(1)}_q-\boldsymbol{\beta}^{(2)}_{p-q}\right)^\top \]

\[ H_0: \boldsymbol{\beta}^{(2)}=0 \]

  • Assuming \(H_0\) is true, as \(n\rightarrow\infty\):
    • \(p\) is the amount of total predictors
    • \(q\) is the amount of predictors in \(\boldsymbol{\beta}^{(1)}\)

\[ \frac{\overbrace{\lambda(\boldsymbol{\beta}^{(1)})}^{\text{Less}}-\overbrace{\lambda(\boldsymbol{\beta})}^{\text{Full}}}{\phi}\sim\chi_{p-q}\]

  • This test can be used for multiple predictors at once or on a one by one basis.
    • Should yield similar results to the wald test for coefficients, which would be distributed \(\sim\chi_{1}\)
  • To get the relevant chi squared quantiles:
## [1] 3.841459
  • Can also do sequential tests for these single coefficients, or groups of coefficients in the case of factors, through the anova command:
    • Each row represents the change in deviance and df when each predictor is added
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: isjunk
## 
## Terms added sequentially (first to last)
## 
## 
##             Df Deviance Resid. Df Resid. Dev
## NULL                         4600     6170.2
## freq.excl    1   806.85      4599     5363.3
## freq.dollar  1   896.40      4598     4466.9
## freq.hash    1    17.97      4597     4448.9
## average      1   221.82      4596     4227.1

Binomial Regression

  • Used for proportional data
    • A common case is when there is a binary outcome for a single trial, such as win/lose

\[ f(y_i) = p^{y_i}(1-p)^{n_i-y_i}\]

  • Our goal is to model the probability of an event occuring, which is bounded between \((0,1)\) so we map x to this range, and then convert it to a linear model.

\[ P(x_i) = \frac{\exp(x_i)}{1+\exp(x_i)} = \frac{1}{1+\exp(-x_i)}\]

  • WIth inverse:

\[ x(p) = \log\left(\frac{p}{1-p}\right)\]

  • This forms the basis of our binomial model, where we model the log odds in a linear fashion:

\[ \underbrace{\log\left(\frac{P(x_i)}{1-P(x_i)}\right)}_{\text{Log odds}} = \mathbf{x}_i^\top\boldsymbol{\beta}\]

  • Benoulli Regression in R (\(1\) trial, e.g win or lose):
## 
## Call:
## glm(formula = isjunk ~ ., family = binomial, data = spam)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.6549  -0.5781   0.5594   1.9608  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.96341    0.06573 -29.871   <2e-16 ***
## freq.excl    1.43767    0.11126  12.922   <2e-16 ***
## freq.dollar 12.13917    0.62319  19.479   <2e-16 ***
## freq.hash    0.27763    0.13236   2.097    0.036 *  
## average      0.19913    0.01739  11.448   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6170.2  on 4600  degrees of freedom
## Residual deviance: 4227.1  on 4596  degrees of freedom
## AIC: 4237.1
## 
## Number of Fisher Scoring iterations: 15
  • Binomial Regression in R (\(n_i\) trails for each datapoint \(y_i\)):
    • Note that \(n_i\) can be different for each datapoint, so we standardise and add weights to the glm function:
## 
## Call:
## glm(formula = prop ~ Car + Age, family = binomial, data = cardata, 
##     weights = Holders)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6  
##  0.76432  -0.72292  -0.08346  -0.52188   0.57918   0.02340  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.5124     0.2954 -15.276  < 2e-16 ***
## Carmedium     1.1830     0.2864   4.131 3.62e-05 ***
## Carsmall      1.9978     0.2824   7.075 1.50e-12 ***
## Age2          1.4890     0.1431  10.404  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 195.1110  on 5  degrees of freedom
## Residual deviance:   1.7221  on 2  degrees of freedom
## AIC: 39.204
## 
## Number of Fisher Scoring iterations: 4

Estimation of Parameters

  • We utilise MLE estimation
  • Bernoulli likelihood:

\[ \ell(\boldsymbol{\beta})=\prod^n_{i=1}P(x_i)^{y_i}(1-P(x_i))^{1-y_i}\]

  • Beta estimates, note this needs to be solved numerically:

\[ \sum^n_{i=1}y_ix_i=\sum^n_{i=1}x_i\left(1-\frac{\exp(-\mathbf{x}_i^\top\boldsymbol{\hat{\beta}})}{1+\exp(-\mathbf{x}_i^\top\boldsymbol{\hat{\beta}})}\right)\]

  • Binomial likelihood:

\[ \ell(\boldsymbol{\beta})=\prod^n_{i=1}P(x_i)^{y_i}(1-P(x_i))^{n_i-y_i} \]

  • Beta estimates, note this needs to be solved numerically:

\[ \sum^n_{i=1}y_ix_i=\sum^n_{i=1}n_ix_i\left(1-\frac{\exp(-\mathbf{x}_i^\top\boldsymbol{\hat{\beta}})}{1+\exp(-\mathbf{x}_i^\top\boldsymbol{\hat{\beta}})}\right)\]

Deviance

  • The deviance for the logistic regression:
    • Estimate for the saturated model is \(\hat{p}_i=\frac{y_i}{n_i}\)
    • Estiamte for the parsimonious model is \(\hat{P}(x_i)=\frac{\hat{y}_i}{n_i}\)

\[ \lambda(\boldsymbol{\beta})=2\sum_i\left(y_i\log\frac{y_i}{\hat{y}_i}+(n_i-y_i)\log\left(\frac{n_i-y_i}{n_i-\hat{y}_i}\right)\right)\]

Residuals

  • Residuals aren’t as helpful in the binomial model (especially when n is small) as the guassian model due to less assumptions being based on the model. They can still be useful for outliers and such though.
    • Residuals should still have mean 0 and no pattern
  • Pearson Residuals, where we utilise an estimate of the standard deviation for the binomial distribution:

\[ (y_i-\hat{y}_i)\sqrt{n_i\hat{P}(x_i)(1-\hat{P}(x_i))}\]

  • In R:
##         1         2         3         4         5 
## 1.0496847 0.4089924 0.2690926 1.7007465 1.7031934
  • Deviance Residuals:

\[ d_i = \text{sign}(y_i-\hat{y}_i)\sqrt{e_i},\quad e_i=2\left(y_i\log\frac{y_i}{\hat{y}_i}+(n_i-y_i)\log\left(\frac{n_i-y_i}{n_i-\hat{y}_i}\right)\right)\]

  • In R:
##         1         2         3         4         5 
## 1.2188620 0.5561867 0.3739230 1.6486732 1.6499691

Choosing Models and Transformations

Logistic Discrimination

  • We can extend the logistic regression model to many classed \((j>2)\), with:

\[ P_j(x_i) = P(y_i=j|x_i)\] * We choose a base class and use this as a pivot to compare other classes to. The resulting coefficients are relative to the base case. + This can be disadvantagous if the classes aren’t independent, i.e if the existence of a class influences the probability of another, such as if you had to make a choice. + We run k-1 binary regressions where the relative probabilities are used. * The model for k classes:

\[ \log\left(\frac{P_j(x_i)}{P_0(x_i)}\right) = \mathbf{x}^\top_i\beta^j,\quad j=1,...,k-1\]

  • Noting that all the probabilities must add to 1:

\[ P_j(x_i)=\exp(\mathbf{x}^\top_i\beta^j)P_0(x_i)\]

\[ P_0(x_i)+P_0(x_i)\sum^{k-1}_{j=1}\exp(\mathbf{x}^\top_i\beta^j)=1,\quad\therefore P_0(x_i)=\frac{1}{1+\sum^{k-1}_{j=1}\exp(\mathbf{x}^\top_i\beta^j)}\]

  • Finally the probabilities for each class is:

\[ P_j(x_i)=\frac{\exp(\mathbf{x}^\top_i\beta^j)}{1+\sum^{k-1}_{j=1}\exp(\mathbf{x}^\top_i\beta^j)}\]

  • In R, where each row is the set of parameters for the log odds of the respective class vs the base class:
## # weights:  12 (6 variable)
## initial  value 23.070858 
## iter  10 value 6.623970
## iter  20 value 6.214841
## iter  30 value 6.182968
## iter  40 value 6.172650
## iter  50 value 6.167699
## iter  60 value 6.162723
## iter  70 value 6.156685
## iter  80 value 6.155298
## iter  90 value 6.153807
## iter 100 value 6.152597
## iter 110 value 6.152041
## iter 120 value 6.151229
## final  value 6.151167 
## converged
## Call:
## multinom(formula = Type ~ ., data = Cf, maxit = 250)
## 
## Coefficients:
##   (Intercept) Tetrahydrocortisone Pregnanetriol
## b   -19.99566          -0.2450327      14.37357
## c   -28.83773           3.3561273      16.23923
## 
## Residual Deviance: 12.30233 
## AIC: 24.30233

Poisson Regression

  • A natural model to fit count data, assuming independent observations.
    • The distribution assumptions are the same as the regression assumptions

\[ f(y_i) = \frac{\exp(-\lambda_i)\lambda_i^{y_i}}{y_i!},\quad y_i\geq 0\]

  • To create our poisson model, we consider the time \((t_i)\) and a rate function which is a transformation of the predictors and parameters (this is basically the \(\lambda\) of the distribution itself which we times by t_i to create a new \(\lambda\) for modelling purposes):

\[ \lambda_i = t_i\mu(x_i;\boldsymbol{\beta})\]

  • Using this model and the cannonical link to the rate function (\(\lambda\) for the distribution, which is timed by \(t_i\) to get the mean response for the model) we get our final model:

\[ \log(\mu(x_i;\beta))=\mathbf{x}^\top_i\boldsymbol{\beta},\quad\therefore\mu(x_i;\boldsymbol{\beta})=\exp(\mathbf{x}^\top_i\boldsymbol{\beta})\]

  • Therefore if we consider our model and our goal to get a linear model:

    • We call \(\log(t_i)\) an offset, which is special to the poisson regression. This is a variable with the known coefficient of 1.
    • \(log(\lambda_i)\) is the natural link function for the poisson distribution

    \[ \lambda_i = t_i\exp(\mathbf{x}_i^\top\boldsymbol{\beta}),\quad\therefore\log(\lambda_i)=\log(t_i)+\mathbf{x}_i^\top\boldsymbol{\beta}\]

  • We can fit this poisson linear model in R:

## 
## Call:
## glm(formula = incidents ~ type + year + period + offset(log(service)), 
##     family = poisson, data = ships[ships$service != 0, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5348  -0.9319  -0.3686   0.4654   2.8833  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.079076   0.876149 -11.504  < 2e-16 ***
## typeB        -0.546090   0.178415  -3.061 0.002208 ** 
## typeC        -0.632631   0.329500  -1.920 0.054862 .  
## typeD        -0.232257   0.287979  -0.807 0.419951    
## typeE         0.405975   0.234933   1.728 0.083981 .  
## year          0.042247   0.012826   3.294 0.000988 ***
## period        0.023705   0.008091   2.930 0.003392 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 146.328  on 33  degrees of freedom
## Residual deviance:  59.375  on 27  degrees of freedom
## AIC: 171.24
## 
## Number of Fisher Scoring iterations: 5
  • Sometimes we may want to use other link functions, such as the identity link function, such as if we wanted to model the following:

\[ \lambda_i = \beta_1T_{i1}+\beta_2T_{i2}\]

  • We can model this in R:
## 
## Call:
## glm(formula = Failure ~ Mode1 + Mode2 - 1, family = poisson(link = identity))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8920  -0.5212  -0.1373   0.5240   2.2700  
## 
## Coefficients:
##       Estimate Std. Error z value Pr(>|z|)    
## Mode1  0.16661    0.03486   4.779 1.76e-06 ***
## Mode2  0.09040    0.06301   1.435    0.151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance:    Inf  on 9  degrees of freedom
## Residual deviance: 7.5716  on 7  degrees of freedom
## AIC: 54.628
## 
## Number of Fisher Scoring iterations: 5
  • We can also use the poisson regression to approximate the binomial regression

    • This is the same idea as with the relative distributions, where you can use the following approximation if \(n_i\) is large and \(P(x_i)\) is small:

    \[ \lambda_i=n_iP(x_i)\]

  • If we take logs we can get it into the familiar form with a new offset \(log(n_i)\):

\[ \log(\lambda_i) = \log(n_i)+\log(P(x_i))\]

  • In R we model by doing the following:
## 
## Call:
## glm(formula = Cases ~ Age + Town + offset(log(Population)), family = poisson, 
##     data = skin)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2902  -0.3346   0.0000   0.3931   1.0927  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -11.69207    0.44922 -26.028  < 2e-16 ***
## Age25-34      2.62899    0.46746   5.624 1.87e-08 ***
## Age35-44      3.84558    0.45466   8.458  < 2e-16 ***
## Age45-54      4.59381    0.45103  10.185  < 2e-16 ***
## Age55-64      5.08638    0.45030  11.296  < 2e-16 ***
## Age65-74      5.64569    0.44975  12.553  < 2e-16 ***
## Age75-84      6.20317    0.45751  13.558  < 2e-16 ***
## Age85+        6.17568    0.45774  13.492  < 2e-16 ***
## Town          0.85269    0.05962  14.302  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2327.2912  on 14  degrees of freedom
## Residual deviance:    5.2089  on  6  degrees of freedom
## AIC: 110.19
## 
## Number of Fisher Scoring iterations: 4
  • This is very similar to what we would get using the binomial regression:
## 
## Call:
## glm(formula = response ~ Age + Town, family = binomial, data = skin, 
##     weights = Population, maxit = 20)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2830  -0.3355   0.0000   0.3927   1.0820  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -11.69364    0.44923 -26.030  < 2e-16 ***
## Age25-34      2.62915    0.46747   5.624 1.86e-08 ***
## Age35-44      3.84627    0.45467   8.459  < 2e-16 ***
## Age45-54      4.59538    0.45104  10.188  < 2e-16 ***
## Age55-64      5.08901    0.45031  11.301  < 2e-16 ***
## Age65-74      5.65031    0.44976  12.563  < 2e-16 ***
## Age75-84      6.20887    0.45756  13.570  < 2e-16 ***
## Age85+        6.18346    0.45783  13.506  < 2e-16 ***
## Town          0.85492    0.05969  14.322  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2330.4637  on 14  degrees of freedom
## Residual deviance:    5.1509  on  6  degrees of freedom
## AIC: 110.1
## 
## Number of Fisher Scoring iterations: 4

Iteratively Weighted Least Squares

  • A general algorithm is used to find the maximium likelihood estimates in GLM as many maximising equations do not have closed form solutions.

  • We consider estimating \(\beta\) by fitting a linear model with \(g(y_i)\) as response

\[ g(\mu_i) = x^\top\beta\]

  • However we often can’t compute \(g(y_i)\), therefore we consider the first order taylor series expansion:

\[ z_i = g(\mu_i)+(y_i-\mu_i)g'(\mu_i) \approx g(y_i)\]

\[ \mathbb{E}[z_i]=g(\mu_i)=x_i^\top\beta,\quad\text{ as }\mathbb{E}[y_i]=\mu_i \]

\[ \mathbb{V}ar(z_i)=\mathbb{V}ar(y_i)g'(\mu_i)^2\]

  • If the values for \(\beta\) were known we could estimate \(\beta\) by weighted least squares regression, however we do not know \(z_i\) or its variance.

  • Considering \(\mathbb{V}ar(\epsilon_i)=a_i\sigma^2\) where \(a_i\) is a weight. \(W\) is a matrix with diagonal elements \(w_ii=\frac{1}{a_i}\). We have the estimate of \(\beta\) (if we knew \(z_i\)):

\[ \hat{\beta} = (X^\top WX)^{-1}X^\top Wy\]

  • As we do not know \(z_i\) we have to do it in an iterative sense.
    • Make inital guess \(\beta^0\) for \(\beta\).

    • Form \(z_i^0\) for \(z_i\) assuming \(\beta^0\) is the true value of \(\beta\). We also form \(\mu^0\)

    • Do a weighted least squares regression of \(z_i^0\) on predictors with the ith diagonal element of weight matrix \(W\) given by:

      \[ w_{ii}=\frac{A_i}{\mathbb{V}(\mu_i^0)g'(\mu_i^0)^2},\quad w_{ii}\propto\frac{1}{\mathbb{V}ar(z_i)}\]

    • This gives us our new estimate \(\beta^1\). This is continued until the estimate is stable/converged.