Generalised Linear Models

The goal of generalised linear models is to model a transformation of the mean response linearly. This allows for a lot of the guassian model assumptions to be relaxed and opens up the possibility of the use of other distributions.
- Consider the following idea, where \(g(\mu_i)\) is a smooth, invertible transformation of the mean response.

\[ \mathbb{E}[y_i]=\mu_i,\qquad g(\mu_i)=\mathbf{x}_i^\top\boldsymbol{\beta}\]

Distributions that can be easily modelled by the generalised linear model framework belong to the exponential family.
- \(\theta = g(\mu_i)\) is known as the cannonical link function
- \(\phi\) is the scale parameter, which is possibly unknown
- \(A_i\) are known weights and \(c(.),h(.)\) are known functions
\[ f(y_i;\theta_i;\phi)=\exp\left(\frac{A_i(y_i\theta_i-c(\theta_i))}{\phi}+h\left(y_i,\frac{\phi}{A_i}\right)\right)\]
If \(y_i\) has density a part of the exponential family, the following hold:

\[ \mathbb{E}[y_i]=c'(\theta_i) \]

\[ \mathbb{V}ar(y_i)=\frac{\phi}{A_i}c''(\theta) \]

\[ c''(\theta_i) = c''(\theta_i(\mu_i)) = \mathbb{V}(\mu_i)\quad\therefore\mathbb{V}ar(y_i)=\frac{\phi}{A_i}\mathbb{V}(\mu_i) \]

Beta Estimates

Beta estimates are found through maximum likelihood estimation:

\[ \prod^n_{i=1}f(y_i;\theta)\]

As these estimates are MLE, they have nice properties, such as unbiased, consistent and asymptotically normally distributed. This allows for inference on the parameters similar to with the guassian model.
- For large samples:
\[ \boldsymbol{\hat{\beta}}\sim^{\text{approx}}N(\boldsymbol{\beta},\mathbf{C}^{-1})\]
Where \(\mathbf{C}\) is a hessian covariance matrix, with (i,j)th entry being the function:

\[ \mathbf{C}_{ii}=\frac{-\partial^2\log\ell(\boldsymbol{\beta})}{\partial\beta_i\partial\beta_j}\]

The Wald hypothesis test utilises the asymptotical normality property, therefore is best used in large samples.

\[ H_0: \beta_j = 0,\quad H_1:\beta_j\neq 0\]

With test statistic:

\[ Z = \frac{\hat{\beta}_j}{\hat{se}(\hat{\beta}_j)}\sim^{\text{approx}}N(0,1) \]

With p-value:

\[ p = P(|Z|\geq|z||\beta_j=0)\]

Iteratively Weighted Least Squares

Iteratively weighted least squares was created as an extension of the newton raphson method.

Scaled Deviance

We can calculate the deviance of a model by comparing the likelihood of a parsimonous model estimate against that of a saturated model estimate
- A saturated model is one where there is a parameter for each observation.
\[ \lambda(\beta) = -2\log\frac{\overbrace{\ell(\hat{\beta})}^{\text{Less}}}{\underbrace{\ell(\beta^*)}_{\text{Sat}}} \]
If we have a scale parameter \(\phi\neq 1\) then we have the scaled deviance:

\[ \frac{\lambda(\beta)}{\phi} \]

If we partition the beta vector we can test for signficance of one or more betas.

\[ \boldsymbol{\beta}_p = \left(\boldsymbol{\beta}^{(1)}_q-\boldsymbol{\beta}^{(2)}_{p-q}\right)^\top \]

\[ H_0: \boldsymbol{\beta}^{(2)}=0 \]

Assuming \(H_0\) is true, as \(n\rightarrow\infty\):
- \(p\) is the amount of total predictors
- \(q\) is the amount of predictors in \(\boldsymbol{\beta}^{(1)}\)

\[ \frac{\overbrace{\lambda(\boldsymbol{\beta}^{(1)})}^{\text{Less}}-\overbrace{\lambda(\boldsymbol{\beta})}^{\text{Full}}}{\phi}\sim\chi_{p-q}\]

This test can be used for multiple predictors at once or on a one by one basis.
- Should yield similar results to the wald test for coefficients, which would be distributed \(\sim\chi_{1}\)
To get the relevant chi squared quantiles:

qchisq(0.95,1)

## [1] 3.841459

Can also do sequential tests for these single coefficients, or groups of coefficients in the case of factors, through the anova command:
- Each row represents the change in deviance and df when each predictor is added

anova(spam.glm)

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: isjunk
## 
## Terms added sequentially (first to last)
## 
## 
##             Df Deviance Resid. Df Resid. Dev
## NULL                         4600     6170.2
## freq.excl    1   806.85      4599     5363.3
## freq.dollar  1   896.40      4598     4466.9
## freq.hash    1    17.97      4597     4448.9
## average      1   221.82      4596     4227.1

Binomial Regression

Used for proportional data
- A common case is when there is a binary outcome for a single trial, such as win/lose

\[ f(y_i) = p^{y_i}(1-p)^{n_i-y_i}\]

Our goal is to model the probability of an event occuring, which is bounded between \((0,1)\) so we map x to this range, and then convert it to a linear model.

\[ P(x_i) = \frac{\exp(x_i)}{1+\exp(x_i)} = \frac{1}{1+\exp(-x_i)}\]

WIth inverse:

\[ x(p) = \log\left(\frac{p}{1-p}\right)\]

This forms the basis of our binomial model, where we model the log odds in a linear fashion:

\[ \underbrace{\log\left(\frac{P(x_i)}{1-P(x_i)}\right)}_{\text{Log odds}} = \mathbf{x}_i^\top\boldsymbol{\beta}\]

Benoulli Regression in R (\(1\) trial, e.g win or lose):

spam.glm = glm(isjunk~., data = spam, family = binomial)
summary(spam.glm)

## 
## Call:
## glm(formula = isjunk ~ ., family = binomial, data = spam)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.4904  -0.6549  -0.5781   0.5594   1.9608  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.96341    0.06573 -29.871   <2e-16 ***
## freq.excl    1.43767    0.11126  12.922   <2e-16 ***
## freq.dollar 12.13917    0.62319  19.479   <2e-16 ***
## freq.hash    0.27763    0.13236   2.097    0.036 *  
## average      0.19913    0.01739  11.448   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6170.2  on 4600  degrees of freedom
## Residual deviance: 4227.1  on 4596  degrees of freedom
## AIC: 4237.1
## 
## Number of Fisher Scoring iterations: 15

Binomial Regression in R (\(n_i\) trails for each datapoint \(y_i\)):
- Note that \(n_i\) can be different for each datapoint, so we standardise and add weights to the glm function:

cardata = data.frame(Holders,Claims,Car,Age)
cardata$prop = cardata$Claims/cardata$Holders
cardata.glm = glm(prop~Car+Age, data = cardata, 
                 family = binomial, weights = Holders)
summary(cardata.glm)

## 
## Call:
## glm(formula = prop ~ Car + Age, family = binomial, data = cardata, 
##     weights = Holders)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6  
##  0.76432  -0.72292  -0.08346  -0.52188   0.57918   0.02340  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.5124     0.2954 -15.276  < 2e-16 ***
## Carmedium     1.1830     0.2864   4.131 3.62e-05 ***
## Carsmall      1.9978     0.2824   7.075 1.50e-12 ***
## Age2          1.4890     0.1431  10.404  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 195.1110  on 5  degrees of freedom
## Residual deviance:   1.7221  on 2  degrees of freedom
## AIC: 39.204
## 
## Number of Fisher Scoring iterations: 4

Estimation of Parameters

We utilise MLE estimation
Bernoulli likelihood:

\[ \ell(\boldsymbol{\beta})=\prod^n_{i=1}P(x_i)^{y_i}(1-P(x_i))^{1-y_i}\]

Beta estimates, note this needs to be solved numerically:

\[ \sum^n_{i=1}y_ix_i=\sum^n_{i=1}x_i\left(1-\frac{\exp(-\mathbf{x}_i^\top\boldsymbol{\hat{\beta}})}{1+\exp(-\mathbf{x}_i^\top\boldsymbol{\hat{\beta}})}\right)\]

Binomial likelihood:

\[ \ell(\boldsymbol{\beta})=\prod^n_{i=1}P(x_i)^{y_i}(1-P(x_i))^{n_i-y_i} \]

Beta estimates, note this needs to be solved numerically:

\[ \sum^n_{i=1}y_ix_i=\sum^n_{i=1}n_ix_i\left(1-\frac{\exp(-\mathbf{x}_i^\top\boldsymbol{\hat{\beta}})}{1+\exp(-\mathbf{x}_i^\top\boldsymbol{\hat{\beta}})}\right)\]

Deviance

The deviance for the logistic regression:
- Estimate for the saturated model is \(\hat{p}_i=\frac{y_i}{n_i}\)
- Estiamte for the parsimonious model is \(\hat{P}(x_i)=\frac{\hat{y}_i}{n_i}\)

\[ \lambda(\boldsymbol{\beta})=2\sum_i\left(y_i\log\frac{y_i}{\hat{y}_i}+(n_i-y_i)\log\left(\frac{n_i-y_i}{n_i-\hat{y}_i}\right)\right)\]

Residuals

Residuals aren’t as helpful in the binomial model (especially when n is small) as the guassian model due to less assumptions being based on the model. They can still be useful for outliers and such though.
- Residuals should still have mean 0 and no pattern
Pearson Residuals, where we utilise an estimate of the standard deviation for the binomial distribution:

\[ (y_i-\hat{y}_i)\sqrt{n_i\hat{P}(x_i)(1-\hat{P}(x_i))}\]

In R:

residuals(spam.glm, type='pearson')[1:5]

##         1         2         3         4         5 
## 1.0496847 0.4089924 0.2690926 1.7007465 1.7031934

Deviance Residuals:

\[ d_i = \text{sign}(y_i-\hat{y}_i)\sqrt{e_i},\quad e_i=2\left(y_i\log\frac{y_i}{\hat{y}_i}+(n_i-y_i)\log\left(\frac{n_i-y_i}{n_i-\hat{y}_i}\right)\right)\]

In R:

residuals(spam.glm,type='deviance')[1:5]

##         1         2         3         4         5 
## 1.2188620 0.5561867 0.3739230 1.6486732 1.6499691

Choosing Models and Transformations

Alternative Link Functions

The logit link function isn’t the only function that does the appropriate mapping. Many functions in the following form work:

\[ g(P(x_i))=\mathbf{x}_i^\top\boldsymbol{\beta}\]

Probit function is another commonly used link function, often utilising the inverse of the normal distribution functino (CDF):

\[ \Phi^{-1}(P(x_i))=\mathbf{x}_i^\top\boldsymbol{\beta}\]

In R we can change the link function to a few predefined defaults:

cardata.glm2 = glm(prop~Car+Age, data = cardata, 
                   family = binomial(link = 'probit'), 
                   weights = Holders)

There often isn’t much difference between these link functions:

plot(cardata.glm$fitted.values, cardata.glm2$fitted.values)
abline(a = 0, b=1)

Another link function, albiet less common, is the log-log function:

\[ \log(-\log(1-P(x_i))=\mathbf{x}_i^\top\boldsymbol{\beta}\]

Logistic Discrimination

We can extend the logistic regression model to many classed \((j>2)\), with:

\[ P_j(x_i) = P(y_i=j|x_i)\] * We choose a base class and use this as a pivot to compare other classes to. The resulting coefficients are relative to the base case. + This can be disadvantagous if the classes aren’t independent, i.e if the existence of a class influences the probability of another, such as if you had to make a choice. + We run k-1 binary regressions where the relative probabilities are used. * The model for k classes:

\[ \log\left(\frac{P_j(x_i)}{P_0(x_i)}\right) = \mathbf{x}^\top_i\beta^j,\quad j=1,...,k-1\]

Noting that all the probabilities must add to 1:

\[ P_j(x_i)=\exp(\mathbf{x}^\top_i\beta^j)P_0(x_i)\]

\[ P_0(x_i)+P_0(x_i)\sum^{k-1}_{j=1}\exp(\mathbf{x}^\top_i\beta^j)=1,\quad\therefore P_0(x_i)=\frac{1}{1+\sum^{k-1}_{j=1}\exp(\mathbf{x}^\top_i\beta^j)}\]

Finally the probabilities for each class is:

\[ P_j(x_i)=\frac{\exp(\mathbf{x}^\top_i\beta^j)}{1+\sum^{k-1}_{j=1}\exp(\mathbf{x}^\top_i\beta^j)}\]

In R, where each row is the set of parameters for the log odds of the respective class vs the base class:

library(nnet)
Cush.multinom <- multinom(Type~., Cf, maxit = 250)

## # weights:  12 (6 variable)
## initial  value 23.070858 
## iter  10 value 6.623970
## iter  20 value 6.214841
## iter  30 value 6.182968
## iter  40 value 6.172650
## iter  50 value 6.167699
## iter  60 value 6.162723
## iter  70 value 6.156685
## iter  80 value 6.155298
## iter  90 value 6.153807
## iter 100 value 6.152597
## iter 110 value 6.152041
## iter 120 value 6.151229
## final  value 6.151167 
## converged

Cush.multinom

## Call:
## multinom(formula = Type ~ ., data = Cf, maxit = 250)
## 
## Coefficients:
##   (Intercept) Tetrahydrocortisone Pregnanetriol
## b   -19.99566          -0.2450327      14.37357
## c   -28.83773           3.3561273      16.23923
## 
## Residual Deviance: 12.30233 
## AIC: 24.30233

Poisson Regression

A natural model to fit count data, assuming independent observations.
- The distribution assumptions are the same as the regression assumptions

\[ f(y_i) = \frac{\exp(-\lambda_i)\lambda_i^{y_i}}{y_i!},\quad y_i\geq 0\]

To create our poisson model, we consider the time \((t_i)\) and a rate function which is a transformation of the predictors and parameters (this is basically the \(\lambda\) of the distribution itself which we times by t_i to create a new \(\lambda\) for modelling purposes):

\[ \lambda_i = t_i\mu(x_i;\boldsymbol{\beta})\]

Using this model and the cannonical link to the rate function (\(\lambda\) for the distribution, which is timed by \(t_i\) to get the mean response for the model) we get our final model:

\[ \log(\mu(x_i;\beta))=\mathbf{x}^\top_i\boldsymbol{\beta},\quad\therefore\mu(x_i;\boldsymbol{\beta})=\exp(\mathbf{x}^\top_i\boldsymbol{\beta})\]

Therefore if we consider our model and our goal to get a linear model:
- We call \(\log(t_i)\) an offset, which is special to the poisson regression. This is a variable with the known coefficient of 1.
- \(log(\lambda_i)\) is the natural link function for the poisson distribution
\[ \lambda_i = t_i\exp(\mathbf{x}_i^\top\boldsymbol{\beta}),\quad\therefore\log(\lambda_i)=\log(t_i)+\mathbf{x}_i^\top\boldsymbol{\beta}\]
We can fit this poisson linear model in R:

ships.glm <- glm(incidents~type+year+period+offset(log(service)), 
                 data = ships[ships$service != 0,], family = poisson)
summary(ships.glm)

## 
## Call:
## glm(formula = incidents ~ type + year + period + offset(log(service)), 
##     family = poisson, data = ships[ships$service != 0, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5348  -0.9319  -0.3686   0.4654   2.8833  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.079076   0.876149 -11.504  < 2e-16 ***
## typeB        -0.546090   0.178415  -3.061 0.002208 ** 
## typeC        -0.632631   0.329500  -1.920 0.054862 .  
## typeD        -0.232257   0.287979  -0.807 0.419951    
## typeE         0.405975   0.234933   1.728 0.083981 .  
## year          0.042247   0.012826   3.294 0.000988 ***
## period        0.023705   0.008091   2.930 0.003392 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 146.328  on 33  degrees of freedom
## Residual deviance:  59.375  on 27  degrees of freedom
## AIC: 171.24
## 
## Number of Fisher Scoring iterations: 5

Sometimes we may want to use other link functions, such as the identity link function, such as if we wanted to model the following:

\[ \lambda_i = \beta_1T_{i1}+\beta_2T_{i2}\]

We can model this in R:

failures.glm <- glm(Failure~Mode1+Mode2-1, family = poisson(link=identity))
summary(failures.glm)

## 
## Call:
## glm(formula = Failure ~ Mode1 + Mode2 - 1, family = poisson(link = identity))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8920  -0.5212  -0.1373   0.5240   2.2700  
## 
## Coefficients:
##       Estimate Std. Error z value Pr(>|z|)    
## Mode1  0.16661    0.03486   4.779 1.76e-06 ***
## Mode2  0.09040    0.06301   1.435    0.151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance:    Inf  on 9  degrees of freedom
## Residual deviance: 7.5716  on 7  degrees of freedom
## AIC: 54.628
## 
## Number of Fisher Scoring iterations: 5

We can also use the poisson regression to approximate the binomial regression
- This is the same idea as with the relative distributions, where you can use the following approximation if \(n_i\) is large and \(P(x_i)\) is small:
\[ \lambda_i=n_iP(x_i)\]
If we take logs we can get it into the familiar form with a new offset \(log(n_i)\):

\[ \log(\lambda_i) = \log(n_i)+\log(P(x_i))\]

In R we model by doing the following:

skin = read.table('skin.txt', head = T)
skin.glmpoi <- glm(Cases~Age+Town+offset(log(Population)), 
                   data = skin, family = poisson)
summary(skin.glmpoi)

## 
## Call:
## glm(formula = Cases ~ Age + Town + offset(log(Population)), family = poisson, 
##     data = skin)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2902  -0.3346   0.0000   0.3931   1.0927  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -11.69207    0.44922 -26.028  < 2e-16 ***
## Age25-34      2.62899    0.46746   5.624 1.87e-08 ***
## Age35-44      3.84558    0.45466   8.458  < 2e-16 ***
## Age45-54      4.59381    0.45103  10.185  < 2e-16 ***
## Age55-64      5.08638    0.45030  11.296  < 2e-16 ***
## Age65-74      5.64569    0.44975  12.553  < 2e-16 ***
## Age75-84      6.20317    0.45751  13.558  < 2e-16 ***
## Age85+        6.17568    0.45774  13.492  < 2e-16 ***
## Town          0.85269    0.05962  14.302  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2327.2912  on 14  degrees of freedom
## Residual deviance:    5.2089  on  6  degrees of freedom
## AIC: 110.19
## 
## Number of Fisher Scoring iterations: 4

This is very similar to what we would get using the binomial regression:

skin$response <- skin$Cases/skin$Population
skin.glmbin <- glm(response~Age+Town, data = skin,
                   weights = Population, maxit = 20,
                   family = binomial)
summary(skin.glmbin)

## 
## Call:
## glm(formula = response ~ Age + Town, family = binomial, data = skin, 
##     weights = Population, maxit = 20)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2830  -0.3355   0.0000   0.3927   1.0820  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -11.69364    0.44923 -26.030  < 2e-16 ***
## Age25-34      2.62915    0.46747   5.624 1.86e-08 ***
## Age35-44      3.84627    0.45467   8.459  < 2e-16 ***
## Age45-54      4.59538    0.45104  10.188  < 2e-16 ***
## Age55-64      5.08901    0.45031  11.301  < 2e-16 ***
## Age65-74      5.65031    0.44976  12.563  < 2e-16 ***
## Age75-84      6.20887    0.45756  13.570  < 2e-16 ***
## Age85+        6.18346    0.45783  13.506  < 2e-16 ***
## Town          0.85492    0.05969  14.322  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2330.4637  on 14  degrees of freedom
## Residual deviance:    5.1509  on  6  degrees of freedom
## AIC: 110.1
## 
## Number of Fisher Scoring iterations: 4

Iteratively Weighted Least Squares

A general algorithm is used to find the maximium likelihood estimates in GLM as many maximising equations do not have closed form solutions.
We consider estimating \(\beta\) by fitting a linear model with \(g(y_i)\) as response

\[ g(\mu_i) = x^\top\beta\]

However we often can’t compute \(g(y_i)\), therefore we consider the first order taylor series expansion:

\[ z_i = g(\mu_i)+(y_i-\mu_i)g'(\mu_i) \approx g(y_i)\]

\[ \mathbb{E}[z_i]=g(\mu_i)=x_i^\top\beta,\quad\text{ as }\mathbb{E}[y_i]=\mu_i \]

\[ \mathbb{V}ar(z_i)=\mathbb{V}ar(y_i)g'(\mu_i)^2\]

If the values for \(\beta\) were known we could estimate \(\beta\) by weighted least squares regression, however we do not know \(z_i\) or its variance.
Considering \(\mathbb{V}ar(\epsilon_i)=a_i\sigma^2\) where \(a_i\) is a weight. \(W\) is a matrix with diagonal elements \(w_ii=\frac{1}{a_i}\). We have the estimate of \(\beta\) (if we knew \(z_i\)):

\[ \hat{\beta} = (X^\top WX)^{-1}X^\top Wy\]

As we do not know \(z_i\) we have to do it in an iterative sense.
- Make inital guess \(\beta^0\) for \(\beta\).
- Form \(z_i^0\) for \(z_i\) assuming \(\beta^0\) is the true value of \(\beta\). We also form \(\mu^0\)
- Do a weighted least squares regression of \(z_i^0\) on predictors with the ith diagonal element of weight matrix \(W\) given by:
  
  \[ w_{ii}=\frac{A_i}{\mathbb{V}(\mu_i^0)g'(\mu_i^0)^2},\quad w_{ii}\propto\frac{1}{\mathbb{V}ar(z_i)}\]
- This gives us our new estimate \(\beta^1\). This is continued until the estimate is stable/converged.

Generalised Linear Models

Jake

26/06/2021

Generalised Linear Models

Beta Estimates

Iteratively Weighted Least Squares

Scaled Deviance

Binomial Regression

Estimation of Parameters

Deviance

Residuals

Choosing Models and Transformations

Alternative Link Functions

Logistic Discrimination

Poisson Regression

Iteratively Weighted Least Squares