Generalised Linear Models
- The goal of generalised linear models is to model a transformation of the mean response linearly. This allows for a lot of the guassian model assumptions to be relaxed and opens up the possibility of the use of other distributions.
- Consider the following idea, where \(g(\mu_i)\) is a smooth, invertible transformation of the mean response.
\[ \mathbb{E}[y_i]=\mu_i,\qquad g(\mu_i)=\mathbf{x}_i^\top\boldsymbol{\beta}\]
- Distributions that can be easily modelled by the generalised linear model framework belong to the exponential family.
- \(\theta = g(\mu_i)\) is known as the cannonical link function
- \(\phi\) is the scale parameter, which is possibly unknown
- \(A_i\) are known weights and \(c(.),h(.)\) are known functions
- If \(y_i\) has density a part of the exponential family, the following hold:
\[ \mathbb{E}[y_i]=c'(\theta_i) \]
\[ \mathbb{V}ar(y_i)=\frac{\phi}{A_i}c''(\theta) \]
\[ c''(\theta_i) = c''(\theta_i(\mu_i)) = \mathbb{V}(\mu_i)\quad\therefore\mathbb{V}ar(y_i)=\frac{\phi}{A_i}\mathbb{V}(\mu_i) \]
Beta Estimates
- Beta estimates are found through maximum likelihood estimation:
\[ \prod^n_{i=1}f(y_i;\theta)\]
As these estimates are MLE, they have nice properties, such as unbiased, consistent and asymptotically normally distributed. This allows for inference on the parameters similar to with the guassian model.
- For large samples:
\[ \boldsymbol{\hat{\beta}}\sim^{\text{approx}}N(\boldsymbol{\beta},\mathbf{C}^{-1})\]
Where \(\mathbf{C}\) is a hessian covariance matrix, with (i,j)th entry being the function:
\[ \mathbf{C}_{ii}=\frac{-\partial^2\log\ell(\boldsymbol{\beta})}{\partial\beta_i\partial\beta_j}\]
- The Wald hypothesis test utilises the asymptotical normality property, therefore is best used in large samples.
\[ H_0: \beta_j = 0,\quad H_1:\beta_j\neq 0\]
- With test statistic:
\[ Z = \frac{\hat{\beta}_j}{\hat{se}(\hat{\beta}_j)}\sim^{\text{approx}}N(0,1) \]
- With p-value:
\[ p = P(|Z|\geq|z||\beta_j=0)\]
Iteratively Weighted Least Squares
- Iteratively weighted least squares was created as an extension of the newton raphson method.
Scaled Deviance
We can calculate the deviance of a model by comparing the likelihood of a parsimonous model estimate against that of a saturated model estimate
- A saturated model is one where there is a parameter for each observation.
\[ \lambda(\beta) = -2\log\frac{\overbrace{\ell(\hat{\beta})}^{\text{Less}}}{\underbrace{\ell(\beta^*)}_{\text{Sat}}} \]
If we have a scale parameter \(\phi\neq 1\) then we have the scaled deviance:
\[ \frac{\lambda(\beta)}{\phi} \]
- If we partition the beta vector we can test for signficance of one or more betas.
\[ \boldsymbol{\beta}_p = \left(\boldsymbol{\beta}^{(1)}_q-\boldsymbol{\beta}^{(2)}_{p-q}\right)^\top \]
\[ H_0: \boldsymbol{\beta}^{(2)}=0 \]
- Assuming \(H_0\) is true, as \(n\rightarrow\infty\):
- \(p\) is the amount of total predictors
- \(q\) is the amount of predictors in \(\boldsymbol{\beta}^{(1)}\)
\[ \frac{\overbrace{\lambda(\boldsymbol{\beta}^{(1)})}^{\text{Less}}-\overbrace{\lambda(\boldsymbol{\beta})}^{\text{Full}}}{\phi}\sim\chi_{p-q}\]
- This test can be used for multiple predictors at once or on a one by one basis.
- Should yield similar results to the wald test for coefficients, which would be distributed \(\sim\chi_{1}\)
- To get the relevant chi squared quantiles:
## [1] 3.841459
- Can also do sequential tests for these single coefficients, or groups of coefficients in the case of factors, through the anova command:
- Each row represents the change in deviance and df when each predictor is added
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: isjunk
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 4600 6170.2
## freq.excl 1 806.85 4599 5363.3
## freq.dollar 1 896.40 4598 4466.9
## freq.hash 1 17.97 4597 4448.9
## average 1 221.82 4596 4227.1
Binomial Regression
- Used for proportional data
- A common case is when there is a binary outcome for a single trial, such as win/lose
\[ f(y_i) = p^{y_i}(1-p)^{n_i-y_i}\]
- Our goal is to model the probability of an event occuring, which is bounded between \((0,1)\) so we map x to this range, and then convert it to a linear model.
\[ P(x_i) = \frac{\exp(x_i)}{1+\exp(x_i)} = \frac{1}{1+\exp(-x_i)}\]
- WIth inverse:
\[ x(p) = \log\left(\frac{p}{1-p}\right)\]
- This forms the basis of our binomial model, where we model the log odds in a linear fashion:
\[ \underbrace{\log\left(\frac{P(x_i)}{1-P(x_i)}\right)}_{\text{Log odds}} = \mathbf{x}_i^\top\boldsymbol{\beta}\]
- Benoulli Regression in R (\(1\) trial, e.g win or lose):
##
## Call:
## glm(formula = isjunk ~ ., family = binomial, data = spam)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.4904 -0.6549 -0.5781 0.5594 1.9608
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.96341 0.06573 -29.871 <2e-16 ***
## freq.excl 1.43767 0.11126 12.922 <2e-16 ***
## freq.dollar 12.13917 0.62319 19.479 <2e-16 ***
## freq.hash 0.27763 0.13236 2.097 0.036 *
## average 0.19913 0.01739 11.448 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6170.2 on 4600 degrees of freedom
## Residual deviance: 4227.1 on 4596 degrees of freedom
## AIC: 4237.1
##
## Number of Fisher Scoring iterations: 15
- Binomial Regression in R (\(n_i\) trails for each datapoint \(y_i\)):
- Note that \(n_i\) can be different for each datapoint, so we standardise and add weights to the glm function:
cardata = data.frame(Holders,Claims,Car,Age)
cardata$prop = cardata$Claims/cardata$Holders
cardata.glm = glm(prop~Car+Age, data = cardata,
family = binomial, weights = Holders)
summary(cardata.glm)##
## Call:
## glm(formula = prop ~ Car + Age, family = binomial, data = cardata,
## weights = Holders)
##
## Deviance Residuals:
## 1 2 3 4 5 6
## 0.76432 -0.72292 -0.08346 -0.52188 0.57918 0.02340
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.5124 0.2954 -15.276 < 2e-16 ***
## Carmedium 1.1830 0.2864 4.131 3.62e-05 ***
## Carsmall 1.9978 0.2824 7.075 1.50e-12 ***
## Age2 1.4890 0.1431 10.404 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 195.1110 on 5 degrees of freedom
## Residual deviance: 1.7221 on 2 degrees of freedom
## AIC: 39.204
##
## Number of Fisher Scoring iterations: 4
Estimation of Parameters
- We utilise MLE estimation
- Bernoulli likelihood:
\[ \ell(\boldsymbol{\beta})=\prod^n_{i=1}P(x_i)^{y_i}(1-P(x_i))^{1-y_i}\]
- Beta estimates, note this needs to be solved numerically:
\[ \sum^n_{i=1}y_ix_i=\sum^n_{i=1}x_i\left(1-\frac{\exp(-\mathbf{x}_i^\top\boldsymbol{\hat{\beta}})}{1+\exp(-\mathbf{x}_i^\top\boldsymbol{\hat{\beta}})}\right)\]
- Binomial likelihood:
\[ \ell(\boldsymbol{\beta})=\prod^n_{i=1}P(x_i)^{y_i}(1-P(x_i))^{n_i-y_i} \]
- Beta estimates, note this needs to be solved numerically:
\[ \sum^n_{i=1}y_ix_i=\sum^n_{i=1}n_ix_i\left(1-\frac{\exp(-\mathbf{x}_i^\top\boldsymbol{\hat{\beta}})}{1+\exp(-\mathbf{x}_i^\top\boldsymbol{\hat{\beta}})}\right)\]
Deviance
- The deviance for the logistic regression:
- Estimate for the saturated model is \(\hat{p}_i=\frac{y_i}{n_i}\)
- Estiamte for the parsimonious model is \(\hat{P}(x_i)=\frac{\hat{y}_i}{n_i}\)
\[ \lambda(\boldsymbol{\beta})=2\sum_i\left(y_i\log\frac{y_i}{\hat{y}_i}+(n_i-y_i)\log\left(\frac{n_i-y_i}{n_i-\hat{y}_i}\right)\right)\]
Residuals
- Residuals aren’t as helpful in the binomial model (especially when n is small) as the guassian model due to less assumptions being based on the model. They can still be useful for outliers and such though.
- Residuals should still have mean 0 and no pattern
- Pearson Residuals, where we utilise an estimate of the standard deviation for the binomial distribution:
\[ (y_i-\hat{y}_i)\sqrt{n_i\hat{P}(x_i)(1-\hat{P}(x_i))}\]
- In R:
## 1 2 3 4 5
## 1.0496847 0.4089924 0.2690926 1.7007465 1.7031934
- Deviance Residuals:
\[ d_i = \text{sign}(y_i-\hat{y}_i)\sqrt{e_i},\quad e_i=2\left(y_i\log\frac{y_i}{\hat{y}_i}+(n_i-y_i)\log\left(\frac{n_i-y_i}{n_i-\hat{y}_i}\right)\right)\]
- In R:
## 1 2 3 4 5
## 1.2188620 0.5561867 0.3739230 1.6486732 1.6499691
Choosing Models and Transformations
Alternative Link Functions
- The logit link function isn’t the only function that does the appropriate mapping. Many functions in the following form work:
\[ g(P(x_i))=\mathbf{x}_i^\top\boldsymbol{\beta}\]
- Probit function is another commonly used link function, often utilising the inverse of the normal distribution functino (CDF):
\[ \Phi^{-1}(P(x_i))=\mathbf{x}_i^\top\boldsymbol{\beta}\]
- In R we can change the link function to a few predefined defaults:
cardata.glm2 = glm(prop~Car+Age, data = cardata,
family = binomial(link = 'probit'),
weights = Holders)- There often isn’t much difference between these link functions:
- Another link function, albiet less common, is the log-log function:
\[ \log(-\log(1-P(x_i))=\mathbf{x}_i^\top\boldsymbol{\beta}\]
Logistic Discrimination
- We can extend the logistic regression model to many classed \((j>2)\), with:
\[ P_j(x_i) = P(y_i=j|x_i)\] * We choose a base class and use this as a pivot to compare other classes to. The resulting coefficients are relative to the base case. + This can be disadvantagous if the classes aren’t independent, i.e if the existence of a class influences the probability of another, such as if you had to make a choice. + We run k-1 binary regressions where the relative probabilities are used. * The model for k classes:
\[ \log\left(\frac{P_j(x_i)}{P_0(x_i)}\right) = \mathbf{x}^\top_i\beta^j,\quad j=1,...,k-1\]
- Noting that all the probabilities must add to 1:
\[ P_j(x_i)=\exp(\mathbf{x}^\top_i\beta^j)P_0(x_i)\]
\[ P_0(x_i)+P_0(x_i)\sum^{k-1}_{j=1}\exp(\mathbf{x}^\top_i\beta^j)=1,\quad\therefore P_0(x_i)=\frac{1}{1+\sum^{k-1}_{j=1}\exp(\mathbf{x}^\top_i\beta^j)}\]
- Finally the probabilities for each class is:
\[ P_j(x_i)=\frac{\exp(\mathbf{x}^\top_i\beta^j)}{1+\sum^{k-1}_{j=1}\exp(\mathbf{x}^\top_i\beta^j)}\]
- In R, where each row is the set of parameters for the log odds of the respective class vs the base class:
## # weights: 12 (6 variable)
## initial value 23.070858
## iter 10 value 6.623970
## iter 20 value 6.214841
## iter 30 value 6.182968
## iter 40 value 6.172650
## iter 50 value 6.167699
## iter 60 value 6.162723
## iter 70 value 6.156685
## iter 80 value 6.155298
## iter 90 value 6.153807
## iter 100 value 6.152597
## iter 110 value 6.152041
## iter 120 value 6.151229
## final value 6.151167
## converged
## Call:
## multinom(formula = Type ~ ., data = Cf, maxit = 250)
##
## Coefficients:
## (Intercept) Tetrahydrocortisone Pregnanetriol
## b -19.99566 -0.2450327 14.37357
## c -28.83773 3.3561273 16.23923
##
## Residual Deviance: 12.30233
## AIC: 24.30233
Poisson Regression
- A natural model to fit count data, assuming independent observations.
- The distribution assumptions are the same as the regression assumptions
\[ f(y_i) = \frac{\exp(-\lambda_i)\lambda_i^{y_i}}{y_i!},\quad y_i\geq 0\]
- To create our poisson model, we consider the time \((t_i)\) and a rate function which is a transformation of the predictors and parameters (this is basically the \(\lambda\) of the distribution itself which we times by t_i to create a new \(\lambda\) for modelling purposes):
\[ \lambda_i = t_i\mu(x_i;\boldsymbol{\beta})\]
- Using this model and the cannonical link to the rate function (\(\lambda\) for the distribution, which is timed by \(t_i\) to get the mean response for the model) we get our final model:
\[ \log(\mu(x_i;\beta))=\mathbf{x}^\top_i\boldsymbol{\beta},\quad\therefore\mu(x_i;\boldsymbol{\beta})=\exp(\mathbf{x}^\top_i\boldsymbol{\beta})\]
Therefore if we consider our model and our goal to get a linear model:
- We call \(\log(t_i)\) an offset, which is special to the poisson regression. This is a variable with the known coefficient of 1.
- \(log(\lambda_i)\) is the natural link function for the poisson distribution
\[ \lambda_i = t_i\exp(\mathbf{x}_i^\top\boldsymbol{\beta}),\quad\therefore\log(\lambda_i)=\log(t_i)+\mathbf{x}_i^\top\boldsymbol{\beta}\]
We can fit this poisson linear model in R:
ships.glm <- glm(incidents~type+year+period+offset(log(service)),
data = ships[ships$service != 0,], family = poisson)
summary(ships.glm)##
## Call:
## glm(formula = incidents ~ type + year + period + offset(log(service)),
## family = poisson, data = ships[ships$service != 0, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5348 -0.9319 -0.3686 0.4654 2.8833
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.079076 0.876149 -11.504 < 2e-16 ***
## typeB -0.546090 0.178415 -3.061 0.002208 **
## typeC -0.632631 0.329500 -1.920 0.054862 .
## typeD -0.232257 0.287979 -0.807 0.419951
## typeE 0.405975 0.234933 1.728 0.083981 .
## year 0.042247 0.012826 3.294 0.000988 ***
## period 0.023705 0.008091 2.930 0.003392 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 146.328 on 33 degrees of freedom
## Residual deviance: 59.375 on 27 degrees of freedom
## AIC: 171.24
##
## Number of Fisher Scoring iterations: 5
- Sometimes we may want to use other link functions, such as the identity link function, such as if we wanted to model the following:
\[ \lambda_i = \beta_1T_{i1}+\beta_2T_{i2}\]
- We can model this in R:
##
## Call:
## glm(formula = Failure ~ Mode1 + Mode2 - 1, family = poisson(link = identity))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8920 -0.5212 -0.1373 0.5240 2.2700
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## Mode1 0.16661 0.03486 4.779 1.76e-06 ***
## Mode2 0.09040 0.06301 1.435 0.151
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: Inf on 9 degrees of freedom
## Residual deviance: 7.5716 on 7 degrees of freedom
## AIC: 54.628
##
## Number of Fisher Scoring iterations: 5
We can also use the poisson regression to approximate the binomial regression
- This is the same idea as with the relative distributions, where you can use the following approximation if \(n_i\) is large and \(P(x_i)\) is small:
\[ \lambda_i=n_iP(x_i)\]
If we take logs we can get it into the familiar form with a new offset \(log(n_i)\):
\[ \log(\lambda_i) = \log(n_i)+\log(P(x_i))\]
- In R we model by doing the following:
skin = read.table('skin.txt', head = T)
skin.glmpoi <- glm(Cases~Age+Town+offset(log(Population)),
data = skin, family = poisson)
summary(skin.glmpoi)##
## Call:
## glm(formula = Cases ~ Age + Town + offset(log(Population)), family = poisson,
## data = skin)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2902 -0.3346 0.0000 0.3931 1.0927
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.69207 0.44922 -26.028 < 2e-16 ***
## Age25-34 2.62899 0.46746 5.624 1.87e-08 ***
## Age35-44 3.84558 0.45466 8.458 < 2e-16 ***
## Age45-54 4.59381 0.45103 10.185 < 2e-16 ***
## Age55-64 5.08638 0.45030 11.296 < 2e-16 ***
## Age65-74 5.64569 0.44975 12.553 < 2e-16 ***
## Age75-84 6.20317 0.45751 13.558 < 2e-16 ***
## Age85+ 6.17568 0.45774 13.492 < 2e-16 ***
## Town 0.85269 0.05962 14.302 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2327.2912 on 14 degrees of freedom
## Residual deviance: 5.2089 on 6 degrees of freedom
## AIC: 110.19
##
## Number of Fisher Scoring iterations: 4
- This is very similar to what we would get using the binomial regression:
skin$response <- skin$Cases/skin$Population
skin.glmbin <- glm(response~Age+Town, data = skin,
weights = Population, maxit = 20,
family = binomial)
summary(skin.glmbin)##
## Call:
## glm(formula = response ~ Age + Town, family = binomial, data = skin,
## weights = Population, maxit = 20)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2830 -0.3355 0.0000 0.3927 1.0820
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.69364 0.44923 -26.030 < 2e-16 ***
## Age25-34 2.62915 0.46747 5.624 1.86e-08 ***
## Age35-44 3.84627 0.45467 8.459 < 2e-16 ***
## Age45-54 4.59538 0.45104 10.188 < 2e-16 ***
## Age55-64 5.08901 0.45031 11.301 < 2e-16 ***
## Age65-74 5.65031 0.44976 12.563 < 2e-16 ***
## Age75-84 6.20887 0.45756 13.570 < 2e-16 ***
## Age85+ 6.18346 0.45783 13.506 < 2e-16 ***
## Town 0.85492 0.05969 14.322 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2330.4637 on 14 degrees of freedom
## Residual deviance: 5.1509 on 6 degrees of freedom
## AIC: 110.1
##
## Number of Fisher Scoring iterations: 4
Iteratively Weighted Least Squares
A general algorithm is used to find the maximium likelihood estimates in GLM as many maximising equations do not have closed form solutions.
We consider estimating \(\beta\) by fitting a linear model with \(g(y_i)\) as response
\[ g(\mu_i) = x^\top\beta\]
- However we often can’t compute \(g(y_i)\), therefore we consider the first order taylor series expansion:
\[ z_i = g(\mu_i)+(y_i-\mu_i)g'(\mu_i) \approx g(y_i)\]
\[ \mathbb{E}[z_i]=g(\mu_i)=x_i^\top\beta,\quad\text{ as }\mathbb{E}[y_i]=\mu_i \]
\[ \mathbb{V}ar(z_i)=\mathbb{V}ar(y_i)g'(\mu_i)^2\]
If the values for \(\beta\) were known we could estimate \(\beta\) by weighted least squares regression, however we do not know \(z_i\) or its variance.
Considering \(\mathbb{V}ar(\epsilon_i)=a_i\sigma^2\) where \(a_i\) is a weight. \(W\) is a matrix with diagonal elements \(w_ii=\frac{1}{a_i}\). We have the estimate of \(\beta\) (if we knew \(z_i\)):
\[ \hat{\beta} = (X^\top WX)^{-1}X^\top Wy\]
- As we do not know \(z_i\) we have to do it in an iterative sense.
Make inital guess \(\beta^0\) for \(\beta\).
Form \(z_i^0\) for \(z_i\) assuming \(\beta^0\) is the true value of \(\beta\). We also form \(\mu^0\)
Do a weighted least squares regression of \(z_i^0\) on predictors with the ith diagonal element of weight matrix \(W\) given by:
\[ w_{ii}=\frac{A_i}{\mathbb{V}(\mu_i^0)g'(\mu_i^0)^2},\quad w_{ii}\propto\frac{1}{\mathbb{V}ar(z_i)}\]
This gives us our new estimate \(\beta^1\). This is continued until the estimate is stable/converged.