\(\;\)

Generalized linear models (GLMs)

Why not linear models ?

Linear models (to me other people might dispute this)are the most useful applied statistical technique. However, they are not without their limitations. A couple of limitations that come to my mind:

Components of generalized linear models

GLMs introduced in a 1972 RSSB paper by Nelder and Wedderburn.

GLMs involves three components:

Linear models

Assume that \(Y_i \sim N(\mu_i, \sigma^2)\) (the Gaussian distribution is an exponential family distribution.)

Define the linear predictor to be \(\eta_i = \sum_{k=1}^p X_{ik} \beta_k\).

The link function as \(g\) so that \(g(\mu) = \eta\). This is always the case thats why its called link function. It always links the mean of the Y’s to the linear predictor.

This yields the same likelihood model as our additive error Gaussian linear model

\[ Y_i = \sum_{k=1}^p X_{ik} \beta_k + \epsilon_{i} \] where \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\)

Logistic regression

Assume that our outcome is Bernoulli, \(Y_i \sim Bernoulli(\mu_i)\) so that \(E[Y_i] = \mu_i\) where \(0\leq \mu_i \leq 1\).

We define linear predictor \(\eta_i = \sum_{k=1}^p X_{ik} \beta_k\). It looks like our regression model.

Link function that liks the mean of the Y’s to the linear predictor is \(g(\mu) = \eta = \log\left( \frac{\mu}{1 - \mu}\right)\) . \(g\) is the (natural) log odds, referred to as the logit.

Note then we can invert the logit function as \[ \mu_i = \frac{\exp(\eta_i)}{1 + \exp(\eta_i)} ~~~\mbox{and}~~~ 1 - \mu_i = \frac{1}{1 + \exp(\eta_i)} \] Thus (according to the Bernoulli liklihood) the likelihood for all of the muus is \[ \prod_{i=1}^n \mu_i^{y_i} (1 - \mu_i)^{1-y_i} = \exp\left(\sum_{i=1}^n y_i \eta_i \right) \prod_{i=1}^n (1 + \eta_i)^{-1} \]

Poisson regression

Lets do a poisson example :

Assume that \(Y_i \sim Poisson(\mu_i)\) so that \(E[Y_i] = \mu_i\) where \(0\leq \mu_i\)

We define linear predictor \(\eta_i = \sum_{k=1}^p X_{ik} \beta_k\)

And the link function \(g(\mu) = \eta = \log(\mu)\)

And recall that \(e^x\) is the inverse of \(\log(x)\) so that \[ \mu_i = e^{\eta_i} \]

Thus, the likelihood is \[ \prod_{i=1}^n (y_i !)^{-1} \mu_i^{y_i}e^{-\mu_i} \propto \exp\left(\sum_{i=1}^n y_i \eta_i - \sum_{i=1}^n \mu_i\right) \]

Variances and quasi likelihood

Its interesting to note that in each of these cases, the only way in which the likelihood depends on the data is through

\[\sum_{i=1}^n y_i \eta_i = \sum_{i=1}^n y_i\sum_{k=1}^p X_{ik} \beta_k = \sum_{k=1}^p \beta_k\sum_{i=1}^n X_{ik} y_i \]

Thus if we don’t actually need the full data, but only \(\sum_{i=1}^n X_{ik} y_i\). This simplification is a consequence of chosing so-called ‘canonical’ link functions. If we didn’t choose those function, that wouldn’t be the case.

(This has to be derived). All models achieve their maximum at the root of the so called normal equations, which is notthing but the solution of this equation: \[ 0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{Var(Y_i)}W_i \] where \(W_i\) are the derivative of the inverse of the link function.

For the linear model \(Var(Y_i) = \sigma^2\) is constant.

For Bernoulli case \(Var(Y_i) = \mu_i (1 - \mu_i)\)

For the Poisson case \(Var(Y_i) = \mu_i\).

In the latter cases, it is often relevant to have a more flexible variance model, even if it doesn’t correspond to an actual likelihood \[ 0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{\phi \mu_i (1 - \mu_i ) } W_i ~~~\mbox{and}~~~ 0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{\phi \mu_i} W_i \]

These are called quasi-likelihood normal equations . The whole point is to relax these strict assumption relating the means and the variances in those two models. And it simply fits the standard normal equations, but throws an extra parameter down there in the denominator. They inherit most of the properties that gml’s do. They just don’t happen to correspond to a specific model.

These normal equations have to be solved iteratively. Resulting in \(\hat \beta_k\) and, if included, \(\hat \phi\).

Predicted linear predictor responses can be obtained as \(\hat \eta = \sum_{k=1}^p X_k \hat \beta_k\)

Therefore predicted mean responses as \(\hat \mu = g^{-1}(\hat \eta)\)

Coefficients are interpretted as \[ g(E[Y | X_k = x_k + 1, X_{\sim k} = x_{\sim k}]) - g(E[Y | X_k = x_k, X_{\sim k}=x_{\sim k}]) = \beta_k \] or the change in the link function of the expected response per unit change in \(X_k\) holding other regressors constant.

For iterative solving variations on Newon/Raphson’s algorithm are used to do it.

Asymptotics are used for inference usually.

Finally many of the ideas from linear models can be brought over to GLMs.

Generalized linear models, Binary outcomes

Frequently we care about outcomes that have two values such as :

These are called binary, Bernoulli or 0/1 outcomes. The most common version of binary outcome for example in clinical research is diseased or not.

Collection of exchangeable binary outcomes for the same covariate data are called binomial outcomes. So, instead of analyzing the individual zero and ones, we might analyze the count of zeros and ones out of the number of trials and that would just be called binomial outcomes.

Example-Baltimore Ravens win/loss

Lets look at an example using Ravens Data . Lets first download the Ravens Data and look at the first couple of rows.

library(knitr)
library(printr)
URL <- "https://dl.dropboxusercontent.com/u/7710864/data/ravensData.rda"
#download.file(URL, destfile= "./data/ravensData.rda",method="curl")
load("./data/ravensData.rda")
kable(head(ravensData),align = 'c')
ravenWinNum ravenWin ravenScore opponentScore
1 W 24 9
1 W 38 35
1 W 28 13
1 W 34 31
1 W 44 13
0 L 23 24

Linear regression

Imagine trying to fit a linear regression model to this. It wouldn’t work very well, right? Obviously the additive error term doesn’t make a lot of sense because this additive error is going to make it so that the outcome isn’t 0 and 1, if the errors say is continuous.

\[ RW_i = b_0 + b_1 RS_i + e_i \]

\(RW_i\) - 1 if a Ravens win, 0 if not

\(RS_i\) - Number of points Ravens scored

\(b_0\) - probability of a Ravens win if they score 0 points

\(b_1\) - increase in probability of a Ravens win for each additional point

\(e_i\) - residual variation due

So here you can fit the linear regression.

lmRavens <- lm(ravensData$ravenWinNum ~ ravensData$ravenScore)
summary(lmRavens)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2850317 0.2566432 1.110615 0.2813504
ravensData$ravenScore 0.0158992 0.0090590 1.755069 0.0962526

I don’t think it’s the particularly great thing to do. I would also say, it’s not a horrible thing to do either. If you’re interested in predicting a Ravens win versus a loss and you just create a thresh held linear regression estimate, this isn’t the worst thing in the world to do, it’s just not very interpretable. Here we’re interested in interpretation.

So, from a machine learning perspective, depending on the loss function a linear regression might be perfectly reasonable for binary outcomes. But, for interoperability, you want something different, so let’s talk about things like odds and this is way to re-frame the model in a particularly interpretable way.

Odds

As I mentioned let’s talk about things odds and this is way to re-frame the model in a particularly interpretable way.

Binary Outcome 0/1

\[RW_i\]

Probability (0,1) : We could talk about the probability that they won, given the parameters and their score, how many points they scored:

\[\rm{Pr}(RW_i | RS_i, b_0, b_1 )\]

Odds \((0,\infty)\) : The odds, remember, is the probability over one minus the probability:

\[\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\]

Log odds \((-\infty,\infty)\) : And the log odds of course is just the log of the odds, but the log odds is also termed the logit :

\[\log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right)\]

Linear vs. logistic regression

So the Linear regression models this

\[ RW_i = b_0 + b_1 RS_i + e_i \]

or

\[ E[RW_i | RS_i, b_0, b_1] = b_0 + b_1 RS_i\]

and Logistic regression models

\[ \rm{Pr}(RW_i | RS_i, b_0, b_1) = \frac{\exp(b_0 + b_1 RS_i)}{1 + \exp(b_0 + b_1 RS_i)}\]

or

\[ \log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right) = b_0 + b_1 RS_i \]

Interpreting Logistic Regression

So here is the equation for logistics regression, rewritten:

\[ \log\left(\frac{\rm{Pr}(RW_i | RS_i, b_0, b_1 )}{1-\rm{Pr}(RW_i | RS_i, b_0, b_1)}\right) = b_0 + b_1 RS_i \]

And the interpretations would be:

\(b_0\) - Log odds of a Ravens win if they score zero points

\(b_1\) - Log odds ratio of win probability for each point scored (compared to zero points). (By the way, we don’t have any other covariates here, but if we had other covariates, we’d have to add the phrase with all the covariates held fixed.)

\(\exp(b_1)\) - Odds ratio of win probability for each point scored (compared to zero points)

Interpreting odds

Now lets talk about interpreting odds. Odds are very interpretable, at least as interpretable as probabilities are, and I like to think of them in terms of these gambling experiments.

Imagine that you are playing a game where you flip a coin with success probability \(p\). If it comes up heads, you win \(X\). If it comes up tails, you lose \(Y\). What should we set \(X\) and \(Y\) for the game to be fair? By fair we mean :

\[ E[earnings]= X p - Y (1 - p) = 0 \]

By the way your opponent expected earning would be the negative of \(E[earnings]\).

Which implies this equation:

\[\frac{Y}{X} = \frac{p}{1 - p}\]

Lets explore this equation:

The odds can be said as “How much should you be willing to pay for a \(p\) probability of winning a dollar?”

And that’s why when you go to a horse track, for example, they give you the odds of the horse against the horse. You use that to figures out your payout. If it’s 10 to 1, then that means you’re going to pay 10 times the amount that you’re putting in. And so that’s odds, so odds are kind of uniquely interpretable

Visualizing fitting logistic regression curves

You can use manipulate function in R to investigae the logistic regression curve.

library(manipulate)
x <- seq(-10, 10, length = 1000)
manipulate(
    plot(x, exp(beta0 + beta1 * x) / (1 + exp(beta0 + beta1 * x)), 
         type = "l", lwd = 3, frame = FALSE),
    beta1 = slider(-2, 2, step = .1, initial = 2),
    beta0 = slider(-2, 2, step = .1, initial = 0)
    )

Ravens logistic regression

So, let’s look at our logistic regression for the ravens data here and how we fit it. In the code bleow, the family equals binomial says that it’s going to be logistic progression

logRegRavens <- glm(ravensData$ravenWinNum ~ ravensData$ravenScore,family="binomial")
summary(logRegRavens)
## 
## Call:
## glm(formula = ravensData$ravenWinNum ~ ravensData$ravenScore, 
##     family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7575  -1.0999   0.5305   0.8060   1.4947  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)           -1.68001    1.55412  -1.081     0.28
## ravensData$ravenScore  0.10658    0.06674   1.597     0.11
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24.435  on 19  degrees of freedom
## Residual deviance: 20.895  on 18  degrees of freedom
## AIC: 24.895
## 
## Number of Fisher Scoring iterations: 5

So -1.68 is the log odds of the Ravens winning, when they score nothing. So of course that’s negative when they score nothing, you wouldn’t expect them to win. And if you wanted the actual odds of them winning, you would have to do \(exp(-1.68)\).

And 0.106 is the increase in the log odds per every point that they score. And if we had included other variables down here that interpretation would be holding these other variables constant.

Ravens fitted values

So here’s the fitted probabilities. It has to look like a S curve, which is what the inverse logit function looks like. So if you plug in a score, this is going to give you your estimated probability.

plot(ravensData$ravenScore,logRegRavens$fitted,pch=19,col="blue",
     xlab="Score",ylab="Prob Ravens Win")

Odds ratios and confidence intervals

So, here’s just getting the coefficients if you exponentiate the coefficients.

Also you can exponentiate the conference interval and then get the upper and lower bounds, which is a good thing to do. I prefer to do it this way just because I prefer to interpret the coefficients on the relative scale rather than on the log scale.

exp(logRegRavens$coeff)
##           (Intercept) ravensData$ravenScore 
##             0.1863724             1.1124694
exp(confint(logRegRavens))
2.5 % 97.5 %
(Intercept) 0.0056750 3.106384
ravensData$ravenScore 0.9962297 1.303304

ANOVA for logistic regression

The ANOVA function also works for GLM’s.

anova(logRegRavens,test="Chisq")
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL NA NA 19 24.43457 NA
ravensData$ravenScore 1 3.539826 18 20.89475 0.0599117

In this output if you fit multiple models, just like with linear models, it adds them sequentially. Ideally you are doing this with nested models.

So the deviance here is the measure of model fit compared to the previous model. The residual deviance, if it’s subtracted is going to give you the deviance.

In this case the chance of getting a chi-squared with 3.5 and with 1 degree of freedom is pretty far out in the tail. There’s only 6% chance of getting one more extreme. (\(P(X > 3.5)=0.06\))

Interpreting Odds Ratios

let’s finish by talking a little bit about interpreting odds ratios.

So odds ratios are Not probabilities, which I think is clear. And odds are of course not probability, but they’re a meaningful scale. So people have a hard time interpreting probabilities, and they have an equivalently hard time interpreting odds. I do like the one thing about odds in the way it directly relates back to gambling and the sort of example we covered earlier.

So whether you wanted to look at relative odds or relative probabilities, you’re going to find the same thing that :

You can also look at the relative probabilities themselves, and that’s called a relative risk :

Also odds have an ineresting use in retrospective studies. What do I mean by that is that imagine you’re studying a treatment, or you want to study something like smoking, and you can’t follow people prospectively over time, it takes too long and if you wanted to study smoking and lung cancer, you couldn’t say follow people over 15 years and then determine exactly how much they smoked, and then determine who gets lung cancer or not. That’s too hard.

A much easier way to do would be to go to hospital records and collect and get the names of a bunch of people who died from lung cancer and get their information, the age at which they died and so on. Find some controls, maybe some match controls and look backwards. Maybe go contact their family or something, or something like that to ascertain whether or not they smoked and how much.

Or you could go to live patients and get their smoking history, that would be another kind of study. In both these cases, what you’re doing is you’re looking at the outcome as a fixed non-random quantity. For example you sample 50 people who have lung cancer, or you sample 50 people who don’t have lung cancer. And that number is non-random, but you’d like to treat it as if it was the outcome. Well, there’s this interesting fact that you can use the odds to sort of treat that as the outcome.

Wikipedia on Odds Ratio

Further resources

Count outcomes

Key ideas

Many data take the form of counts and that we can’t necessarily model counts in the same way that we model other things with linear regression or binary regression. Examples are:

Data may also be in the form of rates. What we mean by rates then is a count per unit time.

Percents also that seem like binomial processes can be modeled by Poisson models especially when the probability is very low and the sample size is very large. For example:

In these cases linear regression with transformation is an option

Poisson distribution

The Poisson distribution is a useful model for counts and rates

Again here a rate is count per some monitoring time. Some examples uses of the Poisson distribution are

The Poisson mass function

We say random variable \(X \sim Poisson(t\lambda)\) if \[ P(X = x) = \frac{(t\lambda)^x e^{-t\lambda}}{x!} \] For \(x = 0, 1, \ldots\).

The mean of the Poisson is \(E[X] = t\lambda\), thus \(E[X / t] = \lambda\)

The variance of the Poisson is \(Var(X) = t\lambda\).

The Poisson tends to a normal as \(t\lambda\) gets large, as shown in the code below:

par(mfrow = c(1, 3))
plot(0 : 10, dpois(0 : 10, lambda = 2), type = "h", frame = FALSE)
plot(0 : 20, dpois(0 : 20, lambda = 10), type = "h", frame = FALSE)
plot(0 : 200, dpois(0 : 200, lambda = 100), type = "h", frame = FALSE) 

Poisson distribution

Below is sort of, showing that the mean and variance are equal :

x <- 0 : 10000; lambda = 3
mu <- sum(x * dpois(x, lambda = lambda))
sigmasq <- sum((x - mu)^2 * dpois(x, lambda = lambda))
c(mu, sigmasq)
## [1] 3 3

Example: Leek Group Website Traffic

Consider the daily counts to Jeff Leek’s web site

http://biostat.jhsph.edu/~jleek/

Since the unit of time is always one day, set \(t = 1\) and then the Poisson mean is interpretted as web hits per day. (If we set \(t = 24\), it would be web hits per hour).

Website data

Below are the data and we create a new column for the julian dates:

download.file("https://dl.dropboxusercontent.com/u/7710864/data/gaData.rda",destfile="./data/gaData.rda",method="curl")
load("./data/gaData.rda")
gaData$julian <- julian(gaData$date)
head(gaData)
date visits simplystats julian
2011-01-01 0 0 14975
2011-01-02 0 0 14976
2011-01-03 0 0 14977
2011-01-04 0 0 14978
2011-01-05 0 0 14979
2011-01-06 0 0 14980

http://skardhamar.github.com/rga/

Plot data

Here is a plot of Julian data and the number of visits:

plot(gaData$julian,gaData$visits,pch=19,col="darkgrey",xlab="Julian",ylab="Visits")

Linear regression

We could model this as a linear regression:

\[ NH_i = b_0 + b_1 JD_i + e_i \]

\(NH_i\) - number of hits to the website

\(JD_i\) - day of the year (Julian day)

\(b_0\) - number of hits on Julian day 0 (1970-01-01)

\(b_1\) - increase in number of hits per unit day

\(e_i\) - variation due to everything we didn’t measure

*Linear regression line

Here is the plot of the fit and it looks good !

plot(gaData$julian,gaData$visits,pch=19,col="darkgrey",xlab="Julian",ylab="Visits")
lm1 <- lm(gaData$visits ~ gaData$julian)
abline(lm1,col="red",lwd=3)

Aside, taking the log of the outcome

Taking the natural log of the outcome has a specific interpretation.

Consider the model

\[ \log(NH_i) = b_0 + b_1 JD_i + e_i \]

\(NH_i\) - number of hits to the website

\(JD_i\) - day of the year (Julian day)

\(b_0\) - log number of hits on Julian day 0 (1970-01-01)

\(b_1\) - increase in log number of hits per unit day

\(e_i\) - variation due to everything we didn’t measure

Exponentiating coefficients

\(e^{E[\log(Y)]}\) geometric mean of \(Y\).

With no covariates, this is estimated by \(e^{\frac{1}{n}\sum_{i=1}^n \log(y_i)} = (\prod_{i=1}^n y_i)^{1/n}\)

When you take the natural log of outcomes and fit a regression model, your exponentiated coefficients estimate things about geometric means.

\(e^{\beta_0}\) estimated geometric mean hits on day 0

\(e^{\beta_1}\) estimated relative increase or decrease in geometric mean hits per day

There’s a problem with logs with you have zero counts, adding a constant works:

round(exp(coef(lm(I(log(gaData$visits + 1)) ~ gaData$julian))), 5)
##   (Intercept) gaData$julian 
##       0.00000       1.00231

Linear vs. Poisson regression

Just to compare the Linear vs. Poisson regression :

Linear

\[ NH_i = b_0 + b_1 JD_i + e_i \]

or

\[ E[NH_i | JD_i, b_0, b_1] = b_0 + b_1 JD_i\]

Poisson/log-linear

\[ \log\left(E[NH_i | JD_i, b_0, b_1]\right) = b_0 + b_1 JD_i \]

or

\[ E[NH_i | JD_i, b_0, b_1] = \exp\left(b_0 + b_1 JD_i\right) \]

Multiplicative differences

Just to reiterate :



\[ E[NH_i | JD_i, b_0, b_1] = \exp\left(b_0 + b_1 JD_i\right) \]

Which is of course :



\[ E[NH_i | JD_i, b_0, b_1] = \exp\left(b_0 \right)\exp\left(b_1 JD_i\right) \]



Meaning that if \(JD_i\) is increased by one unit, \(E[NH_i | JD_i, b_0, b_1]\) is multiplied by \(\exp\left(b_1\right)\)

Poisson regression in R

Poisson regression in R is pretty simple::

plot(gaData$julian,gaData$visits,pch=19,col="darkgrey",xlab="Julian",ylab="Visits")
glm1 <- glm(gaData$visits ~ gaData$julian,family="poisson")
abline(lm1,col="red",lwd=3)
lines(gaData$julian,glm1$fitted,col="blue",lwd=3)

The GLM fit is in blue which is non-linear.

Mean-variance relationship?

There is some concern of the mean-variance relationship in this dataset.

plot(glm1$fitted,glm1$residuals,pch=19,col="grey",ylab="Residuals",xlab="Fitted")

Model agnostic standard errors

We can use varience sandwich estimates and get confidence intervals that uses socalled Model agnostic standard errors:

library(sandwich)
confint.agnostic <- function (object, parm, level = 0.95, ...)
{
    cf <- coef(object); pnames <- names(cf)
    if (missing(parm))
        parm <- pnames
    else if (is.numeric(parm))
        parm <- pnames[parm]
    a <- (1 - level)/2; a <- c(a, 1 - a)
    pct <- stats:::format.perc(a, 3)
    fac <- qnorm(a)
    ci <- array(NA, dim = c(length(parm), 2L), dimnames = list(parm,
                                                               pct))
    ses <- sqrt(diag(sandwich::vcovHC(object)))[parm]
    ci[] <- cf[parm] + ses %o% fac
    ci
}

http://stackoverflow.com/questions/3817182/vcovhc-and-confidence-interval

Estimating confidence intervals

So below are confidence intervals computed in a normal way as well as using the robust standard errors dicusse in previous segment:

confint(glm1)
2.5 % 97.5 %
(Intercept) -34.34658 -31.1597157
gaData$julian 0.00219 0.0023965
confint.agnostic(glm1)
2.5 % 97.5 %
(Intercept) -36.3626746 -29.136997
gaData$julian 0.0020581 0.002528

Rates

Lets now talk about modeling rates:



\[ E[NHSS_i | JD_i, b_0, b_1]/NH_i = \exp\left(b_0 + b_1 JD_i\right) \]



\[ \log\left(E[NHSS_i | JD_i, b_0, b_1]\right) - \log(NH_i) = b_0 + b_1 JD_i \]



\[ \log\left(E[NHSS_i | JD_i, b_0, b_1]\right) = \log(NH_i) + b_0 + b_1 JD_i \]

Fitting rates in R

glm2 <- glm(gaData$simplystats ~ julian(gaData$date),offset=log(visits+1),
            family="poisson",data=gaData)
plot(julian(gaData$date),glm2$fitted,col="blue",pch=19,xlab="Date",ylab="Fitted Counts")
points(julian(gaData$date),glm1$fitted,col="red",pch=19)

and a second fit:

glm2 <- glm(gaData$simplystats ~ julian(gaData$date),offset=log(visits+1),
            family="poisson",data=gaData)
plot(julian(gaData$date),gaData$simplystats/(gaData$visits+1),col="grey",xlab="Date",
     ylab="Fitted Rates",pch=19)
lines(julian(gaData$date),glm2$fitted/(gaData$visits+1),col="blue",lwd=3)

More information

Reference

Adopted based on the course on regression models offered by Brian Caffo at Johns Hopkins Bloomberg School of Public Health