Generalized Linear Models

Generalized Linear Models

  • In Linear models \(Y_i\) is normally distributed with mean \(\mu_i\) and variance \(\sigma^2\)

  • \[ Y_i \sim N(mean=\mu_i, var=\sigma) \]

  • \[ \mu_i = \beta_0 + \beta_1x_{1,i} + \beta_2x_{2,i} ... \]

  • Assumptions: Linearity, homoscedasticity, normality

  • GLM’s link one response variable with predictor variables:

  • \[ Y_i \sim some \ distribution(\mu_i) \]

  • \[ link(\mu_i) = \beta_0 + \beta_1x_{1,i} + \beta_2x_{2,i} \]

When do we use them?

Sometimes, the assumptions for a linear model are not met. We can sometimes do a transformation to make it linear, homoscedastic (equal variance).

Why not transform the data?

We can transform the data. If the data is exponential as in the following example:

Why not transform the data?

We can transform the data. If the data is exponential as in the following example:

Assumptions

  • Draw an “imaginary” line through the center of the plot. Is the variance equal?

  • Variance? Think how far each point is from your imaginary line! The variance should be the same independently of the value of x!

Assumptions

  • Draw an “imaginary” line through the center of the plot. Is the variance equal?

  • Variance? Think how far each point is from your imaginary line! The variance should be the same independently of the value of x!

  • Clearly! IT IS NOT HOMOSCEDASTIC! (let’s start calling this non-equal variance!), also not linear…

We can transform the data

dat$logy<-log(dat$y+0.00001)

We can transform the data

dat$logy<-log(dat$y+0.00001)

We can transform the data

dat$logy<-log(dat$y+0.00001)
  • Ooops! We have a problem! It is linear now, which is good… but there is an issue

  • What is it?

  • Yeah, we are transforming the variance as well! We need a different solution!

We can transform the data

However, at least it is linear! so, we can estimate a lm:

model<-lm(logy~x, data=dat)
predictedmodel<-predict.lm(model,dat,interval = "co")
datpred<-cbind(dat,predictedmodel)

We can transform the data

However, at least it is linear! so, we can estimate a lm:

model<-lm(logy~x, data=dat)
predictedmodel<-predict.lm(model,dat,interval = "co")
datpred<-cbind(dat,predictedmodel)

Distribution

Transformations will linearize the data, but if the response variable has a differentt distribution around the model than normal, then a glm is the way to go:

We can do a glm!

This a Poisson distribution… why?

  • Only integers

  • Counts

  • It could also be negative binomial, zero-inflated, etc… but in this case it is binomial!

We can do a glm!

This a Poisson distribution… Let’s do a Poisson GLM:

Glm’s have:

  1. Linear Predictor

  2. Link Function

  3. Probability Distribution

We can do a glm!

This a Poisson distribution… Let’s do a Poisson GLM:

Linear Predictor:

\[ \beta_0 + \beta_1x_1 \]

\[ \beta_0 + \beta_1flow \]

We can do a glm!

This a Poisson distribution… Let’s do a Poisson GLM:

Link function:

\[log(\lambda_i) = \beta_0 + \beta_1x_1 \]

\[ log(\lambda_i) = \beta_0 + \beta_1flow \]

Data

Generalized Linear Model

We can estimate a glm:

model<-glm(y~x, family=poisson, data=dat) 
summary(model)

Call:
glm(formula = y ~ x, family = poisson, data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 1.163085   0.038830   29.95   <2e-16 ***
x           0.332651   0.005078   65.51   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 5820.68  on 299  degrees of freedom
Residual deviance:  263.37  on 298  degrees of freedom
AIC: 1652.3

Number of Fisher Scoring iterations: 4

Model

predictedmodel<-predict.glm(model,dat, se.fit = T) 
datpred2<-cbind(dat,predictedmodel)

It is essentially transforming it!

GLM equation: poisson

  • \[log(\mu_i) = \beta_0 + \beta_1x_i \]

  • \[ y_i \sim Poisson(\lambda = \mu_i)\]

  • \[ \mu_i = e^{\beta_0 + \beta_1x_i}\]

GLM equation: poisson

  • \[log(\mu_i) = \beta_0 + \beta_1x_i => log(\mu_i) = \beta_0 + \beta_1flow_i \]

  • \[ y_i \sim Poisson(\lambda = \mu_i) => Nfish_i \sim Poisson(\lambda = \mu_i)\]

  • \[ \mu_i = e^{\beta_0 + \beta_1x_i} => \mu_i = e^{\beta_0 + \beta_1flow_i} \]

Model Summary


Call:
glm(formula = y ~ x, family = poisson, data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 1.163085   0.038830   29.95   <2e-16 ***
x           0.332651   0.005078   65.51   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 5820.68  on 299  degrees of freedom
Residual deviance:  263.37  on 298  degrees of freedom
AIC: 1652.3

Number of Fisher Scoring iterations: 4

Probability distribution

\[ Y_i \sim Poisson(\lambda_i) \]

Probability distribution

Not normally distributed! This is OK! This is the point of a glm! It maps the error to the actual distribution

Success!

Look at the error!

How to plot:

ggplot(datpred2,aes(x=x,y=log(y),ymin=fit-2*se.fit,ymax=fit+2*se.fit))+
  geom_point()+
  geom_line(aes(y=fit),lwd=1)+
    geom_ribbon(alpha=0.2)+
  xlab("flow")+
  ylab("log (N fish)")+
  ggtitle("Fish passage \n N of fish")+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.5))

How to plot:

ggplot(datpred2,aes(x=x,y=y,ymin=exp(fit-2*se.fit),ymax=exp(fit+2*se.fit)))+
  geom_point()+
  geom_line(aes(y=exp(fit)),lwd=1)+
    geom_ribbon(alpha=0.2)+
  xlab("flow")+
  ylab("log (N fish)")+
  ggtitle("Fish passage \n N of fish")+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.5))

Distributions

Normal Distribution

  • Linear Models

  • Continuous

  • Theoretically from -inf to inf

  • Bell curve shapes

  • 2 parameters: mean and variance

Poisson Distribution

  • It is discrete

  • Only positive

  • Usually counts (usually rate)

  • It has only one parameter

Poisson distributions

Binomial distributions

The binomial distribution is also a discrete probability distribution with support from 0 to \(N\). It is often used to model count data where there is an upper limit (e.g., # of heads from \(N\) coin flips)

Binomial distributions

The binomial distribution has two parameters \(N\) (# of trials) and \(p\) (probability of success). The mean is \(Np\) and the variance is \(np(1-p)\)

Binomial distributions

Binomial

Note that \(N\) is often fixed and known. We are interested in estimating \(p\)

bernoulli distribution

The Bernoulli distribution is a special case of the binomial distribution where \(N = 1\). What are the possible values of \(y\)?

The Bernoulli distribution is very useful for data where the response variable is binary in nature (e.g., dead/alive, presence/absence, ). The parameter we estimate is \(p\) - the probability of “success”

bernoulli distribution

rbinom(n = 10, size = 1, p = 0.1)
 [1] 0 0 0 0 0 0 1 0 0 0
rbinom(n = 10, size = 1, p = 0.5)
 [1] 0 0 0 1 0 0 1 0 1 1
rbinom(n = 10, size = 1, p = 0.9)
 [1] 0 1 1 0 1 1 1 0 1 1

How to run a glm?

Let’s go back to the Poisson example

Poisson glm code

model<-glm(y~x, data=dat,,family=poisson(link ="log"))

Call:
glm(formula = y ~ x, family = poisson(link = "log"), data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 1.163085   0.038830   29.95   <2e-16 ***
x           0.332651   0.005078   65.51   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 5820.68  on 299  degrees of freedom
Residual deviance:  263.37  on 298  degrees of freedom
AIC: 1652.3

Number of Fisher Scoring iterations: 4

Poisson glm code

model<-glm(y~x, data=dat,,family=poisson(link ="log"))
predictedmodel<-predict.glm(model,dat,se.fit = T)
ci_lwr <- with(predictedmodel, exp(fit + qnorm(0.025)*se.fit))
ci_upr <- with(predictedmodel, exp(fit + qnorm(0.975)*se.fit))
datpred2<-cbind(dat,predictedmodel)
datpred2$fit2<-exp(datpred2$fit)
datpred2$lwr<-ci_lwr
datpred2$upr<-ci_upr

Binomial glm code

model<-glm(y~x, data=dat,,family=binomial(link = "logit"))

##

dat<-read.csv("parasitecod.csv")
model<-glm(Prevalence~Weight, data=dat,,family=binomial(link ="logit"))
predictedmodel<-predict.glm(model,dat,se.fit = T)
ci_lwr <- with(predictedmodel, plogis(fit + qnorm(0.025)*se.fit))
ci_upr <- with(predictedmodel, plogis(fit + qnorm(0.975)*se.fit))
datpred2<-cbind(dat,predictedmodel)
datpred2$fit2<-plogis(datpred2$fit)
datpred2$lwr<-ci_lwr
datpred2$upr<-ci_upr