Generalized Linear Models

In Linear models \(Y_i\) is normally distributed with mean \(\mu_i\) and variance \(\sigma^2\)
\[ Y_i \sim N(mean=\mu_i, var=\sigma) \]
\[ \mu_i = \beta_0 + \beta_1x_{1,i} + \beta_2x_{2,i} ... \]
Assumptions: Linearity, homoscedasticity, normality
GLM’s link one response variable with predictor variables:
\[ Y_i \sim some \ distribution(\mu_i) \]
\[ link(\mu_i) = \beta_0 + \beta_1x_{1,i} + \beta_2x_{2,i} \]

When do we use them?

Sometimes, the assumptions for a linear model are not met. We can sometimes do a transformation to make it linear, homoscedastic (equal variance).

Why not transform the data?

We can transform the data. If the data is exponential as in the following example:

Why not transform the data?

We can transform the data. If the data is exponential as in the following example:

Assumptions

Draw an “imaginary” line through the center of the plot. Is the variance equal?
Variance? Think how far each point is from your imaginary line! The variance should be the same independently of the value of x!

Assumptions

Draw an “imaginary” line through the center of the plot. Is the variance equal?
Variance? Think how far each point is from your imaginary line! The variance should be the same independently of the value of x!
Clearly! IT IS NOT HOMOSCEDASTIC! (let’s start calling this non-equal variance!), also not linear…

We can transform the data

dat$logy<-log(dat$y+0.00001)

We can transform the data

dat$logy<-log(dat$y+0.00001)

We can transform the data

dat$logy<-log(dat$y+0.00001)

Ooops! We have a problem! It is linear now, which is good… but there is an issue
What is it?
Yeah, we are transforming the variance as well! We need a different solution!

We can transform the data

However, at least it is linear! so, we can estimate a lm:

model<-lm(logy~x, data=dat)
predictedmodel<-predict.lm(model,dat,interval = "co")
datpred<-cbind(dat,predictedmodel)

We can transform the data

However, at least it is linear! so, we can estimate a lm:

model<-lm(logy~x, data=dat)
predictedmodel<-predict.lm(model,dat,interval = "co")
datpred<-cbind(dat,predictedmodel)

Distribution

Transformations will linearize the data, but if the response variable has a differentt distribution around the model than normal, then a glm is the way to go:

We can do a glm!

This a Poisson distribution… why?

Only integers
Counts
It could also be negative binomial, zero-inflated, etc… but in this case it is binomial!

We can do a glm!

This a Poisson distribution… Let’s do a Poisson GLM:

Glm’s have:

Linear Predictor
Link Function
Probability Distribution

We can do a glm!

This a Poisson distribution… Let’s do a Poisson GLM:

Linear Predictor:

\[ \beta_0 + \beta_1x_1 \]

\[ \beta_0 + \beta_1flow \]

We can do a glm!

This a Poisson distribution… Let’s do a Poisson GLM:

Link function:

\[log(\lambda_i) = \beta_0 + \beta_1x_1 \]

\[ log(\lambda_i) = \beta_0 + \beta_1flow \]

Data

Generalized Linear Model

We can estimate a glm:

model<-glm(y~x, family=poisson, data=dat) 
summary(model)


Call:
glm(formula = y ~ x, family = poisson, data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 1.163085   0.038830   29.95   <2e-16 ***
x           0.332651   0.005078   65.51   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 5820.68  on 299  degrees of freedom
Residual deviance:  263.37  on 298  degrees of freedom
AIC: 1652.3

Number of Fisher Scoring iterations: 4

Model

predictedmodel<-predict.glm(model,dat, se.fit = T) 
datpred2<-cbind(dat,predictedmodel)

It is essentially transforming it!

GLM equation: poisson

\[log(\mu_i) = \beta_0 + \beta_1x_i \]
\[ y_i \sim Poisson(\lambda = \mu_i)\]
\[ \mu_i = e^{\beta_0 + \beta_1x_i}\]

GLM equation: poisson

\[log(\mu_i) = \beta_0 + \beta_1x_i => log(\mu_i) = \beta_0 + \beta_1flow_i \]
\[ y_i \sim Poisson(\lambda = \mu_i) => Nfish_i \sim Poisson(\lambda = \mu_i)\]
\[ \mu_i = e^{\beta_0 + \beta_1x_i} => \mu_i = e^{\beta_0 + \beta_1flow_i} \]

Model Summary


Call:
glm(formula = y ~ x, family = poisson, data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 1.163085   0.038830   29.95   <2e-16 ***
x           0.332651   0.005078   65.51   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 5820.68  on 299  degrees of freedom
Residual deviance:  263.37  on 298  degrees of freedom
AIC: 1652.3

Number of Fisher Scoring iterations: 4

Link function

\[ ln ( \lambda_i )= \beta_0+\beta_1x_1 \iff \lambda_i =exp(\beta_0+\beta_1x_1) \]

Probability distribution

\[ Y_i \sim Poisson(\lambda_i) \]

Probability distribution

Not normally distributed! This is OK! This is the point of a glm! It maps the error to the actual distribution

Success!

Look at the error!

How to plot:

ggplot(datpred2,aes(x=x,y=log(y),ymin=fit-2*se.fit,ymax=fit+2*se.fit))+
  geom_point()+
  geom_line(aes(y=fit),lwd=1)+
    geom_ribbon(alpha=0.2)+
  xlab("flow")+
  ylab("log (N fish)")+
  ggtitle("Fish passage \n N of fish")+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.5))

How to plot:

ggplot(datpred2,aes(x=x,y=y,ymin=exp(fit-2*se.fit),ymax=exp(fit+2*se.fit)))+
  geom_point()+
  geom_line(aes(y=exp(fit)),lwd=1)+
    geom_ribbon(alpha=0.2)+
  xlab("flow")+
  ylab("log (N fish)")+
  ggtitle("Fish passage \n N of fish")+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.5))

Distributions

Normal Distribution

Linear Models
Continuous
Theoretically from -inf to inf
Bell curve shapes
2 parameters: mean and variance

Poisson Distribution

It is discrete
Only positive
Usually counts (usually rate)
It has only one parameter

Poisson distributions

Binomial distributions

The binomial distribution is also a discrete probability distribution with support from 0 to \(N\). It is often used to model count data where there is an upper limit (e.g., # of heads from \(N\) coin flips)

Binomial distributions

The binomial distribution has two parameters \(N\) (# of trials) and \(p\) (probability of success). The mean is \(Np\) and the variance is \(np(1-p)\)

Binomial distributions

Binomial

Note that \(N\) is often fixed and known. We are interested in estimating \(p\)

bernoulli distribution

The Bernoulli distribution is a special case of the binomial distribution where \(N = 1\). What are the possible values of \(y\)?

The Bernoulli distribution is very useful for data where the response variable is binary in nature (e.g., dead/alive, presence/absence, ). The parameter we estimate is \(p\) - the probability of “success”

bernoulli distribution

rbinom(n = 10, size = 1, p = 0.1)

 [1] 0 0 0 0 0 0 1 0 0 0

rbinom(n = 10, size = 1, p = 0.5)

 [1] 0 0 0 1 0 0 1 0 1 1

rbinom(n = 10, size = 1, p = 0.9)

 [1] 0 1 1 0 1 1 1 0 1 1

How to run a glm?

Let’s go back to the Poisson example

Poisson glm code

model<-glm(y~x, data=dat,,family=poisson(link ="log"))


Call:
glm(formula = y ~ x, family = poisson(link = "log"), data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 1.163085   0.038830   29.95   <2e-16 ***
x           0.332651   0.005078   65.51   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 5820.68  on 299  degrees of freedom
Residual deviance:  263.37  on 298  degrees of freedom
AIC: 1652.3

Number of Fisher Scoring iterations: 4

Poisson glm code

model<-glm(y~x, data=dat,,family=poisson(link ="log"))
predictedmodel<-predict.glm(model,dat,se.fit = T)
ci_lwr <- with(predictedmodel, exp(fit + qnorm(0.025)*se.fit))
ci_upr <- with(predictedmodel, exp(fit + qnorm(0.975)*se.fit))
datpred2<-cbind(dat,predictedmodel)
datpred2$fit2<-exp(datpred2$fit)
datpred2$lwr<-ci_lwr
datpred2$upr<-ci_upr

Binomial glm code

model<-glm(y~x, data=dat,,family=binomial(link = "logit"))

dat<-read.csv("parasitecod.csv")
model<-glm(Prevalence~Weight, data=dat,,family=binomial(link ="logit"))
predictedmodel<-predict.glm(model,dat,se.fit = T)
ci_lwr <- with(predictedmodel, plogis(fit + qnorm(0.025)*se.fit))
ci_upr <- with(predictedmodel, plogis(fit + qnorm(0.975)*se.fit))
datpred2<-cbind(dat,predictedmodel)
datpred2$fit2<-plogis(datpred2$fit)
datpred2$lwr<-ci_lwr
datpred2$upr<-ci_upr