In Linear models \(Y_i\) is normally distributed with mean \(\mu_i\) and variance \(\sigma^2\)
\[ Y_i \sim N(mean=\mu_i, var=\sigma) \]
\[ \mu_i = \beta_0 + \beta_1x_{1,i} + \beta_2x_{2,i} ... \]
Assumptions: Linearity, homoscedasticity, normality
GLM’s link one response variable with predictor variables:
\[ Y_i \sim some \ distribution(\mu_i) \]
\[ link(\mu_i) = \beta_0 + \beta_1x_{1,i} + \beta_2x_{2,i} \]
Sometimes, the assumptions for a linear model are not met. We can sometimes do a transformation to make it linear, homoscedastic (equal variance).
We can transform the data. If the data is exponential as in the following example:
We can transform the data. If the data is exponential as in the following example:
Draw an “imaginary” line through the center of the plot. Is the variance equal?
Variance? Think how far each point is from your imaginary line! The variance should be the same independently of the value of x!
Draw an “imaginary” line through the center of the plot. Is the variance equal?
Variance? Think how far each point is from your imaginary line! The variance should be the same independently of the value of x!
Clearly! IT IS NOT HOMOSCEDASTIC! (let’s start calling this non-equal variance!), also not linear…
Ooops! We have a problem! It is linear now, which is good… but there is an issue
What is it?
Yeah, we are transforming the variance as well! We need a different solution!
However, at least it is linear! so, we can estimate a lm:
However, at least it is linear! so, we can estimate a lm:
Transformations will linearize the data, but if the response variable has a differentt distribution around the model than normal, then a glm is the way to go:
This a Poisson distribution… why?
Only integers
Counts
It could also be negative binomial, zero-inflated, etc… but in this case it is binomial!
This a Poisson distribution… Let’s do a Poisson GLM:
Glm’s have:
Linear Predictor
Link Function
Probability Distribution
This a Poisson distribution… Let’s do a Poisson GLM:
Linear Predictor:
\[ \beta_0 + \beta_1x_1 \]
\[ \beta_0 + \beta_1flow \]
This a Poisson distribution… Let’s do a Poisson GLM:
Link function:
\[log(\lambda_i) = \beta_0 + \beta_1x_1 \]
\[ log(\lambda_i) = \beta_0 + \beta_1flow \]
We can estimate a glm:
Call:
glm(formula = y ~ x, family = poisson, data = dat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.163085 0.038830 29.95 <2e-16 ***
x 0.332651 0.005078 65.51 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 5820.68 on 299 degrees of freedom
Residual deviance: 263.37 on 298 degrees of freedom
AIC: 1652.3
Number of Fisher Scoring iterations: 4
\[log(\mu_i) = \beta_0 + \beta_1x_i \]
\[ y_i \sim Poisson(\lambda = \mu_i)\]
\[ \mu_i = e^{\beta_0 + \beta_1x_i}\]
\[log(\mu_i) = \beta_0 + \beta_1x_i => log(\mu_i) = \beta_0 + \beta_1flow_i \]
\[ y_i \sim Poisson(\lambda = \mu_i) => Nfish_i \sim Poisson(\lambda = \mu_i)\]
\[ \mu_i = e^{\beta_0 + \beta_1x_i} => \mu_i = e^{\beta_0 + \beta_1flow_i} \]
Call:
glm(formula = y ~ x, family = poisson, data = dat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.163085 0.038830 29.95 <2e-16 ***
x 0.332651 0.005078 65.51 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 5820.68 on 299 degrees of freedom
Residual deviance: 263.37 on 298 degrees of freedom
AIC: 1652.3
Number of Fisher Scoring iterations: 4
\[ ln ( \lambda_i )= \beta_0+\beta_1x_1 \iff \lambda_i =exp(\beta_0+\beta_1x_1) \]
\[ Y_i \sim Poisson(\lambda_i) \]
Not normally distributed! This is OK! This is the point of a glm! It maps the error to the actual distribution
Look at the error!
Linear Models
Continuous
Theoretically from -inf to inf
Bell curve shapes
2 parameters: mean and variance
It is discrete
Only positive
Usually counts (usually rate)
It has only one parameter
The binomial distribution is also a discrete probability distribution with support from 0 to \(N\). It is often used to model count data where there is an upper limit (e.g., # of heads from \(N\) coin flips)
The binomial distribution has two parameters \(N\) (# of trials) and \(p\) (probability of success). The mean is \(Np\) and the variance is \(np(1-p)\)
Note that \(N\) is often fixed and known. We are interested in estimating \(p\)
The Bernoulli distribution is a special case of the binomial distribution where \(N = 1\). What are the possible values of \(y\)?
The Bernoulli distribution is very useful for data where the response variable is binary in nature (e.g., dead/alive, presence/absence, ). The parameter we estimate is \(p\) - the probability of “success”
Let’s go back to the Poisson example
Call:
glm(formula = y ~ x, family = poisson(link = "log"), data = dat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.163085 0.038830 29.95 <2e-16 ***
x 0.332651 0.005078 65.51 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 5820.68 on 299 degrees of freedom
Residual deviance: 263.37 on 298 degrees of freedom
AIC: 1652.3
Number of Fisher Scoring iterations: 4
model<-glm(y~x, data=dat,,family=poisson(link ="log"))
predictedmodel<-predict.glm(model,dat,se.fit = T)
ci_lwr <- with(predictedmodel, exp(fit + qnorm(0.025)*se.fit))
ci_upr <- with(predictedmodel, exp(fit + qnorm(0.975)*se.fit))
datpred2<-cbind(dat,predictedmodel)
datpred2$fit2<-exp(datpred2$fit)
datpred2$lwr<-ci_lwr
datpred2$upr<-ci_upr
##
dat<-read.csv("parasitecod.csv")
model<-glm(Prevalence~Weight, data=dat,,family=binomial(link ="logit"))
predictedmodel<-predict.glm(model,dat,se.fit = T)
ci_lwr <- with(predictedmodel, plogis(fit + qnorm(0.025)*se.fit))
ci_upr <- with(predictedmodel, plogis(fit + qnorm(0.975)*se.fit))
datpred2<-cbind(dat,predictedmodel)
datpred2$fit2<-plogis(datpred2$fit)
datpred2$lwr<-ci_lwr
datpred2$upr<-ci_upr