Generalized Models
library(tidyverse)
library(emmeans)
library(car)
library(agridat)
Example: Conducting a one-way ANOVA with non-normal data
Let’s load up the InsectSprays data
data("InsectSprays")# available from base R
Now we need to filter the data to just 4 treatments:
d <- InsectSprays %>% filter(spray=='A'|spray=='B'|spray=='C'|spray=='F') %>%
droplevels()
Let’s plot out data:
ggplot(d, aes(x=spray,y=count)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(height=0,width=.1) +
theme_classic() +
theme(axis.title = element_text(face= 'bold', size = 15))
hist(d$count)
Now let’s construct a linear model to examine the effect of the different sprays on insect counts:
lm1 <- lm(count~spray, data=d)
Anova(lm1, type=2) ## car::Anova will print out an ANOVA table
## Anova Table (Type II tests)
##
## Response: count
## Sum Sq Df F value Pr(>F)
## spray 1648.73 3 26.478 6.091e-10 ***
## Residuals 913.25 44
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s check the residuals of the model:
hist(resid(lm1)) #' residuals should be normally distributed, even for glm
plot(resid(lm1)~fitted(lm1)) ## residuals should be evenly dispersed around 0 across the range of x's
abline(h=0) # funnel shapes or curvature is bad
qqPlot(resid(lm1)) ## residuals should line up pretty closely to the blue line
## [1] 45 46
boxplot(resid(lm1) ~ d$spray) ## variances should be homogeneous for each group
Let’s use emmeans:
emmeans(lm1, ~spray)
## spray emmean SE df lower.CL upper.CL
## A 14.50 1.32 44 11.849 17.15
## B 15.33 1.32 44 12.683 17.98
## C 2.08 1.32 44 -0.567 4.73
## F 16.67 1.32 44 14.016 19.32
##
## Confidence level used: 0.95
Note all the SE are the same and CL is off.
Log-linear model
Now let’s use a log-linear model to examine the effect of the different sprays on insect counts.
lm2 <- lm(log(count+1)~spray, data=d)
Anova(lm2, type=2) ## car::Anova will print out an ANOVA table testing
## Anova Table (Type II tests)
##
## Response: log(count + 1)
## Sum Sq Df F value Pr(>F)
## spray 29.3651 3 56.682 3.701e-15 ***
## Residuals 7.5984 44
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s check residuals:
hist(resid(lm2)) ## residuals should be normally distributed, even for glm
plot(resid(lm2)~fitted(lm2)) + ## residuals should be evenly dispersed around 0 across the range of x's
abline(h=0) # funnel shapes or curvature is bad
## integer(0)
qqPlot and boxplot next:
qqPlot(resid(lm2)) ## residuals should line up pretty closely to the blue line
## [1] 27 25
boxplot(resid(lm2) ~ d$spray) ## variances should be homogeneous for each group
Let’s use emmeans again:
emmeans(lm2, ~spray) ## note that now all means are back-transformed
## spray emmean SE df lower.CL upper.CL
## A 2.697 0.12 44 2.455 2.94
## B 2.757 0.12 44 2.515 3.00
## C 0.953 0.12 44 0.711 1.19
## F 2.816 0.12 44 2.574 3.06
##
## Results are given on the log(mu + 1) (not the response) scale.
## Confidence level used: 0.95
To calculate back-transformed emmeans we can add additional arguments:
emmeans(lm2, ~spray, type='response') ## note that now all means are back-transformed
## spray response SE df lower.CL upper.CL
## A 13.83 1.779 44 10.65 17.9
## B 14.75 1.889 44 11.36 19.1
## C 1.59 0.311 44 1.04 2.3
## F 15.70 2.004 44 12.12 20.3
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log(mu + 1) scale
Generalized linear models
Now let’s use GLMs to examine the effect of the different sprays:
glm1 <- glm(count~spray, data=d, family='poisson')
glm() is a general function that conducts a generalized linear model and we must specify the ‘family’ (aka error distribution). The default is the ‘gaussian’ distribution (normal). All the model “calculations” are saved in an object we called ‘glm1’.
Anova(glm1, type=2) ## car::Anova will print out an ANOVA table testing
## Analysis of Deviance Table (Type II tests)
##
## Response: count
## LR Chisq Df Pr(>Chisq)
## spray 185.83 3 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For the ANOVA above, the null hypothesis that all group means are equal. The argument, type = 2, provides margin tests, which is usually better than the default Type I, especially for more complicated models. For GLMs, Anova returns a likelihood ratio test with a chi-sq value
summary(glm1) ## summary() will provide the model coefficients (ie. the "guts" of the model)
##
## Call:
## glm(formula = count ~ spray, family = "poisson", data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3852 -0.9345 -0.1482 0.7048 2.6709
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.67415 0.07581 35.274 <2e-16 ***
## sprayB 0.05588 0.10574 0.528 0.597
## sprayC -1.94018 0.21389 -9.071 <2e-16 ***
## sprayF 0.13926 0.10367 1.343 0.179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 262.154 on 47 degrees of freedom
## Residual deviance: 76.323 on 44 degrees of freedom
## AIC: 273.93
##
## Number of Fisher Scoring iterations: 5
The coefficients allow you rebuild the means from the linear model. Rebuilding the model from the coefficients is not super helpful and the p-values aren’t very meaningful. Residual deviance should be about equal to the degrees of freedom. More than twice as high is problematic.
Now let’s check assumptions of model by examining residuals:
hist(resid(glm1)) ## residuals should be normally distributed, but don't need to be for GLMs
plot(resid(glm1)~fitted(glm1)) ## residuals should be evenly dispersed around 0 across the range of x's
abline(h=0) # funnel shapes or curvature is bad
qqPlot(resid(glm1)) ## calls from car package, residuals should line up pretty closely to the blue line
## [1] 27 23
# points that drift from line might be outliers
boxplot(resid(glm1) ~ d$spray) ## variances should be homogeneous for each group
Diagnosing more complex GLMs can be very difficult. Residuals are often NOT NORMALLY DISTRIBUTED. We will return to this later…
emmeans(glm1, ~spray) ## emmeans::emmmeans will rebuild the model for you
## spray emmean SE df asymp.LCL asymp.UCL
## A 2.674 0.0758 Inf 2.526 2.82
## B 2.730 0.0737 Inf 2.586 2.87
## C 0.734 0.2000 Inf 0.342 1.13
## F 2.813 0.0707 Inf 2.675 2.95
##
## Results are given on the log (not the response) scale.
## Confidence level used: 0.95
The emmeans code above will print off the means, SE, and confidence intervals for each treatment group. Note, the coefficients are on the log-scale (look at model).
emmeans(glm1, pairwise~spray, type='response') ## adding 'pairwise' will conduct pairwise contrasts -- ie. compare each group mean to the others
## $emmeans
## spray rate SE df asymp.LCL asymp.UCL
## A 14.50 1.099 Inf 12.50 16.82
## B 15.33 1.130 Inf 13.27 17.72
## C 2.08 0.417 Inf 1.41 3.08
## F 16.67 1.179 Inf 14.51 19.14
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
##
## $contrasts
## contrast ratio SE df null z.ratio p.value
## A / B 0.946 0.1000 Inf 1 -0.528 0.9522
## A / C 6.960 1.4886 Inf 1 9.071 <.0001
## A / F 0.870 0.0902 Inf 1 -1.343 0.5352
## B / C 7.360 1.5688 Inf 1 9.364 <.0001
## B / F 0.920 0.0940 Inf 1 -0.816 0.8468
## C / F 0.125 0.0265 Inf 1 -9.803 <.0001
##
## P value adjustment: tukey method for comparing a family of 4 estimates
## Tests are performed on the log scale
Adding the argument ‘pairwise’ will conduct pairwise contrasts – ie. compare each group mean to the others. This automatically adjusts p-values using the ‘tukey’ adjust. This can be changed using ‘adjust=XX’ within the emmeans function. The type=‘response’ will back-transform (ie. exponentiate) to the original scale
Let’s compare residuals for normal, log-transformed, and poisson models:
par(mfrow=c(1,3))
boxplot(resid(lm1) ~ d$spray)
boxplot(resid(lm2) ~ d$spray)
boxplot(resid(glm1) ~ d$spray)
dev.off()
## null device
## 1
Let’s also compare means for normal, log-transformed, and poisson models:
emmeans(lm1, ~spray)
## spray emmean SE df lower.CL upper.CL
## A 14.50 1.32 44 11.849 17.15
## B 15.33 1.32 44 12.683 17.98
## C 2.08 1.32 44 -0.567 4.73
## F 16.67 1.32 44 14.016 19.32
##
## Confidence level used: 0.95
emmeans(lm2, ~spray, type='response')
## spray response SE df lower.CL upper.CL
## A 13.83 1.779 44 10.65 17.9
## B 14.75 1.889 44 11.36 19.1
## C 1.59 0.311 44 1.04 2.3
## F 15.70 2.004 44 12.12 20.3
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log(mu + 1) scale
emmeans(glm1, ~spray, type='response')
## spray rate SE df asymp.LCL asymp.UCL
## A 14.50 1.099 Inf 12.50 16.82
## B 15.33 1.130 Inf 13.27 17.72
## C 2.08 0.417 Inf 1.41 3.08
## F 16.67 1.179 Inf 14.51 19.14
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
Overdispersion
In this section we will go over how to deal with overdispersion.
Let’s load the necessary libraries:
library(tidyverse)
library(emmeans)
library(car)
library(agridat)
Load in and read about the beall.webworms dataset. The variables of interest are the y-count of webworms, spray-spray treatment, and lead-lead treatment. Don’t worry about the block or other variables for now.
data("beall.webworms")
d1 <- beall.webworms
head(d1)
## row col y block trt spray lead
## 1 1 1 1 B1 T1 N N
## 2 2 1 0 B1 T1 N N
## 3 3 1 1 B1 T1 N N
## 4 4 1 3 B1 T1 N N
## 5 5 1 6 B1 T1 N N
## 6 6 1 0 B2 T1 N N
Let’s examine and plot the data:
ggplot(d1, aes(x=spray, y=y, fill=lead)) +
geom_violin(scale="width", adjust=2) +
geom_point(position = position_jitterdodge(jitter.width=.5,
jitter.height=.1,
dodge.width = 1),
alpha=.1) +
theme_bw(base_size = 14)
Let’s now run a model with the interaction of spray and lead.
r3 <- glm(y ~ spray * lead, data=d1, family="poisson")
Examin the model summary:
summary(r3)
##
## Call:
## glm(formula = y ~ spray * lead, family = "poisson", data = d1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6733 -1.0046 -0.9081 0.6141 4.2771
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.33647 0.04688 7.177 7.12e-13 ***
## sprayY -1.02043 0.09108 -11.204 < 2e-16 ***
## leadY -0.49628 0.07621 -6.512 7.41e-11 ***
## sprayY:leadY 0.29425 0.13917 2.114 0.0345 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1955.9 on 1299 degrees of freedom
## Residual deviance: 1720.4 on 1296 degrees of freedom
## AIC: 3125.5
##
## Number of Fisher Scoring iterations: 6
Anova(r3)
## Analysis of Deviance Table (Type II tests)
##
## Response: y
## LR Chisq Df Pr(>Chisq)
## spray 188.707 1 < 2.2e-16 ***
## lead 42.294 1 7.853e-11 ***
## spray:lead 4.452 1 0.03485 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(r3, ~spray:lead, type='response')
## spray lead rate SE df asymp.LCL asymp.UCL
## N N 1.400 0.0656 Inf 1.277 1.535
## Y N 0.505 0.0394 Inf 0.433 0.588
## N Y 0.852 0.0512 Inf 0.758 0.959
## Y Y 0.412 0.0356 Inf 0.348 0.488
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
We need to load library(performance) to test for overdispersion:
library(performance)
check_overdispersion(r3) # overdispersion ratio calculator from RVAideMemoire
## # Overdispersion test
##
## dispersion ratio = 1.355
## Pearson's Chi-Squared = 1755.717
## p-value = < 0.001
Now let’s load the glmmTMB package to implement negative binomial distribution.
library(glmmTMB)
r4 <- glmmTMB(y ~ spray * lead, data=d1, family="nbinom2")
Anova(r4)
## Analysis of Deviance Table (Type II Wald chisquare tests)
##
## Response: y
## Chisq Df Pr(>Chisq)
## spray 125.5047 1 < 2.2e-16 ***
## lead 26.8005 1 2.256e-07 ***
## spray:lead 3.3942 1 0.06542 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(r4, ~spray:lead, type='response')
## spray lead response SE df lower.CL upper.CL
## N N 1.400 0.0855 1295 1.242 1.578
## Y N 0.505 0.0441 1295 0.425 0.599
## N Y 0.852 0.0611 1295 0.740 0.981
## Y Y 0.412 0.0391 1295 0.342 0.497
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
Let’s use the DHARMa package to simulate residuals for poisson and negative binomial models.
library(DHARMa)
We can interpret the simulated residuals very similarly to the raw residuals we have previously examined. The residuals should line up along the line in the QQ plot and there should be (roughly) equal scatter in the residuals among the groups
plot(simulateResiduals(r3)) ## plot simulated residuals
Histograms of the simulated residuals will be different than before. Here the simulated residuals should be flat. Its ok if the bars bump up and down, but they should be on average flat across the graph.The bars shouldn’t be peaked (eg. “normally” distributed) or U-shaped or increasing/decreasing.
hist(simulateResiduals(r3)) ## histogram should be flat
Now let’s look at residuals for negative binomial model:
plot(simulateResiduals(r4)) ## plot simulated residuals
hist(simulateResiduals(r4)) ## histogram should be flat
Binomial GLM
In this section we will run a GLM with a binomial error distribution. We load the following packages:
library(tidyverse)
library(emmeans)
library(car)
library(agridat)
library(DHARMa)
We will use the Titanic survival dataset for the binomial GLM.
## LOAD TITANIC SURVIVAL DATASET
data("TitanicSurvival")
t1 <- TitanicSurvival %>% filter(age>17) # filter out children
head(t1)
## survived sex age passengerClass
## Allen, Miss. Elisabeth Walton yes female 29 1st
## Allison, Mr. Hudson Joshua Crei no male 30 1st
## Allison, Mrs. Hudson J C (Bessi no female 25 1st
## Anderson, Mr. Harry yes male 48 1st
## Andrews, Miss. Kornelia Theodos yes female 63 1st
## Andrews, Mr. Thomas Jr no male 39 1st
Let’s quickly plot the data:
ggplot(t1, aes(x=passengerClass, y=survived, color=sex)) +
geom_jitter(height=.2, width=0.2)
Now we can construct a GLM to estimate survival as a function of sex and passengerClass while include Age as co-variate.
tglm1 <- glm(survived ~ sex * passengerClass + age, data=t1, family = binomial(link = "logit"))
Let’s look at the summary of the model:
## print off anova table
Anova(tglm1)
## Analysis of Deviance Table (Type II tests)
##
## Response: survived
## LR Chisq Df Pr(>Chisq)
## sex 270.225 1 < 2.2e-16 ***
## passengerClass 85.841 2 < 2.2e-16 ***
## age 8.419 1 0.003714 **
## sex:passengerClass 48.354 2 3.163e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## print off summary
summary(tglm1)
##
## Call:
## glm(formula = survived ~ sex * passengerClass + age, family = binomial(link = "logit"),
## data = t1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7672 -0.6349 -0.4503 0.4438 2.5268
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.440409 0.637328 6.967 3.23e-12 ***
## sexmale -4.105505 0.540235 -7.599 2.97e-14 ***
## passengerClass2nd -1.709631 0.609577 -2.805 0.00504 **
## passengerClass3rd -3.957654 0.562147 -7.040 1.92e-12 ***
## age -0.025352 0.008945 -2.834 0.00459 **
## sexmale:passengerClass2nd -0.203867 0.698880 -0.292 0.77051
## sexmale:passengerClass3rd 2.655924 0.597510 4.445 8.79e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1191.35 on 891 degrees of freedom
## Residual deviance: 752.38 on 885 degrees of freedom
## AIC: 766.38
##
## Number of Fisher Scoring iterations: 6
Now we can check residuals:
hist(resid(tglm1)) ## residuals should be normally distributed, even for glm
plot(resid(tglm1)~fitted(tglm1)) + ## residuals should be evenly dispersed around 0 across the range of x's
abline(h=0) # funnel shapes or curvature is bad
## integer(0)
qqPlot(resid(tglm1)) ## residuals should line up pretty closely to the blue line
## Allison, Mrs. Hudson J C (Bessi Harris, Mr. George
## 3 363
boxplot(resid(tglm1) ~ t1$passengerClass) ## variances should be homogeneous for each group
## simulate residuals
plot(simulateResiduals(tglm1)) ## plot simulated residuals
hist(simulateResiduals(tglm1)) ## histogram should be flat
To make sense of the model output let’s use the emmeans package:
emmeans(tglm1, pairwise ~ sex:passengerClass)
## $emmeans
## sex passengerClass emmean SE df asymp.LCL asymp.UCL
## female 1st 3.592 0.516 Inf 2.580 4.6034
## male 1st -0.514 0.192 Inf -0.890 -0.1371
## female 2nd 1.882 0.325 Inf 1.246 2.5183
## male 2nd -2.427 0.303 Inf -3.022 -1.8324
## female 3rd -0.366 0.203 Inf -0.764 0.0322
## male 3rd -1.815 0.170 Inf -2.149 -1.4814
##
## Results are given on the logit (not the response) scale.
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df z.ratio p.value
## female 1st - male 1st 4.106 0.540 Inf 7.599 <.0001
## female 1st - female 2nd 1.710 0.610 Inf 2.805 0.0567
## female 1st - male 2nd 6.019 0.602 Inf 9.997 <.0001
## female 1st - female 3rd 3.958 0.562 Inf 7.040 <.0001
## female 1st - male 3rd 5.407 0.551 Inf 9.810 <.0001
## male 1st - female 2nd -2.396 0.377 Inf -6.354 <.0001
## male 1st - male 2nd 1.913 0.364 Inf 5.262 <.0001
## male 1st - female 3rd -0.148 0.291 Inf -0.507 0.9959
## male 1st - male 3rd 1.302 0.270 Inf 4.826 <.0001
## female 2nd - male 2nd 4.309 0.444 Inf 9.700 <.0001
## female 2nd - female 3rd 2.248 0.383 Inf 5.872 <.0001
## female 2nd - male 3rd 3.698 0.367 Inf 10.088 <.0001
## male 2nd - female 3rd -2.061 0.362 Inf -5.699 <.0001
## male 2nd - male 3rd -0.612 0.344 Inf -1.776 0.4812
## female 3rd - male 3rd 1.450 0.255 Inf 5.678 <.0001
##
## Results are given on the log odds ratio (not the response) scale.
## P value adjustment: tukey method for comparing a family of 6 estimates
We can obtain back-transformed means:
emmeans(tglm1, pairwise ~ sex:passengerClass, type="response") ## type= does contrasts before back-transforming (more appropriate!)
## $emmeans
## sex passengerClass prob SE df asymp.LCL asymp.UCL
## female 1st 0.9732 0.0135 Inf 0.9296 0.990
## male 1st 0.3743 0.0450 Inf 0.2911 0.466
## female 2nd 0.8679 0.0372 Inf 0.7766 0.925
## male 2nd 0.0811 0.0226 Inf 0.0465 0.138
## female 3rd 0.4096 0.0491 Inf 0.3178 0.508
## male 3rd 0.1400 0.0205 Inf 0.1044 0.185
##
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
##
## $contrasts
## contrast odds.ratio SE df null z.ratio p.value
## female 1st / male 1st 60.6734 32.7779 Inf 1 7.599 <.0001
## female 1st / female 2nd 5.5269 3.3691 Inf 1 2.805 0.0567
## female 1st / male 2nd 411.1685 247.5444 Inf 1 9.997 <.0001
## female 1st / female 3rd 52.3344 29.4197 Inf 1 7.040 <.0001
## female 1st / male 3rd 223.0141 122.9249 Inf 1 9.810 <.0001
## male 1st / female 2nd 0.0911 0.0343 Inf 1 -6.354 <.0001
## male 1st / male 2nd 6.7768 2.4645 Inf 1 5.262 <.0001
## male 1st / female 3rd 0.8626 0.2514 Inf 1 -0.507 0.9959
## male 1st / male 3rd 3.6756 0.9915 Inf 1 4.826 <.0001
## female 2nd / male 2nd 74.3937 33.0509 Inf 1 9.700 <.0001
## female 2nd / female 3rd 9.4690 3.6250 Inf 1 5.872 <.0001
## female 2nd / male 3rd 40.3505 14.7906 Inf 1 10.088 <.0001
## male 2nd / female 3rd 0.1273 0.0460 Inf 1 -5.699 <.0001
## male 2nd / male 3rd 0.5424 0.1868 Inf 1 -1.776 0.4812
## female 3rd / male 3rd 4.2613 1.0878 Inf 1 5.678 <.0001
##
## P value adjustment: tukey method for comparing a family of 6 estimates
## Tests are performed on the log odds ratio scale
In order to plot we need to create new variable that is 0 or 1.
t1$surv <- if_else(t1$survived=='yes',1,0)
Now we can make a plot with regression lines:
ggplot(t1, aes(x=age, y=surv, color=sex)) +
geom_jitter(height=.1, width=0) +
geom_smooth(method="glm",
method.args=list(family="binomial"),
formula = y ~ x, se=F, lwd=1.5) +
facet_wrap(~passengerClass) +
theme_bw(base_size = 20)
We can improve the aesthetics of the plot:
tm <- emmeans(tglm1, ~ sex:passengerClass, type="response") %>% as.data.frame()
cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
plot1 <- ggplot() +
geom_jitter(data=t1 %>% filter(sex=='female'),
aes(x=passengerClass, y=surv+.01),
height=0, width=.25, size=1,
alpha=.1, color="#E69F00") +
geom_jitter(data=t1 %>% filter(sex=='male'),
aes(x=passengerClass, y=surv-.01),
height=0, width=.25, size=1,
alpha=.1, color="#56B4E9") +
geom_errorbar(data=tm,
aes(x=passengerClass,
y=prob, ymin=(prob-SE),
ymax=(prob+SE), color=sex),
width=.2, lwd=1.25,
position = position_dodge(width=0.5)) +
## make bars thinner
geom_point(data=tm ,
aes(x=passengerClass, y=prob, color=sex),
size=5,
position=position_dodge(width=0.5)) +
scale_y_continuous('survival', labels = scales::percent) +
scale_color_manual(values=cbPalette) +
theme(panel.background = element_blank(),
panel.border = element_rect(color="black",
fill=NA, size=2)) +
theme(axis.ticks.length=unit(0.3, "cm"),
axis.text.x = element_text(margin=margin(5,5,5,5,"pt"),colour="black"),
axis.text.y = element_text(margin=margin(5,5,5,5,"pt"),colour="black")) + ## change axis tick marks to make them a little longer
#theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
theme(text = element_text(size=20))
plot1
plot2 <- ggplot() +
geom_jitter(data=t1 ,
aes(x=passengerClass, y=surv,
color=passengerClass),
height=.01, width=.35,
size=1, alpha=.2) +
geom_errorbar(data=tm ,
aes(x=passengerClass,
y=prob, ymin=(prob-SE),
ymax=(prob+SE),
color=passengerClass),
width=.2, lwd=1.25) + ## make bars thinner
geom_point(data=tm ,
aes(x=passengerClass,
y=prob, color=passengerClass), size=5) +
facet_wrap(~sex) +
scale_y_continuous('survival', labels = scales::percent) +
scale_color_manual(values=cbPalette) +
theme(panel.background = element_blank(),
panel.border = element_rect(color="black", fill=NA, size=2)) +
theme(axis.ticks.length=unit(0.3, "cm"),
axis.text.x = element_text(margin=margin(5,5,5,5,"pt"),colour="black"),
axis.text.y = element_text(margin=margin(5,5,5,5,"pt"),colour="black")) +
theme(text = element_text(size=20))
plot2
Gamma GLM
In this section we will learn about implementing the Gamma distribution in R. This distribution is helpful in for modeling a variety of data but is most frequently applied to data that is right skewed but not necessarily count data. The next several examples will illustrate how to generate Gamma distributed data based on several parameters as well as how to implement Gamma GLMs.
Example 1
Here we generate a distribution:
set.seed(15)
var1 <- rgamma(1000, shape = 2, scale = .5)
hist(var1, main='mean=1, scale=0.5')
Note that the mean of this distribution should be 1 because the mean is equal to product of the shape parameter and scale parameter. So, 2 multiplied by 0.5 should be 1.
set.seed(15)
g1 <- glmmTMB(var1 ~ 1, family=Gamma(link="inverse"))
g1
## Formula: var1 ~ 1
## AIC BIC logLik df.resid
## 1822.0994 1831.9149 -909.0497 998
##
## Number of obs: 1000
##
## Dispersion estimate for Gamma family (sigma^2): 0.483
##
## Fixed Effects:
##
## Conditional model:
## (Intercept)
## 0.9657
## mean should be 1 with inverse link function
1/0.9711
## [1] 1.02976
In the above example we run a Gamma GLM as an intercept only model. By running an intercept only model we assume no other predictors can predict the response variable. Therefore the output from this model would simply be the mean of the response, which should be ~1 (which it is after applying the right link function to the intercept).
g1a <- glmmTMB(var1 ~ 1, family=Gamma(link="log"))
g1a
## Formula: var1 ~ 1
## AIC BIC logLik df.resid
## 1822.0994 1831.9149 -909.0497 998
##
## Number of obs: 1000
##
## Dispersion estimate for Gamma family (sigma^2): 0.483
##
## Fixed Effects:
##
## Conditional model:
## (Intercept)
## 0.03491
## mean should be 1 with log link function
exp(0.02937)
## [1] 1.029806
The same is shown above but in this case we use a log link and so to back transform the intercept we use exponentiation.
Example 2
In the example below we generate another Gamma distributed variable but we change the shape and scale parameter. But because the mean is based on the product of the two, the mean remains as 1.
set.seed(15)
var1 <- rgamma(1000, shape = 4, scale = .25)
hist(var1, main='mean=1, scale=0.25')
# mean = a (shape) * b (rate)
# mean = 4 * .25 = 1.0
g1 <- glmmTMB(var1 ~ 1, family=Gamma(link="inverse"))
g1
## Formula: var1 ~ 1
## AIC BIC logLik df.resid
## 1281.0616 1290.8771 -638.5308 998
##
## Number of obs: 1000
##
## Dispersion estimate for Gamma family (sigma^2): 0.233
##
## Fixed Effects:
##
## Conditional model:
## (Intercept)
## 0.9711
## mean should be 1 with inverse link function
1/0.9711
## [1] 1.02976
g1a <- glmmTMB(var1 ~ 1, family=Gamma(link="log"))
g1a
## Formula: var1 ~ 1
## AIC BIC logLik df.resid
## 1281.0616 1290.8771 -638.5308 998
##
## Number of obs: 1000
##
## Dispersion estimate for Gamma family (sigma^2): 0.233
##
## Fixed Effects:
##
## Conditional model:
## (Intercept)
## 0.02937
## mean should be 1 with log link function
exp(0.02937)
## [1] 1.029806
We once again repeat the same exercise as example 1 and see that when the appropriate back transformation based on link functions are applied to the intercept, it results in the original mean.
Example 3
In this example we change the mean of the distribution to 0.5 by changing the shape and scale parameters to 1 and 0.5 respectively. We then repeat the previous examples and run the inverse and log link Gamma GLMs and observe how the intercepts are backtransformed to approximate the previously set mean value.
set.seed(15)
var1 <- rgamma(1000, shape = 1, scale = .5)
hist(var1, main='mean=0.5, scale=0.5')
# mean = a (shape) * b (rate)
# mean = 1 * .5 = 0.5
g1 <- glmmTMB(var1 ~ 1, family=Gamma(link="inverse"))
g1
## Formula: var1 ~ 1
## AIC BIC logLik df.resid
## 654.8480 664.6635 -325.4240 998
##
## Number of obs: 1000
##
## Dispersion estimate for Gamma family (sigma^2): 0.968
##
## Fixed Effects:
##
## Conditional model:
## (Intercept)
## 1.963
## mean should be 1 with inverse link function
1/g1$sdr$par.fixed[1]
## beta
## 0.5095401
g1a <- glmmTMB(var1 ~ 1, family=Gamma(link="log"))
g1a$sdr$par.fixed[1]
## beta
## -0.6742484
## mean should be 1 with log link function
exp(g1a$sdr$par.fixed[1])
## beta
## 0.5095393
Example 4
This example follows the previous examples but with the mean changed to 0.25.
set.seed(15)
var1 <- rgamma(1000, shape = .5, scale = 1)
hist(var1, main='mean=0.25, scale=1')
# mean = a (shape) * b (rate)
# mean = .5 * 1 = 0.5
g1 <- glmmTMB(var1 ~ 1, family=Gamma(link="inverse"))
g1
## Formula: var1 ~ 1
## AIC BIC logLik df.resid
## 202.40398 212.21949 -99.20199 998
##
## Number of obs: 1000
##
## Dispersion estimate for Gamma family (sigma^2): 1.97
##
## Fixed Effects:
##
## Conditional model:
## (Intercept)
## 2.005
## mean should be 1 with inverse link function
1/g1$sdr$par.fixed[1]
## beta
## 0.4986803
g1a <- glmmTMB(var1 ~ 1, family=Gamma(link="log"))
g1a
## Formula: var1 ~ 1
## AIC BIC logLik df.resid
## 202.40398 212.21949 -99.20199 998
##
## Number of obs: 1000
##
## Dispersion estimate for Gamma family (sigma^2): 1.97
##
## Fixed Effects:
##
## Conditional model:
## (Intercept)
## -0.6958
## mean should be 1 with log link function
exp(g1a$sdr$par.fixed[1])
## beta
## 0.4986805
Running Gamma GLMs
In this section we simulate data for two groups:
set.seed(25)
v1 <- rgamma(100, shape = 3, scale = .5) %>% as.data.frame()
colnames(v1) <- "var"
v1$group <- "one"
head(v1)
## var group
## 1 1.0881400 one
## 2 2.4520815 one
## 3 3.5583571 one
## 4 0.9405657 one
## 5 0.9225875 one
## 6 1.4339098 one
v2 <- rgamma(100, shape = 1, scale = .2) %>% as.data.frame()
colnames(v2) <- "var"
v2$group <- "two"
head(v2)
## var group
## 1 0.15108136 two
## 2 0.01761541 two
## 3 0.47914554 two
## 4 0.05403794 two
## 5 0.13555345 two
## 6 0.19157853 two
Then bind the two groups into one dataset:
dat1 <- rbind(v1,v2) #mean group 1 = 1.5, group 2 = 0.5
dat1 %>% mutate(obs=rep(1:100,2)) %>% group_by(obs) %>% pivot_wider(names_from = group,values_from = var) %>%
ungroup() %>% select(one,two)
## # A tibble: 100 × 2
## one two
## <dbl> <dbl>
## 1 1.09 0.151
## 2 2.45 0.0176
## 3 3.56 0.479
## 4 0.941 0.0540
## 5 0.923 0.136
## 6 1.43 0.192
## 7 0.559 0.185
## 8 1.20 0.0195
## 9 1.15 0.0725
## 10 0.854 0.00261
## # … with 90 more rows
We then check the data distributions:
hist(dat1$var)
ggplot(dat1, aes(x=var)) + geom_histogram(bins=8, fill="grey", color="black") +
facet_wrap(~group, scales="free") + theme_bw(base_size = 16)
Then we construct several different models and use AIC to compare models:
#### construct model w/ normal distribution
mod0 <- glmmTMB(var ~ group, data=dat1)
plot(simulateResiduals(mod0))
summary(mod0)
## Family: gaussian ( identity )
## Formula: var ~ group
## Data: dat1
##
## AIC BIC logLik deviance df.resid
## 408.0 417.9 -201.0 402.0 197
##
##
## Dispersion estimate for gaussian family (sigma^2): 0.437
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.49247 0.06610 22.58 <2e-16 ***
## grouptwo -1.28086 0.09349 -13.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(mod0, ~group, type="response")
## group emmean SE df lower.CL upper.CL
## one 1.492 0.0661 197 1.3621 1.623
## two 0.212 0.0661 197 0.0813 0.342
##
## Confidence level used: 0.95
#### construct model w/ log-normal distribution
mod0a <- glmmTMB(log(var) ~ group, data=dat1)
plot(simulateResiduals(mod0a))
summary(mod0a)
## Family: gaussian ( identity )
## Formula: log(var) ~ group
## Data: dat1
##
## AIC BIC logLik deviance df.resid
## 585.6 595.5 -289.8 579.6 197
##
##
## Dispersion estimate for gaussian family (sigma^2): 1.06
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2288 0.1030 2.221 0.0264 *
## grouptwo -2.3798 0.1457 -16.330 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(mod0a, ~group, type="response")
## group response SE df lower.CL upper.CL
## one 1.257 0.130 197 1.026 1.540
## two 0.116 0.012 197 0.095 0.143
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
#### construct model w/ Gamma distribution and inverse link
mod1 <- glmmTMB(var ~ group, data=dat1, family=Gamma(link = "inverse"))
plot(simulateResiduals(mod1))
summary(mod1)
## Family: Gamma ( inverse )
## Formula: var ~ group
## Data: dat1
##
## AIC BIC logLik deviance df.resid
## 160.7 170.6 -77.3 154.7 197
##
##
## Dispersion estimate for Gamma family (sigma^2): 0.693
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.67003 0.05577 12.01 <2e-16 ***
## grouptwo 4.05552 0.39726 10.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(mod1, ~group, type="response")
## group response SE df lower.CL upper.CL
## one 1.492 0.1242 197 1.282 1.786
## two 0.212 0.0176 197 0.182 0.253
##
## Confidence level used: 0.95
## Intervals are back-transformed from the inverse scale
#### construct model w/ Gamma distribution and log link
mod2 <- glmmTMB(var ~ group, data=dat1, family=Gamma(link = "log"))
plot(simulateResiduals(mod2))
summary(mod2)
## Family: Gamma ( log )
## Formula: var ~ group
## Data: dat1
##
## AIC BIC logLik deviance df.resid
## 160.7 170.6 -77.3 154.7 197
##
##
## Dispersion estimate for Gamma family (sigma^2): 0.693
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.40043 0.08323 4.811 1.5e-06 ***
## grouptwo -1.95342 0.11771 -16.595 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(mod2, ~group, type="response")
## group response SE df lower.CL upper.CL
## one 1.492 0.1242 197 1.27 1.759
## two 0.212 0.0176 197 0.18 0.249
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
AIC:
AIC(mod0,mod0a,mod1,mod2)
## df AIC
## mod0 3 408.0006
## mod0a 3 585.5705
## mod1 3 160.6724
## mod2 3 160.6724
Which model is the best ranked model?