Starting off
Generalized linear models in glm()
- Logistic regression
- Poisson regression
Linear mixed models in lmer()
This week’s task

Starting off

Remember to set your working directory before you start. You also need to load today’s packages.

library(arm)
library(car)
library(effects)
library(GGally)
library(ggeffects)
library(ggplot2)
library(ggthemes)
library(lme4)
library(MuMIn)
library(performance)
library(predictmeans)

Generalized linear models in `glm()`

We will briefly cover two common forms of generalized linear model: logistic regression and Poisson regression. This tutorial covers key points from Ch. 13 of Quinn & Keough (2024).

Logistic regression

The dataset we will use for logistic regression was published by Blas et al. (2007). It contains two variables recorded in 34 European white storks (Ciconia ciconia):

cort: plasma corticosterone (ng/ml) measured in nestling storks after experimentally induced stress
surv: whether the bird died (0) or survived (1) within the next five years of the study

In this analysis, cort is the predictor variable and surv is the response variable. Because we have a continuous predictor and a binary response variable, a logistic regression is a reasonable choice. Note: because surv is a categorical variable, we may be tempted to convert it to a factor. However, for a logistic regression, you should not do so. You should leave it as numeric.

Inspect the data

First, we will read the data, summarize it, and plot it.

d <- read.csv("Data/storks.csv")
summary(d)

##       cort            surv       
##  Min.   :26.00   Min.   :0.0000  
##  1st Qu.:34.90   1st Qu.:0.0000  
##  Median :40.75   Median :0.0000  
##  Mean   :40.72   Mean   :0.3824  
##  3rd Qu.:47.40   3rd Qu.:1.0000  
##  Max.   :61.10   Max.   :1.0000

Next, let’s plot the data. To remind ourselves that surv is the response variable, we will use coord_flip() to create horizontal boxplots.

ggplot(data = d, aes(x = factor(surv, labels=c("No", "Yes")), y = cort)) +
  geom_boxplot(outliers = FALSE) +
  geom_jitter(width = 0.1, height = 0, alpha = 0.5) +
  labs(x = "Survival", y = "Corticosterone (ng/ml)") +
  coord_flip() +
  theme_few()

It appears that storks with higher corticosterone levels may be more likely to die.

Build the full model and check the assumptions

Next, let’s fit the model. By default, when you specify family=binomial, it will use a logit link.¹ Then plot the binned residuals using the binnedplot() function from the arm package.

stork.glm <- glm(surv ~ cort, family = binomial, data = d, na.action = "na.fail")

x <- predict(stork.glm)
y <- resid(stork.glm)
binnedplot(x, y, main = NULL)

Ideally, ~95% of the binned residuals would be within the grey lines, but because we have a smallish sample (only 34 observations), the binned residual plot is difficult to interpret. We will assume it is okay.

Perform model comparison

Next we will perform model selection using the AIC_c.

dredge(stork.glm)

## Fixed term is "(Intercept)"

## Global model call: glm(formula = surv ~ cort, family = binomial, data = d, na.action = "na.fail")
## ---
## Model selection table 
##   (Intrc)    cort df  logLik AICc delta weight
## 2  2.7030 -0.0798  2 -20.698 45.8  0.00  0.687
## 1 -0.4796          1 -22.617 47.4  1.58  0.313
## Models ranked by AICc(x)

The top-ranked model contains cort as a term. (Although some people might argue that cort should be excluded because the intercept-only model has \(\Delta\)AIC_c<2.) We will keep it in our model so we can continue with the tutorial…

Interpret the model

Next, show the coefficient table.

S(stork.glm, header = FALSE)

## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  2.70304    1.74725   1.547   0.1219  
## cort        -0.07980    0.04368  -1.827   0.0677 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 45.234  on 33  degrees of freedom
## Residual deviance: 41.396  on 32  degrees of freedom
## 
## logLik     df    AIC    BIC 
## -20.70      2  45.40  48.45 
## 
## Number of Fisher Scoring iterations: 4
## 
## Exponentiated Coefficients and Confidence Bounds
##               Estimate     2.5 %     97.5 %
## (Intercept) 14.9250213 0.5721667 623.023934
## cort         0.9232967 0.8397103   1.000028

Overdispersion does not appear to be a problem: the residual deviance is 41.4 on 32 df, where a 1:1 ratio is expected. To more formally check for overdispersion, we can use the Pearson dispersion statistic. The easiest way to get this value is to refit the model using a quasi-binomial distribution, which estimates the dispersion parameter (rather than forcing it to be equal to 1 as in a binomial distribution). Values greater than 1.5 suggest potentially worrisome overdispersion.

stork.qglm <- glm(surv ~ cort, family = quasibinomial, d)
S(stork.qglm)$dispersion

## [1] 1.025242

Overdispersion is not a problem.²

Now we will calculate Tjur’s D, a pseudo-\(R^2\) value used as a measure of goodness of fit for a logistic regression. It is equal to the difference in the average predicted probability for the two groups. The r2_tjur() function comes from the performance package.

r2_tjur(stork.glm)

## Tjur's R2 
## 0.1022956

On average, the predicted probability of survival among storks that actually survived is 10.2 percentage points higher than the predicted probability of survival among storks that died. This suggests that our model has low to moderate predictive ability.

The model suggests that for every unit increase in plasma corticosterone, the odds of survival decline multiplicatively by a factor of 0.92 (95% CI: 0.84-1.00); in other words, the odds of survival decrease by 8% (95% CI: 0%–16%).

Finally, we will plot the logistic regression. The easiest way to do this is possibly with the ggpredict() function from the ggeffects package.

pred <- ggpredict(stork.glm, terms = "cort [all]")
ggplot(pred, aes(x = x, y = predicted)) +
  geom_point(data = d, aes(x = cort, y = surv), alpha = 0.5, size = 2) +
  geom_line(colour = "blue") +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2) +
  labs(x = "Corticosterone (ng/ml)", y = "Probability of survival") +
  theme_few()

If we wanted to use this model to make predictions, we could validate it and make predictions as in Tutorial 8.

Poisson regression

Inspect the data

The dataset we will use for Poisson regression was published by Beckerman et al. (2017). It contains two variables recorded in 50 Soay sheep (Ovis aries):

fitness: lifetime reproductive success, measured as the number of offspring
body.size: standardized measure of body mass (in kg) for each ewe

In this analysis, body.size is the predictor variable and fitness is the response variable. Because we have a continuous predictor and response variable that is a count, a Poisson regression is a reasonable choice.

First, we will read the data, summarize it, and plot it.

d <- read.csv("Data/sheep.csv")
summary(d)

##     fitness        body.size    
##  Min.   : 0.00   Min.   :4.785  
##  1st Qu.: 3.00   1st Qu.:6.628  
##  Median : 4.00   Median :7.129  
##  Mean   : 4.52   Mean   :7.100  
##  3rd Qu.: 5.00   3rd Qu.:7.728  
##  Max.   :14.00   Max.   :8.595

Next, let’s plot the data. To see whether a straight line (blue) fits the data, we will compare it to a LOESS curve (in magenta).

ggplot(data = d, aes(x = body.size, y = fitness)) +
  geom_point() +
  geom_smooth(colour = "magenta") +
  geom_smooth(method=lm) +
  labs(x = "Standardized body mass (kg)", y = "Lifetime reproductive success") +
  theme_few()

It appears that ewes with higher body masses have higher lifetime reproductive success, and that the relationship does not appear to be linear.

Build the full model and check the assumptions

Next, let’s fit the model and plot the residuals. Because the response variable is a count, it is reasonable to try a Poisson regression. By default, when you specify family=poisson, it will use a log link. When you plot the residuals, R will automatically convert them into deviance residuals and Pearson residuals.

sheep.glm <- glm(fitness ~ body.size, family = poisson, data = d, na.action = "na.fail")
opar <- par(mfrow = c(2, 2))
plot(sheep.glm)
par(opar)

The residuals do not look too bad.

Perform model comparison

Next we will perform model selection using the AIC_c.

dredge(sheep.glm)

## Fixed term is "(Intercept)"

## Global model call: glm(formula = fitness ~ body.size, family = poisson, data = d, 
##     na.action = "na.fail")
## ---
## Model selection table 
##    (Int) bdy.siz df   logLik  AICc delta weight
## 2 -2.422  0.5409  2 -103.424 211.1  0.00      1
## 1  1.509          1 -121.944 246.0 34.87      0
## Models ranked by AICc(x)

The top-ranked model contains body.size as a term. The model without this term has a \(\Delta\)AIC_c of 34.9, so it is obviously far worse.

Interpret the model

Next, show the coefficient table.

S(sheep.glm, header = FALSE)

## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.42203    0.69432  -3.488 0.000486 ***
## body.size    0.54087    0.09316   5.806 6.41e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 85.081  on 49  degrees of freedom
## Residual deviance: 48.040  on 48  degrees of freedom
## 
##  logLik      df     AIC     BIC 
## -103.42       2  210.85  214.67 
## 
## Number of Fisher Scoring iterations: 4
## 
## Exponentiated Coefficients and Confidence Bounds
##               Estimate      2.5 %    97.5 %
## (Intercept) 0.08874154 0.02211508 0.3363099
## body.size   1.71750081 1.43477775 2.0672237

Overdispersion does not appear to be a problem: the residual deviance is 48.0 on 48 df, where a 1:1 ratio is expected. To more formally check for overdispersion, we can use the Pearson dispersion statistic. The easiest way to get this value is to refit the model using a quasi-Poisson distribution, which estimates the dispersion parameter (rather than forcing it to be equal to 1 as in a Poisson distribution). Values greater than 1.5 suggest potentially worrisome overdispersion.

sheep.qglm <- glm(fitness ~ body.size, family = quasipoisson, d)
S(sheep.qglm)$dispersion

## [1] 0.9026887

Overdispersion is not a problem. If anything, we appear to have slight underdispersion.³

Next, we might like to calculate something like an \(R^2\) value… Unfortunately, whereas we could use Tjur’s D as a pseudo-\(R^2\) value for a logistic regression, no easily interpretable pseudo-\(R^2\) value exists for a Poisson regression.

The model suggests that for every unit increase in standardized body mass, the expected number of offspring increases multiplicatively by a factor of 1.72 (95% CI: 1.43–2.07); in other words, it increases by 72% (95% CI: 43%–107%).

Finally, we will plot the Poisson regression. The easiest way to do this is possibly with the ggpredict() function from the ggeffects package.

pred <- ggpredict(sheep.glm, terms = "body.size [all]")
ggplot(pred, aes(x = x, y = predicted)) +
  geom_point(data = d, aes(x = body.size, y = fitness), alpha = 0.5, size = 2) +
  geom_line(colour = "blue") +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2) +
  labs(x = "Standardized body mass (kg)", y = "Lifetime reproductive success") +
  theme_few()

If we wanted to use this model to make predictions, we could validate it and make predictions as in Tutorial 8.

Linear mixed models in `lmer()`

The owls.txt dataset (described in Zuur et al., 2009) contains information on vocal-begging behaviour of nestling barn owls when their parents brought prey back to the nest. Their primary interest was in how chicks might respond differently to their father rather than their mother. The columns in the dataset are:

Nest: The nest ID. In total, 599 observations have been made on 27 different nests.
FoodTreatment: “Deprived” nests had prey removed; “Satiated” nests had prey added. The experiment was conducted over two nights; nests that were deprived on the first night were satiated on the second and vice versa.
SexParent: Sex of the arriving parent (female or male)
ArrivalTime: Time that the parent arrived at the nest, measured between 21.5 (= 9:30 PM) and 29.5 (= 5:30 AM)
SiblingNegotiation: Number of calls made by the owl chicks in the 30 seconds prior to parent arrival
BroodSize: Total number of chicks in the nest
NegPerChick: (response variable; calculated as SiblingNegotiation/BroodSize)

Inspect the data

Read in the data, convert to factors where appropriate. Then inspect the output of the summary() command (which is not shown because it is too wide for this page).

d <- read.delim(file = "Data/owls.txt")
d$Nest <- factor(d$Nest)
d$FoodTreatment <- factor(d$FoodTreatment)
d$SexParent <- factor(d$SexParent)
summary(d)

To increase interpretability of our random-slope model, we will center the arrival time within each nest. To match the analysis in Zuur et al. (2009) and the paper from which they took this data, we will log-transform our response variable to reduce heteroskedasticity.

d$ArrivalTime.mean     <- ave(d$ArrivalTime, d$Nest)
d$ArrivalTime.centered <- d$ArrivalTime - d$ArrivalTime.mean
d$NegPerChick.log10    <- log10(d$NegPerChick + 1)

Next, let’s check how the variables are related to each other with a scatterplot matrix.

ggpairs(d[,c(2, 3, 8, 9, 10)]) +
  theme_few()

Looks pretty good to me. At a glance, it appears that FoodTreatment seems to have an effect on chick begging and that SexParent probably does not. The effect of arrival time is unclear.

The scatterplot matrix did not show the nest effects because Nest contains too many levels. So, let’s conduct a regression of NegPerChick.log10 on centered arrival time for each nest:

ggplot(data = d, aes(x = ArrivalTime.centered, y = NegPerChick.log10)) +
  facet_wrap(~Nest) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "Arrival time (centered for each nest)",
       y = expression(log[10]("Negotiations per chick"))) +
  theme_few()

Does it look like there is substantial variation in slopes and intercepts among the different nests?

Now we will begin our model-selection procedure.

Build the full model

We will start with a “beyond-optimal” model: a full model containing the maximal random-effects structure justified by the design. It also contains all fixed predictor variables and as many of their two-way and three-way interactions as possible (or that we believe might be important). There is usually no need to include interactions between four or more variables: they usually explain little and are difficult to interpret.

For us, our full model will be a “varying-intercepts-and-slopes” model. (These are often called “random-slope” models.) This model will allow each nest to differ in its average level of chick begging (i.e., the intercept varies among nests), and it allows the effect of (centered) arrival time to vary in each nest (i.e., the slope varies among nests). This model is specified in lmer() using (ArrivalTime.centered | Nest).

We will also include all possible main effects, two-way interactions, and three-way interactions between our predictor variables, using (...)^3 notation. Notice that in the model below, ArrivalTime.centered appears both as a fixed effect and as a random effect. This is standard practice for random-slope models: The fixed effect of ArrivalTime.centered estimates the average effect across all nests. The random slope models deviations from that average for each nest.

randomslope.lmer <- lmer(
  NegPerChick.log10 ~ (SexParent + FoodTreatment + ArrivalTime.mean + ArrivalTime.centered)^3 +
    (ArrivalTime.centered | Nest), data = d)

Check assumptions of the full model

If you use plot() on an lmer object, it will produce only a single residual plot for the residuals vs. fitted values. The residplot() function in the predictmeans package produces additional plots (including a Q-Q plot for the random intercepts) that can be useful for assessing whether the assumptions of the model are satisfied. We will colour the data points in the “Residuals vs Fitted” plot based on the FoodTreatment variable.

residplot(randomslope.lmer, group = "FoodTreatment")

We can also produce a Q-Q plot for the random slopes.

residplot(randomslope.lmer, group = "FoodTreatment", slope = TRUE)

We might also want to plot standardized residuals against each of our predictor variables. (Output not shown to save space.)

randomslope.resid <- residuals(randomslope.lmer, scaled = TRUE)
plot(randomslope.resid ~ SexParent, data = d)
plot(randomslope.resid ~ FoodTreatment, data = d)
plot(randomslope.resid ~ ArrivalTime.mean, data = d)
plot(randomslope.resid ~ ArrivalTime.centered, data = d)
plot(randomslope.resid ~ Nest, data = d)

In this case, our residuals are not too horrible. The straight line of residuals in the bottom row of plots arises from the lower boundary at NegPerChick=0. For now, we will ignore any problems, but we might want to try other data transformations or modelling approaches, such as modelling the total number of calls from the nest as being distributed according to a (zero-inflated) Poisson distribution (in a generalized linear mixed model, generalized estimating equation, generalized additive mixed model, etc.)

Model comparison for random effects (using REML)

We will first determine the optimal random-effects structure. To do this, we will compare our “varying-intercepts-and-slopes” model to a “varying-intercept” model. (These are often called “random-intercept” models.) This model allows each nest to differ in its average level of chick begging (i.e., the intercept varies among nests), but it does not estimate the effect of (centered) arrival time for each nest. This model is specified in lmer() using (1 | Nest).

We will also fit a fixed-effects model where the average level of chick begging is assumed to be equal for all nests. This must be fitted in lm(), not lmer() (which is for random- and mixed-effects models only).

randomintercept.lmer <- lmer(
  NegPerChick.log10 ~ (SexParent + FoodTreatment + ArrivalTime.mean + ArrivalTime.centered)^3 +
    (1 | Nest), data = d)
fixedeffects.lm <- lm(
  NegPerChick.log10 ~ (SexParent + FoodTreatment + ArrivalTime.mean + ArrivalTime.centered)^3,
  data = d)

There are multiple ways to compare models that differ in their random effects; however, all must use REML (not ML). Currently, the preferred approach is to use likelihood ratio tests (LRTs) that have been corrected for “testing on the boundary” (i.e., we are testing against the possibility that variance components are equal to zero). This requires use of the following function to obtain a corrected p-value for the likelihood ratio test. The number of degrees of freedom for the test is the difference in the number of parameters estimated for each model.

REMLBoundaryTest <- function(fullerModel, reducedModel, df) {
  D <- -2 * c((logLik(reducedModel, REML = TRUE) - logLik(fullerModel, REML = TRUE)))
  p.value <- (pchisq(D, df, lower.tail = FALSE) + pchisq(D, df - 1, lower.tail = FALSE))/2
  output <- c(D, df, p.value)
  names(output) <- c("test statistic", "df", "p-value")
  return(output)
}

Using this function, we will conduct two tests. First, we will test whether a random-intercept model is significantly better than a fixed-effects model. These models differ by one degree of freedom (for the random intercept).

REMLBoundaryTest(randomintercept.lmer, fixedeffects.lm, df = 1)

## test statistic             df        p-value 
##   2.763626e+01   1.000000e+00   7.320434e-08

The test result suggests that the random-intercept model is significantly better than the fixed-effects model. Therefore, we will test whether the random-slope model is also a significant improvement over the random-intercept model. These models differ by two degrees of freedom (one for the random slope and one for the correlation between the random intercept and the random slope).⁴

REMLBoundaryTest(randomslope.lmer, randomintercept.lmer, df = 2)

## test statistic             df        p-value 
##   1.495198e+01   2.000000e+00   3.384036e-04

The random-slope model is significantly better than the random-intercept model. The random-slope model appears to have the best random-effect structure among those tested.

Model comparison for fixed effects (using ML)

Now that we have determined the optimal random effect structure, we will next determine the optimal fixed-effects structure. To do this, we need to use models fitted using ML, not REML (i.e., REML = FALSE). We will use MuMIn to dredge for the best model, which requires that we tell it to fail if there are any missing data in the variables used (i.e., na.action="na.fail").

randomslope.lmer.ml <- lmer(
  NegPerChick.log10 ~ (SexParent + FoodTreatment + ArrivalTime.mean + ArrivalTime.centered)^3 +
    (ArrivalTime.centered | Nest), data = d, REML = FALSE, na.action = "na.fail")

Rank models by AIC_c and save models with \(\Delta\)AIC_c < 4. You can also plot the table. (Because the model-selection table is too wide for this page, results are not shown).

(dd <- subset(dredge(randomslope.lmer.ml), delta < 4))
plot(dd)

If you run the code above, you will find that our best model contains terms for the intercept, centered arrival time, food treatment, and the interaction between centered arrival time and food treatment. Sex of the parent does not appear in the top model, although it does appear in the 2nd best model, which is almost tied with the best model (\(\Delta\)AIC_c = 0.11). Therefore, I would consider the importance of parental sex to be ambiguous; however, we will likely make better predictions by omitting it from the final model.

Re-fit the reduced model (using REML)

Now we can refit the top-ranked model using REML (and re-including any observations that had missing data).

bestmodel.lmer <- lmer(
  NegPerChick.log10 ~ FoodTreatment * ArrivalTime.centered + (ArrivalTime.centered | Nest),
  data = d)

Check assumptions of the reduced model

Let’s check if our reduced model meets the assumptions, the same way we did for the full model above. (Output not shown to save space.)

residplot(bestmodel.lmer, group = "FoodTreatment")
residplot(bestmodel.lmer, slope = TRUE, group = "FoodTreatment")

bestmodel.resid <- residuals(bestmodel.lmer, scaled = TRUE)
plot(bestmodel.resid ~ FoodTreatment, data = d)
plot(bestmodel.resid ~ ArrivalTime.centered, data = d)

These plots look similar to the ones for the full model above, suggesting that simplifying the model did not negatively impact model adequacy.

Interpret the reduced model

The assumptions seem to be satisfied, so let’s look at the coefficient table and the variance components.

S(bestmodel.lmer, brief = TRUE)

## 
## Estimates of Fixed Effects:
##                                              Estimate Std. Error    z value
## (Intercept)                                 4.115e-01  2.235e-02  2.934e+01
## FoodTreatmentSatiated                      -1.684e-01  1.968e-02  5.873e+02
## ArrivalTime.centered                       -4.415e-02  9.468e-03  3.408e+01
## FoodTreatmentSatiated:ArrivalTime.centered  2.316e-02  1.081e-02  5.509e+02
##                                              Pr(>|z|)   Pr(>|t|)  
## (Intercept)                                 1.841e+01  1.153e-17 0
## FoodTreatmentSatiated                      -8.554e+00  1.035e-16 0
## ArrivalTime.centered                       -4.663e+00  4.649e-05 0
## FoodTreatmentSatiated:ArrivalTime.centered  2.142e+00  3.266e-02 0
## 
## Estimates of Random Effects (Covariance Components):
##  Groups   Name                 Std.Dev. Corr 
##  Nest     (Intercept)          0.09099       
##           ArrivalTime.centered 0.03125  -0.48
##  Residual                      0.22499       
## 
## Number of obs: 599, groups:  Nest, 27
## 
## logLik     df    AIC    BIC 
##   2.52      8  10.96  46.13

And get the confidence intervals:

confint(bestmodel.lmer, oldNames = FALSE)

##                                                   2.5 %      97.5 %
## sd_(Intercept)|Nest                         0.057831380  0.13143076
## cor_ArrivalTime.centered.(Intercept)|Nest  -0.998315638  0.15455504
## sd_ArrivalTime.centered|Nest                0.015236639  0.04851347
## sigma                                       0.211893295  0.23892094
## (Intercept)                                 0.367648012  0.45643566
## FoodTreatmentSatiated                      -0.207196073 -0.12942820
## ArrivalTime.centered                       -0.063199691 -0.02570132
## FoodTreatmentSatiated:ArrivalTime.centered  0.001712906  0.04446430

We can also obtain predicted values for the random effects (i.e., the intercept and slope for each nest). (Output not shown to save space.)

ranef(bestmodel.lmer)

And get the marginal R² and conditional R². Marginal R² represents the proportion of variance in the response variable explained by the fixed effects only. Conditional R² is the proportion of variance in the response variable explained by both the fixed effects and the random effects.

r.squaredGLMM(bestmodel.lmer)

##            R2m       R2c
## [1,] 0.1513617 0.3091264

Our final random-slope model explains 30.9% of the variance in chick begging (fixed effects alone explain 15.1%) and contains FoodTreatment, ArrivalTime.centered, and their interaction as predictors.

Now, let’s plot the fixed effects in our top-ranked model using the allEffects() function in the effects package.

plot(allEffects(bestmodel.lmer),
     main = NULL,
     xlab = "Arrival time (centered for each nest)",
     ylab = expression(log[10]("Negotiations per chick")))

Our results suggest that all else being equal, chicks beg less when they are food-satiated as opposed to food-deprived. On average, they beg less later in the night (relative to the centered arrival time for the nest). This effect of arrival time is more pronounced when chicks started as food-deprived. Therefore, the effect of arrival time likely represents chicks becoming satiated as parents provide them with more and more food during the night. There was variation among nests in the average amount of chick begging and in the change in chick begging throughout the night. This might be caused by variation in hunting success among nests.

This week’s task

The orthodont.csv dataset (originally published in the nlme package) contains information about the growth of 27 children (16 males, 11 females) from age 8 until age 14. Every two years, they measured the distance between the pituitary and the pterygomaxillary fissure.

The dataset contains the following columns:

distance: numeric vector of distances from the pituitary to the pterygomaxillary fissure (mm). These distances are measured on x-ray images of the skull (the response variable)
age: a numeric vector of ages of the subject (in years).
Subject: a factor indicating the individual on which the measurement was made. Subjects are labelled F01 to F13 for the females and M01 to M16 for the males.
Sex: a factor with levels Female and Male

Fit a linear mixed model that is appropriate for this dataset.

To use a probit link, you can use family = binomial(link=probit).↩︎
If we had overdispersion, we could use the quasi-binomial GLM, a beta-binomial GLM using the vglm() function in the VGAM package, or using a generalized linear mixed model using the glmer() function in the lme4 package.↩︎
If we had overdispersion, we could use the quasi-Poisson GLM, a negative-binomial GLM using the glm.nb() function in the MASS package, or using a generalized linear mixed model using the glmer() or glmer.nb() functions in the lme4 package.↩︎
To force the random intercept and random slope to be uncorrelated, you can use || instead of |. For instance, (ArrivalTime.centered || Nest) instead of (ArrivalTime.centered | Nest).↩︎

BIOL 5473 Tutorial 10

Starting off

Generalized linear models in glm()

Logistic regression

Inspect the data

Build the full model and check the assumptions

Perform model comparison

Interpret the model

Poisson regression

Inspect the data

Build the full model and check the assumptions

Perform model comparison

Interpret the model

Linear mixed models in lmer()

Inspect the data

Build the full model

Check assumptions of the full model

Model comparison for random effects (using REML)

Model comparison for fixed effects (using ML)

Re-fit the reduced model (using REML)

Check assumptions of the reduced model

Interpret the reduced model

This week’s task

Generalized linear models in `glm()`

Linear mixed models in `lmer()`