This document uses the Orthodont data from the nlme package to introduce the concept of random effects in intercepts and slopes. This data set is similar to the the fictitious orchard example previously discussed. In the orchard example the response variable was yield of almonds per tree and the covariate was tree trunk diameter. A random sample of 20 orchards was assigned to a control and an additional random sample of 20 orchards was assigned to a treatment. Twenty trees were selected randomly per orchard. In the Orthodont data the response variable is the distance from the pituitary to the pterygomaxillary fissure (mm) measure on X-ray images of the skulls of several kids (from 8 to 14 years old). There were 16 males and 11 females in the study. Each subject was measured every two years and their age was the covariate.
The orchard study was an example of a manipulative experiment where treatments were assigned at random. The Orthodont study is observational because gender was not assigned randomly but observed a it occurred. The following analogies can be reasonable stated to facilitate the understanding of the structure of the Orthodont data:
| Element | Orchard study | Orthodont |
|---|---|---|
| Grouping variable | Orchard | Subject (person) |
| Categorical predictor | Treatment | Sex |
| Covariate | tree diameter | age |
| Measured over | trees | time |
The data are interesting because the implicit linear model involves categorical (Sex) and continuous (age) predictors and random effects in both the intercepts and slopes. This is the simplest model within the category of most complex models, because it has only one covariate, only one categorical predictor with just two levels, and one grouping factor.
# Load data
data(Orthodont)
str(Orthodont)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 108 obs. of 4 variables:
## $ distance: num 26 25 29 31 21.5 22.5 23 26.5 23 22.5 ...
## $ age : num 8 10 12 14 8 10 12 14 8 10 ...
## $ Subject : Ord.factor w/ 27 levels "M16"<"M05"<"M02"<..: 15 15 15 15 3 3 3 3 7 7 ...
## $ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "outer")=Class 'formula' language ~Sex
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## - attr(*, "formula")=Class 'formula' language distance ~ age | Subject
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## - attr(*, "labels")=List of 2
## ..$ x: chr "Age"
## ..$ y: chr "Distance from pituitary to pterygomaxillary fissure"
## - attr(*, "units")=List of 2
## ..$ x: chr "(yr)"
## ..$ y: chr "(mm)"
## - attr(*, "FUN")=function (x)
## ..- attr(*, "source")= chr "function (x) max(x, na.rm = TRUE)"
## - attr(*, "order.groups")= logi TRUE
# Transform into tibble instead of groupedData, etc
orthod <- Orthodont %>%
as_tibble() %>%
mutate(c_age = age - mean(age)) # center age
# Explore data
str(orthod)
## tibble [108 × 5] (S3: tbl_df/tbl/data.frame)
## $ distance: num [1:108] 26 25 29 31 21.5 22.5 23 26.5 23 22.5 ...
## $ age : num [1:108] 8 10 12 14 8 10 12 14 8 10 ...
## $ Subject : Ord.factor w/ 27 levels "M16"<"M05"<"M02"<..: 15 15 15 15 3 3 3 3 7 7 ...
## $ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
## $ c_age : num [1:108] -3 -1 1 3 -3 -1 1 3 -3 -1 ...
## - attr(*, "outer")=Class 'formula' language ~Sex
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## - attr(*, "formula")=Class 'formula' language distance ~ age | Subject
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## - attr(*, "labels")=List of 2
## ..$ x: chr "Age"
## ..$ y: chr "Distance from pituitary to pterygomaxillary fissure"
## - attr(*, "units")=List of 2
## ..$ x: chr "(yr)"
## ..$ y: chr "(mm)"
## - attr(*, "FUN")=function (x)
## ..- attr(*, "source")= chr "function (x) max(x, na.rm = TRUE)"
## - attr(*, "order.groups")= logi TRUE
summary(orthod)
## distance age Subject Sex c_age
## Min. :16.50 Min. : 8.0 M16 : 4 Male :64 Min. :-3.0
## 1st Qu.:22.00 1st Qu.: 9.5 M05 : 4 Female:44 1st Qu.:-1.5
## Median :23.75 Median :11.0 M02 : 4 Median : 0.0
## Mean :24.02 Mean :11.0 M11 : 4 Mean : 0.0
## 3rd Qu.:26.00 3rd Qu.:12.5 M07 : 4 3rd Qu.: 1.5
## Max. :31.50 Max. :14.0 M08 : 4 Max. : 3.0
## (Other):84
unique(orthod$Subject)
## [1] M01 M02 M03 M04 M05 M06 M07 M08 M09 M10 M11 M12 M13 M14 M15 M16 F01 F02 F03
## [20] F04 F05 F06 F07 F08 F09 F10 F11
## 27 Levels: M16 < M05 < M02 < M11 < M07 < M08 < M03 < M12 < M13 < ... < F11
length(unique(orthod$Subject))
## [1] 27
with(orthod, table(c_age, Subject))
## Subject
## c_age M16 M05 M02 M11 M07 M08 M03 M12 M13 M14 M09 M15 M06 M04 M01 M10 F10 F09
## -3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## Subject
## c_age F06 F01 F05 F07 F02 F08 F03 F04 F11
## -3 1 1 1 1 1 1 1 1 1
## -1 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1
with(orthod, table(c_age, Sex))
## Sex
## c_age Male Female
## -3 16 11
## -1 16 11
## 1 16 11
## 3 16 11
# Plot data
ggplot(
data = orthod,
mapping = aes(
x = age,
y = distance,
group = Subject,
color = Subject
)
) +
geom_line() +
geom_point() +
facet_wrap(~Sex) +
theme(legend.position = "none")
Distance from the pituitary to the pterygomaxillary fissure (mm) as a function of age (years) for 11 females and 16 males. Distances were measured every 2 years. Each line represents a Subject (Pinheiro and Bates 2000).
These data are grouped by Subject. We consider the subjects to be a random sample from the population of kids. As kids grow, they can differ from each other in two ways:
Therefore, the model has to include Subject effects that are random for both intercept and slope. This example is part of the nlme package, which uses a syntax slightly different from lme4. We present both formulas. Age is centered by subtracting its average (11 years) so the intercept is more meaningful; instead of being the distance at age 0 it is the distance at age 11, which is within the scope of the age data. 1
# nlme:
ortho_lme1 <- lme(
fixed = distance ~ Sex * c_age,
random = ~ 1 + c_age | Subject,
data = orthod
)
# lme4:
ortho_lmer1 <- lmer(distance ~ Sex * c_age +
(c_age | Subject),
data = orthod
)
Note that in the nlme syntax the formula is separated into two arguments: fixed and random, whereas in lme4 both fixed and random components are in a single formula. In lme4, random effects are specified inside parentheses. If there is no covariate to have random slopes, the random effects reduce to 1 | Subject, where the 1 represents the intercept. When there is a random slope component for a covariate, the intercept is implicit 1 + c_age | Subject is the same as c_age | Subject. To remove the random intercept and keep the random slope, use -1 + c_age | Subject.
For the rest of this document we use lme4::lmer() and related functions. In the future we will come back to nlme::lme() to make adjustments for data that violate the usual assumptions.
For the purpose of learning how to use linear mixed models, assume the goals of the study are to answer the following questions or complete the tasks:
Goal 2 is stated a we typically do, but it is ambiguous. As it is, it does not state which males and females are to be compared. We usually assume that the statement refers to comparing means of populations. It would be clearer to state: “Determine if there is a difference in the mean distance between males and females.” This statement is clearer, but not complete: distances vary with age; at what age should we compare the means? A suitable age would be the average age in the experiment. However, if the growth rates differ between males and females (Goal 1), then the answer to the first question changes depending on age, because there is an interaction between sex and age. In such a case, the answer to the question as stated is “It depends on the age at which you make the comparison.”
Goals 1-4 refer to the means of the populations of females and males. Therefore, they do not involve the specific values calculated for random effects. The means average over all subjects in the population. Therefore, these estimates are affected only by the residual variation given by the MSE.
Goal 5 refers to specific subjects who were observed. Therefore, their predicted values include their values for the random effects. The prediction intervals include the uncertainty in the estimated mean line (MSE) as well as the uncertainty in the random effects for the specific individuals, but it does not include the variance among subjects.
Goal 6 refers to a random individual. Thus, no individual random effects are included, but the uncertainty due to variance among individuals has to be included. These prediction intervals have the most uncertainty and widest prediction bands.
anova(ortho_lmer1)
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## Sex 15.947 15.947 1 25 9.2921 0.005375 **
## c_age 151.024 151.024 1 25 87.9989 1.146e-09 ***
## Sex:c_age 8.785 8.785 1 25 5.1186 0.032614 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As usual, when there are interactions, the interaction test is considered first. There is a significant difference between the growth rate in females and males indicated by the F test in the line of the anova() labeled Sex:c_age. We can check the size and confidence interval of the difference to determine its practical significance. Let’s assume that a difference of 0.25 mm/year is clinically significant. Slopes can be estimated and compared with the emmeans::emtrends() function. In this case, there are only two slopes, so the test using emtrends is the same as the F-test printed in the anova.
(sex_slopes <- emtrends(
object = ortho_lmer1,
specs = ~ Sex * c_age,
var = "c_age"
))
## Sex c_age c_age.trend SE df lower.CL upper.CL
## Male 0 0.784 0.086 25 0.607 0.961
## Female 0 0.480 0.104 25 0.266 0.693
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
pairs(sex_slopes) # Compare slopes
## contrast estimate SE df t.ratio p.value
## Male 0 - Female 0 0.305 0.135 25 2.262 0.0326
##
## Degrees-of-freedom method: kenward-roger
pairs(sex_slopes,
offset = c(-0.25)
) # Compare difference in slope to 0.25
## contrast estimate SE df t.ratio p.value
## Male 0 - Female 0 0.0548 0.135 25 0.407 0.6875
##
## Degrees-of-freedom method: kenward-roger
The first part of the output shows the estimated growth rates for Males and Females in mm/year. Second, the slopes are compared to each other and a confidence interval for the difference is produced. The rates are statistically significantly different. Finally, to determine if the difference is clinically significant, we compare it to the practically significant difference of 0.25 mm/year. This is specified in the second use of the emmeans::pairs() function where offset = c(-0.25) indicates to subtract 0.25 from the difference in slopes and compare the results to 0. The difference in slopes between females and males is not statistically different from the clinically important value 2
More important than the statistical significance of the tests, we have found out that between ages 8 and 14, that part of the skull grows at about 0.78 mm/year \(\pm\) 0.086 mm/year for males and 0.48 mm/year \(\pm\) 0.104 mm/year for females. The greater standard error for females is caused by the fact that there were fewer females than males in the study.
The presence of an interaction makes it impossible to give an unique comparison between females and males of the absolute distance, because the answer depends on the age at which the comparison is made. That is why emmeans issues the warning. We assume that the interaction is 0 for the purpose of demonstrating the code for the comparison. The comparison is made at the average c_age, which is 0, because c_age is a centered variable.
# Compare distance assuming that interaction is 0
(sex_means <- emmeans(ortho_lmer1, ~Sex))
## NOTE: Results may be misleading due to involvement in interactions
## Sex emmean SE df lower.CL upper.CL
## Male 25.0 0.486 25 24.0 26.0
## Female 22.6 0.586 25 21.4 23.9
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
pairs(sex_means) # same as F test because there are only two levels
## contrast estimate SE df t.ratio p.value
## Male - Female 2.32 0.761 25 3.048 0.0054
##
## Degrees-of-freedom method: kenward-roger
The result indicates that at age 11, the distance for males is 2.32 mm longer than for females, which is significantly different from 0. The standard error of the difference is 0.761 mm.
Now, assume that we had a specific need to compare distances at 8 years old. We can estimate and compare the estimated mean distances for males and females at that age. Because c_age is centered when age = 8, c_age = 8 - 11 = -3.
(sex_means_8 <- emmeans(ortho_lmer1,
var = "c_age",
specs = ~Sex,
at = list(c_age = -3)
))
## NOTE: Results may be misleading due to involvement in interactions
## Sex emmean SE df lower.CL upper.CL
## Male 22.6 0.527 25 21.5 23.7
## Female 21.2 0.635 25 19.9 22.5
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
pairs(sex_means_8)
## contrast estimate SE df t.ratio p.value
## Male - Female 1.41 0.825 25 1.705 0.1006
##
## Degrees-of-freedom method: kenward-roger
At 8 years old, females and males are not significantly different in the distance between the two points in the skull. More interestingly, we estimate that the distance is 1.41 mm longer in males than females, with a standard error of 0.825 mm.
The population estimates can be described both with parameters estimates and with a graph made using such estimates. It is interesting to have a general idea of how much uncertainty we have about the population lines for males and females, but also to see how much the lines vary from one individual to another. Keep in mind that the population line does not necessarily apply to any individual, it represents the central tendency of the population of growth lines. Always keep in mind that population values are characteristics of populations, not individuals 3
In terms of parameter estimates the lines are:
\[\hat{distance} = 25.0 (\pm 0.486) + 0.784 (\pm 0.086) * (age - 11) \qquad \text{for males}\] and \[\hat{distance} = 22.6 (\pm 0.586) + 0.480 (\pm 0.104) * (age - 11) \qquad \text{for females}\]
To make a nice graph we get estimated points on the lines and the corresponding confidence intervals. First we create a data frame to be the “newdata” and then we apply the merTools::predictInterval() function and organize the results for the graph. Finally we add thin lines representing the individual subjects.
The usual predict() method does not provide confidence levels for lmer() models. This happens because for these models the distribution of estimated parameters is not strictly known and we cannot assume that sample size is large enough for asymptotic normality. merTools::predictInterval() implements a numerical simulation that is able to approximate the confidence and prediction intervals without using the assumption of normality of estimated quantities.
newd <- expand.grid(
c_age = -20:20 / 5,
Sex = c("Female", "Male"),
Subject = "M10" # not used, but needed for predictInterval
)
newd <- newd %>%
bind_cols(ortho_lmer1 %>%
predictInterval(
newdata = newd,
which = "fixed",
level = 0.95,
n.sims = 3000,
stat = "mean",
include.resid.var = FALSE,
)) %>%
mutate(
dist_hat = predict(ortho_lmer1,
newdata = newd,
re.form = NA
)
)
## Warning: executing %dopar% sequentially: no parallel backend registered
ggplot(
data = newd,
aes(
y = dist_hat,
x = c_age,
group = Sex,
color = Sex
)
) +
geom_line(size = 2) +
geom_ribbon(
aes(
ymax = upr,
ymin = lwr,
fill = Sex
),
alpha = 0.4,
linetype = 0
) +
geom_line(
data = orthod,
aes(
y = distance,
x = c_age,
group = Subject,
color = Sex
),
size = 0.5,
linetype = "dotted"
) +
xlab("Centered age in years (age - 11)") +
ylab("Distance in mm")
The resulting graph clearly represents the population lines and subject data. It would also be informative to visualize the estimated lines for each of the subjects. For this we have to make Subject-level predictions that incorporate the random effects of Subject on the slope and intercept. These are called “conditional” predictions because the are for given Subjects.
# Create conditional predictions
new_subj_data <- expand.grid(
c_age = c(-3, 3),
Subject = unique(orthod$Subject)
) %>%
mutate(
Sex = ifelse(
grepl("M", Subject),
"Male",
"Female"
)
) %>%
mutate(dist_hat = predict(ortho_lmer1,
newdata = .,
re.form = ~ (c_age | Subject)
))
# We reuse most of the previous graph
ggplot(
data = newd,
aes(
y = dist_hat,
x = c_age,
group = Sex,
color = Sex
)
) +
geom_line(size = 2) +
geom_ribbon(
aes(
ymax = upr,
ymin = lwr,
fill = Sex
),
alpha = 0.4,
linetype = 0
) +
geom_line(
data = new_subj_data,
aes(
y = dist_hat,
x = c_age,
group = Subject,
color = Sex
),
alpha = 0.5,
size = 0.5,
linetype = "solid"
) +
geom_point(
data = orthod,
aes(
y = distance,
x = c_age,
group = Subject,
color = Sex
)
) +
xlab("Centered age in years (age - 11)") +
ylab("Distance in mm")
BLUE stands for best linear unbiased estimator. In this case we are calculating a confidence interval for a specific point on the thick (population level) red line in the previous graph. Imagine that we slice the graph vertically at c_age = 2 (age = 13) and get the height of the line and the extremes of the red ribbon.
newd_female_13 <- data.frame(
c_age = 2,
Sex = "Female",
Subject = "F01"
) # Needed for predictInterval but not used
ortho_lmer1 %>%
predictInterval(
newdata = newd_female_13,
which = "fixed",
level = 0.95,
n.sims = 3000,
stat = "mean",
include.resid.var = FALSE,
)
## fit upr lwr
## 1 23.60407 24.87121 22.33586
These lines are already sketched on the previous graph, but they did not include intervals to indicate the certainty. Lines for specific individuals that were part of the study would be much more interesting to the subjects and their parents than the population line. On the other hand the population line and its confidence intervals would be more useful for a person working on public health. It would most likely be a poor choice to treat, for example, subject M13 using the population line, or to use the line for M13 to make public health decisions.
The BLUPs are best linear unbiased predictors, and include the uncertainty in the calculation of random effects.
newd_M11_M13 <- expand.grid(
c_age = -20:20 / 5,
Sex = "Male",
Subject = c("M11", "M13")
)
newd_M11_M13 <- newd_M11_M13 %>%
bind_cols(ortho_lmer1 %>%
predictInterval(
newdata = newd_M11_M13,
which = "full",
level = 0.95,
n.sims = 3000,
stat = "mean",
include.resid.var = FALSE,
)) %>%
mutate(
dist_hat = predict(ortho_lmer1,
newdata = newd_M11_M13,
re.form = ~ (c_age | Subject)
)
)
ggplot(
data = newd_M11_M13,
aes(
y = dist_hat,
x = c_age,
group = Subject,
color = Subject
)
) +
geom_line(size = 1) +
geom_ribbon(
aes(
ymax = upr,
ymin = lwr,
fill = Subject
),
alpha = 0.4,
linetype = 0
) +
geom_point(
data = orthod %>%
dplyr::filter(Subject %in% c("M11", "M13")),
aes(
y = distance,
x = c_age,
group = Subject,
color = Subject
)
) +
xlab("Centered age in years (age - 11)") +
ylab("Distance in mm")
The graph shows the effects of shrinkage. It is clear that both lines could fit the data better if they were not restricted to be members of a family of lines whose parameters are random draws from a normal distribution. Unrestricted, the blue line would have a steeper slope and the red line would have a flatter slope. Subject M13 (blue line) has unusually fast growth relative to the population mean, and its line was “shrunk” towards the population line, like a sort of “bet-hedging” strategy to incorporate the population information to estimate the subject response. Conversely, subject M11 had an unusually slow growth and its slope was “shrunk” upwards towards the population mean. This is the impact of using subject as a random effect factor as opposed to a fixed effect. This impacts is called “shrinkage” or “regularization” and you can read more details about its useful properties in the excellent post by M. Clark.
The question to be answered is: given all data in the study, and a female subject of 8 years of age, what is her distance now, and what will her growth line look like?
I will not include the calculations here because they are a bit lengthy. However, I want to bring to your attention the fact that making a prediction for a specific unobserved subject whose age is known includes a lot of uncertainty. The prediction much less certain than an estimate for the mean of all subjects of such age and sex.
The question can be understood intuitively as follows: I am looking at the graph above where all individual subject lines are plotted. I have picked one of the subject at age 8 but I do not tell you who. What is your best guess of the distance for that subject at age 8 and how certain are you? Give me a range of values that will contain 95% of my choices as I repeat my choice randomly.
The figure suggests that the interval should be centered at the thick red line for the population of females and it should range approximately between 18 and 24 for an 8-year old and between 20 and 27 for a 14-yr old.
I think this is important because CE Advisors create and consume information that typically has a focus on the population, but they give advice to specific subjects, orchards farms, etc. who typically have not participated in the studies. They are unobserved “subjects.” It would be incorrect to make statements about treatment effects, varieties, etc. for specific farms, orchards, subjects, etc. using population-level confidence intervals.
Centering and standardizing predictor variables has multiple benefits in statistical modeling. In addition to making the intercept more meaningful and with least variance, it reduces the collinearity between powers of the same predictor, as in polynomial regression.↩︎
Keep in mind that this “clinically” important value is made up. I do not know if such a value is even defined.↩︎
For example, consider a population of coin tosses with a fair coin where head = 1 and tail = 0. The mean of this population is 0.5, which does not describe any of the individual coin tosses.↩︎