library(labelled) # labeling data
library(rstatix) # summary statistics
library(ggpubr) # convenient summary statistics and plots
library(GGally) # advanced plot
library(car) # useful for anova/wald test
library(Epi) # easy getting CI for model coef/pred
library(lme4) # linear mixed-effects models
library(lmerTest) # test for linear mixed-effects models
library(emmeans) # marginal means
library(multcomp) # CI for linear combinations of model coef
library(geepack) # generalized estimating equations
library(ggeffects) # marginal effects, adjusted predictions
library(gt) # nice tables
library(tidyverse) # for everything (data manipulation, visualization, coding, and more)
theme_set(theme_minimal() + theme(legend.position = "bottom")) # theme for ggplotA review of Longitudinal Data Analysis in R
(Last compiled on Sat Dec 10, 2022)
Aim
The aim of this document is to review how to analyze longitudinal data in R.
The material presented in this document is taken from the lecture Regression models for dependent data by Nicola Orsini, where the Stata code and output have been translated to equivalent R code.
The methodologies are demonstrated by re-analyzing two working examples which are reported in Fitzmaurice GM, Laird NM and Ware JH. (2004) Applied Longitudinal Analysis, Wiley.
R packages
The following R packages have been used in this review. A short comment on the use (for this document) of the package is presented after loading the package. Be sure to install the package before loading it.
Longitudinal Data Analysis
The key feature of Longitudinal Data Analysis is the analysis of longitudinal data, i.e. repeated observations taken on individuals. Oftentimes, the primary goal is to characterize the change in response over time (or measurements), and factors that influence this change.
Observations that belong to the same subject (or group/cluster) tend to be more similar than observations from different groups. This correlation must be accounted for in order to obtain valid inference.
Longitudinal data is a special case of hierarchical data, where observations are nested within different hierarchically ordered levels, such as measurements within individuals, or pupils within schools (and within regions). A broad task view presenting useful R packages for analyzing these type of correlated data can be found on the dedicated CRAN Task View at https://cran.r-project.org/web/views/MixedModels.html.
Notation
Response variables
\(Y_{ij}\) is the response variable for \(i^{th}\) individual (\(i = 1, \dots, N\)) measured at time \(t_{ij}\) (\(j = 1,\dots,n\)).
\(\mu_j = E[Y_{ij}]\) is the mean of \(Y_{ij}\), which provides a measure of the location of the center of the distribution of \(Y_{ij}\).
\(\sigma^2 = \textrm{Var}[Y_{ij}]\) is the variance of \(Y_{ij}\), which provides a measure of the spread or dispersion of the values of \(Y_{ij}\) around their mean.
\(\sigma_{jk} = \textrm{Cov}[Y_{ij}, Y_{ik}]\) is the covariance of \(Y_{ij}\) and \(Y_{ik}\), which provides a measure of the linear dependence between the variables (\(\sigma_{jk} = 0\) corresponds to no linear dependence).
\(\rho_{jk} = \frac{\sigma_{jk}}{\sigma_{j} \sigma_{k}}\) is the correlation between \(Y_{ij}\) and \(Y_{ik}\), a measure of dependence free of scales. It ranges from -1 (perfect negative linear correlation) to 1 (perfect positive linear correlation).
Repeated measurements
\(Y_i = (Y_{i1}, \dots, Y_1{n})^\top\) is the vector of repeated measures, with symmetric matrix of variances and covariances \(\textrm{Cov}(Y_i) = \Sigma_i\)
\[ \Sigma_i = \begin{bmatrix} \sigma_1^2 & \sigma_{12} & \cdot & \sigma_{1n} \\ \sigma_{21} & \sigma_2^2 & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \sigma_{n1} & \cdot & \cdot & \sigma_n^2 \end{bmatrix} \]
A matrix of correlations can be easily obtained from the variance/covariance matrix.
\[ \textrm{Corr}(Y_i) = \rho_i = \begin{bmatrix} 1 & \rho_{12} & \cdot & \rho_{1n} \\ \rho_{21} & 1 & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \rho_{n1} & \cdot & \cdot & 1 \end{bmatrix} \]
Based on empirical observations from longitudinal studies, the correlation among repeated measurements are often present these feature:
- correlations are positive;
- correlations decrease with increasing time separation;
- correlations between repeated measures rarely ever approach zero;
- correlation between a pair of repeated measures taken very closely together in time rarely approaches one.
Dental Growth Data
Data
This study was conducted in 16 boys and 11 girls, who at ages 8, 10, 12, and 14 had their distance (mm) from the center of the pituitary gland to the pteryomaxillary fissure measured. Changes in pituitary-pteryomaxillary distances during growth is important in orthodontal therapy.
The goals of the study were to describe the distance in boys and girls as simple functions of age, and then to compare the functions for boys and girls.
Reference: Potthoff, R.F. and Roy, S.W. (1964). A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika, 51, 313-326
The dental dataset contains the following variables:
id = a sequential number;
sex = sex, a factor with categories 0 = “Girl”, 1 = “Boy”;
y8 = Measure at age 8;
y10 = Measure at age 10;
y12 = Measure at age 12;
y14 = Measure at age 14.
Load data into R from the URL.
#load("data/dental.RData")
load(url("http://alecri.github.io/downloads/data/dental.RData"))
head(dental)# A tibble: 6 × 6
id sex y8 y10 y12 y14
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 Girl 21 20 21.5 23
2 2 Girl 21 21.5 24 25.5
3 3 Girl 20.5 24 24.5 26
4 4 Girl 23.5 24.5 25 26.5
5 5 Girl 21.5 23 22.5 23.5
6 6 Girl 20 21 21 22.5
The data is presented in a wide format where repeated measurements are contained in separate variables (only one raw for each individual).
Oftentimes, a longitudinal format, where repeated measurements are reported as separate rows, is preferable for the analysis. The pivot_longer function is useful for restruring a data from a wide to a long format:
dental_long <- pivot_longer(dental, cols = starts_with("y"),
names_to = "measurement", values_to = "distance") %>%
mutate(
age = parse_number(measurement),
measurement = fct_inorder(paste("Measure at age", age))
) %>%
set_variable_labels(
age = "Age of the child at measurement",
measurement = "Label for time measurement",
distance = "Measurement"
)
head(dental_long)# A tibble: 6 × 5
id sex measurement distance age
<dbl> <fct> <fct> <dbl> <dbl>
1 1 Girl Measure at age 8 21 8
2 1 Girl Measure at age 10 20 10
3 1 Girl Measure at age 12 21.5 12
4 1 Girl Measure at age 14 23 14
5 2 Girl Measure at age 8 21 8
6 2 Girl Measure at age 10 21.5 10
Descriptive statistics
Mean response over time
group_by(dental_long, age) %>%
get_summary_stats(distance)# A tibble: 4 × 14
age variable n min max median q1 q3 iqr mad mean sd
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 8 distance 27 16.5 27.5 22 21 23.2 2.25 1.48 22.2 2.43
2 10 distance 27 19 28 23 21.5 24.5 3 2.22 23.2 2.16
3 12 distance 27 19 31 24 23 26 3 2.22 24.6 2.82
4 14 distance 27 19.5 31.5 26 25 27.8 2.75 2.22 26.1 2.77
# … with 2 more variables: se <dbl>, ci <dbl>
Overall, the mean response increases with age. The variance of distance also increases with age.
The mean distance increases by 4 mm (26-22) from age 8 to age 14 years.
ggplot(dental_long, aes(measurement, distance, fill = measurement)) +
geom_boxplot() +
geom_jitter(width = 0.2) +
guides(fill = "none") +
labs(x = "", y = "Dental growth, mm")group_by(dental_long, sex, measurement) %>%
get_summary_stats(distance, show = c("mean", "sd"))# A tibble: 8 × 6
sex measurement variable n mean sd
<fct> <fct> <chr> <dbl> <dbl> <dbl>
1 Girl Measure at age 8 distance 11 21.2 2.12
2 Girl Measure at age 10 distance 11 22.2 1.90
3 Girl Measure at age 12 distance 11 23.1 2.37
4 Girl Measure at age 14 distance 11 24.1 2.44
5 Boy Measure at age 8 distance 16 22.9 2.45
6 Boy Measure at age 10 distance 16 23.8 2.14
7 Boy Measure at age 12 distance 16 25.7 2.65
8 Boy Measure at age 14 distance 16 27.5 2.08
The average response increases both among boys and girls.
ggplot(dental_long, aes(sex, distance, fill = measurement)) +
geom_boxplot() +
labs(x = "", y = "Dental growth, mm", fill = "")group_by(dental_long, sex, measurement) %>%
summarise(mean_distance = mean(distance), .groups = "drop") %>%
ggplot(aes(sex, mean_distance, fill = measurement, label = round(mean_distance))) +
geom_col(position = "dodge") +
geom_text(position = position_dodge(width = 0.9), vjust = -0.5) +
coord_flip() +
labs(x = "", y = "Mean Dental growth, mm", fill = "")Correlation of responses
# co-variance matrix
cov_obs <- select(dental, starts_with("y")) %>%
cov()
cov_obs y8 y10 y12 y14
y8 5.925926 3.285256 4.875356 4.039886
y10 3.285256 4.653846 3.858974 4.532051
y12 4.875356 3.858974 7.938746 6.197293
y14 4.039886 4.532051 6.197293 7.654558
# correlation matrix
cov2cor(cov_obs) y8 y10 y12 y14
y8 1.0000000 0.6255833 0.7108079 0.5998338
y10 0.6255833 1.0000000 0.6348775 0.7593268
y12 0.7108079 0.6348775 1.0000000 0.7949980
y14 0.5998338 0.7593268 0.7949980 1.0000000
A positive (linear) correlation is observed for consecutive measurements.
ggpairs(select(dental, starts_with("y")), lower = list(continuous = "smooth"))ggpairs(dental, mapping = aes(colour = sex), columns = 3:6,
lower = list(continuous = "smooth"))Trajectories over time
Many different alternative graphical representations are possible. Here the most common ones.
group_by(dental_long, sex, age) %>%
summarise(mean = list(mean_ci(distance)), .groups = "drop") %>%
unnest_wider(mean) %>%
mutate(agex = age - .05 + .05*(sex == "Boy")) %>%
ggplot(aes(agex, y, col = sex, shape = sex)) +
geom_point() +
geom_errorbar(aes(ymin = ymin, ymax = ymax), width = 0.2) +
geom_line() +
labs(x = "Age, years", y = "Mean Dental growth, mm", shape = "Sex", col = "Sex")Plot of individual data over time
ggplot(dental_long, aes(age, distance, col = factor(id))) +
geom_point() +
geom_line() +
facet_wrap(~ id) +
labs(x = "Age, years", y = "Dental growth, mm", col = "Child id") +
guides(col = guide_legend(nrow = 3))Spaghetti plot
ggplot(dental_long, aes(age, distance, col = factor(id))) +
geom_line() +
labs(x = "Age, years", y = "Dental growth, mm", col = "Child id") +
guides(col = guide_legend(nrow = 3))Spaghetti plot by sex
ggplot(dental_long, aes(age, distance)) +
geom_line(aes(group = factor(id))) +
geom_smooth() +
facet_grid(~ sex) Research question
In the next sections, we are going to answer the following three main questions:
- Time effect: What is the shape of the trajectory of the mean response over time?
- Group effect: What is the average difference between groups of individuals?
- Interaction between time and group: How the relationship between the response and time vary according to groups of individuals?
Linear Mixed Models
In these models, some subset of the regression coefficients vary randomly from one individual to another, accounting for natural heterogeneity across subjects. Individuals in population are assumed to have their own subject-specific mean response trajectories over time.
The mean response is modeled as a combination of population characteristics (fixed effects) assumed to be shared by all individuals, while subject-specific effects (random effects) are unique to a particular individual. Linear Mixed Models are a particular type of hierarchical models which contains both fixed and random effects.
Empty model
The equation of an empty mixed-effects model, also known as one-way random-effect ANOVA model, is \(Y_{ij} = \beta_0 + u_{0i} + \epsilon_{ij}\), where \(\beta_0\) is the intercept, \(u_{0i}\) the random effect, and \(\epsilon_{ij}\) is the residual error terms.
The model has the following distributional assumptions:
- \(u_{0i} \sim N(0, \sigma_{u_0}^2)\)
- \(\epsilon_{ij} \sim N(0, \sigma^2)\)
- \(u_{0i} \perp \!\!\! \perp \epsilon_{ij}\)
Interpretation:
- \(\beta_0\) is the fixed effect representing the overall mean of the response variable \(Y_{ij}\)
- \(u_{0i}\) is a random variable of which we estimate the variance. It is the individual-specific deviation from \(\beta_0\) that can be predicted from the model and observed data.
The conditional mean response for a specific individual is \(E[Y_{ij} | u_{0i}] = \beta_0 + u_{0i}\).
The marginal mean response in the population (i.e., averaged over all individuals in population) is \(E[Y_{ij}] = \beta_0\)
For the model with randomly varying intercepts, the marginal covariance has the following compound symmetry pattern:
\[ \Sigma_i = \begin{bmatrix} \sigma_{u_0}^2 + \sigma^2 & \sigma_{u_0}^2 & \cdot & \sigma_{u_0}^2 \\ \sigma_{u_0}^2 & \sigma_{u_0}^2 + \sigma^2 & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \sigma_{u_0}^2 & \cdot & \cdot & \sigma_{u_0}^2 + \sigma^2 \end{bmatrix} \]
Intraclass correlation
The common correlation among repeated measures implicit in the random intercept model is known as intraclass correlation (ICC), which is defined as:
\[\textrm{ICC} = \frac{\sigma_{u_0}^2}{\sigma_{u_0}^2 + \sigma^2}\]
The ICC can be intepretated as the proportion of the total variance that is “explained” by between-subject variability is. It can take any value between 0 and 1.
Estimation
\(\beta\) coefficients can be estimated using Maximum Likelihood (ML) estimation, which give estimates that maximize the log-likelihood, i.e. the probability for the observed the data given the parameters.
Maximum Likelihood is not taking account of the degrees of freedom used for estimating fixed effects when estimating variance components. This can lead to bias estimates (underestimation) for the variance components in data with relatively small number of individuals.
The alternative Restricted Maximum Likelihood (REML) can overcome this limitation by maximizing a marginal or residual (invariant to fixed effects) likelihood.
Test linear hypotheses
Different functions and packages can be used for performing inference on linear combination of model coefficients, which may be helpful for answering relevant scientific research questions. We are going to use the glht (general linear hypotheses testing) in the multcomp package.
Additional packages can be useful for visualizing confidence intervals for \(\beta\) coefficients (ci.lin in the Epi package, tidy), testing nested models (Anova in the car package), and for estimating marginal meas (emmeans package).
For example, we might want to test for a subset of fixed effects: $H_0: _0 = 0 $ vs $H_1: _0 $. A walt-type test can be used for testing this type of hypotheses.
Fitting model in R
Linear mixed-effects models can be fitted using the lmer function in the lme4 package.
The first part of the formula model specify the fixed effects part of the model, while the second in parantheses specify the random components.
lin_0 <- lmer(distance ~ 1 + (1 | id), data = dental_long)
summary(lin_0)Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: distance ~ 1 + (1 | id)
Data: dental_long
REML criterion at convergence: 515.4
Scaled residuals:
Min 1Q Median 3Q Max
-3.2400 -0.5277 -0.1073 0.4732 2.7687
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 3.752 1.937
Residual 4.930 2.220
Number of obs: 108, groups: id, 27
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 24.0231 0.4297 26.0000 55.91 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ci.lin(lin_0) Estimate StdErr z P 2.5% 97.5%
(Intercept) 24.02315 0.4296605 55.91193 0 23.18103 24.86527
The estimated marginal mean of the dental distance is \(\beta_0 = 24.02\) mm.
The estimated variance of the random-effect reflecting between-subject variability is 3.752; the estimated variance of the error term reflecting within-subject variability is 4.930. The correlation between any two repeated measures (ICC) is equal to \(3.76/(3.76 + 4.93) = 0.43\).
Variance components can be tested using an ANOVA-like table that can be derived using the ranova function in the lmer test.
ranova(lin_0)ANOVA-like table for random-effects: Single term deletions
Model:
distance ~ (1 | id)
npar logLik AIC LRT Df Pr(>Chisq)
<none> 3 -257.68 521.36
(1 | id) 2 -269.14 542.28 22.92 1 1.689e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The small \(p\)-value suggests evidence of between-individual heterogeneity, which support evidence for choosing a mixed-effects model instead of a only fixed-effects model.
Graphical presentation of the model
ggplot(dental_long, aes(id, distance)) +
geom_point(aes(col = measurement, shape = measurement)) +
geom_point(data = group_by(dental_long, id) %>%
summarise(distance = mean(distance), .groups = "drop"),
aes(col = "Mean", shape = "Mean"), size = 2.5) +
geom_hline(yintercept = mean(dental_long$distance)) +
scale_shape_manual(values = c(4, 19, 19, 19, 19)) +
labs(x = "Child id", y = "Dental growth, mm", col = "Measurement", shape = "Measurement")Time effect
Question: Is the mean response varying with time?
One modeling strategy is to use indicator variables to contrast mean responses at different occasions. In the dental growth example we have measurements of the response at age 8 (baseline), 10, 12, and 14 years-old:
\[E[Y_{ij}] = \beta_0 + \beta_1X_{ij1} + \beta_2X_{ij2} + \beta_3X_{ij3}\]
where \(X_{ij1}\), \(X_{ij2}\), and \(X_{ij3}\) are indicator variables for age 10, 12, and 14 years respectively. Measurement at the age of 8 years is the referent.
lin_age <- lmer(distance ~ measurement + (1 | id), data = dental_long)
summary(lin_age)Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: distance ~ measurement + (1 | id)
Data: dental_long
REML criterion at convergence: 443.2
Scaled residuals:
Min 1Q Median 3Q Max
-3.7376 -0.5248 0.0153 0.4027 3.7212
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 4.465 2.113
Residual 2.078 1.442
Number of obs: 108, groups: id, 27
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 22.1852 0.4923 43.3911 45.066 < 2e-16 ***
measurementMeasure at age 10 0.9815 0.3924 78.0000 2.501 0.0145 *
measurementMeasure at age 12 2.4630 0.3924 78.0000 6.277 1.80e-08 ***
measurementMeasure at age 14 3.9074 0.3924 78.0000 9.958 1.52e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) mMaa10 mMaa12
msrmntMaa10 -0.399
msrmntMaa12 -0.399 0.500
msrmntMaa14 -0.399 0.500 0.500
The mean dental distance at age 8 years (baseline) is 22.18 mm.
The difference between the mean responses of 10 and 8 years-old children is \(\beta_1 = 0.98\) mm.
The difference between the mean responses of 12 and 8 years-old children is \(\beta_2 = 2.46\) mm.
The difference between the mean responses of 14 and 8 years-old children is \(\beta_3 = 3.91\) mm.
The emmeans function can be usefull for computing marginal means with corresponding confidence intervals.
tidy(emmeans(lin_age, "measurement"), conf.int = TRUE)# A tibble: 4 × 8
measurement estimate std.error df conf.low conf.high stati…¹ p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Measure at age 8 22.2 0.492 43.4 21.2 23.2 45.1 4.45e-38
2 Measure at age 10 23.2 0.492 43.4 22.2 24.2 47.1 7.05e-39
3 Measure at age 12 24.6 0.492 43.4 23.7 25.6 50.1 5.02e-40
4 Measure at age 14 26.1 0.492 43.4 25.1 27.1 53.0 4.41e-41
# … with abbreviated variable name ¹statistic
We can test whether the mean response is constant over time by testing the null hypothesis that all the regression coefficients used to model time are simultaneously equal to zero (\(H_0: \beta_1 = \beta_2 = \beta_3 = 0\)).
Anova(lin_age)Analysis of Deviance Table (Type II Wald chisquare tests)
Response: distance
Chisq Df Pr(>Chisq)
measurement 114.12 3 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Results from the Wald-test suggest that the trajectory of the mean response over time is not flat.
We can presented the estimated trajectory from the fitted model by using the predict function.
dental_fit <- bind_cols(
dental_long, pred_age = predict(lin_age, re.form = ~ 0)
)
ggplot(dental_fit, aes(age, distance)) +
geom_line(aes(group = factor(id))) +
geom_point(aes(y = pred_age), col = "blue", size = 2) +
labs(x = "Age, years", y = "Dental growth, mm")Group effect
Question: Is the mean response varying in the two groups of individuals?
\[E[Y_{ij}] = \beta_0 + \beta_1X_{ij1} + \beta_2X_{ij2} + \beta_3X_{ij3} + \beta_4X_{ij4}\] where \(X_{ij1}\), \(X_{ij2}\), and \(X_{ij3}\) are indicator variables for age 10, 12, and 14 years respectively. And \(X_{ij4}\) is the indicator variable for sex (1 for boys and 0 for girls).
NB: The difference between the mean responses of boys and girls is \(E[Y_{ij} | X_{ij4} = 1] - E[Y_{ij} | X_{ij4} = 0] = \beta_4\), that is adjusted or controlled for time.
lin_agesex <- lmer(distance ~ measurement + sex + (1 | id), data = dental_long)
summary(lin_agesex)Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: distance ~ measurement + sex + (1 | id)
Data: dental_long
REML criterion at convergence: 433.7
Scaled residuals:
Min 1Q Median 3Q Max
-3.8225 -0.5116 -0.0066 0.4169 3.6565
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 3.260 1.805
Residual 2.078 1.442
Number of obs: 108, groups: id, 27
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 20.8098 0.6335 33.8026 32.850 < 2e-16 ***
measurementMeasure at age 10 0.9815 0.3924 78.0000 2.501 0.01447 *
measurementMeasure at age 12 2.4630 0.3924 78.0000 6.277 1.80e-08 ***
measurementMeasure at age 14 3.9074 0.3924 78.0000 9.958 1.52e-15 ***
sexBoy 2.3210 0.7614 25.0000 3.048 0.00538 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) mMaa10 mMaa12 mMaa14
msrmntMaa10 -0.310
msrmntMaa12 -0.310 0.500
msrmntMaa14 -0.310 0.500 0.500
sexBoy -0.712 0.000 0.000 0.000
ci.lin(lin_agesex) Estimate StdErr z P
(Intercept) 20.8097643 0.6334777 32.850032 1.137820e-236
measurementMeasure at age 10 0.9814815 0.3923780 2.501367 1.237147e-02
measurementMeasure at age 12 2.4629630 0.3923780 6.277017 3.451314e-10
measurementMeasure at age 14 3.9074074 0.3923780 9.958274 2.320555e-23
sexBoy 2.3210227 0.7614168 3.048294 2.301443e-03
2.5% 97.5%
(Intercept) 19.5681707 22.051358
measurementMeasure at age 10 0.2124348 1.750528
measurementMeasure at age 12 1.6939163 3.232010
measurementMeasure at age 14 3.1383607 4.676454
sexBoy 0.8286731 3.813372
The difference between mean dental distances of boys vs. girls is 2.3 (95% CI = 0.83-3.81) mm higher than girls, adjusting for time.
The emmeans can be used for estimating the mean responses for different combinations of sex and time measurements.
tidy(emmeans(lin_agesex, c("measurement", "sex")), conf.int = TRUE)# A tibble: 8 × 9
measurement sex estim…¹ std.e…² df conf.…³ conf.…⁴ stati…⁵ p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Measure at age 8 Girl 20.8 0.633 33.8 19.5 22.1 32.9 3.11e-27
2 Measure at age 10 Girl 21.8 0.633 33.8 20.5 23.1 34.4 6.84e-28
3 Measure at age 12 Girl 23.3 0.633 33.8 22.0 24.6 36.7 7.83e-29
4 Measure at age 14 Girl 24.7 0.633 33.8 23.4 26.0 39.0 1.07e-29
5 Measure at age 8 Boy 23.1 0.542 38.0 22.0 24.2 42.7 1.06e-33
6 Measure at age 10 Boy 24.1 0.542 38.0 23.0 25.2 44.5 2.25e-34
7 Measure at age 12 Boy 25.6 0.542 38.0 24.5 26.7 47.2 2.43e-35
8 Measure at age 14 Boy 27.0 0.542 38.0 25.9 28.1 49.9 3.11e-36
# … with abbreviated variable names ¹estimate, ²std.error, ³conf.low,
# ⁴conf.high, ⁵statistic
A graphical presentation of the fitted model can be easily obtained by
dental_fit$pred_agesex <- predict(lin_agesex, re.form = ~ 0)
ggplot(dental_fit, aes(age, distance)) +
geom_line(aes(group = factor(id))) +
geom_point(aes(y = pred_agesex, col = sex), size = 2) +
labs(x = "Age, years", y = "Dental growth, mm", col = "Sex")Interaction between time and group
Question: Is the change of the mean response over time varying according to group of individuals?
lin_agesexinter <- lmer(distance ~ measurement*sex + (1 | id), data = dental_long)
lin_agesexinterLinear mixed model fit by REML ['lmerModLmerTest']
Formula: distance ~ measurement * sex + (1 | id)
Data: dental_long
REML criterion at convergence: 423.4085
Random effects:
Groups Name Std.Dev.
id (Intercept) 1.813
Residual 1.405
Number of obs: 108, groups: id, 27
Fixed Effects:
(Intercept) measurementMeasure at age 10
21.1818 1.0455
measurementMeasure at age 12 measurementMeasure at age 14
1.9091 2.9091
sexBoy measurementMeasure at age 10:sexBoy
1.6932 -0.1080
measurementMeasure at age 12:sexBoy measurementMeasure at age 14:sexBoy
0.9347 1.6847
The mean distance among girls at age 8 was 21.2 mm
The mean distance among girls at age 10 was 1.04 mm higher than age 8.
The mean distance among girls at age 14 was 2.9.0 mm higher than age 8.
The mean distance among boys at age 8 was 21.2+1.7= 22.9 mm
The mean distance among boys at age 10 was 1.04-0.11=0.93 mm higher than age 8.
\(\dots\)
Testing the interaction between group and time ::: {.cell}
Anova(lin_agesexinter, type = 3)Analysis of Deviance Table (Type III Wald chisquare tests)
Response: distance
Chisq Df Pr(>Chisq)
(Intercept) 938.2060 1 < 2.2e-16 ***
measurement 25.6468 3 1.131e-05 ***
sex 3.5525 1 0.05946 .
measurement:sex 7.0847 3 0.06925 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
:::
The Wald test detects some evidence of interaction between age and sex (\(p\) = 0.07), which suggests that the mean response profiles might not be parallel.
The marginal prediction for the combination of sex and time of measurement can be obtained as
tidy(emmeans(lin_agesexinter, c("measurement", "sex")), conf.int = TRUE)# A tibble: 8 × 9
measurement sex estim…¹ std.e…² df conf.…³ conf.…⁴ stati…⁵ p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Measure at age 8 Girl 21.2 0.692 46.1 19.8 22.6 30.6 2.78e-32
2 Measure at age 10 Girl 22.2 0.692 46.1 20.8 23.6 32.1 3.33e-33
3 Measure at age 12 Girl 23.1 0.692 46.1 21.7 24.5 33.4 6.17e-34
4 Measure at age 14 Girl 24.1 0.692 46.1 22.7 25.5 34.8 9.40e-35
5 Measure at age 8 Boy 22.9 0.573 46.1 21.7 24.0 39.9 2.22e-37
6 Measure at age 10 Boy 23.8 0.573 46.1 22.7 25.0 41.5 3.66e-38
7 Measure at age 12 Boy 25.7 0.573 46.1 24.6 26.9 44.9 1.15e-39
8 Measure at age 14 Boy 27.5 0.573 46.1 26.3 28.6 47.9 5.87e-41
# … with abbreviated variable names ¹estimate, ²std.error, ³conf.low,
# ⁴conf.high, ⁵statistic
The difference between the mean response among boys and girls depends on age.
The change in the mean response over time depends on sex
dental_fit$pred_agesexinter <- predict(lin_agesexinter, re.form = ~ 0)
ggplot(dental_fit, aes(age, distance)) +
geom_line(aes(group = factor(id))) +
geom_point(aes(y = pred_agesexinter, col = sex), size = 2) +
labs(x = "Age, years", y = "Dental growth, mm", col = "Sex")Parametric curve
We can fit parametric curves to longitudinal data using the actual time value when the measurement was taken. In many studies, the true underlying mean response process changes over time in a relatively smooth, monotonically increasing/decreasing pattern. In our example we could model the relationship between age and the mean response using a simple linear trend among boys and girls.
Let’s assume a linear relationship between the time variable and the mean response allowing the linear trend to vary according to sex \(E[Y_{ij}] = \beta_0 + \beta_1X_1 + \beta_2X_2 + + \beta_3X_1X_2\), where \(X_1\) is the indicator variable for sex, \(X_2\) is age, and \(X_1X_2\) the interaction term.
lin_agecsexinter <- lmer(distance ~ sex*age + (1 | id), data = dental_long)
summary(lin_agecsexinter)Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: distance ~ sex * age + (1 | id)
Data: dental_long
REML criterion at convergence: 433.8
Scaled residuals:
Min 1Q Median 3Q Max
-3.5980 -0.4546 0.0158 0.5024 3.6862
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 3.299 1.816
Residual 1.922 1.386
Number of obs: 108, groups: id, 27
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 17.37273 1.18351 103.98636 14.679 < 2e-16 ***
sexBoy -1.03210 1.53742 103.98636 -0.671 0.5035
age 0.47955 0.09347 79.00000 5.130 2.02e-06 ***
sexBoy:age 0.30483 0.12142 79.00000 2.511 0.0141 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) sexBoy age
sexBoy -0.770
age -0.869 0.669
sexBoy:age 0.669 -0.869 -0.770
The intercept \(\beta_0 = 17.4\) is the mean distance among girls at age 0 (extrapolation, not interesting).
The coefficient of sex \(\beta_1 =-1.03\) is the difference between the mean distance of boys vs. girls at age 0 (extrapolation, not interesting).
The coefficient of age \(\beta_2 =0.48\) is the change in the mean response for every one year increment of age among girls.
The coefficient of the interaction term \(\beta_3 =0.30\) is represents the additional change of the regression coefficient of sex for a certain value of age and viceversa.
Test linear hypotheses
K <- rbind(
c(0, 1, 0, 8),
c(0, 1, 0, 14),
c(0, 0, 1, 1),
c(0, 0, 2, 2),
c(1, 1, 8, 8),
c(1, 1, 14, 14),
c(0, 0, 14-8, 14-8)
)
rownames(K) <- c("Difference between boys and girls at age 8 years",
"Difference between boys and girls at age 14 years",
"Linear trend for 1 year increase of age among boys",
"Linear trend for 2 years increase of age among boys",
"Mean response at age 8 years among boys",
"Mean response at age 14 years among boys",
"Difference between age 14 and 8 among boys")
tidy(glht(lin_agecsexinter, linfct = K), conf.int = TRUE) %>%
gt() %>%
fmt_number(columns = -1,decimals = 2)| contrast | null.value | estimate | conf.low | conf.high | statistic | adj.p.value |
|---|---|---|---|---|---|---|
| Difference between boys and girls at age 8 years | 0.00 | 1.41 | −0.71 | 3.52 | 1.67 | 0.31 |
| Difference between boys and girls at age 14 years | 0.00 | 3.24 | 1.12 | 5.35 | 3.83 | 0.00 |
| Linear trend for 1 year increase of age among boys | 0.00 | 0.78 | 0.59 | 0.98 | 10.12 | 0.00 |
| Linear trend for 2 years increase of age among boys | 0.00 | 1.57 | 1.18 | 1.96 | 10.12 | 0.00 |
| Mean response at age 8 years among boys | 0.00 | 22.62 | 21.27 | 23.96 | 41.98 | 0.00 |
| Mean response at age 14 years among boys | 0.00 | 27.32 | 25.97 | 28.67 | 50.71 | 0.00 |
| Difference between age 14 and 8 among boys | 0.00 | 4.71 | 3.54 | 5.87 | 10.12 | 0.00 |
Graphical presentation of the fitted model
pred_agecsexinter <- expand.grid(
age = seq(8, 14, .5),
sex = levels(dental_long$sex)
) %>%
bind_cols(pred = predict(lin_agecsexinter, newdata = ., re.form = ~ 0))
ggplot(pred_agecsexinter, aes(age, pred, col = sex)) +
geom_line() +
labs(x = "Age, years", y = "Dental growth, mm", col = "Sex")Prediction random effects
In many applications, inference is focused on fixed effects (the overall change of the response over time). However, we can also estimate (or predict) individual-specific response trajectories over time.
This is known as “best linear unbiased predictor” (or BLUP). They are also known as empirical Bayes predictions. The idea is to predict the random effect (deviation from population mean) for each individual from realized, observed value of the response. The predicted BLUP response trajectory for the \(i\)-th individual is a weighted combination of estimated population averaged mean response profile and the subject’s observed response profile. This helps to explains why it is often said that empirical BLUP “shrinks” the \(i\)-th subject’s predicted response profile towards population-averaged mean.
Let’s assume a random-intercept model with only age (continuous) as covariate.
lin_agec <- lmer(distance ~ age + (1 | id), data = dental_long)
lin_agecLinear mixed model fit by REML ['lmerModLmerTest']
Formula: distance ~ age + (1 | id)
Data: dental_long
REML criterion at convergence: 447.0025
Random effects:
Groups Name Std.Dev.
id (Intercept) 2.115
Residual 1.432
Number of obs: 108, groups: id, 27
Fixed Effects:
(Intercept) age
16.7611 0.6602
We will focus on two specific children (id 10 and 21).
sid <- c(10, 21)
expand.grid(
age = seq(8, 14, .5),
id = sid
) %>%
bind_cols(
indiv_pred = predict(lin_agec, newdata = .),
marg_pred = predict(lin_agec, newdata = ., re.form = ~ 0)
) %>%
left_join(
filter(dental_long, id %in% sid), by = c("id", "age")
) %>%
ggplot(aes(age, indiv_pred, group = id, col = factor(id))) +
geom_line() +
geom_point(aes(y = distance)) +
geom_line(aes(y = marg_pred, col = "Marginal"), lwd = 1.5) +
labs(x = "Age, years", y = "Dental growth, mm", col = "Curve")The figure above provides graphical representation of the random intercept model.
Marginal (overall) mean response over time in the population increases linearly with age (denoted by the solid thick blue line).
Conditional mean responses for the two specific individuals are represented with different colored thin solid lines.
Random intercept and slope
In a random intercept and slope, also the regression coefficient of time \(\beta_1\) is allowed to vary randomly among individuals:
\[E[Y_{ij}] = \beta_0 + \beta_1X_{ij1} + u_{0i} + u_{1i}X_{ij1} + \epsilon_{ij}\]
\[E[Y_{ij}] = (\beta_0 + u_{0i}) + (\beta_1 + u_{1i})X_{ij1} + \epsilon_{ij}\]
The assumptions about the random variables are:
\[\begin{equation} \begin{pmatrix} u_{0i} \\ u_{1i} \end{pmatrix} \sim N \left(0,\begin{pmatrix} \sigma_0^2 & \sigma_{u01} \\ \sigma_{u10} & \sigma_1^2 \end{pmatrix} \right) \end{equation}\]
\[\epsilon_{ij} \sim N(0, \sigma^2)\]
The random effects and the residual random error are independent.
The conditional mean (individual-specific) is \(E[Y_{ij} | X_{ij}, u_{0i}, u_{1i}] = (\beta_0 + u_{0i}) + (\beta_1 + u_{1i})X_{ij1}\)
The marginal mean (population-average) is \(E[Y_{ij}| X_{ij}] = \beta_0 + \beta_1X_{ij1}\)
Fitting model in R
lin_agecr <- lmer(distance ~ age + (age | id), data = dental_long)
summary(lin_agecr)Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: distance ~ age + (age | id)
Data: dental_long
REML criterion at convergence: 442.6
Scaled residuals:
Min 1Q Median 3Q Max
-3.2229 -0.4938 0.0073 0.4721 3.9161
Random effects:
Groups Name Variance Std.Dev. Corr
id (Intercept) 5.41657 2.3274
age 0.05128 0.2264 -0.61
Residual 1.71616 1.3100
Number of obs: 108, groups: id, 27
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 16.76111 0.77527 25.99911 21.620 < 2e-16 ***
age 0.66019 0.07126 25.99805 9.265 1.01e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
age -0.848
This model assumes that individuals vary not only in their baseline level of response (intercept), but also in terms of their changes (slope) in the mean response over time.
VarCorr(lin_agecr) Groups Name Std.Dev. Corr
id (Intercept) 2.32735
age 0.22645 -0.609
Residual 1.31002
as.data.frame(VarCorr(lin_agecr)) grp var1 var2 vcov sdcor
1 id (Intercept) <NA> 5.41656658 2.3273518
2 id age <NA> 0.05127907 0.2264488
3 id (Intercept) age -0.32118277 -0.6094248
4 Residual <NA> <NA> 1.71615750 1.3100219
There is a negative correlation (-0.32) between random intercepts and slopes: the greater is the mean response at baseline (8 years), the lower is likely to be the increase (slope) of the rate of change over time.
We can get the predicted value of the marginal (overall) and individual-specific mean response trajectories.
expand.grid(
age = seq(8, 14, .5),
id = unique(dental_long$id)
) %>%
bind_cols(
indiv_pred = predict(lin_agecr, newdata = .),
marg_pred = predict(lin_agecr, newdata = ., re.form = ~ 0)
) %>%
ggplot(aes(age, indiv_pred, group = id)) +
geom_line(col = "grey") +
geom_line(aes(y = marg_pred), col = "blue", lwd = 1.5) +
labs(x = "Age, years", y = "Dental growth, mm")Question: Is there any difference between boys and girls?
lin_agecsexinterr <- lmer(distance ~ age*sex + (age | id), data = dental_long)
summary(lin_agecsexinterr)Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: distance ~ age * sex + (age | id)
Data: dental_long
REML criterion at convergence: 432.6
Scaled residuals:
Min 1Q Median 3Q Max
-3.1694 -0.3862 0.0070 0.4454 3.8490
Random effects:
Groups Name Variance Std.Dev. Corr
id (Intercept) 5.77449 2.4030
age 0.03245 0.1801 -0.67
Residual 1.71663 1.3102
Number of obs: 108, groups: id, 27
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 17.3727 1.2281 25.0078 14.147 1.94e-13 ***
age 0.4795 0.1037 25.0113 4.625 9.85e-05 ***
sexBoy -1.0321 1.5953 25.0078 -0.647 0.5235
age:sexBoy 0.3048 0.1347 25.0113 2.263 0.0326 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) age sexBoy
age -0.880
sexBoy -0.770 0.678
age:sexBoy 0.678 -0.770 -0.880
There is evidence of significant interaction between sex and time in predicting dental distance (p for interaction = 0.032).
Graphical presentation
expand.grid(
age = seq(8, 14, .5),
id = unique(dental_long$id)
) %>%
left_join(select(dental, sex, id), by = "id") %>%
bind_cols(
indiv_pred = predict(lin_agecsexinterr, newdata = .),
marg_pred = predict(lin_agecsexinterr, newdata = ., re.form = ~ 0)
) %>%
ggplot(aes(age, indiv_pred, group = id)) +
geom_line(col = "grey") +
geom_line(aes(y = marg_pred, col = sex), lwd = 1.5) +
labs(x = "Age, measurements", y = "Mean Dental growth, mm")Generalized Estimation Equation
The geepack package contains functions for fitting population-averaged panel-data models. The function geeglm allows to estimate generalized linear models with different within-group correlation structure for the panels. Have a look at the help page ?geeglm for the different within-group correlation structure (corstr argument).
Some type of correlations
independence: the correlation matrix is diagonal. That is 0 out of main diagonal. No correlation among repeated measurements;
exchangeable: all pair of measurements on the same individual are equally correlated \(\rho_{jk} = \rho\) (compound symmetry);
ar1: Correlation depends on time or distance between measurements j and k. Plausible for repeated measures where correlation is known to decline over time \(\rho_{jk} = \rho^{|j-k|}\);
unstructured: no assumptions about the correlations;
userdefined: Correlation is specified by the user.
Things to consider when you choose a correlation matrix:
In general if the number of repeated measures per individual is small in a balanced and complete design, then an unstructured matrix is recommended. If data are unbalanced, unequally spaced, and with gaps (missing) the more appropriate correlation matrices are exchangeable and unstructured. Furthermore, when the order of the measurement it doesn’t matter an exchangeable correlation matrix could be appropriate.
Let’s assume the following marginal model for the dental distance:
\[E[Y_{ij}| X_{ij}] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2\] where \(X_1\) is the indicator variable for sex (1 for boys and 0 for girls), \(X_2\) is age, and \(X_1X_2\) is the interaction between sex and age.
gee_inter <- geeglm(distance ~ sex*age, data = dental_long,
id = id, family = gaussian, corstr = "exchangeable")
summary(gee_inter)
Call:
geeglm(formula = distance ~ sex * age, family = gaussian, data = dental_long,
id = id, corstr = "exchangeable")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 17.37273 0.72521 573.869 < 2e-16 ***
sexBoy -1.03210 1.37779 0.561 0.4538
age 0.47955 0.06313 57.697 3.05e-14 ***
sexBoy:age 0.30483 0.11687 6.803 0.0091 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation structure = exchangeable
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 4.905 1.015
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha 0.6178 0.1312
Number of clusters: 27 Maximum cluster size: 4
tidy(gee_inter, conf.int = TRUE)# A tibble: 4 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 17.4 0.725 574. 0 16.0 18.8
2 sexBoy -1.03 1.38 0.561 4.54e- 1 -3.73 1.67
3 age 0.480 0.0631 57.7 3.05e-14 0.356 0.603
4 sexBoy:age 0.305 0.117 6.80 9.10e- 3 0.0758 0.534
The Wald-test detects a statistically significant (\(p < 0.05\)) interaction between age and sex in predicting distance. That is, the fitted trends are not parallel, the linear pattern of change differ comparing boys and girls.
K <- rbind(
"mean response comparing boys vs. girls" = c(0, 1, 0, 0),
"mean distance increases every 1 year increase of age among girls" = c(0, 0, 1, 0),
"mean distance increases every 1 year increase of age among boys" = c(0, 0, 1, 1)
)
tidy(glht(gee_inter, linfct = K), conf.int = TRUE) %>%
gt() %>%
fmt_number(columns = -1,decimals = 2)| contrast | null.value | estimate | conf.low | conf.high | statistic | adj.p.value |
|---|---|---|---|---|---|---|
| mean response comparing boys vs. girls | 0.00 | −1.03 | −4.25 | 2.19 | −0.75 | 0.77 |
| mean distance increases every 1 year increase of age among girls | 0.00 | 0.48 | 0.33 | 0.63 | 7.60 | 0.00 |
| mean distance increases every 1 year increase of age among boys | 0.00 | 0.78 | 0.55 | 1.01 | 7.98 | 0.00 |
Graphical presentation
pred_geeinter <- ggpredict(gee_inter, terms = c("age", "sex"))
ggplot(pred_geeinter, aes(x, predicted, col = group)) +
geom_line() +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high, fill = group), alpha = .2, col = NA) +
labs(x = "Age, years", y = "Dental growth, mm", col = "Sex", fill = "Sex")Other correlation structure
gee_inter_exch <- gee_inter
gee_inter_ind <- geeglm(distance ~ sex*age, data = dental_long,
id = id, family = gaussian, corstr = "independence")
gee_inter_unstr <- geeglm(distance ~ sex*age, data = dental_long,
id = id, family = gaussian, corstr = "unstructured")
gee_inter_ar1 <- geeglm(distance ~ sex*age, data = dental_long,
id = id, family = gaussian, corstr = "ar1")
# comparison
QIC(gee_inter_exch, gee_inter_ind, gee_inter_unstr, gee_inter_ar1) QIC QICu Quasi Lik CIC params QICC
gee_inter_exch 543.1 537.8 -264.9 6.651 4 545.9
gee_inter_ind 543.1 537.8 -264.9 6.651 4 544.9
gee_inter_unstr 542.6 537.8 -264.9 6.420 4 556.4
gee_inter_ar1 544.6 538.8 -265.4 6.864 4 547.4
Influence of Menarche on Changes in Body Fat
Data
Prospective study on body fat accretion in a cohort of 162 girls from the MIT Growth and Development Study. At start of study, all the girls were pre-menarcheal and non-obese.
All girls were followed over time according to a schedule of annual measurements until four years after menarche. The final measurement was scheduled on the fourth anniversary of their reported date of menarche.
At each examination, a measure of body fatness was obtained based on bioelectric impedance analysis.
Reference: Phillips, S.M., Bandini, L.G., Compton, D.V., Naumova, E.N. and Must, A. (2003). A longitudinal comparison of body composition by total body water and bioelectrical impedance in adolescent girls. Journal of Nutrition, 133, 1419-1425.
The bodyfat contains the following variables:
id = a sequential number;
agec = Current age, years;
agem = Age at menarche, years;
time = Time relative to menarche, years;
pbf = Percent Body Fat;
occasion = Number of sequential measurement.
#load("data/bodyfat.RData")
load(url("http://alecri.github.io/downloads/data/bodyfat.RData"))
head(bodyfat)# A tibble: 6 × 6
id agec agem time pbf occasion
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 9.32 13.2 -3.87 7.94 1
2 1 10.3 13.2 -2.86 15.6 2
3 1 11.2 13.2 -1.95 13.5 3
4 1 12.2 13.2 -1 23.2 4
5 1 13.2 13.2 0.0500 10.5 5
6 1 14.2 13.2 1.05 20.5 6
Descriptive statistics
Trajectories over time
ggplot(bodyfat, aes(agec, pbf)) +
geom_line(aes(group = id)) +
geom_smooth(se = FALSE, size = 2) +
geom_vline(xintercept = 12.86, linetype = "dashed") +
labs(x = "Age, years", y = "Percent Body Fat")Let’s consider the hypothesis that percent body fat accretion increases linearly with age, but with different slopes before/after menarche. We assume that each girl has a piecewise linear spline growth curve with a knot at the time of menarche.
Each girl’s growth curve can be described with an intercept and two slopes, one slope for changes in response before menarche, another slope for changes in response after menarche. Note the knot is not a fixed age for all subjects.
Linear splines
If the simplest possible curve is a straight line, then one way to extend the curve is to have sequence of joined line segments that produces a piecewise linear pattern. Linear spline models provide flexible way to accommodate many non-linear trends that cannot be approximated by simple polynomials in time.
Basic idea: Divide time axis into series of segments and consider piecewise-linear trends, having different slopes but joined at fixed times. Locations where lines are tied together are known as knots. Resulting piecewise-linear curve is called a spline.
A piece-wise linear mixed model can be written as
\[E[Y_{ij | X_{ij}}] = \beta_0 + \beta_1X_{ij1} + \beta_2X_{ij2}]\]
\[E[Y_{ij} | X_{ij}, u_{i}] = (\beta_0 + u_{0i}) + (\beta_1 + u_{1i})X_{ij1} + (\beta_2 + u_{12})X_{ij2}]\] where \(X_{ij1}\) is time since menarche of the \(j\)-th measurement on \(i\)-th subject (i.e., \(X_{ij1} = 0\) at menarche); \(X_{ij2} = \textrm{max}(X_{ij1} - 0, 0)\) is the linear spline of \(X_{ij1}\) with a knot at 0.
Marginal model
- \(\beta_0\): is the population mean percent body fat at menarche (when \(X_{ij1} = 0\))
- \(\beta_1\): is the population slope (linear trend) for time before menarche (when \(X_{ij1} < 0\))
- \(\beta_2\): is the additional change in the population slope (linear trend) for time after menarche (when \(X_{ij1} > 0\))
- \(\beta_1 + \beta_2\): is the population slope (linear trend) for time after menarche (when \(X_{ij1} > 0\))
Conditional model
- \(\beta_0 + u_{0i}\): is the intercept for \(i\)-th subject and is the true percent body fat at menarche (when \(X_{ij1} = 0\))
- \(\beta_1 + u_{1i}\): is the \(i\)-th subject slope, or rate of change in percent body fat during the pre-menarcheal period
- \(\beta_1 + \beta_2 + u_{1i} + u_{2i}\): is the \(i\)-th subject slope, or rate of change during the post-menarcheal period. (when \(X_{ij1} > 0\))
Fitting model in R
bodyfat <- bodyfat %>%
mutate(timepost = pmax(time, 0))
lin_lspl0 <- lmer(pbf ~ time + timepost + (time + timepost | id), data = bodyfat)
summary(lin_lspl0)Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: pbf ~ time + timepost + (time + timepost | id)
Data: bodyfat
REML criterion at convergence: 6062
Scaled residuals:
Min 1Q Median 3Q Max
-2.774 -0.590 -0.036 0.595 3.380
Random effects:
Groups Name Variance Std.Dev. Corr
id (Intercept) 45.94 6.78
time 1.63 1.28 0.29
timepost 2.75 1.66 -0.54 -0.83
Residual 9.47 3.08
Number of obs: 1049, groups: id, 162
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 21.361 0.565 161.566 37.84 < 2e-16 ***
time 0.417 0.157 108.460 2.65 0.0091 **
timepost 2.047 0.228 132.682 8.98 2.3e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) time
time 0.351
timepost -0.515 -0.872
Based on magnitude of \(beta_2\), relative to its standard error, slopes before and after menarche differ.
K <- rbind(
"population mean pre-menarcheal slope" = c(0, 1, 0),
"population mean post-menarcheal slope" = c(0, 1, 1)
)
tidy(glht(lin_lspl0, linfct = K), conf.int = TRUE)# A tibble: 2 × 7
contrast null.…¹ estim…² conf.…³ conf.…⁴ stati…⁵ adj.p…⁶
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 population mean pre-menarchea… 0 0.417 0.0674 0.767 2.65 0.0156
2 population mean post-menarche… 0 2.46 2.20 2.73 20.7 0
# … with abbreviated variable names ¹null.value, ²estimate, ³conf.low,
# ⁴conf.high, ⁵statistic, ⁶adj.p.value
Estimate of the population mean pre-menarcheal slope is 0.42 (95% CI = 0.067-0.767%).
Estimated slope is rather shallow and indicates that the annual rate of body fat accretion is less that 0.5%.
Estimate of population mean post-menarcheal slope is (0.42+2.04) = 2.46, which indicates that annual rate of growth in body fat is approximately 2.5% (95% CI = 2.2-2.7%), almost 6 times higher than pre-menarche.
Covariance structure ::: {.cell}
VarCorr(lin_lspl0) Groups Name Std.Dev. Corr
id (Intercept) 6.78
time 1.28 0.29
timepost 1.66 -0.54 -0.83
Residual 3.08
as.data.frame(VarCorr(lin_lspl0)) grp var1 var2 vcov sdcor
1 id (Intercept) <NA> 45.940 6.7779
2 id time <NA> 1.631 1.2771
3 id timepost <NA> 2.749 1.6580
4 id (Intercept) time 2.526 0.2919
5 id (Intercept) timepost -6.109 -0.5436
6 id time timepost -1.750 -0.8266
7 Residual <NA> <NA> 9.473 3.0779
:::
Estimated variance of \(u_{1i}\) is 1.63, indicating substantial variability from girl to girl in rates of percent body fat accretion during the pre-menarcheal period.
Approximately 95% of girls have changes in percent body fat between -2.09% and 2.92%, i.e. \(0.42 \pm 1.96 × \sqrt{1.63}\).
Estimated variance of slopes during post-menarcheal period, \(\textrm{Var}(u_{1i} + u_{2i})\) is 0.88 (or [1.63 + 2.75 −2 × 1.75]), indicating less variability in the slopes after menarche. Approximately 95% of girls have changes in percent body fat between 0.62% and 4.30%, i.e., \(2.46 \pm 1.96 × \sqrt{0.88}\)
The mixed effects model can be used to obtain estimates of each girl’s growth trajectory over time (fixed plus realization of random effects). We next present graphically the estimated population mean growth curve and predicted (empirical BLUP) growth curves for two girls (id = 16 and id = 4).
A noticeable feature of the predicted growth curves is that there is more shrinkage towards the population mean curve when fewer data points are available.
sid <- c(4, 16)
p_lin_lspl0 <- expand.grid(
time = seq(-6, 4.1, .05),
id = sid
) %>%
mutate(timepost = pmax(time, 0)) %>%
bind_cols(
indiv_pred = predict(lin_lspl0, newdata = .),
marg_pred = predict(lin_lspl0, newdata = ., re.form = ~ 0)
) %>%
ggplot(aes(time, indiv_pred, group = id, col = factor(id))) +
geom_line(aes(y = marg_pred), col = "black", lwd = 1.5) +
geom_line(aes(linetype = "BLUP")) +
geom_point(data = filter(bodyfat, id %in% sid),
aes(y = pbf, shape = factor(id), col = factor(id))) +
guides(shape = "none") +
labs(x = "Time relative to menarche, years", y = "Percent Body Fat",
col = "Id", linetype = "Prediction")
p_lin_lspl0Predicted growth curve for girl with only 6 data points is pulled closer to the population mean curve. Predicted growth curve for girl with 10 observation follows her data more closely.
This becomes more apparent when empirical BLUPs are compared to ordinary least squares (OLS) estimates based only on observations from each girl.
lm4 <- lm(pbf ~ time + timepost, data = filter(bodyfat, id == 4))
lm16 <- lm(pbf ~ time + timepost, data = filter(bodyfat, id == 16))
newd4_16 <- data.frame(
time = seq(-6, 4.1, .05),
timepost = pmax(seq(-6, 4.1, .05), 0)
) %>%
mutate(
pred4 = predict(lm4, newdata = .),
pred16 = predict(lm16, newdata = .)
) %>%
pivot_longer(cols = starts_with("pred"), names_to = "id", values_to = "pred") %>%
mutate(id = parse_number(id))
p_lin_lspl0 +
geom_line(data = newd4_16, aes(y = pred, linetype = "OLS"))