A review of Longitudinal Data Analysis in R

Author

Affiliation

Medical Epidemiology and Biostatistics
Karolinska Institutet

Published

December 10, 2022

(Last compiled on Sat Dec 10, 2022)

Aim

The aim of this document is to review how to analyze longitudinal data in R.
The material presented in this document is taken from the lecture Regression models for dependent data by Nicola Orsini, where the Stata code and output have been translated to equivalent R code.
The methodologies are demonstrated by re-analyzing two working examples which are reported in Fitzmaurice GM, Laird NM and Ware JH. (2004) Applied Longitudinal Analysis, Wiley.

R packages

The following R packages have been used in this review. A short comment on the use (for this document) of the package is presented after loading the package. Be sure to install the package before loading it.

library(labelled)   # labeling data
library(rstatix)    # summary statistics
library(ggpubr)     # convenient summary statistics and plots
library(GGally)     # advanced plot
library(car)        # useful for anova/wald test
library(Epi)        # easy getting CI for model coef/pred
library(lme4)       # linear mixed-effects models
library(lmerTest)   # test for linear mixed-effects models
library(emmeans)    # marginal means
library(multcomp)   # CI for linear combinations of model coef
library(geepack)    # generalized estimating equations
library(ggeffects)  # marginal effects, adjusted predictions
library(gt)         # nice tables

library(tidyverse)  # for everything (data manipulation, visualization, coding, and more)
theme_set(theme_minimal() + theme(legend.position = "bottom")) # theme for ggplot

Longitudinal Data Analysis

The key feature of Longitudinal Data Analysis is the analysis of longitudinal data, i.e. repeated observations taken on individuals. Oftentimes, the primary goal is to characterize the change in response over time (or measurements), and factors that influence this change.
Observations that belong to the same subject (or group/cluster) tend to be more similar than observations from different groups. This correlation must be accounted for in order to obtain valid inference.

Longitudinal data is a special case of hierarchical data, where observations are nested within different hierarchically ordered levels, such as measurements within individuals, or pupils within schools (and within regions). A broad task view presenting useful R packages for analyzing these type of correlated data can be found on the dedicated CRAN Task View at https://cran.r-project.org/web/views/MixedModels.html.

Notation

Response variables

$Y_{ij}$ is the response variable for $i^{th}$ individual ($i = 1, \dots, N$) measured at time $t_{ij}$ ($j = 1,\dots,n$).

$\mu_j = E[Y_{ij}]$ is the mean of $Y_{ij}$, which provides a measure of the location of the center of the distribution of $Y_{ij}$.
$\sigma^2 = \textrm{Var}[Y_{ij}]$ is the variance of $Y_{ij}$, which provides a measure of the spread or dispersion of the values of $Y_{ij}$ around their mean.
$\sigma_{jk} = \textrm{Cov}[Y_{ij}, Y_{ik}]$ is the covariance of $Y_{ij}$ and $Y_{ik}$, which provides a measure of the linear dependence between the variables ($\sigma_{jk} = 0$ corresponds to no linear dependence).
$\rho_{jk} = \frac{\sigma_{jk}}{\sigma_{j} \sigma_{k}}$ is the correlation between $Y_{ij}$ and $Y_{ik}$, a measure of dependence free of scales. It ranges from -1 (perfect negative linear correlation) to 1 (perfect positive linear correlation).

Repeated measurements

$Y_i = (Y_{i1}, \dots, Y_1{n})^\top$ is the vector of repeated measures, with symmetric matrix of variances and covariances $\textrm{Cov}(Y_i) = \Sigma_i$

\[ \Sigma_i = \begin{bmatrix} \sigma_1^2 & \sigma_{12} & \cdot & \sigma_{1n} \\ \sigma_{21} & \sigma_2^2 & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \sigma_{n1} & \cdot & \cdot & \sigma_n^2 \end{bmatrix} \]

A matrix of correlations can be easily obtained from the variance/covariance matrix.

\[ \textrm{Corr}(Y_i) = \rho_i = \begin{bmatrix} 1 & \rho_{12} & \cdot & \rho_{1n} \\ \rho_{21} & 1 & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \rho_{n1} & \cdot & \cdot & 1 \end{bmatrix} \]

Based on empirical observations from longitudinal studies, the correlation among repeated measurements are often present these feature:

correlations are positive;
correlations decrease with increasing time separation;
correlations between repeated measures rarely ever approach zero;
correlation between a pair of repeated measures taken very closely together in time rarely approaches one.

Dental Growth Data

Data

This study was conducted in 16 boys and 11 girls, who at ages 8, 10, 12, and 14 had their distance (mm) from the center of the pituitary gland to the pteryomaxillary fissure measured. Changes in pituitary-pteryomaxillary distances during growth is important in orthodontal therapy.

The goals of the study were to describe the distance in boys and girls as simple functions of age, and then to compare the functions for boys and girls.

Reference: Potthoff, R.F. and Roy, S.W. (1964). A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika, 51, 313-326

The dental dataset contains the following variables:
id = a sequential number;
sex = sex, a factor with categories 0 = “Girl”, 1 = “Boy”;
y8 = Measure at age 8;
y10 = Measure at age 10;
y12 = Measure at age 12;
y14 = Measure at age 14.

Load data into R from the URL.

#load("data/dental.RData")
load(url("http://alecri.github.io/downloads/data/dental.RData"))

head(dental)

# A tibble: 6 × 6
     id sex      y8   y10   y12   y14
  <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1     1 Girl   21    20    21.5  23  
2     2 Girl   21    21.5  24    25.5
3     3 Girl   20.5  24    24.5  26  
4     4 Girl   23.5  24.5  25    26.5
5     5 Girl   21.5  23    22.5  23.5
6     6 Girl   20    21    21    22.5

The data is presented in a wide format where repeated measurements are contained in separate variables (only one raw for each individual).

Oftentimes, a longitudinal format, where repeated measurements are reported as separate rows, is preferable for the analysis. The pivot_longer function is useful for restruring a data from a wide to a long format:

dental_long <- pivot_longer(dental, cols = starts_with("y"), 
                            names_to = "measurement", values_to = "distance") %>% 
  mutate(
    age = parse_number(measurement),
    measurement = fct_inorder(paste("Measure at age", age))
  ) %>% 
  set_variable_labels(
    age = "Age of the child at measurement",
    measurement = "Label for time measurement",
    distance = "Measurement"
  )

head(dental_long)

# A tibble: 6 × 5
     id sex   measurement       distance   age
  <dbl> <fct> <fct>                <dbl> <dbl>
1     1 Girl  Measure at age 8      21       8
2     1 Girl  Measure at age 10     20      10
3     1 Girl  Measure at age 12     21.5    12
4     1 Girl  Measure at age 14     23      14
5     2 Girl  Measure at age 8      21       8
6     2 Girl  Measure at age 10     21.5    10

Descriptive statistics

Mean response over time

group_by(dental_long, age) %>% 
  get_summary_stats(distance)

# A tibble: 4 × 14
    age variable     n   min   max median    q1    q3   iqr   mad  mean    sd
  <dbl> <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     8 distance    27  16.5  27.5     22  21    23.2  2.25  1.48  22.2  2.43
2    10 distance    27  19    28       23  21.5  24.5  3     2.22  23.2  2.16
3    12 distance    27  19    31       24  23    26    3     2.22  24.6  2.82
4    14 distance    27  19.5  31.5     26  25    27.8  2.75  2.22  26.1  2.77
# … with 2 more variables: se <dbl>, ci <dbl>

Overall, the mean response increases with age. The variance of distance also increases with age.
The mean distance increases by 4 mm (26-22) from age 8 to age 14 years.

ggplot(dental_long, aes(measurement, distance, fill = measurement)) +
  geom_boxplot() +
  geom_jitter(width = 0.2) +
  guides(fill = "none") +
  labs(x = "", y = "Dental growth, mm")

group_by(dental_long, sex, measurement) %>% 
  get_summary_stats(distance, show = c("mean", "sd"))

# A tibble: 8 × 6
  sex   measurement       variable     n  mean    sd
  <fct> <fct>             <chr>    <dbl> <dbl> <dbl>
1 Girl  Measure at age 8  distance    11  21.2  2.12
2 Girl  Measure at age 10 distance    11  22.2  1.90
3 Girl  Measure at age 12 distance    11  23.1  2.37
4 Girl  Measure at age 14 distance    11  24.1  2.44
5 Boy   Measure at age 8  distance    16  22.9  2.45
6 Boy   Measure at age 10 distance    16  23.8  2.14
7 Boy   Measure at age 12 distance    16  25.7  2.65
8 Boy   Measure at age 14 distance    16  27.5  2.08

The average response increases both among boys and girls.

ggplot(dental_long, aes(sex, distance, fill = measurement)) +
  geom_boxplot() +
  labs(x = "", y = "Dental growth, mm", fill = "")

group_by(dental_long, sex, measurement) %>% 
  summarise(mean_distance = mean(distance), .groups = "drop") %>% 
  ggplot(aes(sex, mean_distance, fill = measurement, label = round(mean_distance))) +
  geom_col(position = "dodge") +
  geom_text(position = position_dodge(width = 0.9), vjust = -0.5) +
  coord_flip() +
  labs(x = "", y = "Mean Dental growth, mm", fill = "")

Correlation of responses

# co-variance matrix
cov_obs <- select(dental, starts_with("y")) %>% 
  cov()
cov_obs

          y8      y10      y12      y14
y8  5.925926 3.285256 4.875356 4.039886
y10 3.285256 4.653846 3.858974 4.532051
y12 4.875356 3.858974 7.938746 6.197293
y14 4.039886 4.532051 6.197293 7.654558

# correlation matrix
cov2cor(cov_obs)

           y8       y10       y12       y14
y8  1.0000000 0.6255833 0.7108079 0.5998338
y10 0.6255833 1.0000000 0.6348775 0.7593268
y12 0.7108079 0.6348775 1.0000000 0.7949980
y14 0.5998338 0.7593268 0.7949980 1.0000000

A positive (linear) correlation is observed for consecutive measurements.

ggpairs(select(dental, starts_with("y")), lower = list(continuous = "smooth"))

ggpairs(dental, mapping = aes(colour = sex), columns = 3:6,
        lower = list(continuous = "smooth"))

Trajectories over time

Many different alternative graphical representations are possible. Here the most common ones.

group_by(dental_long, sex, age) %>% 
  summarise(mean = list(mean_ci(distance)), .groups = "drop") %>% 
  unnest_wider(mean) %>% 
  mutate(agex = age - .05 + .05*(sex == "Boy")) %>% 
  ggplot(aes(agex, y, col = sex, shape = sex)) +
  geom_point() +
  geom_errorbar(aes(ymin = ymin, ymax = ymax), width = 0.2) +
  geom_line() +
  labs(x = "Age, years", y = "Mean Dental growth, mm", shape = "Sex", col = "Sex")

Plot of individual data over time

ggplot(dental_long, aes(age, distance, col = factor(id))) +
  geom_point() +
  geom_line() +
  facet_wrap(~ id) +
  labs(x = "Age, years", y = "Dental growth, mm", col = "Child id") +
  guides(col = guide_legend(nrow = 3))

Spaghetti plot

ggplot(dental_long, aes(age, distance, col = factor(id))) +
  geom_line() +
  labs(x = "Age, years", y = "Dental growth, mm", col = "Child id") +
  guides(col = guide_legend(nrow = 3))

Spaghetti plot by sex

ggplot(dental_long, aes(age, distance)) +
  geom_line(aes(group = factor(id))) +
  geom_smooth() +
  facet_grid(~ sex)

Research question

In the next sections, we are going to answer the following three main questions:
- Time effect: What is the shape of the trajectory of the mean response over time?
- Group effect: What is the average difference between groups of individuals?
- Interaction between time and group: How the relationship between the response and time vary according to groups of individuals?

Linear Mixed Models

In these models, some subset of the regression coefficients vary randomly from one individual to another, accounting for natural heterogeneity across subjects. Individuals in population are assumed to have their own subject-specific mean response trajectories over time.

The mean response is modeled as a combination of population characteristics (fixed effects) assumed to be shared by all individuals, while subject-specific effects (random effects) are unique to a particular individual. Linear Mixed Models are a particular type of hierarchical models which contains both fixed and random effects.

Empty model

The equation of an empty mixed-effects model, also known as one-way random-effect ANOVA model, is $Y_{ij} = \beta_0 + u_{0i} + \epsilon_{ij}$, where $\beta_0$ is the intercept, $u_{0i}$ the random effect, and $\epsilon_{ij}$ is the residual error terms.

The model has the following distributional assumptions:

$u_{0i} \sim N(0, \sigma_{u_0}^2)$
$\epsilon_{ij} \sim N(0, \sigma^2)$
$u_{0i} \perp \!\!\! \perp \epsilon_{ij}$

Interpretation:

$\beta_0$ is the fixed effect representing the overall mean of the response variable $Y_{ij}$
$u_{0i}$ is a random variable of which we estimate the variance. It is the individual-specific deviation from $\beta_0$ that can be predicted from the model and observed data.

The conditional mean response for a specific individual is $E[Y_{ij} | u_{0i}] = \beta_0 + u_{0i}$.
The marginal mean response in the population (i.e., averaged over all individuals in population) is $E[Y_{ij}] = \beta_0$

For the model with randomly varying intercepts, the marginal covariance has the following compound symmetry pattern:

\[ \Sigma_i = \begin{bmatrix} \sigma_{u_0}^2 + \sigma^2 & \sigma_{u_0}^2 & \cdot & \sigma_{u_0}^2 \\ \sigma_{u_0}^2 & \sigma_{u_0}^2 + \sigma^2 & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \sigma_{u_0}^2 & \cdot & \cdot & \sigma_{u_0}^2 + \sigma^2 \end{bmatrix} \]

Intraclass correlation
The common correlation among repeated measures implicit in the random intercept model is known as intraclass correlation (ICC), which is defined as:

\[\textrm{ICC} = \frac{\sigma_{u_0}^2}{\sigma_{u_0}^2 + \sigma^2}\]

The ICC can be intepretated as the proportion of the total variance that is “explained” by between-subject variability is. It can take any value between 0 and 1.

Estimation
$\beta$ coefficients can be estimated using Maximum Likelihood (ML) estimation, which give estimates that maximize the log-likelihood, i.e. the probability for the observed the data given the parameters.
Maximum Likelihood is not taking account of the degrees of freedom used for estimating fixed effects when estimating variance components. This can lead to bias estimates (underestimation) for the variance components in data with relatively small number of individuals.
The alternative Restricted Maximum Likelihood (REML) can overcome this limitation by maximizing a marginal or residual (invariant to fixed effects) likelihood.

Test linear hypotheses

Different functions and packages can be used for performing inference on linear combination of model coefficients, which may be helpful for answering relevant scientific research questions. We are going to use the glht (general linear hypotheses testing) in the multcomp package.
Additional packages can be useful for visualizing confidence intervals for $\beta$ coefficients (ci.lin in the Epi package, tidy), testing nested models (Anova in the car package), and for estimating marginal meas (emmeans package).

For example, we might want to test for a subset of fixed effects: $H_0: _0 = 0 $ vs $H_1: _0 $. A walt-type test can be used for testing this type of hypotheses.

Fitting model in R

Linear mixed-effects models can be fitted using the lmer function in the lme4 package.
The first part of the formula model specify the fixed effects part of the model, while the second in parantheses specify the random components.

lin_0 <- lmer(distance ~ 1 + (1 | id), data = dental_long)
summary(lin_0)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: distance ~ 1 + (1 | id)
   Data: dental_long

REML criterion at convergence: 515.4

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.2400 -0.5277 -0.1073  0.4732  2.7687 

Random effects:
 Groups   Name        Variance Std.Dev.
 id       (Intercept) 3.752    1.937   
 Residual             4.930    2.220   
Number of obs: 108, groups:  id, 27

Fixed effects:
            Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)  24.0231     0.4297 26.0000   55.91   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ci.lin(lin_0)

            Estimate    StdErr        z P     2.5%    97.5%
(Intercept) 24.02315 0.4296605 55.91193 0 23.18103 24.86527

The estimated marginal mean of the dental distance is $\beta_0 = 24.02$ mm.
The estimated variance of the random-effect reflecting between-subject variability is 3.752; the estimated variance of the error term reflecting within-subject variability is 4.930. The correlation between any two repeated measures (ICC) is equal to $3.76/(3.76 + 4.93) = 0.43$.

Variance components can be tested using an ANOVA-like table that can be derived using the ranova function in the lmer test.

ranova(lin_0)

ANOVA-like table for random-effects: Single term deletions

Model:
distance ~ (1 | id)
         npar  logLik    AIC   LRT Df Pr(>Chisq)    
<none>      3 -257.68 521.36                        
(1 | id)    2 -269.14 542.28 22.92  1  1.689e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The small $p$-value suggests evidence of between-individual heterogeneity, which support evidence for choosing a mixed-effects model instead of a only fixed-effects model.

Graphical presentation of the model

ggplot(dental_long, aes(id, distance)) +
  geom_point(aes(col = measurement, shape = measurement)) +
  geom_point(data = group_by(dental_long, id) %>%
               summarise(distance = mean(distance), .groups = "drop"),
             aes(col = "Mean", shape = "Mean"), size = 2.5) +
  geom_hline(yintercept = mean(dental_long$distance)) +
  scale_shape_manual(values = c(4, 19, 19, 19, 19)) +
  labs(x = "Child id", y = "Dental growth, mm", col = "Measurement", shape = "Measurement")

Time effect

Question: Is the mean response varying with time?

One modeling strategy is to use indicator variables to contrast mean responses at different occasions. In the dental growth example we have measurements of the response at age 8 (baseline), 10, 12, and 14 years-old:
\[E[Y_{ij}] = \beta_0 + \beta_1X_{ij1} + \beta_2X_{ij2} + \beta_3X_{ij3}\]

where $X_{ij1}$, $X_{ij2}$, and $X_{ij3}$ are indicator variables for age 10, 12, and 14 years respectively. Measurement at the age of 8 years is the referent.

lin_age <- lmer(distance ~ measurement + (1 | id), data = dental_long)
summary(lin_age)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: distance ~ measurement + (1 | id)
   Data: dental_long

REML criterion at convergence: 443.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.7376 -0.5248  0.0153  0.4027  3.7212 

Random effects:
 Groups   Name        Variance Std.Dev.
 id       (Intercept) 4.465    2.113   
 Residual             2.078    1.442   
Number of obs: 108, groups:  id, 27

Fixed effects:
                             Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)                   22.1852     0.4923 43.3911  45.066  < 2e-16 ***
measurementMeasure at age 10   0.9815     0.3924 78.0000   2.501   0.0145 *  
measurementMeasure at age 12   2.4630     0.3924 78.0000   6.277 1.80e-08 ***
measurementMeasure at age 14   3.9074     0.3924 78.0000   9.958 1.52e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) mMaa10 mMaa12
msrmntMaa10 -0.399              
msrmntMaa12 -0.399  0.500       
msrmntMaa14 -0.399  0.500  0.500

The mean dental distance at age 8 years (baseline) is 22.18 mm.
The difference between the mean responses of 10 and 8 years-old children is $\beta_1 = 0.98$ mm.
The difference between the mean responses of 12 and 8 years-old children is $\beta_2 = 2.46$ mm.
The difference between the mean responses of 14 and 8 years-old children is $\beta_3 = 3.91$ mm.

The emmeans function can be usefull for computing marginal means with corresponding confidence intervals.

tidy(emmeans(lin_age, "measurement"), conf.int = TRUE)

# A tibble: 4 × 8
  measurement       estimate std.error    df conf.low conf.high stati…¹  p.value
  <chr>                <dbl>     <dbl> <dbl>    <dbl>     <dbl>   <dbl>    <dbl>
1 Measure at age 8      22.2     0.492  43.4     21.2      23.2    45.1 4.45e-38
2 Measure at age 10     23.2     0.492  43.4     22.2      24.2    47.1 7.05e-39
3 Measure at age 12     24.6     0.492  43.4     23.7      25.6    50.1 5.02e-40
4 Measure at age 14     26.1     0.492  43.4     25.1      27.1    53.0 4.41e-41
# … with abbreviated variable name ¹statistic

We can test whether the mean response is constant over time by testing the null hypothesis that all the regression coefficients used to model time are simultaneously equal to zero ($H_0: \beta_1 = \beta_2 = \beta_3 = 0$).

Anova(lin_age)

Analysis of Deviance Table (Type II Wald chisquare tests)

Response: distance
             Chisq Df Pr(>Chisq)    
measurement 114.12  3  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Results from the Wald-test suggest that the trajectory of the mean response over time is not flat.
We can presented the estimated trajectory from the fitted model by using the predict function.

dental_fit <- bind_cols(
  dental_long, pred_age = predict(lin_age, re.form = ~ 0)
)
ggplot(dental_fit, aes(age, distance)) +
  geom_line(aes(group = factor(id))) +
  geom_point(aes(y = pred_age), col = "blue", size = 2) + 
  labs(x = "Age, years", y = "Dental growth, mm")

Group effect

Question: Is the mean response varying in the two groups of individuals?

\[E[Y_{ij}] = \beta_0 + \beta_1X_{ij1} + \beta_2X_{ij2} + \beta_3X_{ij3} + \beta_4X_{ij4}\] where $X_{ij1}$, $X_{ij2}$, and $X_{ij3}$ are indicator variables for age 10, 12, and 14 years respectively. And $X_{ij4}$ is the indicator variable for sex (1 for boys and 0 for girls).

NB: The difference between the mean responses of boys and girls is $E[Y_{ij} | X_{ij4} = 1] - E[Y_{ij} | X_{ij4} = 0] = \beta_4$, that is adjusted or controlled for time.

lin_agesex <- lmer(distance ~ measurement + sex + (1 | id), data = dental_long)
summary(lin_agesex)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: distance ~ measurement + sex + (1 | id)
   Data: dental_long

REML criterion at convergence: 433.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.8225 -0.5116 -0.0066  0.4169  3.6565 

Random effects:
 Groups   Name        Variance Std.Dev.
 id       (Intercept) 3.260    1.805   
 Residual             2.078    1.442   
Number of obs: 108, groups:  id, 27

Fixed effects:
                             Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)                   20.8098     0.6335 33.8026  32.850  < 2e-16 ***
measurementMeasure at age 10   0.9815     0.3924 78.0000   2.501  0.01447 *  
measurementMeasure at age 12   2.4630     0.3924 78.0000   6.277 1.80e-08 ***
measurementMeasure at age 14   3.9074     0.3924 78.0000   9.958 1.52e-15 ***
sexBoy                         2.3210     0.7614 25.0000   3.048  0.00538 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) mMaa10 mMaa12 mMaa14
msrmntMaa10 -0.310                     
msrmntMaa12 -0.310  0.500              
msrmntMaa14 -0.310  0.500  0.500       
sexBoy      -0.712  0.000  0.000  0.000

ci.lin(lin_agesex)

                               Estimate    StdErr         z             P
(Intercept)                  20.8097643 0.6334777 32.850032 1.137820e-236
measurementMeasure at age 10  0.9814815 0.3923780  2.501367  1.237147e-02
measurementMeasure at age 12  2.4629630 0.3923780  6.277017  3.451314e-10
measurementMeasure at age 14  3.9074074 0.3923780  9.958274  2.320555e-23
sexBoy                        2.3210227 0.7614168  3.048294  2.301443e-03
                                   2.5%     97.5%
(Intercept)                  19.5681707 22.051358
measurementMeasure at age 10  0.2124348  1.750528
measurementMeasure at age 12  1.6939163  3.232010
measurementMeasure at age 14  3.1383607  4.676454
sexBoy                        0.8286731  3.813372

The difference between mean dental distances of boys vs. girls is 2.3 (95% CI = 0.83-3.81) mm higher than girls, adjusting for time.
The emmeans can be used for estimating the mean responses for different combinations of sex and time measurements.

tidy(emmeans(lin_agesex, c("measurement", "sex")), conf.int = TRUE)

# A tibble: 8 × 9
  measurement       sex   estim…¹ std.e…²    df conf.…³ conf.…⁴ stati…⁵  p.value
  <chr>             <chr>   <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl>    <dbl>
1 Measure at age 8  Girl     20.8   0.633  33.8    19.5    22.1    32.9 3.11e-27
2 Measure at age 10 Girl     21.8   0.633  33.8    20.5    23.1    34.4 6.84e-28
3 Measure at age 12 Girl     23.3   0.633  33.8    22.0    24.6    36.7 7.83e-29
4 Measure at age 14 Girl     24.7   0.633  33.8    23.4    26.0    39.0 1.07e-29
5 Measure at age 8  Boy      23.1   0.542  38.0    22.0    24.2    42.7 1.06e-33
6 Measure at age 10 Boy      24.1   0.542  38.0    23.0    25.2    44.5 2.25e-34
7 Measure at age 12 Boy      25.6   0.542  38.0    24.5    26.7    47.2 2.43e-35
8 Measure at age 14 Boy      27.0   0.542  38.0    25.9    28.1    49.9 3.11e-36
# … with abbreviated variable names ¹estimate, ²std.error, ³conf.low,
#   ⁴conf.high, ⁵statistic

A graphical presentation of the fitted model can be easily obtained by

dental_fit$pred_agesex <- predict(lin_agesex, re.form = ~ 0)
ggplot(dental_fit, aes(age, distance)) +
  geom_line(aes(group = factor(id))) +
  geom_point(aes(y = pred_agesex, col = sex), size = 2) + 
  labs(x = "Age, years", y = "Dental growth, mm", col = "Sex")

Interaction between time and group

Question: Is the change of the mean response over time varying according to group of individuals?

lin_agesexinter <- lmer(distance ~ measurement*sex + (1 | id), data = dental_long)
lin_agesexinter

Linear mixed model fit by REML ['lmerModLmerTest']
Formula: distance ~ measurement * sex + (1 | id)
   Data: dental_long
REML criterion at convergence: 423.4085
Random effects:
 Groups   Name        Std.Dev.
 id       (Intercept) 1.813   
 Residual             1.405   
Number of obs: 108, groups:  id, 27
Fixed Effects:
                        (Intercept)         measurementMeasure at age 10  
                            21.1818                               1.0455  
       measurementMeasure at age 12         measurementMeasure at age 14  
                             1.9091                               2.9091  
                             sexBoy  measurementMeasure at age 10:sexBoy  
                             1.6932                              -0.1080  
measurementMeasure at age 12:sexBoy  measurementMeasure at age 14:sexBoy  
                             0.9347                               1.6847

The mean distance among girls at age 8 was 21.2 mm
The mean distance among girls at age 10 was 1.04 mm higher than age 8.
The mean distance among girls at age 14 was 2.9.0 mm higher than age 8.
The mean distance among boys at age 8 was 21.2+1.7= 22.9 mm
The mean distance among boys at age 10 was 1.04-0.11=0.93 mm higher than age 8.
$\dots$

Testing the interaction between group and time ::: {.cell}

Anova(lin_agesexinter, type = 3)

Analysis of Deviance Table (Type III Wald chisquare tests)

Response: distance
                   Chisq Df Pr(>Chisq)    
(Intercept)     938.2060  1  < 2.2e-16 ***
measurement      25.6468  3  1.131e-05 ***
sex               3.5525  1    0.05946 .  
measurement:sex   7.0847  3    0.06925 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

:::

The Wald test detects some evidence of interaction between age and sex ($p$ = 0.07), which suggests that the mean response profiles might not be parallel.
The marginal prediction for the combination of sex and time of measurement can be obtained as

tidy(emmeans(lin_agesexinter, c("measurement", "sex")), conf.int = TRUE)

# A tibble: 8 × 9
  measurement       sex   estim…¹ std.e…²    df conf.…³ conf.…⁴ stati…⁵  p.value
  <chr>             <chr>   <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl>    <dbl>
1 Measure at age 8  Girl     21.2   0.692  46.1    19.8    22.6    30.6 2.78e-32
2 Measure at age 10 Girl     22.2   0.692  46.1    20.8    23.6    32.1 3.33e-33
3 Measure at age 12 Girl     23.1   0.692  46.1    21.7    24.5    33.4 6.17e-34
4 Measure at age 14 Girl     24.1   0.692  46.1    22.7    25.5    34.8 9.40e-35
5 Measure at age 8  Boy      22.9   0.573  46.1    21.7    24.0    39.9 2.22e-37
6 Measure at age 10 Boy      23.8   0.573  46.1    22.7    25.0    41.5 3.66e-38
7 Measure at age 12 Boy      25.7   0.573  46.1    24.6    26.9    44.9 1.15e-39
8 Measure at age 14 Boy      27.5   0.573  46.1    26.3    28.6    47.9 5.87e-41
# … with abbreviated variable names ¹estimate, ²std.error, ³conf.low,
#   ⁴conf.high, ⁵statistic

The difference between the mean response among boys and girls depends on age.
The change in the mean response over time depends on sex

dental_fit$pred_agesexinter <- predict(lin_agesexinter, re.form = ~ 0)
ggplot(dental_fit, aes(age, distance)) +
  geom_line(aes(group = factor(id))) +
  geom_point(aes(y = pred_agesexinter, col = sex), size = 2) + 
  labs(x = "Age, years", y = "Dental growth, mm", col = "Sex")

Parametric curve

We can fit parametric curves to longitudinal data using the actual time value when the measurement was taken. In many studies, the true underlying mean response process changes over time in a relatively smooth, monotonically increasing/decreasing pattern. In our example we could model the relationship between age and the mean response using a simple linear trend among boys and girls.

Let’s assume a linear relationship between the time variable and the mean response allowing the linear trend to vary according to sex $E[Y_{ij}] = \beta_0 + \beta_1X_1 + \beta_2X_2 + + \beta_3X_1X_2$, where $X_1$ is the indicator variable for sex, $X_2$ is age, and $X_1X_2$ the interaction term.

lin_agecsexinter <- lmer(distance ~ sex*age + (1 | id), data = dental_long)
summary(lin_agecsexinter)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: distance ~ sex * age + (1 | id)
   Data: dental_long

REML criterion at convergence: 433.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.5980 -0.4546  0.0158  0.5024  3.6862 

Random effects:
 Groups   Name        Variance Std.Dev.
 id       (Intercept) 3.299    1.816   
 Residual             1.922    1.386   
Number of obs: 108, groups:  id, 27

Fixed effects:
             Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)  17.37273    1.18351 103.98636  14.679  < 2e-16 ***
sexBoy       -1.03210    1.53742 103.98636  -0.671   0.5035    
age           0.47955    0.09347  79.00000   5.130 2.02e-06 ***
sexBoy:age    0.30483    0.12142  79.00000   2.511   0.0141 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
           (Intr) sexBoy age   
sexBoy     -0.770              
age        -0.869  0.669       
sexBoy:age  0.669 -0.869 -0.770

The intercept $\beta_0 = 17.4$ is the mean distance among girls at age 0 (extrapolation, not interesting).
The coefficient of sex $\beta_1 =-1.03$ is the difference between the mean distance of boys vs. girls at age 0 (extrapolation, not interesting).
The coefficient of age $\beta_2 =0.48$ is the change in the mean response for every one year increment of age among girls.
The coefficient of the interaction term $\beta_3 =0.30$ is represents the additional change of the regression coefficient of sex for a certain value of age and viceversa.

Test linear hypotheses

K <- rbind(
  c(0, 1, 0, 8),
  c(0, 1, 0, 14),
  c(0, 0, 1, 1),
  c(0, 0, 2, 2),
  c(1, 1, 8, 8),
  c(1, 1, 14, 14),
  c(0, 0, 14-8, 14-8)
) 
rownames(K) <- c("Difference between boys and girls at age 8 years",
                 "Difference between boys and girls at age 14 years",
                 "Linear trend for 1 year increase of age among boys",
                 "Linear trend for 2 years increase of age among boys",
                 "Mean response at age 8 years among boys",
                 "Mean response at age 14 years among boys",
                 "Difference between age 14 and 8 among boys")
tidy(glht(lin_agecsexinter, linfct =  K), conf.int = TRUE) %>% 
  gt() %>% 
  fmt_number(columns = -1,decimals = 2)

contrast	estimate	conf.low	conf.high	statistic	adj.p.value
Difference between boys and girls at age 8 years	1.41	−0.71	3.52	1.67	0.31
Difference between boys and girls at age 14 years	3.24	1.12	5.35	3.83	0.00
Linear trend for 1 year increase of age among boys	0.78	0.59	0.98	10.12	0.00
Linear trend for 2 years increase of age among boys	1.57	1.18	1.96	10.12	0.00
Mean response at age 8 years among boys	22.62	21.27	23.96	41.98	0.00
Mean response at age 14 years among boys	27.32	25.97	28.67	50.71	0.00
Difference between age 14 and 8 among boys	4.71	3.54	5.87	10.12	0.00

Graphical presentation of the fitted model

pred_agecsexinter <- expand.grid(
  age = seq(8, 14, .5),
  sex = levels(dental_long$sex)
) %>% 
  bind_cols(pred = predict(lin_agecsexinter, newdata = ., re.form = ~ 0))
ggplot(pred_agecsexinter, aes(age, pred, col = sex)) +
  geom_line() +
  labs(x = "Age, years", y = "Dental growth, mm", col = "Sex")

Prediction random effects

In many applications, inference is focused on fixed effects (the overall change of the response over time). However, we can also estimate (or predict) individual-specific response trajectories over time.

This is known as “best linear unbiased predictor” (or BLUP). They are also known as empirical Bayes predictions. The idea is to predict the random effect (deviation from population mean) for each individual from realized, observed value of the response. The predicted BLUP response trajectory for the $i$-th individual is a weighted combination of estimated population averaged mean response profile and the subject’s observed response profile. This helps to explains why it is often said that empirical BLUP “shrinks” the $i$-th subject’s predicted response profile towards population-averaged mean.

Let’s assume a random-intercept model with only age (continuous) as covariate.

lin_agec <- lmer(distance ~ age + (1 | id), data = dental_long)
lin_agec

Linear mixed model fit by REML ['lmerModLmerTest']
Formula: distance ~ age + (1 | id)
   Data: dental_long
REML criterion at convergence: 447.0025
Random effects:
 Groups   Name        Std.Dev.
 id       (Intercept) 2.115   
 Residual             1.432   
Number of obs: 108, groups:  id, 27
Fixed Effects:
(Intercept)          age  
    16.7611       0.6602

We will focus on two specific children (id 10 and 21).

sid <- c(10, 21)
expand.grid(
  age = seq(8, 14, .5),
  id = sid
) %>% 
  bind_cols(
    indiv_pred = predict(lin_agec, newdata = .),
    marg_pred = predict(lin_agec, newdata = ., re.form = ~ 0)
  ) %>% 
  left_join(
    filter(dental_long, id %in% sid), by = c("id", "age")
  ) %>% 
  ggplot(aes(age, indiv_pred, group = id, col = factor(id))) +
  geom_line() +
  geom_point(aes(y = distance)) +
  geom_line(aes(y = marg_pred, col = "Marginal"), lwd = 1.5) +
  labs(x = "Age, years", y = "Dental growth, mm", col = "Curve")

The figure above provides graphical representation of the random intercept model.
Marginal (overall) mean response over time in the population increases linearly with age (denoted by the solid thick blue line).
Conditional mean responses for the two specific individuals are represented with different colored thin solid lines.

Random intercept and slope

In a random intercept and slope, also the regression coefficient of time $\beta_1$ is allowed to vary randomly among individuals:
\[E[Y_{ij}] = \beta_0 + \beta_1X_{ij1} + u_{0i} + u_{1i}X_{ij1} + \epsilon_{ij}\]

\[E[Y_{ij}] = (\beta_0 + u_{0i}) + (\beta_1 + u_{1i})X_{ij1} + \epsilon_{ij}\]

The assumptions about the random variables are:

\[\begin{equation} \begin{pmatrix} u_{0i} \\ u_{1i} \end{pmatrix} \sim N \left(0,\begin{pmatrix} \sigma_0^2 & \sigma_{u01} \\ \sigma_{u10} & \sigma_1^2 \end{pmatrix} \right) \end{equation}\]

\[\epsilon_{ij} \sim N(0, \sigma^2)\]

The random effects and the residual random error are independent.

The conditional mean (individual-specific) is $E[Y_{ij} | X_{ij}, u_{0i}, u_{1i}] = (\beta_0 + u_{0i}) + (\beta_1 + u_{1i})X_{ij1}$
The marginal mean (population-average) is $E[Y_{ij}| X_{ij}] = \beta_0 + \beta_1X_{ij1}$

Fitting model in R

lin_agecr <- lmer(distance ~ age + (age | id), data = dental_long)
summary(lin_agecr)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: distance ~ age + (age | id)
   Data: dental_long

REML criterion at convergence: 442.6

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.2229 -0.4938  0.0073  0.4721  3.9161 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 id       (Intercept) 5.41657  2.3274        
          age         0.05128  0.2264   -0.61
 Residual             1.71616  1.3100        
Number of obs: 108, groups:  id, 27

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept) 16.76111    0.77527 25.99911  21.620  < 2e-16 ***
age          0.66019    0.07126 25.99805   9.265 1.01e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
    (Intr)
age -0.848

This model assumes that individuals vary not only in their baseline level of response (intercept), but also in terms of their changes (slope) in the mean response over time.

VarCorr(lin_agecr)

 Groups   Name        Std.Dev. Corr  
 id       (Intercept) 2.32735        
          age         0.22645  -0.609
 Residual             1.31002

as.data.frame(VarCorr(lin_agecr))

       grp        var1 var2        vcov      sdcor
1       id (Intercept) <NA>  5.41656658  2.3273518
2       id         age <NA>  0.05127907  0.2264488
3       id (Intercept)  age -0.32118277 -0.6094248
4 Residual        <NA> <NA>  1.71615750  1.3100219

There is a negative correlation (-0.32) between random intercepts and slopes: the greater is the mean response at baseline (8 years), the lower is likely to be the increase (slope) of the rate of change over time.

We can get the predicted value of the marginal (overall) and individual-specific mean response trajectories.

expand.grid(
  age = seq(8, 14, .5),
  id = unique(dental_long$id)
) %>% 
  bind_cols(
    indiv_pred = predict(lin_agecr, newdata = .),
    marg_pred = predict(lin_agecr, newdata = ., re.form = ~ 0)
  ) %>% 
  ggplot(aes(age, indiv_pred, group = id)) +
  geom_line(col = "grey") +
  geom_line(aes(y = marg_pred), col = "blue", lwd = 1.5) +
  labs(x = "Age, years", y = "Dental growth, mm")

Question: Is there any difference between boys and girls?

lin_agecsexinterr <- lmer(distance ~ age*sex + (age | id), data = dental_long)
summary(lin_agecsexinterr)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: distance ~ age * sex + (age | id)
   Data: dental_long

REML criterion at convergence: 432.6

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.1694 -0.3862  0.0070  0.4454  3.8490 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 id       (Intercept) 5.77449  2.4030        
          age         0.03245  0.1801   -0.67
 Residual             1.71663  1.3102        
Number of obs: 108, groups:  id, 27

Fixed effects:
            Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)  17.3727     1.2281 25.0078  14.147 1.94e-13 ***
age           0.4795     0.1037 25.0113   4.625 9.85e-05 ***
sexBoy       -1.0321     1.5953 25.0078  -0.647   0.5235    
age:sexBoy    0.3048     0.1347 25.0113   2.263   0.0326 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
           (Intr) age    sexBoy
age        -0.880              
sexBoy     -0.770  0.678       
age:sexBoy  0.678 -0.770 -0.880

There is evidence of significant interaction between sex and time in predicting dental distance (p for interaction = 0.032).

Graphical presentation

expand.grid(
  age = seq(8, 14, .5),
  id = unique(dental_long$id)
) %>% 
  left_join(select(dental, sex, id), by = "id") %>% 
  bind_cols(
    indiv_pred = predict(lin_agecsexinterr, newdata = .),
    marg_pred = predict(lin_agecsexinterr, newdata = ., re.form = ~ 0)
  ) %>% 
  ggplot(aes(age, indiv_pred, group = id)) +
  geom_line(col = "grey") +
  geom_line(aes(y = marg_pred, col = sex), lwd = 1.5) +
  labs(x = "Age, measurements", y = "Mean Dental growth, mm")

Generalized Estimation Equation

The geepack package contains functions for fitting population-averaged panel-data models. The function geeglm allows to estimate generalized linear models with different within-group correlation structure for the panels. Have a look at the help page ?geeglm for the different within-group correlation structure (corstr argument).

Some type of correlations

independence: the correlation matrix is diagonal. That is 0 out of main diagonal. No correlation among repeated measurements;
exchangeable: all pair of measurements on the same individual are equally correlated $\rho_{jk} = \rho$ (compound symmetry);
ar1: Correlation depends on time or distance between measurements j and k. Plausible for repeated measures where correlation is known to decline over time $\rho_{jk} = \rho^{|j-k|}$;
unstructured: no assumptions about the correlations;
userdefined: Correlation is specified by the user.

Things to consider when you choose a correlation matrix:
In general if the number of repeated measures per individual is small in a balanced and complete design, then an unstructured matrix is recommended. If data are unbalanced, unequally spaced, and with gaps (missing) the more appropriate correlation matrices are exchangeable and unstructured. Furthermore, when the order of the measurement it doesn’t matter an exchangeable correlation matrix could be appropriate.

Let’s assume the following marginal model for the dental distance:

\[E[Y_{ij}| X_{ij}] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2\] where $X_1$ is the indicator variable for sex (1 for boys and 0 for girls), $X_2$ is age, and $X_1X_2$ is the interaction between sex and age.

gee_inter <- geeglm(distance ~ sex*age, data = dental_long,
                    id = id, family = gaussian, corstr = "exchangeable")
summary(gee_inter)


Call:
geeglm(formula = distance ~ sex * age, family = gaussian, data = dental_long, 
    id = id, corstr = "exchangeable")

 Coefficients:
            Estimate  Std.err    Wald Pr(>|W|)    
(Intercept) 17.37273  0.72521 573.869  < 2e-16 ***
sexBoy      -1.03210  1.37779   0.561   0.4538    
age          0.47955  0.06313  57.697 3.05e-14 ***
sexBoy:age   0.30483  0.11687   6.803   0.0091 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = exchangeable 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)    4.905   1.015
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha   0.6178  0.1312
Number of clusters:   27  Maximum cluster size: 4

tidy(gee_inter, conf.int = TRUE)

# A tibble: 4 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   17.4      0.725    574.    0         16.0       18.8  
2 sexBoy        -1.03     1.38       0.561 4.54e- 1  -3.73       1.67 
3 age            0.480    0.0631    57.7   3.05e-14   0.356      0.603
4 sexBoy:age     0.305    0.117      6.80  9.10e- 3   0.0758     0.534

The Wald-test detects a statistically significant ($p < 0.05$) interaction between age and sex in predicting distance. That is, the fitted trends are not parallel, the linear pattern of change differ comparing boys and girls.

K <- rbind(
  "mean response comparing boys vs. girls" = c(0, 1, 0, 0),
  "mean distance increases every 1 year increase of age among girls" = c(0, 0, 1, 0),
  "mean distance increases every 1 year increase of age among boys" = c(0, 0, 1, 1)
)
tidy(glht(gee_inter, linfct = K), conf.int = TRUE) %>% 
    gt() %>% 
  fmt_number(columns = -1,decimals = 2)

contrast	estimate	conf.low	conf.high	statistic	adj.p.value
mean response comparing boys vs. girls	−1.03	−4.25	2.19	−0.75	0.77
mean distance increases every 1 year increase of age among girls	0.48	0.33	0.63	7.60	0.00
mean distance increases every 1 year increase of age among boys	0.78	0.55	1.01	7.98	0.00

Graphical presentation

pred_geeinter <- ggpredict(gee_inter, terms = c("age", "sex"))
ggplot(pred_geeinter, aes(x, predicted, col = group)) + 
  geom_line() +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high, fill = group), alpha = .2, col = NA) +
  labs(x = "Age, years", y = "Dental growth, mm", col = "Sex", fill = "Sex")

Other correlation structure

gee_inter_exch <- gee_inter 
gee_inter_ind <- geeglm(distance ~ sex*age, data = dental_long,
                        id = id, family = gaussian, corstr = "independence")
gee_inter_unstr <- geeglm(distance ~ sex*age, data = dental_long,
                          id = id, family = gaussian, corstr = "unstructured")
gee_inter_ar1 <- geeglm(distance ~ sex*age, data = dental_long,
                        id = id, family = gaussian, corstr = "ar1")
# comparison
QIC(gee_inter_exch, gee_inter_ind, gee_inter_unstr, gee_inter_ar1)

                  QIC  QICu Quasi Lik   CIC params  QICC
gee_inter_exch  543.1 537.8    -264.9 6.651      4 545.9
gee_inter_ind   543.1 537.8    -264.9 6.651      4 544.9
gee_inter_unstr 542.6 537.8    -264.9 6.420      4 556.4
gee_inter_ar1   544.6 538.8    -265.4 6.864      4 547.4

Influence of Menarche on Changes in Body Fat

Data

Prospective study on body fat accretion in a cohort of 162 girls from the MIT Growth and Development Study. At start of study, all the girls were pre-menarcheal and non-obese.
All girls were followed over time according to a schedule of annual measurements until four years after menarche. The final measurement was scheduled on the fourth anniversary of their reported date of menarche.
At each examination, a measure of body fatness was obtained based on bioelectric impedance analysis.

Reference: Phillips, S.M., Bandini, L.G., Compton, D.V., Naumova, E.N. and Must, A. (2003). A longitudinal comparison of body composition by total body water and bioelectrical impedance in adolescent girls. Journal of Nutrition, 133, 1419-1425.

The bodyfat contains the following variables:
id = a sequential number;
agec = Current age, years;
agem = Age at menarche, years;
time = Time relative to menarche, years;
pbf = Percent Body Fat;
occasion = Number of sequential measurement.

#load("data/bodyfat.RData")
load(url("http://alecri.github.io/downloads/data/bodyfat.RData"))

head(bodyfat)

# A tibble: 6 × 6
     id  agec  agem    time   pbf occasion
  <dbl> <dbl> <dbl>   <dbl> <dbl>    <dbl>
1     1  9.32  13.2 -3.87    7.94        1
2     1 10.3   13.2 -2.86   15.6         2
3     1 11.2   13.2 -1.95   13.5         3
4     1 12.2   13.2 -1      23.2         4
5     1 13.2   13.2  0.0500 10.5         5
6     1 14.2   13.2  1.05   20.5         6

Descriptive statistics

Trajectories over time

ggplot(bodyfat, aes(agec, pbf)) +
  geom_line(aes(group = id)) + 
  geom_smooth(se = FALSE, size = 2) +
  geom_vline(xintercept = 12.86, linetype = "dashed") +
  labs(x = "Age, years", y = "Percent Body Fat")

Let’s consider the hypothesis that percent body fat accretion increases linearly with age, but with different slopes before/after menarche. We assume that each girl has a piecewise linear spline growth curve with a knot at the time of menarche.
Each girl’s growth curve can be described with an intercept and two slopes, one slope for changes in response before menarche, another slope for changes in response after menarche. Note the knot is not a fixed age for all subjects.

Linear splines

If the simplest possible curve is a straight line, then one way to extend the curve is to have sequence of joined line segments that produces a piecewise linear pattern. Linear spline models provide flexible way to accommodate many non-linear trends that cannot be approximated by simple polynomials in time.

Basic idea: Divide time axis into series of segments and consider piecewise-linear trends, having different slopes but joined at fixed times. Locations where lines are tied together are known as knots. Resulting piecewise-linear curve is called a spline.

A piece-wise linear mixed model can be written as
\[E[Y_{ij | X_{ij}}] = \beta_0 + \beta_1X_{ij1} + \beta_2X_{ij2}]\]

\[E[Y_{ij} | X_{ij}, u_{i}] = (\beta_0 + u_{0i}) + (\beta_1 + u_{1i})X_{ij1} + (\beta_2 + u_{12})X_{ij2}]\] where $X_{ij1}$ is time since menarche of the $j$-th measurement on $i$-th subject (i.e., $X_{ij1} = 0$ at menarche); $X_{ij2} = \textrm{max}(X_{ij1} - 0, 0)$ is the linear spline of $X_{ij1}$ with a knot at 0.

Marginal model

$\beta_0$: is the population mean percent body fat at menarche (when $X_{ij1} = 0$)
$\beta_1$: is the population slope (linear trend) for time before menarche (when $X_{ij1} < 0$)
$\beta_2$: is the additional change in the population slope (linear trend) for time after menarche (when $X_{ij1} > 0$)
$\beta_1 + \beta_2$: is the population slope (linear trend) for time after menarche (when $X_{ij1} > 0$)

Conditional model

$\beta_0 + u_{0i}$: is the intercept for $i$-th subject and is the true percent body fat at menarche (when $X_{ij1} = 0$)
$\beta_1 + u_{1i}$: is the $i$-th subject slope, or rate of change in percent body fat during the pre-menarcheal period
$\beta_1 + \beta_2 + u_{1i} + u_{2i}$: is the $i$-th subject slope, or rate of change during the post-menarcheal period. (when $X_{ij1} > 0$)

Fitting model in R

bodyfat <- bodyfat %>%
  mutate(timepost = pmax(time, 0))
lin_lspl0 <- lmer(pbf ~ time + timepost + (time + timepost | id), data = bodyfat)
summary(lin_lspl0)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: pbf ~ time + timepost + (time + timepost | id)
   Data: bodyfat

REML criterion at convergence: 6062

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-2.774 -0.590 -0.036  0.595  3.380 

Random effects:
 Groups   Name        Variance Std.Dev. Corr       
 id       (Intercept) 45.94    6.78                
          time         1.63    1.28      0.29      
          timepost     2.75    1.66     -0.54 -0.83
 Residual              9.47    3.08                
Number of obs: 1049, groups:  id, 162

Fixed effects:
            Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)   21.361      0.565 161.566   37.84  < 2e-16 ***
time           0.417      0.157 108.460    2.65   0.0091 ** 
timepost       2.047      0.228 132.682    8.98  2.3e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
         (Intr) time  
time      0.351       
timepost -0.515 -0.872

Based on magnitude of $beta_2$, relative to its standard error, slopes before and after menarche differ.

K <- rbind(
  "population mean pre-menarcheal slope" = c(0, 1, 0),
  "population mean post-menarcheal slope" = c(0, 1, 1)
)
tidy(glht(lin_lspl0, linfct = K), conf.int = TRUE)

# A tibble: 2 × 7
  contrast                       null.…¹ estim…² conf.…³ conf.…⁴ stati…⁵ adj.p…⁶
  <chr>                            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 population mean pre-menarchea…       0   0.417  0.0674   0.767    2.65  0.0156
2 population mean post-menarche…       0   2.46   2.20     2.73    20.7   0     
# … with abbreviated variable names ¹null.value, ²estimate, ³conf.low,
#   ⁴conf.high, ⁵statistic, ⁶adj.p.value

Estimate of the population mean pre-menarcheal slope is 0.42 (95% CI = 0.067-0.767%).
Estimated slope is rather shallow and indicates that the annual rate of body fat accretion is less that 0.5%.

Estimate of population mean post-menarcheal slope is (0.42+2.04) = 2.46, which indicates that annual rate of growth in body fat is approximately 2.5% (95% CI = 2.2-2.7%), almost 6 times higher than pre-menarche.

Covariance structure ::: {.cell}

VarCorr(lin_lspl0)

 Groups   Name        Std.Dev. Corr       
 id       (Intercept) 6.78                
          time        1.28      0.29      
          timepost    1.66     -0.54 -0.83
 Residual             3.08

as.data.frame(VarCorr(lin_lspl0))

       grp        var1     var2   vcov   sdcor
1       id (Intercept)     <NA> 45.940  6.7779
2       id        time     <NA>  1.631  1.2771
3       id    timepost     <NA>  2.749  1.6580
4       id (Intercept)     time  2.526  0.2919
5       id (Intercept) timepost -6.109 -0.5436
6       id        time timepost -1.750 -0.8266
7 Residual        <NA>     <NA>  9.473  3.0779

:::

Estimated variance of $u_{1i}$ is 1.63, indicating substantial variability from girl to girl in rates of percent body fat accretion during the pre-menarcheal period.
Approximately 95% of girls have changes in percent body fat between -2.09% and 2.92%, i.e. $0.42 \pm 1.96 × \sqrt{1.63}$.

Estimated variance of slopes during post-menarcheal period, $\textrm{Var}(u_{1i} + u_{2i})$ is 0.88 (or [1.63 + 2.75 −2 × 1.75]), indicating less variability in the slopes after menarche. Approximately 95% of girls have changes in percent body fat between 0.62% and 4.30%, i.e., $2.46 \pm 1.96 × \sqrt{0.88}$

The mixed effects model can be used to obtain estimates of each girl’s growth trajectory over time (fixed plus realization of random effects). We next present graphically the estimated population mean growth curve and predicted (empirical BLUP) growth curves for two girls (id = 16 and id = 4).

A noticeable feature of the predicted growth curves is that there is more shrinkage towards the population mean curve when fewer data points are available.

sid <- c(4, 16)
p_lin_lspl0 <- expand.grid(
  time = seq(-6, 4.1, .05),
  id = sid
) %>% 
  mutate(timepost = pmax(time, 0)) %>% 
  bind_cols(
    indiv_pred = predict(lin_lspl0, newdata = .),
    marg_pred = predict(lin_lspl0, newdata = ., re.form = ~ 0)
  ) %>% 
  ggplot(aes(time, indiv_pred, group = id, col = factor(id))) +
  geom_line(aes(y = marg_pred), col = "black", lwd = 1.5) +
  geom_line(aes(linetype = "BLUP")) +
  geom_point(data = filter(bodyfat, id %in% sid), 
             aes(y = pbf, shape = factor(id), col = factor(id))) +
  guides(shape = "none") +
  labs(x = "Time relative to menarche, years", y = "Percent Body Fat", 
       col = "Id", linetype = "Prediction")
p_lin_lspl0

Predicted growth curve for girl with only 6 data points is pulled closer to the population mean curve. Predicted growth curve for girl with 10 observation follows her data more closely.

This becomes more apparent when empirical BLUPs are compared to ordinary least squares (OLS) estimates based only on observations from each girl.

lm4 <- lm(pbf ~ time + timepost, data = filter(bodyfat, id == 4))
lm16 <- lm(pbf ~ time + timepost, data = filter(bodyfat, id == 16))

newd4_16 <- data.frame(
  time = seq(-6, 4.1, .05),
  timepost = pmax(seq(-6, 4.1, .05), 0)
  ) %>% 
  mutate(
    pred4 = predict(lm4, newdata = .),
    pred16 = predict(lm16, newdata = .)
  ) %>% 
  pivot_longer(cols = starts_with("pred"), names_to = "id", values_to = "pred") %>% 
  mutate(id = parse_number(id))
  
p_lin_lspl0 +
  geom_line(data = newd4_16, aes(y = pred, linetype = "OLS"))