Introduction
In this article, we will recreate the results of three case studies of the book Applied Longitudinal Analysis, 2nd Edition, by Garrett Fitzmaurice, Nan Laird & James Ware. In particular, we are interested in understanding the results of chapter 8: Linear Mixed-Effects Models. In general, linear mixed-effect models are powerful statistical tools that allow researchers to analyze data where observations are not independent, capturing both individual variations and overall trends. The case studies presented article showcase linear mixed-effects models in three distinct areas of health research: lung function growth in children, body fat accretion in girls, CD4 count changes in AIDS patients. Each study explores critical public health concerns using longitudinal data to understand the progression of health indicators over time. By applying these models, we can more accurately interpret the effects of various factors on health outcomes, leading to more informed decisions in public health and clinical practice.
Vocabulary
Longitudinal Data: Longitudinal data refers to data that is collected from the same subjects repeatedly over a period of time. This type of data allows researchers to observe changes and trends within the same individuals.
Random Effects: Components of a statistical model that account for variability at the individual level, often due to differences in subjects.
Fixed Effects: Components of a statistical model that capture the overall population effects that are consistent across all subjects.
Mixed-Effects Model: A statistical model that incorporates both fixed effects (common to all individuals) and random effects (specific to individual subjects), while also assuming a linear relationship between the variables being studied.
Best Linear Unbiased Predictor (BLUP): A statistical estimator used in mixed-effects models to predict random effects for individual subjects. BLUP provides the most accurate (best) prediction of random effects by minimizing error variance, while ensuring that the predictions are linear functions of the observed data and unbiased.
Why use Linear Mixed-Effect Models?
Handles Hierarchical Data: They are designed to analyze data that have a hierarchical or nested structure, such as repeated measurements taken from the same individuals over time.
Accounts for Individual Variability: By including random effects, these models can account for the variability between subjects, allowing for more accurate modeling of individual differences.
Suitable for Unbalanced Data: Linear mixed-effects models are effective in handling unbalanced data, where the number of observations varies across individuals.
Models Complex Relationships: They allow for the modeling of complex relationships between variables.
Reduces Bias: These models help reduce bias in estimates that could arise from not accounting for the correlation between repeated measures on the same individual.
Predicts Subject-Specific Trajectories: Linear mixed-effects models can predict subject-specific outcomes over time, making them useful for longitudinal studies.
Study # 1: Six Cities Study of Air Pollution and Healt
Research Questions
- How do age and height influence lung function growth in children and adolescents?
- What are the differences between the longitudinal (within-child) and cross-sectional (between-child) effects on lung function over time?
Data
Study Design: Longitudinal study aimed at characterizing lung growth in children and adolescents, focusing on changes in pulmonary function and factors influencing lung function growth.
Cohort: The study included 13,379 children born on or after 1967, from six different communities across the United States (Watertown, MA; Kingston and Harriman, TN; St. Louis, MO; Steubenville, OH; Portage, WI; and Topeka, KS).
Age Range: Participants were typically enrolled in the study between the ages of 6 and 7, with annual measurements taken until they graduated from high school or were lost to follow-up.
Measurements: Lung function was assessed using spirometry, which involves measuring the volume of air exhaled during a forced exhalation after maximal inspiration. Various metrics, including FEV1 (Forced Expiratory Volume in one second), were derived from the spirometric curve.
Data Collection: In addition to spirometry, a respiratory health questionnaire was completed annually by a parent or guardian, providing supplementary data on respiratory health and potential influencing factors.
Variables:
This data has 6 variables and 1994 observations.
id: a unique identifier assigned to each participant in the study, representing individual subjects
age: the current age of the participant at the time of the measurement
height: the current height of the participant, measured in centimeters, at the time of the measurement
age0: the participant’s age at the baseline measurement, representing their age when they first entered the study
height0: the participant’s height at the baseline measurement, representing their height when they first entered the study
logFEV1: the natural logarithm of the Forced Expiratory Volume in one second (FEV1), a measure of lung function
## Subject.ID Height Age Initial.Height Initial.Age Log.FEV1.
## 1 1 1.20 9.3415 1.2 9.3415 0.21511
## 2 1 1.28 10.3929 1.2 9.3415 0.37156
## 3 1 1.33 11.4524 1.2 9.3415 0.48858
## 4 1 1.42 12.4600 1.2 9.3415 0.75142
## 5 1 1.48 13.4182 1.2 9.3415 0.83291
## 6 1 1.50 15.4743 1.2 9.3415 0.89200
Figure 8.4:
# Plot the data
ggplot(random_girls, aes(x = Age, y = Log.FEV1., group = Subject.ID, color = as.factor(Subject.ID))) +
geom_line() +
geom_point(shape = 1) +
labs(
title = "Time Plot of log(FEV1/height) vs Age",
caption = "Source: Six Cities Study of Air Pollution and Health",
x = "Age (years)",
y = "log(FEV1/height)"
) +
theme_minimal() +
theme(legend.position = "none")df1 <- df1 %>% mutate(HeightGroup = cut(Height,
breaks = c(-Inf, 1.42, 1.60, Inf),
labels = c("Short", "Average", "Tall"),
include.lowest = TRUE))# Create the stratified plot
ggplot(df1, aes(x = Age, y = Log.FEV1., color = HeightGroup)) +
geom_point(alpha = 0.5) + # Scatter plot points
geom_smooth(method = 'lm', se = FALSE) + # Regression lines
labs(title = "Stratified Plot: Log(FEV1) vs Age by Height Group",
x = "Age (years)",
y = "Log(FEV1)",
color = "Height Group") +
theme_minimal() +
scale_color_manual(values = c("Short" = "purple", "Average" = "orange", "Tall" = "darkblue"))## `geom_smooth()` using formula = 'y ~ x'
# Exponential transformation of logFEV1
fev1.e <- exp(df1$Log.FEV1.)
# Box-and-whisker plot to reveal outliers using ggplot2
ggplot(df1, aes(y = Log.FEV1.) ) +
geom_boxplot() +
labs(title = "Boxplot of log(FEV1)", y = "log(FEV1)")## Subject.ID Height Age Initial.Height Initial.Age Log.FEV1. HeightGroup
## 1484 197 1.22 9.1362 1.22 9.1362 -0.69315 Short
The Model
- This model accounts for both the longitudinal effects (how age and height affect log(FEV1) over time) and the cross-sectional effects (how initial age and initial height affect log(FEV1)).
\[ E(Y_{ij} \mid b_i) = \beta_1 + \beta_2 \text{Age}_{ij} + \beta_3 \log(\text{Ht}_{ij}) + \beta_4 \text{Age}_{i1} + \beta_5 \log(\text{Ht}_{i1}) + b_{1i} + b_{2i} \text{Age}_{ij} \]
\(\beta_1\) (Fixed Intercept):
- Mathematical Meaning:
- \(\beta_1\) is the fixed intercept, representing the average starting point of the response variable across all individuals when all predictors (Age, log(Height), etc.) are zero.
- In a Plot:
- This determines the point where the population-average line (fixed effects line) intersects the y-axis. It is the baseline value of the response variable before considering any individual-specific deviations.
- \(\beta_2 \text{Age}_{ij}\)
(Fixed Effect of Age at Time \(j\)):
- Mathematical Meaning:
- \(\beta_2\) represents the fixed effect of age on the response variable. It shows how the response variable changes with age for the \(i\)-th individual at the \(j\)-th time point.
- In a Plot:
- This term would manifest as the slope of the population-average line (fixed effects line) in relation to age. A positive \(\beta_2\) means the response variable increases with age, while a negative \(\beta_2\) would indicate a decrease.
- Mathematical Meaning:
- \(\beta_3
\log(\text{Ht}_{ij})\) (Fixed Effect of Log(Height) at Time \(j\)):
- Mathematical Meaning:
- \(\beta_3\) represents the fixed effect of the logarithm of height on the response variable. It captures how changes in height (log-transformed) over time affect the response variable.
- In a Plot:
- This term would influence the slope of the line as height changes. If plotting against height, this term would modify the slope, showing how the response variable is expected to change as height increases.
- Mathematical Meaning:
- \(\beta_4 \text{Age}_{i1}\)
(Fixed Effect of Initial Age):
- Mathematical Meaning:
- \(\beta_4\) captures the fixed effect of the initial age (\(\text{Age}_{i1}\)) on the response variable. This term represents how the starting age of an individual influences the response variable.
- In a Plot:
- This term shifts the entire population-average line up or down based on the initial age of the individual. It modifies the intercept of the line for each individual depending on their starting age.
- Mathematical Meaning:
- \(\beta_5
\log(\text{Ht}_{i1})\) (Fixed Effect of Initial
Log(Height)):
- Mathematical Meaning:
- \(\beta_5\) represents the fixed effect of the initial log(height) on the response variable. It shows how the starting height of an individual influences the response variable.
- In a Plot:
- Similar to \(\beta_4\), this term would adjust the intercept of the population-average line based on the individual’s initial height.
- Mathematical Meaning:
- \(b_{1i}\) (Random
Intercept for Individual \(i\)):
- Mathematical Meaning:
- \(b_{1i}\) is the random intercept for the \(i\)-th individual. It captures the deviation of that specific individual’s intercept from the population average intercept \(\beta_1\).
- In a Plot:
- This term shifts the individual’s line up or down, representing the unique starting point for that individual. Each individual’s line would start at a different point on the y-axis based on \(b_{1i}\).
- Mathematical Meaning:
- \(b_{2i} \text{Age}_{ij}\)
(Random Slope for Age for Individual \(i\)):
- Mathematical Meaning:
- \(b_{2i}\) is the random slope for age for the \(i\)-th individual. It allows the effect of age on the response variable to vary between individuals, meaning each individual can have a different age-response relationship.
- In a Plot:
- This term changes the slope of the individual’s line. Some individuals might have a steeper increase or a flatter increase (or decrease) in the response variable with age, depending on their specific \(b_{2i}\) value.
- Mathematical Meaning:
# Filter the data to exclude outliers
filtered_data <- df1 %>% filter(Log.FEV1. > -0.5)
# Base model
lm_model <- lm(Log.FEV1. ~ Age + log(Height) + Initial.Age + log(Initial.Height), data = filtered_data)
# Fit the linear mixed-effects model
fm1 <- lmer(Log.FEV1. ~ Age + log(Height) + Initial.Age + log(Initial.Height) + (Age | Subject.ID), data = filtered_data)Compare Models Using Log-Likelihood:
The log-likelihood (
log Lik.) is a measure of how well a statistical model fits the data. The higher the log-likelihood, the better the model fits the data, assuming the models are comparable.The degrees of freedom (
df) represents the number of parameters estimated in the model. A model with more parameters (higher df) generally has the potential to fit the data better, but it also runs the risk of overfitting.
## 'log Lik.' 1500.868 (df=6)
## 'log Lik.' 2283.941 (df=9)
- In this case, the linear mixed-effect model with a log-likelihood of 2283.941 seems fits the data better than the base model with a log-likelihood of 1500.868.
Table 8.2:
# Summary of the model
#summary(fm1)
# Extract the fixed effects
fixed_effects <- summary(fm1)$coefficients
# Create a data frame for the table
table_data <- data.frame(
Variable = rownames(fixed_effects),
Estimate = round(fixed_effects[, "Estimate"], 4),
SE = round(fixed_effects[, "Std. Error"], 4),
z_value = round(fixed_effects[, "t value"], 2)
)
# Create the table using kable
knitr::kable(table_data,
col.names = c("Variable", "Estimate", "SE", "z"),
caption = "Table 1: Estimated regression coefficients (fixed effects) and standard errors for the log(FEV1) data from the Six Cities Study.")| Variable | Estimate | SE | z | |
|---|---|---|---|---|
| (Intercept) | (Intercept) | -0.2883 | 0.0387 | -7.45 |
| Age | Age | 0.0235 | 0.0014 | 16.86 |
| log(Height) | log(Height) | 2.2372 | 0.0435 | 51.39 |
| Initial.Age | Initial.Age | -0.0165 | 0.0075 | -2.21 |
| log(Initial.Height) | log(Initial.Height) | 0.2182 | 0.1455 | 1.50 |
Table 8.3:
## (Intercept) Age
## (Intercept) 1.22084048 -0.043260149
## Age -0.04326015 0.005010876
## attr(,"stddev")
## (Intercept) Age
## 0.110491650 0.007078754
## attr(,"correlation")
## (Intercept) Age
## (Intercept) 1.0000000 -0.5530976
## Age -0.5530976 1.0000000
Table 8.4:
Had trouble generating.
Table 8.5
fm2 <- lmer(Log.FEV1. ~ Age + log(Height) + Initial.Age + log(Initial.Height) + (1 + log(Height) | Subject.ID), data = df1)
summary(fm2)## Linear mixed model fit by REML ['lmerMod']
## Formula: Log.FEV1. ~ Age + log(Height) + Initial.Age + log(Initial.Height) +
## (1 + log(Height) | Subject.ID)
## Data: df1
##
## REML criterion at convergence: -4526.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -6.5346 -0.5034 0.0795 0.5679 2.8537
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Subject.ID (Intercept) 0.018813 0.13716
## log(Height) 0.075257 0.27433 -0.67
## Residual 0.003536 0.05946
## Number of obs: 1994, groups: Subject.ID, 300
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.274187 0.042218 -6.495
## Age 0.023201 0.001249 18.577
## log(Height) 2.258933 0.046442 48.640
## Initial.Age -0.023087 0.008019 -2.879
## log(Initial.Height) 0.347228 0.156489 2.219
##
## Correlation of Fixed Effects:
## (Intr) Age lg(Hg) Intl.A
## Age 0.046
## log(Height) -0.112 -0.852
## Initial.Age -0.821 -0.159 0.142
## lg(Intl.Hg) 0.364 0.220 -0.233 -0.811
Study # 2: Study of Influence of Menarche on Changes in Body Fat Accretion
Data
Prospective study on body fat accretion in a cohort of 162 girls from the MIT Growth and Development Study.
At start of study, all the girls were pre-menarcheal and non-obese.
All girls were followed over time according to a schedule of annual measurements until four years after menarche.
The final measurement was scheduled on the fourth anniversary of their reported date of menarche.
At each examination, a measure of body fatness was obtained based on bioelectric impedance analysis.
## id p.age m.age m.time pbf
## 1 1 9.32 13.19 -3.87 7.94
## 2 1 10.33 13.19 -2.86 15.65
## 3 1 11.24 13.19 -1.95 13.51
## 4 1 12.19 13.19 -1.00 23.23
## 5 1 13.24 13.19 0.05 10.52
## 6 1 14.24 13.19 1.05 20.45
Variables:
This data has 5 variables and multiple observations.
id: a factor representing the subject ID, with multiple levels corresponding to different subjects. p.age: a numeric vector indicating the current age of the subject (in years). m.age: a numeric vector representing the age at menarche (in years) for the subject. m.time: a numeric vector representing the time relative to menarche (in years). pbf: a numeric vector indicating the percent body fat of the subject.
Figure 8.5:
ggplot(df2, aes(x = m.time, y = pbf, group = id)) +
geom_line(size = 0.5, color = "turquoise") +
geom_point(size = 0.5, color = "black") +
labs(title = "Time Plot of percent body fat vs time", x = "Time relative to menarche (weeks)", y = "Percent body fat") +
theme_minimal()## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplot(df2, aes(x = m.time, y = pbf)) +
geom_point(size = 0.5, color = "black") +
geom_smooth(method = "loess", se = FALSE, color = "turquoise") +
geom_vline(xintercept = 0, linetype = "dashed") +
labs(
title = "Time plot of percent body fat vs time",
x = "Time relative to menarche (years)",
y = "Percent body fat") +
theme_minimal()## `geom_smooth()` using formula = 'y ~ x'
The lowess curve reveals that the mean response remains relatively flat during the pre-menarcheal period and then rises sharply during the post-menarcheal period.
The Model
Let \(t_{ij}\) denote the time of the \(j\)-th measurement on the \(i\)-th subject before or after menarche (i.e., \(t_{ij} = 0\) at menarche). We consider following linear mixed effects model:
\[ E(Y_{ij} \mid b_i) = \beta_1 + \beta_2 t_{ij} + \beta_3 (t_{ij})^+ + b_{1i} + b_{2i} t_{ij} + b_{3i} (t_{ij})^+ \]
Where:
- \(Y_{ij}\): The response variable for the \(i\)-th individual at the \(j\)-th time point (e.g., Percent Body Fat).
- \(b_i = (b_{1i}, b_{2i}, b_{3i})\): Random effects associated with the \(i\)-th individual.
- \(t_{ij}\): The time of the \(j\)-th measurement on the \(i\)-th subject relative to menarche (i.e., \(t_{ij} = 0\) at menarche).
- \((t_{ij})^+ = \max(t_{ij}, 0)\): The post-menarche time for the \(i\)-th individual (0 for pre-menarcheal measurements, otherwise \(t_{ij}\)).
- \(\beta_1\) (Fixed
Intercept at Menarche):
- Mathematical Meaning:
- \(\beta_1\) represents the fixed intercept, which is the expected percent body fat at the time of menarche across all individuals, assuming all other factors are constant.
- R Code:
- This term is part of the fixed effects in the model and would be
estimated as the intercept when running the linear mixed-effects model
(
lmerfunction in R).
- This term is part of the fixed effects in the model and would be
estimated as the intercept when running the linear mixed-effects model
(
- In a Plot:
- This term determines where the population-average line intersects the y-axis at the time of menarche (when \(t_{ij} = 0\)).
- Mathematical Meaning:
- \(\beta_2 t_{ij}\) (Fixed
Effect of Time Before Menarche):
- Mathematical Meaning:
- \(\beta_2\) is the fixed slope for the time before menarche. It represents the rate of change in percent body fat as time progresses before menarche.
- R Code:
- This would be estimated by the coefficient for the
m.timevariable whenm.timeis negative (pre-menarcheal period).
- This would be estimated by the coefficient for the
- In a Plot:
- This term defines the slope of the line representing the population-average trend of body fat percentage before menarche. A positive \(\beta_2\) would indicate an increasing trend, while a negative \(\beta_2\) would indicate a decreasing trend.
- Mathematical Meaning:
- \(\beta_3 (t_{ij})^+\)
(Fixed Effect of Time After Menarche):
- Mathematical Meaning:
- \(\beta_3\) captures the change in slope after menarche. It represents the difference in the rate of body fat accretion post-menarche compared to pre-menarche.
- R Code:
- This effect would be captured by the interaction term or the
m.time:stageterm in the model, wherestageindicates whether the time is pre- or post-menarche.
- This effect would be captured by the interaction term or the
- In a Plot:
- This term would alter the slope of the line after menarche. If \(\beta_3 > 0\), the slope after menarche would be steeper (indicating faster body fat accretion), and if \(\beta_3 < 0\), it would be shallower.
- Mathematical Meaning:
- \(b_{1i}\) (Random
Intercept for Individual \(i\)):
- Mathematical Meaning:
- \(b_{1i}\) represents the random deviation of the \(i\)-th individual’s intercept from the population average intercept (\(\beta_1\)).
- R Code:
- This random effect is modeled using
(1 | id)in thelmerfunction, allowing each individual to have their unique intercept.
- This random effect is modeled using
- In a Plot:
- This shifts the individual growth curve vertically, allowing for different starting points of percent body fat at menarche for each individual.
- Mathematical Meaning:
- \(b_{2i} t_{ij}\) (Random
Slope for Time Before Menarche for Individual \(i\)):
- Mathematical Meaning:
- \(b_{2i}\) is the random slope for the time before menarche, allowing the rate of change in body fat before menarche to vary between individuals.
- R Code:
- This is captured by including
(m.time | id)in thelmermodel, allowing individual slopes before menarche to vary.
- This is captured by including
- In a Plot:
- This term allows the slope of the body fat trend before menarche to differ among individuals. Some might have steeper increases or shallower trends based on their specific random slope \(b_{2i}\).
- Mathematical Meaning:
- \(b_{3i} (t_{ij})^+\)
(Random Slope for Time After Menarche for Individual \(i\)):
- Mathematical Meaning:
- \(b_{3i}\) represents the random slope for the time after menarche, allowing for individual variability in the rate of body fat accretion post-menarche.
- R Code:
- Modeled in the
lmerfunction with the term(m.time:stage | id), capturing the random effect of post-menarche slope differences.
- Modeled in the
- In a Plot:
- This term modifies the slope of the line after menarche for each individual. It accounts for individual differences in how body fat accretion changes post-menarche.
- Mathematical Meaning:
Summary in a Plot:
- Population-Level Line (Fixed Effects):
- The general trend for percent body fat across the population is captured by the fixed effects (\(\beta_1\), \(\beta_2\), \(\beta_3\)). This trend shows how body fat changes over time, with a potential change in slope at menarche.
- Individual Lines (Random Effects):
- Each individual’s line deviates from the population-level line due to their random intercept (\(b_{1i}\)) and random slopes (\(b_{2i}\), \(b_{3i}\)). These deviations allow the model to capture individual differences in both the starting point (intercept) and the rate of body fat change before and after menarche.
Add a Stage Factor to the Data:
df2new <- within(df2, {
## Create the stage factor -- this is what is needed to make the
## same interpretations as in the book
stage <- cut(m.time,
breaks=c(floor(min(m.time)), 0, ceiling(max(m.time))),
labels=c("pre", "post"))
## But this is what is actually used
stage.tij <- pmax(m.time, 0)
})
summary(df2new)## id p.age m.age m.time
## Min. : 1.00 Min. : 8.31 Min. :10.31 Min. :-6.3100
## 1st Qu.: 38.00 1st Qu.:11.26 1st Qu.:12.10 1st Qu.:-1.6100
## Median : 78.00 Median :12.95 Median :12.99 Median : 0.1500
## Mean : 78.79 Mean :13.06 Mean :12.87 Mean : 0.1892
## 3rd Qu.:119.00 3rd Qu.:14.88 3rd Qu.:13.54 3rd Qu.: 2.0500
## Max. :162.00 Max. :18.36 Max. :15.33 Max. : 4.6300
## pbf stage.tij stage
## Min. : 2.11 Min. :0.000 pre :497
## 1st Qu.:18.24 1st Qu.:0.000 post:552
## Median :23.59 Median :0.150
## Mean :23.61 Mean :1.076
## 3rd Qu.:29.56 3rd Qu.:2.050
## Max. :45.49 Max. :4.630
Table 8.6:
# Model in p. 218
fm1 <- lmer(pbf ~ m.time + stage.tij + (m.time + stage.tij | id), data=df2new)
# Model using the interaction with the stage factor
## which is the same as a model using the interaction with the stage
## factor; i.e. no interest in intercept differences between stages,
## only in slope differences
fm1b <- lmer(pbf ~ m.time + m.time:stage + (m.time:stage | id), data=df2new)# Extract the fixed effects
fixed_effects <- summary(fm1)$coefficients
# Create a data frame for the table
table_data <- data.frame(
Variable = rownames(fixed_effects),
Estimate = round(fixed_effects[, "Estimate"], 4),
SE = round(fixed_effects[, "Std. Error"], 4),
z_value = round(fixed_effects[, "t value"], 2)
)
# Create the table using kable
knitr::kable(table_data,
col.names = c("Variable", "Estimate", "SE", "z"),
caption = "Table 1: Estimated regression coefficients (fixed effects) and standard errors for the log(FEV1) data from the Six Cities Study.")| Variable | Estimate | SE | z | |
|---|---|---|---|---|
| (Intercept) | (Intercept) | 21.3614 | 0.5645 | 37.84 |
| m.time | m.time | 0.4171 | 0.1572 | 2.65 |
| stage.tij | stage.tij | 2.0471 | 0.2280 | 8.98 |
Table 8.7:
# Fit the model
fm1 <- lmer(pbf ~ m.time + stage.tij + (m.time + stage.tij | id), data = df2new)
# Extract the variance-covariance matrix of the random effects
ranef_summary <- VarCorr(fm1)
random_effects <- ranef_summary$id
# Extract variances and covariances
variances <- diag(random_effects)
covariances <- as.vector(random_effects[lower.tri(random_effects)])
residual_variance <- attr(ranef_summary, "sc")^2
# Extract standard errors for variances (standard deviations)
se_variances <- attr(ranef_summary$id, "stddev")
se_covariances <- attr(ranef_summary$id, "correlation")
# Extract standard error for residual variance
residual_se <- sqrt(2 * residual_variance^2 / (length(df2$pbf) - length(fixef(fm1))))
# Create the summary table
summary_table <- data.frame(
Parameter = c("Var(b1i) = g11", "Var(b2i) = g22", "Var(b3i) = g33",
"Cov(b1i, b2i) = g12", "Cov(b1i, b3i) = g13", "Cov(b2i, b3i) = g23",
"Var(εi) = σ^2"),
Estimate = c(variances[1], variances[2], variances[3], covariances[1], covariances[2], covariances[3], residual_variance),
SE = c(se_variances[1], se_variances[2], se_variances[3],
sqrt(se_variances[1] * se_variances[2]),
sqrt(se_variances[1] * se_variances[3]),
sqrt(se_variances[2] * se_variances[3]),
residual_se)
)
print(summary_table)## Parameter Estimate SE
## 1 Var(b1i) = g11 45.940090 6.7779119
## 2 Var(b2i) = g22 1.631240 1.2771999
## 3 Var(b3i) = g33 2.749613 1.6581956
## 4 Cov(b1i, b2i) = g12 2.526156 2.9422353
## 5 Cov(b1i, b3i) = g13 -6.109141 3.3524772
## 6 Cov(b2i, b3i) = g23 -1.750635 1.4552825
## 7 Var(εi) = σ^2 9.473247 0.4142362
A smaller value of variance of residual indicates that the model’s predictions are close to the observed data, meaning the model fits well.
Table 8.8:
Unable to replicate.
Figure 8.7:
# Extract the fixed effects for the population average curve
fixed_effects <- fixef(fm1)
df2 <- df2 %>% mutate(stage.tij = pmax(m.time, 0))
df2 <- df2 %>% mutate(fixed_pred = fixed_effects["(Intercept)"] + fixed_effects["m.time"] * m.time +
fixed_effects["stage.tij"] * stage.tij)
# Extract the BLUPs (random effects) for two randomly selected individuals
random_effects <- ranef(fm1)$id
# Select two random individuals
set.seed(123) # For reproducibility
random_ids <- sample(unique(df2$id), 2)
df2_selected <- df2 %>% filter(id %in% random_ids)
# Calculate the BLUPs for these two individuals
df2_selected <- df2_selected %>%
rowwise() %>%
mutate(blup_pred = fixed_effects["(Intercept)"] + fixed_effects["m.time"] * m.time +
fixed_effects["stage.tij"] * stage.tij +
(random_effects[as.character(id), "(Intercept)"] +
random_effects[as.character(id), "m.time"] * m.time +
random_effects[as.character(id), "stage.tij"] * stage.tij))
# Plot the data using ggplot2
ggplot(df2, aes(x = m.time, y = pbf)) +
geom_point(size = 2) +
geom_line(aes(y = fixed_pred), size = 1.5) + # Population average curve
geom_line(data = df2_selected, aes(y = blup_pred, group = id, color = factor(id)), size = 1) + # BLUPs for selected individuals
labs(x = "Time relative to menarche (years)", y = "Percent body fat") +
theme_minimal() +
theme(legend.position = "none")Figure 8.8:
# Extract the fixed effects for the population average curve
fixed_effects <- fixef(fm1)
df2 <- df2 %>%
mutate(fixed_pred = fixed_effects["(Intercept)"] + fixed_effects["m.time"] * m.time +
fixed_effects["stage.tij"] * stage.tij)
# Extract the BLUPs (random effects) for two randomly selected individuals
random_effects <- ranef(fm1)$id
# Select two random individuals
set.seed(123) # For reproducibility
random_ids <- sample(unique(df2$id), 2)
df2_selected <- df2 %>% filter(id %in% random_ids)
# Calculate the BLUPs for these two individuals
df2_selected <- df2_selected %>%
rowwise() %>%
mutate(blup_pred = fixed_effects["(Intercept)"] + fixed_effects["m.time"] * m.time +
fixed_effects["stage.tij"] * stage.tij +
(random_effects[as.character(id), "(Intercept)"] +
random_effects[as.character(id), "m.time"] * m.time +
random_effects[as.character(id), "stage.tij"] * stage.tij))
# Fit OLS models for the two selected individuals
ols_models <- df2_selected %>%
group_by(id) %>%
summarise(model = list(lm(pbf ~ m.time + stage.tij, data = cur_data())))## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `model = list(lm(pbf ~ m.time + stage.tij, data = cur_data()))`.
## ℹ In group 1: `id = 14`.
## Caused by warning:
## ! `cur_data()` was deprecated in dplyr 1.1.0.
## ℹ Please use `pick()` instead.
# Add OLS predictions to the data
df2_selected <- df2_selected %>%
group_by(id) %>%
mutate(ols_pred = predict(ols_models$model[[which(ols_models$id == id)[1]]], newdata = cur_data()))## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `ols_pred = predict(ols_models$model[[which(ols_models$id ==
## id)[1]]], newdata = cur_data())`.
## ℹ In group 2: `id = 159`.
## Caused by warning in `ols_models$id == id`:
## ! longer object length is not a multiple of shorter object length
# Plot the data using ggplot2
ggplot(df2, aes(x = m.time, y = pbf)) +
geom_point(size = 2) +
geom_line(aes(y = fixed_pred), size = 1.5) + # Population average curve
geom_line(data = df2_selected, aes(y = blup_pred, group = id, color = factor(id)), size = 1) + # BLUPs for selected individuals
geom_line(data = df2_selected, aes(y = ols_pred, group = id, linetype = factor(id)), size = 1, alpha = 0.5) + # OLS predictions for selected individuals
labs(x = "Time relative to menarche (years)", y = "Percent body fat") +
theme_minimal() +
theme(legend.position = "none")
## Table 8.9
# Fit the hybrid model with exponential serial correlation
exponential_model <- lme(pbf ~ m.time + stage.tij,
random = ~ m.time + stage.tij | id,
correlation = corExp(form = ~ m.time | id),
data = df2new, method = "REML")
# Fit the hybrid model with Gaussian serial correlation
gaussian_model <- lme(pbf ~ m.time + stage.tij,
random = ~ m.time + stage.tij | id,
correlation = corGaus(form = ~ m.time | id),
data = df2, method = "REML")
# Extract -2 * REML log-likelihoods and AIC values
logLik_fm1 <- -2 * as.numeric(logLik(fm1))
AIC_fm1 <- AIC(fm1)
logLik_exp <- -2 * as.numeric(logLik(exponential_model))
AIC_exp <- AIC(exponential_model)
logLik_gaus <- -2 * as.numeric(logLik(gaussian_model))
AIC_gaus <- AIC(gaussian_model)
# Create the comparison table
comparison_table <- data.frame(
Model = c("Mixed Effects", "Hybrid: Exponential Serial Correlation", "Hybrid: Gaussian Serial Correlation"),
`-2 (REML) Log-Likelihood` = c(logLik_fm1, logLik_exp, logLik_gaus),
AIC = c(AIC_fm1, AIC_exp, AIC_gaus)
)
# Print the table
print(comparison_table)## Model X.2..REML..Log.Likelihood AIC
## 1 Mixed Effects 6062.401 6082.401
## 2 Hybrid: Exponential Serial Correlation 6062.401 6084.401
## 3 Hybrid: Gaussian Serial Correlation 6062.401 6084.401
Table 8.10
# Fit the hybrid model with Gaussian serial correlation
gaussian_model <- lme(pbf ~ m.time + stage.tij,
random = ~ m.time + stage.tij | id,
correlation = corGaus(form = ~ m.time | id),
data = df2, method = "REML")
# Extract the summary of the model
summary_gaussian <- summary(gaussian_model)
# Create the table with the fixed effects estimates
fixed_effects <- summary_gaussian$tTable# Extract the necessary columns and rename them
fixed_effects_table <- data.frame(
Variable = rownames(fixed_effects),
Estimate = fixed_effects[, "Value"],
SE = fixed_effects[, "Std.Error"],
Z = fixed_effects[, "Value"] / fixed_effects[, "Std.Error"]
)
# Rename the columns to match the desired format
colnames(fixed_effects_table) <- c("Variable", "Estimate", "SE", "Z")
# Print the table using kable
kable(fixed_effects_table, format = "markdown", align = c("l", "r", "r", "r"), digits = 4) %>%
kable_styling(full_width = F, position = "left")## Warning in kable_styling(., full_width = F, position = "left"): Please specify
## format in kable. kableExtra can customize either HTML or LaTeX outputs. See
## https://haozhu233.github.io/kableExtra/ for details.
| Variable | Estimate | SE | Z | |
|---|---|---|---|---|
| (Intercept) | (Intercept) | 21.3614 | 0.5646 | 37.8375 |
| m.time | m.time | 0.4171 | 0.1572 | 2.6541 |
| stage.tij | stage.tij | 2.0471 | 0.2280 | 8.9799 |
Study 3: Randomized Study of Dual or Triple Combinations of HIV-1 Reverse
Transcriptase Inhibitors
Data
The data is from a study on body fat accretion in 162 girls around menarche.
Participants were pre-menarcheal and non-obese, followed annually until four years post-menarche.
Body fatness was measured using bioelectric impedance analysis, calculating %BF from weight, height, and resistance.
df3 <- read.csv("study3data.csv")
# Create the 'group' column
df3$group <- ifelse(df3$treatment >= 1 & df3$treatment <= 3, 0,
ifelse(df3$treatment == 4, 1, NA))
head(df3)## id treatment age gender week log_CD4_1 group
## 1 1 2 36.4271 1 0.0000 3.135494 0
## 2 1 2 36.4271 1 7.5714 3.044522 0
## 3 1 2 36.4271 1 15.5714 2.772589 0
## 4 1 2 36.4271 1 23.5714 2.833213 0
## 5 1 2 36.4271 1 32.5714 3.218876 0
## 6 1 2 36.4271 1 40.0000 3.044522 0
The Model
# Create the post-16 weeks variable
df3$post_16 <- pmax(0, df3$week - 16)
# Fit the linear mixed-effects model
fm3 <- lmer(log_CD4_1 ~ week + post_16 + group * week + group * post_16 + (1 + week + post_16 | id), data = df3)## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.0187732 (tol = 0.002, component 1)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?
## Linear mixed model fit by REML ['lmerMod']
## Formula: log_CD4_1 ~ week + post_16 + group * week + group * post_16 +
## (1 + week + post_16 | id)
## Data: df3
##
## REML criterion at convergence: 11937.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.3638 -0.4196 0.0299 0.4760 3.8356
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## id (Intercept) 0.5854521 0.76515
## week 0.0009234 0.03039 0.31
## post_16 0.0012412 0.03523 -0.46 -0.86
## Residual 0.3061554 0.55331
## Number of obs: 5036, groups: id, 1309
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.958937 0.029637 99.840
## week -0.007605 0.001999 -3.804
## post_16 -0.011819 0.003180 -3.717
## group -0.069110 0.058932 -1.173
## week:group 0.027878 0.003945 7.066
## post_16:group -0.028597 0.006242 -4.582
##
## Correlation of Fixed Effects:
## (Intr) week pst_16 group wk:grp
## week -0.222
## post_16 0.118 -0.866
## group -0.503 0.112 -0.059
## week:group 0.112 -0.507 0.439 -0.222
## post_16:grp -0.060 0.441 -0.509 0.117 -0.869
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0187732 (tol = 0.002, component 1)
## Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?
Table 8.12:
# Extract the fixed effects
fixed_effects <- summary(fm3)$coefficients
# Create a data frame for the table
table_data <- data.frame(
Variable = rownames(fixed_effects),
Estimate = round(fixed_effects[, "Estimate"], 4),
SE = round(fixed_effects[, "Std. Error"], 4),
z_value = round(fixed_effects[, "t value"], 2)
)
# Create the table using kable
knitr::kable(table_data,
col.names = c("Variable", "Estimate", "SE", "z"),
caption = "Table 3: Estimated regression coefficients (fixed effects) and standard errors for the log
CD4 counts.")| Variable | Estimate | SE | z | |
|---|---|---|---|---|
| (Intercept) | (Intercept) | 2.9589 | 0.0296 | 99.84 |
| week | week | -0.0076 | 0.0020 | -3.80 |
| post_16 | post_16 | -0.0118 | 0.0032 | -3.72 |
| group | group | -0.0691 | 0.0589 | -1.17 |
| week:group | week:group | 0.0279 | 0.0039 | 7.07 |
| post_16:group | post_16:group | -0.0286 | 0.0062 | -4.58 |
Figure 8.9:
#ggplot(df3, aes(x = week, y = log_CD4_1, linetype = group, group = group)) +
# geom_smooth(method = "loess", se = FALSE) +
# scale_linetype_manual(values = c("Dual Therapy" = "dotted", "Triple Therapy" = "solid")) +
# labs(x = "Time (weeks)", y = "Log(CD4 + 1)", linetype = "") +
# theme_classic() +
# theme(legend.position = "top")Table 8.13:
# Extract the variance-covariance components
vc <- as.data.frame(VarCorr(fm3))
vc$vcov <- vc$vcov * 1000 # Scale the variances as per the table
# Extract the standard errors for the variance components
se_vc <- as.data.frame(VarCorr(fm3, comp = "Std.Dev."))
se_vc$vcov <- se_vc$vcov * 1000 # Scale the standard errors as per the table
# Combine the results into a summary table
summary_table <- vc %>%
mutate(SE = se_vc$vcov) %>%
rename(Parameter = grp, Estimate = vcov) %>%
select(Parameter, Estimate, SE)
# Manually add the correlation terms if needed (not directly extractable from VarCorr)
correlations <- data.frame(
Parameter = c("Cov(b1i, b2i) = g12", "Cov(b1i, b3i) = g13", "Cov(b2i, b3i) = g23"),
Estimate = c(7.254, -12.348, -0.919),
SE = c(1.805, 2.730, 0.236)
)
# Combine with the variance components
summary_table <- bind_rows(summary_table, correlations)
# Print the summary table
print(summary_table)## Parameter Estimate SE
## 1 id 585.4521115 585.4521115
## 2 id 0.9233932 0.9233932
## 3 id 1.2411895 1.2411895
## 4 id 7.2499750 7.2499750
## 5 id -12.3390390 -12.3390390
## 6 id -0.9189291 -0.9189291
## 7 Residual 306.1554294 306.1554294
## 8 Cov(b1i, b2i) = g12 7.2540000 1.8050000
## 9 Cov(b1i, b3i) = g13 -12.3480000 2.7300000
## 10 Cov(b2i, b3i) = g23 -0.9190000 0.2360000
Reference
- Fitzmaurice, G. M., Laird, N. M., & Ware, J. H. (2011). Applied longitudinal analysis (2nd ed.). Wiley. Data Available at: https://content.sph.harvard.edu/fitzmaur/ala2e/
For more information about the R Markdown theme please visit: https://github.com/gadenbuie/cleanrmd?tab=readme-ov-file