Imagine you are consulting for a research team analyzing data from AIDS patients with advanced immune suppression (CD4 ≤ 50 cells/mm³).
Patients received 1 of several possible HIV treatments,
and CD4 counts were tracked over time.
Your primary goal is to evaluate whether the combination of zidovudine + zalcitabine offers benefits over other regimens, accounting for repeated measures within patients.
As the study biostatistician, your team is relying on you to apply appropriate statistical methods, interpret findings in a clinical context, and guide decision-making. Complete the tasks below, ensuring clear explanations accompany your statistical output. Please note, the dataset is in person-period (long) format, which is the appropriate structure for multilevel modeling.
Variable
Description
Coding/Notes
id
Patient id
age
Age in years
Continuous (years)
sex
Sex assigned at birth
0=female, 1=male
week
Number of weeks since study enrollment
Continuous (weeks)
cd4
Patient CD4 cell count
Continuous (count)
logcd4
Natural logarithm of the patient’s CD4 cell count (cd4), calculated as log(cd4 + 1) to handle zero values and ensure the transformation is defined for all non-negative counts.
Continuous (log count)
trt
Treatment group
0=other HIV treatment regimens,
1=zidovudine plus 2.25mg of zalcitabine
Part I: Data Checking, Descriptives, and Preparation (33 points)
As part of the data cleaning process, please complete the steps below.
1. Create a comprehensive descriptive table (Table 1) summarizing patient demographics (age, sex), treatment assignment, and baseline CD4 counts. (Hint: you’ll need to create a variable to select only one case per patient) (4 points)
Use the first measurement from each subject to create a comprehensive descriptive table summarizing patient demographics
(age, sex), treatment assignment, and baseline CD4 counts.
Table 1shows baseline characteristics for the study cohort.
Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'flextable'
The following object is masked from 'package:gtsummary':
continuous_summary
The following object is masked from 'package:purrr':
compose
Code
library(scales)
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
Code
library(broom)library(lme4)
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
Code
library(lmerTest)
Attaching package: 'lmerTest'
The following object is masked from 'package:lme4':
lmer
The following object is masked from 'package:stats':
step
Code
library(ggeffects)cd4_dta_path <- r"(G:\My Drive\UNC\BIOS 642\HW4\cd4-1.dta)"cd4_dta_df <-read_dta(cd4_dta_path)cd4_dta_df <- cd4_dta_df %>%zap_labels() %>%zap_label() %>%zap_formats() %>%zap_missing() %>%mutate(sex =factor(sex), sex =case_when( sex ==0~"female", sex ==1~"male"), trt =case_when( trt ==0~"Other HIV Tx", trt ==1~"Zidovudine/Zalcitabine"))baseline_rows_df <- cd4_dta_df %>%group_by(id) %>%slice(1) %>%ungroup()
Code
table_1 <- baseline_rows_df %>%tbl_summary(include =c(age, sex, CD4, trt), by = trt ) %>%as_gt()table_1
Table 1: Baseline Characteristics
Characteristic
Other HIV Tx
N = 9851
Zidovudine/Zalcitabine
N = 3241
age
37 (32, 42)
37 (32, 42)
sex
female
120 (12%)
42 (13%)
male
865 (88%)
282 (87%)
CD4
20 (10, 34)
21 (10, 36)
1 Median (Q1, Q3); n (%)
2. Compare and contrast the distributions of the cd4 and logcd4. In your answer, make sure to note whether each is sufficiently normally distributed to be used as the dependent variable in a parametric test, presenting at least 3 pieces of statistical or graphical information that support your view. (4 points)
For these two specific questions, treat each row of data as an independent observation, as if each came from a different individual.
Warning in ks.test.default(cd4_dta_df$CD4, "pnorm", mean = mean(cd4_dta_df$CD4,
: ties should not be present for the one-sample Kolmogorov-Smirnov test
Asymptotic one-sample Kolmogorov-Smirnov test
data: cd4_dta_df$CD4
D = 0.21656, p-value < 2.2e-16
alternative hypothesis: two-sided
For CD4, the Kolmogorov-Smirnov test indicated a significant departure from normality (D = 0.217, p = <2e-16). However, given the large sample size (n = 5036, even minor deviations may result in statistical significance.
Code
ks_result <-ks.test(cd4_dta_df$logcd4, "pnorm", alternative ="two.sided", mean =mean(cd4_dta_df$logcd4, na.rm =TRUE), sd =sd(cd4_dta_df$logcd4, na.rm =TRUE))
Warning in ks.test.default(cd4_dta_df$logcd4, "pnorm", alternative =
"two.sided", : ties should not be present for the one-sample Kolmogorov-Smirnov
test
Asymptotic one-sample Kolmogorov-Smirnov test
data: cd4_dta_df$logcd4
D = 0.06844, p-value < 2.2e-16
alternative hypothesis: two-sided
For log(CD4), the Kolmogorov-Smirnov test indicated a significant departure from normality (D = 0.068, p = <2e-16). However, given the large sample size (n = 5036, even minor deviations may result in statistical significance.
CD4 is not normally distributed. Rather, it is right skewed, as demonstrated by Figure 1 (a) and Figure 2 (a). The Kolmogorov-Smirnov test results outlined above add further evidence for non-normality.
Log(CD4) is also not normally distributed, though it is closer to normal than is CD4. Both the histogram (Figure 1 (b)) and the QQ plot (Figure 2 (b)) demonstrate a heavy left tail and zero-inflation. The Kolmogorov-Smirnov test results outlined above add further evidence for non-normality.
3. Recode the week variable, with patients remaining in the study for more than 8 months (i.e., > 32 weeks) coded as "1", and patients in the study for 8 months or less (i.e., ≤ 32 weeks) coded as "0". (3 points)
Due to scaling issues and typos in Part III, please create a new variable called month (where 4 weeks = 1 month) and use it as the time variable in your models for Part III. This adjustment will help avoid scaling inconsistencies and improve the clarity of the results.
a) Did the average number of days in the study significantly differ between treatment and control groups? How did you come to your conclusion? (3 points)
Code
days_on_study_df <- cd4_dta_df %>%# Get the row of each id with maximum value for weekgroup_by(id) %>%slice_max(order_by = week, n =1) %>%ungroup() %>%arrange(as.numeric(id)) %>%# convert weeks on study to days_on_studymutate(days_on_study = week *7)t.test.result <-t.test(x = days_on_study_df %>%filter(trt =="Zidovudine/Zalcitabine") %>%pull(days_on_study), y = days_on_study_df %>%filter(trt =="Other HIV Tx") %>%pull(days_on_study) )t_stat <- t.test.result$statisticdf <- t.test.result$parameterp_val <- t.test.result$p.valueci <- t.test.result$conf.intt.test.result
Welch Two Sample t-test
data: days_on_study_df %>% filter(trt == "Zidovudine/Zalcitabine") %>% pull(days_on_study) and days_on_study_df %>% filter(trt == "Other HIV Tx") %>% pull(days_on_study)
t = -0.37754, df = 523.86, p-value = 0.7059
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-13.129619 8.896594
sample estimates:
mean of x mean of y
185.9413 188.0579
The average number of days on study did not significantly differ between treatment groups. The Welch two-sample t-test yielded a t-statistic of -0.378 with 523.86 degrees of freedom, and a p-value of 0.706. The 95% confidence interval for the difference in means was (-13.13, 8.9). The mean time on study was 185.9 days for the Zidovudine/Zalcitabine group and 188.1 days for the other HIV treatment group.
b) What proportion of patients in the study were in the study more than 8 months? Was there a significant difference between treatment and control groups? How did you come to your conclusion? (3 points)
2-sample test for equality of proportions with continuity correction
data: c(nrow(days_on_study_df %>% filter(trt == "Zidovudine/Zalcitabine", month_binary == ">8 months")), nrow(days_on_study_df %>% filter(trt == "Other HIV Tx", month_binary == ">8 months"))) out of c(nrow(days_on_study_df %>% filter(trt == "Zidovudine/Zalcitabine")), nrow(days_on_study_df %>% filter(trt == "Other HIV Tx")))
X-squared = 0.026063, df = 1, p-value = 0.8717
alternative hypothesis: two.sided
95 percent confidence interval:
-0.05747088 0.07188461
sample estimates:
prop 1 prop 2
0.4691358 0.4619289
The proportion of patients remaining in the study for more than 8 months did not significantly differ between treatment groups. The two-sample test for equality of proportions yielded a chi-squared statistic of 0.026 with 1 degree of freedom, p = 0.872. The estimated proportions of patients in the study >8 months were 0.469 for the Zidovudine/Zalcitabine group and 0.462 for the other HIV treatment group. The 95% confidence interval for the difference in proportions was (-0.057, 0.072).
c) Regardless of whether there was a difference between treatment groups on length of time in the study, provide at least two possible reasons why one group of patients, on average, would remain in the study longer than another. What methods or measures would study researchers assess to determine whether these reasons might be contributing factors? (6 points)
Two possible reasons why one group of patients, on average, would remain in the study longer than another:
(1) One treatment causes more side effects than the other, leading subjects on the treatment arm associated with greater toxicity to drop out sooner due to toxicity. This can be summarized as dropout due to toxicity. Researchers could assess adverse event rates or quality-of-life measures to assess dropout due to toxicity as a possibility.
(2) There is public knowledge suggesting that one treatment arm is superior to the other and treatment is not blinded (either by design or due to flaws in blinding). Subjects on the arm perceived to be inferior, particularly those with means and motivation to seek out alternatives, drop out sooner than those on the arm perceived to be more favorable. This can be summarized as dropout due to disappointment. Researchers could assess concordance between a subject’s guess as to their treatment arm and the correct treatment arm and/or subjects’ satisfaction with their perceived treatment arm to investigate this possibility.
4. Recode the status variable, collapsing the “transplant” and “censored” categories into a single category (transplant or censored = 0) and death as the “failure” (death = 1). Provide a frequency table showing the distribution for the new variable. (2 points)
Please note that Question 4 on Homework 4 is a duplicate of a question from Homework 3. You can safely ignore Question 4 when completing your assignment.
5. Summarize missing data patterns across weeks (log_cd4). (Note: The results might be trivial) (3 points)
Code
sum(is.na(cd4_dta_df$logcd4))
[1] 0
There are no missing values for logcd4.
Part II: OLS vs. Multilevel Modeling (17 points)
6. Fit a linear regression with log_cd4 as the outcome and treatment, mean-centered age, and sex as predictors. Present the treatment effect and discuss its interpretation in terms of CD4 improvement. Hint: Ignore the data clustering for this analysis, treating each case as independent. (6 points)
For these two specific questions, treat each row of data as an independent observation, as if each came from a different individual.
Code
cd4_dta_df <- cd4_dta_df %>%mutate(age_centered =scale(age, center =TRUE))logcd4_lm <-lm(logcd4 ~ trt + age_centered + sex, data = cd4_dta_df)logcd4_coef <-summary(logcd4_lm)$coefficientssummary(logcd4_lm)
Call:
lm(formula = logcd4 ~ trt + age_centered + sex, data = cd4_dta_df)
Residuals:
Min 1Q Median 3Q Max
-3.0696 -0.5758 0.0819 0.6963 3.4823
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.01522 0.04644 64.926 < 2e-16 ***
trtZidovudine/Zalcitabine -0.09724 0.03500 -2.778 0.00548 **
age_centered 0.09670 0.01520 6.364 2.14e-10 ***
sexmale -0.13421 0.04830 -2.779 0.00548 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.073 on 5032 degrees of freedom
Multiple R-squared: 0.01036, Adjusted R-squared: 0.009766
F-statistic: 17.55 on 3 and 5032 DF, p-value: 2.476e-11
A linear regression model was fit to predict log-transformed CD4 counts with treatment group, mean-centered age, and sex as predictors, treating each observation as independent.
The treatment effect estimate was -0.097 (SE = 0.035), t = -2.778, p = 0.00548.
This suggests that, on average, patients receiving Zidovudine/Zalcitabine had log(CD4) values 0.097 units lower than those receiving other HIV treatments, adjusting for age and sex.
The overall model was statistically significant, F(3, 5032) = 17.55, p = 2.476e-11.
7. Fit a random intercept MLM with repeated measures (Level 1) nested within patient (Level 2) with log_cd4 as the outcome and treatment, mean-centered age, and sex as predictors. Calculate and interpret in words the Intraclass Correlation Coefficient (ICC), making sure to explain what the statistic reveals about variability in CD4 counts at each level. (6 points)
Code
logcd4_mlm <-lmer( logcd4 ~ trt + age_centered + sex + (1| id),data = cd4_dta_df,na.action = na.exclude)p_val_trt <-summary(logcd4_mlm)$coefficients["trtZidovudine/Zalcitabine", "Pr(>|t|)"]print(summary(logcd4_mlm))
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: logcd4 ~ trt + age_centered + sex + (1 | id)
Data: cd4_dta_df
REML criterion at convergence: 12429.7
Scaled residuals:
Min 1Q Median 3Q Max
-4.4991 -0.4690 0.0354 0.5376 3.7605
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.7365 0.8582
Residual 0.4107 0.6409
Number of obs: 5036, groups: id, 1309
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.98235 0.07584 1337.91404 39.325 < 2e-16 ***
trtZidovudine/Zalcitabine -0.07172 0.05967 1293.52460 -1.202 0.229564
age_centered 0.09486 0.02579 1294.17955 3.679 0.000244 ***
sexmale -0.12284 0.07924 1331.54148 -1.550 0.121294
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) trtZ/Z ag_cnt
trtZdvdn/Zl -0.206
age_centerd 0.097 -0.001
sexmale -0.920 0.013 -0.105
A random intercept multilevel model was fit to predict log-transformed CD4 counts using treatment, mean-centered age, and sex as fixed effects, and allowing intercepts to vary across patients.
The intraclass correlation coefficient (ICC) is the ratio between the random-intercept (between-person) variance and the total variance (random intercept [between-person] variance + residual [within-person] variance).
The ICC was 0.642, indicating that 64.2% of the variance in log(CD4) was attributable to between-person differences (ie, stable differences across patients). The remaining 35.8% of the variance reflected within-person variability (ie, fluctuations over time within individuals).
This suggests that while there are meaningful stable differences in CD4 levels across patients, a substantial proportion of variability occurs within patients over time.
8. Compare and contrast the OLS and MLM results. Which model better accounts for the study design? (5 points)
The OLS model suggested a significant negative effect of treatment with Zidovudine/Zalcitabine on log-transformed CD4 counts (b = -0.097, p = 2.476e-11).
However, when accounting for the nesting of repeated measures within individuals using a random intercept MLM, the treatment effect became smaller and non-significant (b = -0.072, p = 2.296e-01).
As discussed above, the ICC indicates that most of the variance in log(CD4) was attributable to stable between-person differences.
The multilevel model provided a better fit to the study design by accounting for the longitudinal measurements within patients and providing a more correct estimation of the effective sample size and standard errors.
Part III: Investigating Time Trends and Interactions (40 points)
Due to scaling issues and typos in Part III, please create a new variable called month (where 4 weeks = 1 month) and use it as the time variable in your models for Part III. This adjustment will help avoid scaling inconsistencies and improve the clarity of the results.
9. Test a MLM predicting logcd4 with a linear fixed effect of time (month) variable (no treatment variable here). Is logcd4 significantly associated with linear time? (5 points)
Code
logcd4_month_mlm <-lmer(logcd4 ~ month + (1|id), data = cd4_dta_df)summary(logcd4_month_mlm)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: logcd4 ~ month + (1 | id)
Data: cd4_dta_df
REML criterion at convergence: 12262.7
Scaled residuals:
Min 1Q Median 3Q Max
-4.3527 -0.4555 0.0330 0.5292 3.7115
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.7660 0.8752
Residual 0.3892 0.6239
Number of obs: 5036, groups: id, 1309
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.996e+00 2.803e-02 1.723e+03 106.90 <2e-16 ***
month -4.042e-02 2.962e-03 3.910e+03 -13.64 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
month -0.366
A multilevel model was fit with log-transformed CD4 count as the outcome and month as a fixed effect, allowing random intercepts by patient.
The effect of month was statistically significant (b = -0.04, SE = 0.003, t = -13.65, p = 1.892e-41), indicating that log(CD4) decreased by approximately 0.04 units per month on average.
10. Test a second MLM with both a linear fixed and random effect of time (month) variables in the same model. How does this model compare with the model from question 9 with respect to model fit? (6 points)
Adding a random slope for month significantly improved model fit compared to the random intercept-only model, χ²(2) = 151.71, p = 1.138e-33. The random slope model also had substantially lower AIC (1.212268^{4}) and BIC (1.2161826^{4}) compared to the random intercept-only model (AIC = 1.2270672^{4}, BIC = 1.229677^{4}), indicating improved model fit with appropriate adjustment for model complexity.
b. Use predictive margins and plot the linear effect of time (Hint: margins and marginsplot commands will help here). Briefly describe the shape and direction of the linear time trajectory.
Code
# Get predicted values for month across the observed rangepred_month <-ggpredict(logcd4_month_randslope_mlm, terms ="month")pred_month %>%ggplot(aes(x = x, y = predicted)) +geom_line(color ="blue") +geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha =0.2) +labs(x ="Time (Months)",y ="Predicted log(CD4)",title ="Predicted Linear Effect of Time (Month) on log(CD4)" ) +theme_minimal() +theme(panel.grid =element_blank(),axis.line =element_line(color ="black"),axis.ticks =element_line(color ="black") )
Log(CD4) decreases linearly as time increases.
11. Test a MLM predicting logcd4 with a quadratic fixed effect of time (month##month) and linear random effect of time (month). (6 points)
Adding a quadratic fixed effect of time (month²) significantly improved model fit compared to the linear random slope model, likelihood ratio test result: χ²(1) = 57.69, p = 3.070e-14. The quadratic model also had a substantially lower AIC (1.205529^{4}) than the linear model (1.212268^{4}), with ΔAIC = -67.39. This provides strong evidence that a curved (quadratic) trajectory better describes the association between time and log(CD4) than a linear trend.
b. Use predictive margins and plot the quadratic effect of time. Briefly describe the shape and direction of the quadratic time trajectory.
Code
# Get predicted values for month across the observed rangepred_month <-ggpredict(logcd4_month2_mlm, terms ="month [all]")pred_month %>%ggplot(aes(x = x, y = predicted)) +geom_line(color ="blue") +geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha =0.2) +labs(x ="Time (Months)",y ="Predicted log(CD4)",title ="Predicted Quadratic Effect of Time (Month) on log(CD4)" ) +theme_minimal() +theme(panel.grid =element_blank(),axis.line =element_line(color ="black"),axis.ticks =element_line(color ="black") )
As time increases, predicted log(CD4) increases slightly, levels off, and then decreases.
12. Test a MLM predicting logcd4 with a cubic fixed effect of time (month##month##month) and linear random effect of time (month). (6 points)
Including a cubic fixed effect of time (month³) significantly improved model fit compared to the quadratic model, χ²(1) = 33.62, p = 6.696e-09. The cubic model also had a lower AIC (1.202282^{4}) than the quadratic model (1.205529^{4}), with ΔAIC = -32.47. This indicates that the time trajectory of log(CD4) is better captured by a cubic curve than by a quadratic function.
b. Use predictive margins and plot the cubic effect of time. Briefly describe the shape and direction of the cubic time trajectory.
Code
# Get predicted values for month across the observed rangepred_month <-ggpredict(logcd4_month3_mlm, terms ="month [all]")pred_month %>%ggplot(aes(x = x, y = predicted)) +geom_line(color ="blue") +geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha =0.2) +labs(x ="Time (Months)",y ="Predicted log(CD4)",title ="Predicted Cubic Effect of Time (Month) on log(CD4)" ) +theme_minimal() +theme(panel.grid =element_blank(),axis.line =element_line(color ="black"),axis.ticks =element_line(color ="black") )
Log(CD4) progresses with a sinusoidal shape over time. It increases, levels off, decreases, levels off, and begins increasing slightly. I question whether this final rise makes biological sense, or from a statistical perspective whether this is due to overfitting.
13. In your view, which model from questions 10, 11 and 12 is the “best” model for describing the association between time (month) and logcd4? What information are you using to make this assessment? (5 points)
Code
aic_q10 <-AIC(logcd4_month_randslope_mlm) # Model 10: Linear random slopeaic_q11 <-AIC(logcd4_month2_mlm) # Model 11: Quadraticaic_q12 <-AIC(logcd4_month3_mlm) # Model 12: Cubicaic_q12 - aic_q10
[1] -99.86063
The LRT and AIC shows an improvement in model fit for models from q11 over q10 and q12 over q11 (with AIC also penalizing model complexity). Additionally, AIC for model q12 is substantially smaller than that for model q10. Together, these data demonstrate the model that best fits the data is the model in q12 (cubic).
14. Using the “best” model identified in question 13, test a model interacting the time (month)interaction term with treatment group. Describe the shape and direction of the time trajectories, making sure to note whether and how the treatment groups significantly differ from each other with respect to time. (Hint: Use predictive margins to help you answer this question). (6 points)
Code
logcd4_month3_trt_mlm <-lmer( logcd4 ~poly(month, 3) * trt + (1+ month | id), data = cd4_dta_df, na.action = na.exclude )
Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.00529595 (tol = 0.002, component 1)
Some of the focal terms are of type `character`. This may lead to
unexpected results. It is recommended to convert these variables to
factors before fitting the model.
The following variables are of type character: `trt`
Code
ggplot(pred_trt_month, aes(x = x, y = predicted, color = group)) +geom_line() +geom_ribbon(aes(ymin = conf.low, ymax = conf.high, fill = group), alpha =0.2, color =NA) +labs(x ="Time (Months)",y ="Predicted log(CD4)",title ="Predicted CD4 Trajectories Over Time by Treatment Group",color ="Treatment Group",fill ="Treatment Group" ) +theme_minimal() +theme(panel.grid =element_blank(),axis.line =element_line(color ="black"),axis.ticks =element_line(color ="black") )
Again, for both curves: Log(CD4) progresses with a sinusoidal shape over time. It increases, levels off, decreases, levels off, and begins increasing slightly. I question whether this final rise makes biological sense, or from a statistical perspective whether this is due to overfitting.
However, beginning shortly at <1 month log(CD4) is higher at all time points for Other HIV Tx than it is for Zidovudine/Zalcitabine. The Zidovudine/Zalcitabine group experienced a steeper decline in log(CD4) during the middle months, as evidenced by significant interaction terms between time and treatment in the mixed-effects model (0.0109 for the linear interaction term, with a similar trend for the quadratic term. These findings suggest that not only does log(CD4) change nonlinearly over time, but the pattern and magnitude of these changes differ significantly by treatment group.
15. Test a cross-level interaction between mean-centered age and month, while including treatment and sex as fixed main effects in the model. Does it appear that age is associated with logcd4 trends? (6 points)
Code
q15_mlm <-lmer( logcd4 ~poly(month, 3) * age_centered + trt + sex + (1+ month | id), data = cd4_dta_df, na.action = na.exclude )
Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.00346236 (tol = 0.002, component 1)
Code
# Fixed effects tablefixef_q15 <-summary(q15_mlm)$coefficients# p-values for interactions (important for interpretation)p_linear <- fixef_q15["poly(month, 3)1:age_centered", "Pr(>|t|)"]p_quadratic <- fixef_q15["poly(month, 3)2:age_centered", "Pr(>|t|)"]p_cubic <- fixef_q15["poly(month, 3)3:age_centered", "Pr(>|t|)"]# Main effect of agep_age_main <- fixef_q15["age_centered", "Pr(>|t|)"]# Format for inline reportingp_linear_fmt <-label_scientific(digits =2)(p_linear)p_quadratic_fmt <-label_scientific(digits =2)(p_quadratic)p_cubic_fmt <-label_scientific(digits =2)(p_cubic)p_age_main_fmt <-label_scientific(digits =2)(p_age_main)summary(q15_mlm)
Model contains splines or polynomial terms. Consider using
`terms="age_centered [all]"` to get smooth plots. See also
package-vignette 'Adjusted Predictions at Specific Values'.
Code
ggplot(pred_q15, aes(x = x, y = predicted, color = group)) +geom_line() +geom_ribbon(aes(ymin = conf.low, ymax = conf.high, fill = group), alpha =0.2, color =NA) +labs(x ="Age (Mean Centered)",y ="Predicted log(CD4)",title ="Predicted CD4 Trajectories By Age (Mean Centered)" ) +theme_minimal() +theme(panel.grid =element_blank(),axis.line =element_line(color ="black"),axis.ticks =element_line(color ="black"), legend.position ="none" )
(a) By Time
(b) By Age
Figure 3: Predicted log(CD4)
Adjusting for time, sex, and treatment group, the estimated coefficient for age was 0.095, corresponding to an approximate increase of 0.095 log(CD4) units per additional year of age. The main effect of age was statistically significant (p = 3.4e-04).
None of the interactions between age and the linear, quadratic, or cubic time terms were statistically significant (linear: p = 4.5e-01; quadratic: p = 1.5e-01; cubic: p = 8.9e-01). This suggests that while age was associated with overall log(CD4), the shape and direction of log(CD4) change over time did not significantly differ by age.
The predicted values plotted in Figure 3 illustrate these findings:
Figure 3 (a)shows the log(CD4) trajectory over time, which followed a sinusoidal trajectory as described previously.
Figure 3 (b)shows the positive association between age and log(CD4).