HW4

Author

Dan Reef

Imagine you are consulting for a research team analyzing data from AIDS patients with advanced immune suppression (CD4 ≤ 50 cells/mm³).

Patients received 1 of several possible HIV treatments,

and CD4 counts were tracked over time.

Your primary goal is to evaluate whether the combination of zidovudine + zalcitabine offers benefits over other regimens, accounting for repeated measures within patients.

As the study biostatistician, your team is relying on you to apply appropriate statistical methods, interpret findings in a clinical context, and guide decision-making. Complete the tasks below, ensuring clear explanations accompany your statistical output. Please note, the dataset is in person-period (long) format, which is the appropriate structure for multilevel modeling.

Variable Description Coding/Notes
id Patient id
age Age in years Continuous (years)
sex Sex assigned at birth 0=female, 1=male
week Number of weeks since study enrollment Continuous (weeks)
cd4 Patient CD4 cell count Continuous (count)
logcd4 Natural logarithm of the patient’s CD4 cell count (cd4), calculated as log(cd4 + 1) to handle zero values and ensure the transformation is defined for all non-negative counts. Continuous (log count)
trt Treatment group 0=other HIV treatment regimens,
1=zidovudine plus 2.25mg of zalcitabine

Part I: Data Checking, Descriptives, and Preparation (33 points)

As part of the data cleaning process, please complete the steps below.

1.        Create a comprehensive descriptive table (Table 1) summarizing patient demographics (age, sex), treatment assignment, and baseline CD4 counts. (Hint: you’ll need to create a variable to select only one case per patient) (4 points)

Use the first measurement from each subject to create a comprehensive descriptive table summarizing patient demographics
(age, sex), treatment assignment, and baseline CD4 counts.

Table 1 shows baseline characteristics for the study cohort.

Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(readxl)
library(skimr)
library(glue)
library(forcats)
library(gtsummary)
library(gt)
library(haven)
library(flextable)

Attaching package: 'flextable'

The following object is masked from 'package:gtsummary':

    continuous_summary

The following object is masked from 'package:purrr':

    compose
Code
library(scales)

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
Code
library(broom)
library(lme4)
Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack
Code
library(lmerTest)

Attaching package: 'lmerTest'

The following object is masked from 'package:lme4':

    lmer

The following object is masked from 'package:stats':

    step
Code
library(ggeffects)

cd4_dta_path <- r"(G:\My Drive\UNC\BIOS 642\HW4\cd4-1.dta)"

cd4_dta_df <- read_dta(cd4_dta_path)

cd4_dta_df <- cd4_dta_df %>%
  zap_labels() %>%
  zap_label() %>%
  zap_formats() %>%
  zap_missing() %>%
  mutate(
    sex = factor(sex), 
    sex = case_when(
      sex == 0 ~ "female", 
      sex == 1 ~ "male"), 
    trt = case_when(
      trt == 0 ~ "Other HIV Tx", 
      trt == 1 ~ "Zidovudine/Zalcitabine"))

baseline_rows_df <- cd4_dta_df %>%
  group_by(id) %>%
  slice(1) %>%
  ungroup()
Code
table_1 <- baseline_rows_df %>%
  tbl_summary(
    include = c(age, sex, CD4, trt), 
    by = trt
  ) %>% 
  as_gt()

table_1
Table 1: Baseline Characteristics
Characteristic Other HIV Tx
N = 9851
Zidovudine/Zalcitabine
N = 3241
age 37 (32, 42) 37 (32, 42)
sex

    female 120 (12%) 42 (13%)
    male 865 (88%) 282 (87%)
CD4 20 (10, 34) 21 (10, 36)
1 Median (Q1, Q3); n (%)


2.        Compare and contrast the distributions of the cd4 and logcd4. In your answer, make sure to note whether each is sufficiently normally distributed to be used as the dependent variable in a parametric test, presenting at least 3 pieces of statistical or graphical information that support your view. (4 points)

For these two specific questions, treat each row of data as an independent observation, as if each came from a different individual.

Code
cd4_dta_df %>%
  ggplot(aes(x = CD4)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 5) + 
  geom_density(color = "black", linewidth = 1) +
  theme_minimal() + 
  theme(
    panel.grid = element_blank(), 
    axis.line = element_line(color = "black")
  )

cd4_dta_df %>%
  ggplot(aes(x = logcd4)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 0.25) + 
  geom_density(color = "black", linewidth = 1) +
  theme_minimal() + 
  theme(
    panel.grid = element_blank(), 
    axis.line = element_line(color = "black")
  )
(a) Histogram of CD4
(b) Histogram of Log(CD4)
Figure 1: Histograms
Code
cd4_dta_df %>%
  ggplot(aes(sample = CD4)) +
  stat_qq() + 
  stat_qq_line() + 
  theme_minimal() + 
  theme(
    panel.grid = element_blank(), 
    axis.line = element_line(color = "black")
  ) +
  labs(
  x = "Theoretical Quantiles (Normal Distribution)",
  y = "Observed Quantiles of CD4"
  )

cd4_dta_df %>%
  ggplot(aes(sample = logcd4)) +
  stat_qq() + 
  stat_qq_line() + 
  theme_minimal() + 
  theme(
    panel.grid = element_blank(), 
    axis.line = element_line(color = "black")
  ) +
  labs(
  x = "Theoretical Quantiles (Normal Distribution)",
  y = "Observed Quantiles of Log(CD4)"
  )
(a) QQ Plot of CD4
(b) QQ Plot of Log(CD4)
Figure 2: QQ Plots
Code
ks_result <- ks.test(cd4_dta_df$CD4, "pnorm", 
                     mean = mean(cd4_dta_df$CD4, na.rm = TRUE), 
                     sd = sd(cd4_dta_df$CD4, na.rm = TRUE))
Warning in ks.test.default(cd4_dta_df$CD4, "pnorm", mean = mean(cd4_dta_df$CD4,
: ties should not be present for the one-sample Kolmogorov-Smirnov test
Code
D_val <- round(ks_result$statistic, 3)

p_val <- format.pval(ks_result$p.value, digits = 3, scientific = TRUE)

ks_result

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  cd4_dta_df$CD4
D = 0.21656, p-value < 2.2e-16
alternative hypothesis: two-sided

For CD4, the Kolmogorov-Smirnov test indicated a significant departure from normality (D = 0.217, p = <2e-16). However, given the large sample size (n = 5036, even minor deviations may result in statistical significance.

Code
ks_result <- ks.test(cd4_dta_df$logcd4, "pnorm", 
        alternative = "two.sided", 
        mean = mean(cd4_dta_df$logcd4, na.rm = TRUE), 
        sd = sd(cd4_dta_df$logcd4, na.rm = TRUE))
Warning in ks.test.default(cd4_dta_df$logcd4, "pnorm", alternative =
"two.sided", : ties should not be present for the one-sample Kolmogorov-Smirnov
test
Code
D_val <- round(ks_result$statistic, 3)

p_val <- format.pval(ks_result$p.value, digits = 3, scientific = TRUE)

ks_result

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  cd4_dta_df$logcd4
D = 0.06844, p-value < 2.2e-16
alternative hypothesis: two-sided

For log(CD4), the Kolmogorov-Smirnov test indicated a significant departure from normality (D = 0.068, p = <2e-16). However, given the large sample size (n = 5036, even minor deviations may result in statistical significance.

CD4 is not normally distributed. Rather, it is right skewed, as demonstrated by Figure 1 (a) and Figure 2 (a). The Kolmogorov-Smirnov test results outlined above add further evidence for non-normality.

Log(CD4) is also not normally distributed, though it is closer to normal than is CD4. Both the histogram (Figure 1 (b)) and the QQ plot (Figure 2 (b)) demonstrate a heavy left tail and zero-inflation. The Kolmogorov-Smirnov test results outlined above add further evidence for non-normality.

3.        Recode the week variable, with patients remaining in the study for more than 8 months (i.e., > 32 weeks) coded as "1", and patients in the study for 8 months or less (i.e., ≤ 32 weeks) coded as "0". (3 points)

Code
cd4_dta_df <- cd4_dta_df %>%
  mutate(
    week_binary = case_when(
      week <= 32 ~ "≤ 32 weeks", 
      week > 32 ~ "> 32 weeks"
    )
  )

cd4_dta_df %>% select(c(week, week_binary)) %>% head(n = 10)
# A tibble: 10 × 2
    week week_binary
   <dbl> <chr>      
 1  0    ≤ 32 weeks 
 2  7.57 ≤ 32 weeks 
 3 15.6  ≤ 32 weeks 
 4 23.6  ≤ 32 weeks 
 5 32.6  > 32 weeks 
 6 40    > 32 weeks 
 7  0    ≤ 32 weeks 
 8  8    ≤ 32 weeks 
 9 16    ≤ 32 weeks 
10 23    ≤ 32 weeks 

Due to scaling issues and typos in Part III, please create a new variable called month (where 4 weeks = 1 month) and use it as the time variable in your models for Part III. This adjustment will help avoid scaling inconsistencies and improve the clarity of the results.

Code
cd4_dta_df <- cd4_dta_df %>%
  mutate(
    month = week/4
  )

cd4_dta_df %>% select(c(week, month)) %>% head(n = 10)
# A tibble: 10 × 2
    week month
   <dbl> <dbl>
 1  0     0   
 2  7.57  1.89
 3 15.6   3.89
 4 23.6   5.89
 5 32.6   8.14
 6 40    10   
 7  0     0   
 8  8     2   
 9 16     4   
10 23     5.75


a)        Did the average number of days in the study significantly differ between treatment and control groups? How did you come to your conclusion? (3 points)

Code
days_on_study_df <- cd4_dta_df %>% 
  # Get the row of each id with maximum value for week
  group_by(id) %>%
  slice_max(order_by = week, n = 1) %>%
  ungroup() %>% 
  arrange(as.numeric(id)) %>%
  # convert weeks on study to days_on_study
  mutate(days_on_study = week * 7)

t.test.result <- t.test(
  x = days_on_study_df %>% 
    filter(trt == "Zidovudine/Zalcitabine") %>%
    pull(days_on_study), 
  y = days_on_study_df %>% 
    filter(trt == "Other HIV Tx") %>%
    pull(days_on_study)
  )

t_stat <- t.test.result$statistic
df <- t.test.result$parameter
p_val <- t.test.result$p.value
ci <- t.test.result$conf.int

t.test.result

    Welch Two Sample t-test

data:  days_on_study_df %>% filter(trt == "Zidovudine/Zalcitabine") %>% pull(days_on_study) and days_on_study_df %>% filter(trt == "Other HIV Tx") %>% pull(days_on_study)
t = -0.37754, df = 523.86, p-value = 0.7059
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -13.129619   8.896594
sample estimates:
mean of x mean of y 
 185.9413  188.0579 

The average number of days on study did not significantly differ between treatment groups. The Welch two-sample t-test yielded a t-statistic of -0.378 with 523.86 degrees of freedom, and a p-value of 0.706. The 95% confidence interval for the difference in means was (-13.13, 8.9). The mean time on study was 185.9 days for the Zidovudine/Zalcitabine group and 188.1 days for the other HIV treatment group.


b)        What proportion of patients in the study were in the study more than 8 months? Was there a significant difference between treatment and control groups? How did you come to your conclusion? (3 points)

Code
days_on_study_df <- days_on_study_df %>%
  mutate(
    month_binary = case_when(
      month > 8 ~ ">8 months", 
      month <= 8 ~ "≤8 months"
    ))

prop_by_month_binary_df <- days_on_study_df %>%
  group_by(month_binary) %>%
  summarize(
    n = n(),
    proportion = n / nrow(days_on_study_df)
  )

prop_over_8_months <- prop_by_month_binary_df %>% filter(month_binary == ">8 months") %>% pull(proportion)

prop_8_months_or_below <- prop_by_month_binary_df %>% filter(month_binary == "≤8 months") %>% pull(proportion)

prop_by_month_binary_df
# A tibble: 2 × 3
  month_binary     n proportion
  <chr>        <int>      <dbl>
1 >8 months      607      0.464
2 ≤8 months      702      0.536

Proportion of patients in the study >8 months: 46.37%.

Code
prop.test.result <- prop.test(
  x = c(
    nrow(days_on_study_df %>% filter(trt == "Zidovudine/Zalcitabine", month_binary == ">8 months")), 
    nrow(days_on_study_df %>% filter(trt == "Other HIV Tx", month_binary == ">8 months"))), 
  n = c(
    nrow(days_on_study_df %>% filter(trt == "Zidovudine/Zalcitabine")), 
    nrow(days_on_study_df %>% filter(trt == "Other HIV Tx"))
    )
  )

x2 <- prop.test.result$statistic
df <- prop.test.result$parameter
p.val <- prop.test.result$p.val
ci <- prop.test.result$conf.int
estimates <- prop.test.result$estimate

prop.test.result

    2-sample test for equality of proportions with continuity correction

data:  c(nrow(days_on_study_df %>% filter(trt == "Zidovudine/Zalcitabine", month_binary == ">8 months")), nrow(days_on_study_df %>% filter(trt == "Other HIV Tx", month_binary == ">8 months"))) out of c(nrow(days_on_study_df %>% filter(trt == "Zidovudine/Zalcitabine")), nrow(days_on_study_df %>% filter(trt == "Other HIV Tx")))
X-squared = 0.026063, df = 1, p-value = 0.8717
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.05747088  0.07188461
sample estimates:
   prop 1    prop 2 
0.4691358 0.4619289 

The proportion of patients remaining in the study for more than 8 months did not significantly differ between treatment groups. The two-sample test for equality of proportions yielded a chi-squared statistic of 0.026 with 1 degree of freedom, p = 0.872. The estimated proportions of patients in the study >8 months were 0.469 for the Zidovudine/Zalcitabine group and 0.462 for the other HIV treatment group. The 95% confidence interval for the difference in proportions was (-0.057, 0.072).

c)        Regardless of whether there was a difference between treatment groups on length of time in the study, provide at least two possible reasons why one group of patients, on average, would remain in the study longer than another. What methods or measures would study researchers assess to determine whether these reasons might be contributing factors? (6 points)

Two possible reasons why one group of patients, on average, would remain in the study longer than another:

(1) One treatment causes more side effects than the other, leading subjects on the treatment arm associated with greater toxicity to drop out sooner due to toxicity. This can be summarized as dropout due to toxicity. Researchers could assess adverse event rates or quality-of-life measures to assess dropout due to toxicity as a possibility.

(2) There is public knowledge suggesting that one treatment arm is superior to the other and treatment is not blinded (either by design or due to flaws in blinding). Subjects on the arm perceived to be inferior, particularly those with means and motivation to seek out alternatives, drop out sooner than those on the arm perceived to be more favorable. This can be summarized as dropout due to disappointment. Researchers could assess concordance between a subject’s guess as to their treatment arm and the correct treatment arm and/or subjects’ satisfaction with their perceived treatment arm to investigate this possibility.

4.        Recode the status variable, collapsing the “transplant” and “censored” categories into a single category (transplant or censored = 0) and death as the “failure” (death = 1). Provide a frequency table showing the distribution for the new variable. (2 points)

Please note that Question 4 on Homework 4 is a duplicate of a question from Homework 3. You can safely ignore Question 4 when completing your assignment.

5.        Summarize missing data patterns across weeks (log_cd4). (Note: The results might be trivial) (3 points)

Code
sum(is.na(cd4_dta_df$logcd4))
[1] 0

There are no missing values for logcd4.

Part II: OLS vs. Multilevel Modeling (17 points)

6.        Fit a linear regression with log_cd4 as the outcome and treatment, mean-centered age, and sex as predictors. Present the treatment effect and discuss its interpretation in terms of CD4 improvement. Hint: Ignore the data clustering for this analysis, treating each case as independent. (6 points)

For these two specific questions, treat each row of data as an independent observation, as if each came from a different individual.

Code
cd4_dta_df <- cd4_dta_df %>%
  mutate(age_centered = scale(age, center = TRUE))

logcd4_lm <- lm(logcd4 ~ trt + age_centered + sex, data = cd4_dta_df)

logcd4_coef <- summary(logcd4_lm)$coefficients

summary(logcd4_lm)

Call:
lm(formula = logcd4 ~ trt + age_centered + sex, data = cd4_dta_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0696 -0.5758  0.0819  0.6963  3.4823 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                3.01522    0.04644  64.926  < 2e-16 ***
trtZidovudine/Zalcitabine -0.09724    0.03500  -2.778  0.00548 ** 
age_centered               0.09670    0.01520   6.364 2.14e-10 ***
sexmale                   -0.13421    0.04830  -2.779  0.00548 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.073 on 5032 degrees of freedom
Multiple R-squared:  0.01036,   Adjusted R-squared:  0.009766 
F-statistic: 17.55 on 3 and 5032 DF,  p-value: 2.476e-11

A linear regression model was fit to predict log-transformed CD4 counts with treatment group, mean-centered age, and sex as predictors, treating each observation as independent.

The treatment effect estimate was -0.097 (SE = 0.035), t = -2.778, p = 0.00548.

This suggests that, on average, patients receiving Zidovudine/Zalcitabine had log(CD4) values 0.097 units lower than those receiving other HIV treatments, adjusting for age and sex.

The overall model was statistically significant, F(3, 5032) = 17.55, p = 2.476e-11.

7.        Fit a random intercept MLM with repeated measures (Level 1) nested within patient (Level 2) with log_cd4 as the outcome and treatment, mean-centered age, and sex as predictors. Calculate and interpret in words the Intraclass Correlation Coefficient (ICC), making sure to explain what the statistic reveals about variability in CD4 counts at each level. (6 points)

Code
logcd4_mlm <- lmer(
  logcd4 ~ trt + age_centered + sex + (1 | id),
  data = cd4_dta_df,
  na.action = na.exclude
)

p_val_trt <- summary(logcd4_mlm)$coefficients["trtZidovudine/Zalcitabine", "Pr(>|t|)"]

print(summary(logcd4_mlm))
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: logcd4 ~ trt + age_centered + sex + (1 | id)
   Data: cd4_dta_df

REML criterion at convergence: 12429.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.4991 -0.4690  0.0354  0.5376  3.7605 

Random effects:
 Groups   Name        Variance Std.Dev.
 id       (Intercept) 0.7365   0.8582  
 Residual             0.4107   0.6409  
Number of obs: 5036, groups:  id, 1309

Fixed effects:
                            Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)                  2.98235    0.07584 1337.91404  39.325  < 2e-16 ***
trtZidovudine/Zalcitabine   -0.07172    0.05967 1293.52460  -1.202 0.229564    
age_centered                 0.09486    0.02579 1294.17955   3.679 0.000244 ***
sexmale                     -0.12284    0.07924 1331.54148  -1.550 0.121294    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) trtZ/Z ag_cnt
trtZdvdn/Zl -0.206              
age_centerd  0.097 -0.001       
sexmale     -0.920  0.013 -0.105
Code
var_components <- as.data.frame(VarCorr(logcd4_mlm))

# Between-person variance (random intercept variance)
var_between <- var_components$vcov[var_components$grp == "id"]

# Within-person (residual) variance
var_within <- var_components$vcov[var_components$grp == "Residual"]

# ICC
icc <- var_between / (var_between + var_within)
print(icc)
[1] 0.6419785

A random intercept multilevel model was fit to predict log-transformed CD4 counts using treatment, mean-centered age, and sex as fixed effects, and allowing intercepts to vary across patients.

The intraclass correlation coefficient (ICC) is the ratio between the random-intercept (between-person) variance and the total variance (random intercept [between-person] variance + residual [within-person] variance).

The ICC was 0.642, indicating that 64.2% of the variance in log(CD4) was attributable to between-person differences (ie, stable differences across patients). The remaining 35.8% of the variance reflected within-person variability (ie, fluctuations over time within individuals).

This suggests that while there are meaningful stable differences in CD4 levels across patients, a substantial proportion of variability occurs within patients over time.

8.        Compare and contrast the OLS and MLM results. Which model better accounts for the study design? (5 points)

The OLS model suggested a significant negative effect of treatment with Zidovudine/Zalcitabine on log-transformed CD4 counts (b = -0.097, p = 2.476e-11).

However, when accounting for the nesting of repeated measures within individuals using a random intercept MLM, the treatment effect became smaller and non-significant (b = -0.072, p = 2.296e-01).

As discussed above, the ICC indicates that most of the variance in log(CD4) was attributable to stable between-person differences.

The multilevel model provided a better fit to the study design by accounting for the longitudinal measurements within patients and providing a more correct estimation of the effective sample size and standard errors.