PH724_Assignment4

Load R packages

library(tidyverse) 
library(knitr)
library(kableExtra)
library(survival)
library(survminer)
library(acro)
library(ggsurvfit)
library(casebase)
library(MASS)
library(dplyr)
library(glue)
library(tidycmprsk)
require(lubridate)
library(car)
library(forcats)
format(num, scientific = F)

##  [1] "function (x, ..., sigfig = NULL, digits = NULL, label = NULL, "                              
##  [2] "    scale = NULL, notation = c(\"fit\", \"dec\", \"sci\", \"eng\", \"si\"), "                
##  [3] "    fixed_exponent = NULL, extra_sigfig = NULL) "                                            
##  [4] "{"                                                                                           
##  [5] "    stopifnot(is.numeric(x))"                                                                
##  [6] "    stopifnot(is.null(digits) || is_integerish(digits))"                                     
##  [7] "    check_dots_empty()"                                                                      
##  [8] "    x[] <- as.numeric(x)"                                                                    
##  [9] "    if (missing(notation)) {"                                                                
## [10] "        notation <- NULL"                                                                    
## [11] "    }"                                                                                       
## [12] "    out <- set_num_opts(x, sigfig = sigfig, digits = digits, "                               
## [13] "        label = label, scale = scale, notation = notation, fixed_exponent = fixed_exponent, "
## [14] "        extra_sigfig = extra_sigfig)"                                                        
## [15] "    new_class <- c(\"pillar_num\", \"pillar_vctr\", \"vctrs_vctr\", "                        
## [16] "        \"double\")"                                                                         
## [17] "    class(out) <- new_class"                                                                 
## [18] "    out"                                                                                     
## [19] "}"

setwd('/Users/vanessa/Library/Mobile Documents/com~apple~CloudDocs/Courses/PH724-01 Advanced Methods in Epidemiology/Assignments/Assignment4_Long_Analysis')

#PART 1 BREAST CANCER COHORT ### Q1.1

If the researchers were interested in studying the age at onset of breast cancer, what type of truncation could present if participants entered the study at different ages. Would this concern be eliminated if the researchers studied the time since enrollment until onset of breast cancer instead?

The type of truncation that could occur when participants enter the study at different ages is left truncation. Left truncation happens when individuals who develop the event of interest (in this case, breast cancer) before they are eligible to enroll are systematically excluded from the study population. Specifically, women who develop breast cancer at younger ages and are therefore not available to be enrolled later in life would be missing from the study. If the researchers instead studied time since enrollment until onset of breast cancer, left truncation would still be a concern. Even though the analysis would shift to focus on the time from enrollment, the fact remains that participants had to survive without developing breast cancer up to the time of study entry. Thus, individuals who developed breast cancer earlier in life and were not available for enrollment would still be excluded. Therefore, selection bias due to left truncation could be present in either scenario.

Q1.2

Individual A enrolled in the study at age 40 and reported having breast cancer at the fifth exam after enrollment. Individual B enrolled in the study at age 50 and died from heart failure at age 61, without ever being diagnosed with breast cancer. Individual C enrolled in the study at age 42 and moved away from the community at age 54, without ever being diagnosed with breast cancer. Discuss the type(s) of censoring for each individual.

A) Non-censoring. The patient developed the outcome (time to event was observed) B) Right censoring. The patient died from a cause other than breast cancer without being diagnosed with breast cancer (competing risk) C) Right censoring. The patient moved and was lost to follow up so her breast cancer status after that point is unknown.

#PART 2 ADDICTS DATASET ### 2.1 What is the average follow-up time and the proportion of censored observations

addic <- read.csv("addicts.csv")
glimpse(addic)

## Rows: 238
## Columns: 6
## $ id     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
## $ clinic <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ prison <int> 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, …
## $ dose   <int> 50, 55, 55, 30, 65, 55, 65, 60, 50, 65, 80, 60, 55, 70, 60, 60,…
## $ time   <int> 428, 275, 262, 183, 259, 714, 438, 796, 892, 393, 161, 836, 523…
## $ event  <int> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

mean_time <- mean(addic$time)
prop_cens <- mean(addic$event == 0)

# Tabulate
result <- data.frame(
  Mean_Follow_Up_Time = mean_time,
  Proportion_Censored = prop_cens
)

print(result)

##   Mean_Follow_Up_Time Proportion_Censored
## 1            402.5714           0.3697479

Q2.2

Plot the Kaplan-Meier estimator for survival function for the time until exit from maintenance along with 95% confidence intervals. What is the estimated probability and associated 95% confidence intervals that no exit will occur by one year (i.e., 365 days)? What is the estimated median time until exit from maintenance and its 95% confidence interval?

surv_1 <- survminer::surv_fit(Surv(time, event) ~ 1, data = addic)
survminer::ggsurvplot(surv_1, xlab = "Time in days", ylab = "Survival Pr", 
                      risk.table = TRUE, conf.int=TRUE, surv.median.line = "hv", color = "#0099FF",
                        title = "Kaplan-Meier Estimate of time to event")

?ggsurvplot

# Estimated survival probability at 365 days
year1 <- summary(surv_1, times = 365)

# Extract probability and 95% CI
prob_year1 <- year1$surv
ci_lower_year1 <- year1$lower
ci_upper_year1 <- year1$upper


# Estimated median survival time and 95% CI
med_stats <- summary(surv_1)$table
median_time <- med_stats["median"]
median_lower <- med_stats["0.95LCL"]
median_upper <- med_stats["0.95UCL"]

# Create summary table
results_tbl <- tibble(
  Metric = c("Pr Survival > 365 days", "Median survival time (days)"),
  Estimate = c(round(prob_year1, 3), round(median_time, 1)),
  `95% CI` = c(
    sprintf("(%.3f, %.3f)", ci_lower_year1, ci_upper_year1),
    sprintf("(%.1f, %.1f)", median_lower, median_upper)
  )
)

# Format and print
results_tbl %>%
  gt::gt() %>%
  gt::tab_header(title = "Kaplan-Meier Survival Estimates")

Metric	Estimate	95% CI
Kaplan-Meier Survival Estimates
Pr Survival > 365 days	0.606	(0.545, 0.675)
Median survival time (days)	504.000	(399.0, 560.0)

Q2.3

Plot the Kaplan-Meier estimator for survival functions of the time until exit from maintenance comparing patients with a history of incarceration versus those without. What are the estimated median times until exit from maintenance and their 95% confidence intervals for each group? Does the distribution of time until exit from maintenance differ by history of incarceration?

# Kaplan-Meier estimator stratified by incarceration history
surv_1 <- survfit(Surv(time, event) ~ prison, data = addic)

# Plot
ggsurvplot(
  surv_1,
  conf.int = TRUE,
  pval = TRUE,           # Adds p-value from log-rank test
  xlab = "Time (days)",
  ylab = "Survival Probability",
  title = "Kaplan-Meier Estimate by Incarceration History",
  legend.labs = c("No incarceration", "Incarceration"),
  legend.title = "History of incarceration"
)

# Median survival times and 95% CIs for each group
med_stats <- summary(surv_1)$table

# Prepare table
results_tbl <- tibble(
  Group = c("No incarceration", "Incarceration"),
  `Median survival time (days)` = round(med_stats[,"median"], 1),
  `95% CI` = paste0(
    "(", round(med_stats[,"0.95LCL"], 1), ", ", round(med_stats[,"0.95UCL"], 1), ")"
  )
)

# Format and print
library(gt)
results_tbl %>%
  gt() %>%
  tab_header(title = "Median Survival Times by Incarceration History")

Group	Median survival time (days)	95% CI
Median Survival Times by Incarceration History
No incarceration	523	(460, 652)
Incarceration	394	(268, 563)

Q2.4

The researchers aimed to estimate association between time to exit from maintenance and methadone dosage, while considering history of incarceration and clinic potential confounders. Is proportionality of hazards met? If so, fit a proportional hazards model and report finding.

# Fit Cox proportional hazards model
cox_model <- coxph(Surv(time, event) ~ dose + prison + clinic, data = addic)

# Test proportional hazards assumption
ph_test <- cox.zph(cox_model)

# Print proportional hazards test results
ph_test

##         chisq df       p
## dose    0.515  1 0.47310
## prison  0.839  1 0.35969
## clinic 11.159  1 0.00084
## GLOBAL 12.287  3 0.00646

# Plot Schoenfeld residuals if needed
ggcoxzph(ph_test)

# If PH assumption is met, summarize the model
#cox_model %>%
 # tbl_regression(
  #  exponentiate = TRUE,
   # label = list(
    #  dose ~ "Methadone Dose",
     # prison ~ "History of Incarceration",
      #clinic ~ "Clinic"
   # )
  #) %>%
  #modify_header(label = "**Variable**", estimate = "**HR**", conf.low = "**95% CI Lower**", conf.high = "**95% CI Upper**") %>%

Proportional hazard assumption is not met (p value = 0.006, we want to fail to reject H0)

#PART 3 TRANSPLANT DATASET ### Q3.1 Download the transplant.csv. This data set contains information about patients who underwent hematopoietic stem cell transplantation for acute leukemia. Among 177 participants, 56 experienced a relapse, 75 encountered other competing events, and 46 were censored. We are interested in whether the source of stem cells impacts the time to relapse. The research question focuses on whether stem cell source impacts time to relapse. Please perform a proper analysis, accounting for other competing events, to answer the research question of interest.

# Load data and packages
transplant <- read.csv("transplant.csv")
glimpse(transplant)

## Rows: 177
## Columns: 7
## $ Sex    <chr> "M", "F", "M", "F", "F", "M", "M", "F", "M", "F", "M", "M", "F"…
## $ D      <chr> "ALL", "AML", "ALL", "ALL", "ALL", "ALL", "ALL", "ALL", "ALL", …
## $ Phase  <chr> "Relapse", "CR2", "CR3", "CR2", "CR2", "Relapse", "CR1", "CR1",…
## $ Age    <int> 48, 23, 7, 26, 36, 17, 7, 17, 26, 8, 29, 13, 29, 7, 40, 20, 18,…
## $ Status <int> 2, 1, 0, 2, 2, 2, 0, 2, 0, 1, 2, 2, 2, 1, 2, 2, 0, 2, 2, 2, 1, …
## $ Source <chr> "BM+PB", "BM+PB", "BM+PB", "BM+PB", "BM+PB", "BM+PB", "BM+PB", …
## $ ftime  <dbl> 0.67, 9.50, 131.77, 24.03, 1.47, 2.23, 124.83, 10.53, 123.90, 2…

library(cmprsk)
# Create factor variables
transplant <- transplant %>%
  mutate(
    Source = factor(Source),
    Status = factor(Status, levels = c(0, 1, 2)),
    Sex = factor(Sex),
    D = factor(D),
    Phase = factor(Phase)
  )

# Cause-specific hazards regression approach
cs_hr<- coxph(Surv(ftime, ifelse(Status == 1, 1, 0)) ~ Source + D + Sex + Age + Phase, data = transplant)
summary(cs_hr)

## Call:
## coxph(formula = Surv(ftime, ifelse(Status == 1, 1, 0)) ~ Source + 
##     D + Sex + Age + Phase, data = transplant)
## 
##   n= 177, number of events= 56 
## 
##                   coef exp(coef)  se(coef)      z Pr(>|z|)    
## SourcePB      0.381089  1.463878  0.576949  0.661 0.508917    
## DAML         -0.658082  0.517844  0.298633 -2.204 0.027549 *  
## SexM         -0.381256  0.683003  0.284635 -1.339 0.180421    
## Age          -0.006715  0.993308  0.012135 -0.553 0.580038    
## PhaseCR2      0.188738  1.207725  0.467073  0.404 0.686148    
## PhaseCR3      0.512107  1.668804  0.693629  0.738 0.460331    
## PhaseRelapse  1.515829  4.553194  0.396458  3.823 0.000132 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##              exp(coef) exp(-coef) lower .95 upper .95
## SourcePB        1.4639     0.6831    0.4725    4.5353
## DAML            0.5178     1.9311    0.2884    0.9298
## SexM            0.6830     1.4641    0.3910    1.1932
## Age             0.9933     1.0067    0.9700    1.0172
## PhaseCR2        1.2077     0.8280    0.4835    3.0168
## PhaseCR3        1.6688     0.5992    0.4285    6.4987
## PhaseRelapse    4.5532     0.2196    2.0934    9.9033
## 
## Concordance= 0.724  (se = 0.038 )
## Likelihood ratio test= 31.28  on 7 df,   p=6e-05
## Wald test            = 30.16  on 7 df,   p=9e-05
## Score (logrank) test = 33.78  on 7 df,   p=2e-05

# Cox proportional hazards model stratified by Source (exploratory)
cs_hr_strat <- coxph(Surv(ftime, ifelse(Status == 1, 1, 0)) ~ D + Sex + Age + Phase + strata(Source), data = transplant)

# Summary
summary(cs_hr_strat)

## Call:
## coxph(formula = Surv(ftime, ifelse(Status == 1, 1, 0)) ~ D + 
##     Sex + Age + Phase + strata(Source), data = transplant)
## 
##   n= 177, number of events= 56 
## 
##                   coef exp(coef)  se(coef)      z Pr(>|z|)    
## DAML         -0.664363  0.514601  0.298277 -2.227 0.025925 *  
## SexM         -0.361297  0.696772  0.284525 -1.270 0.204148    
## Age          -0.006278  0.993741  0.012165 -0.516 0.605786    
## PhaseCR2      0.182709  1.200465  0.467230  0.391 0.695762    
## PhaseCR3      0.561730  1.753704  0.692953  0.811 0.417577    
## PhaseRelapse  1.507349  4.514745  0.396925  3.798 0.000146 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##              exp(coef) exp(-coef) lower .95 upper .95
## DAML            0.5146     1.9433    0.2868    0.9233
## SexM            0.6968     1.4352    0.3989    1.2170
## Age             0.9937     1.0063    0.9703    1.0177
## PhaseCR2        1.2005     0.8330    0.4804    2.9995
## PhaseCR3        1.7537     0.5702    0.4509    6.8202
## PhaseRelapse    4.5147     0.2215    2.0738    9.8287
## 
## Concordance= 0.712  (se = 0.04 )
## Likelihood ratio test= 30.41  on 6 df,   p=3e-05
## Wald test            = 28.81  on 6 df,   p=7e-05
## Score (logrank) test = 32.34  on 6 df,   p=1e-05

# Subdistribution hazards approach
covar<- model.matrix(~ Source + D + Sex + Age + Phase, 
                           data = transplant)[, -1]

sd_hr <- crr(ftime = transplant$ftime, 
              fstatus = as.numeric(transplant$Status),
              cov1 = covar,
              failcode = 1,
              cencode= 0)

# Summarize model
summary(sd_hr)

## Competing Risks Regression
## 
## Call:
## crr(ftime = transplant$ftime, fstatus = as.numeric(transplant$Status), 
##     cov1 = covar, failcode = 1, cencode = 0)
## 
##                 coef exp(coef) se(coef)      z p-value
## SourcePB      0.6988     2.011   0.4474  1.562 1.2e-01
## DAML          0.4183     1.519   0.3615  1.157 2.5e-01
## SexM          0.2087     1.232   0.3155  0.662 5.1e-01
## Age          -0.0187     0.981   0.0154 -1.209 2.3e-01
## PhaseCR2     -0.3303     0.719   0.3513 -0.940 3.5e-01
## PhaseCR3     -0.3424     0.710   0.5627 -0.609 5.4e-01
## PhaseRelapse -1.7838     0.168   0.4320 -4.130 3.6e-05
## 
##              exp(coef) exp(-coef)  2.5% 97.5%
## SourcePB         2.011      0.497 0.837 4.834
## DAML             1.519      0.658 0.748 3.086
## SexM             1.232      0.812 0.664 2.286
## Age              0.981      1.019 0.952 1.012
## PhaseCR2         0.719      1.391 0.361 1.431
## PhaseCR3         0.710      1.408 0.236 2.139
## PhaseRelapse     0.168      5.952 0.072 0.392
## 
## Num. cases = 177
## Pseudo Log-likelihood = -219 
## Pseudo likelihood ratio test = 25.6  on 7 df,

The choice of model to handle competing risks depends on the relationship between the exposure and the event of interest, which can influence how hazard ratios are distributed across exposure groups. Depending on this relationship, the cause-specific hazard ratios (csHR) and subdistribution hazard ratios (sdHR) may differ. It is necessary to compare hazard ratios from both models to assess these differences. The csRH might be more applicable for studying the etiology of diseases, whereas the sdRH might be more appropriate for predicting an individual’s risk for an outcome (like in our example) or resource allocation. In general, a sdRH >1 indicates that those with exposure will be seen to have a quicker time to event. In this specific dataset, the sdRH had a stronger association than the csRH for status (exposure). Therefore, as the subdistribution hazard maintains individuals who develop the competing event in the risk set, this implies that individuals, depending on their status are maintained in the risk set in a different proportion. For this specific example, I would use the sdHR model to treat competing risks.

#PART 4 CTS DATASET ### Q4.1 Compare baseline characteristics between the surgery group and the non-surgery group to assess randomization success

cts <- read.csv("cts.csv")
glimpse(cts)

## Rows: 116
## Columns: 18
## $ id          <int> 10014, 10015, 10016, 11050, 11068, 11071, 11078, 11086, 11…
## $ ctsaqs0     <dbl> 2.636364, 1.818182, 2.818182, 1.909091, 4.181818, 2.181818…
## $ ctsaqf0     <dbl> 1.666667, 1.444444, 2.000000, 1.888889, 4.000000, 2.000000…
## $ ctsaqs1     <dbl> 2.454545, 1.090909, 3.454545, 2.181818, 4.181818, 1.090909…
## $ ctsaqf1     <dbl> 2.222222, 1.333333, 2.777778, 1.666667, 4.111111, 1.571429…
## $ ctsaqs2     <dbl> 2.090909, 2.000000, 3.727273, 2.272727, 4.000000, 1.181818…
## $ ctsaqf2     <dbl> 2.333333, 1.555556, 3.375000, 1.888889, 4.222222, 1.222222…
## $ ctsaqs3     <dbl> 2.454545, 1.600000, NA, 1.272727, 3.727273, 1.090909, 2.18…
## $ ctsaqf3     <dbl> 2.000000, 1.111111, NA, 1.333333, 3.777778, 1.000000, 2.50…
## $ ctsaqs4     <dbl> 1.545455, 1.272727, 1.000000, 3.454545, 4.272727, 1.000000…
## $ ctsaqf4     <dbl> 1.444444, 1.444444, 1.000000, 2.888889, 4.000000, 1.000000…
## $ treatassign <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1…
## $ age         <dbl> 44.43258, 49.59617, 55.65229, 49.93566, 46.86379, 56.04928…
## $ gender      <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0…
## $ idgroup     <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 3…
## $ bmi         <dbl> 29.04959, 20.11707, 31.09001, NA, 61.50880, 35.86735, NA, …
## $ smoker      <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1…
## $ diabetes    <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…

library(gtsummary)
library(dplyr)

# Modify factor variables
cts <- cts %>%
  mutate(
    treatassign = factor(treatassign, levels = c(0,1), labels = c("Non-Surgical", "Surgical")),
    gender = factor(gender, levels = c(0,1), labels = c("Male", "Female")),
    smoker = factor(smoker, levels = c(0,1), labels = c("Non-Smoker", "Smoker")),
    diabetes = factor(diabetes, levels = c(0,1), labels = c("No Diabetes", "Diabetes"))
  )

# Create the baseline characteristics table
cts %>%
  dplyr::select(treatassign, age, gender, bmi, smoker, diabetes) %>%
  tbl_summary(
    by = treatassign,
    statistic = list(
      all_continuous() ~ "{mean} ({sd})", 
      all_categorical() ~ "{n} ({p}%)"
    ),
    label = list(
      age ~ "Age (years)",
      gender ~ "Gender",
      bmi ~ "BMI",
      smoker ~ "Smoking Status",
      diabetes ~ "Diabetes Status"
    ),
    missing = "no"
  ) %>%
  add_p() %>%
  bold_labels()

Characteristic	Non-Surgical N = 59¹	Surgical N = 57¹	p-value²
Age (years)	51 (9)	50 (10)	0.8
Gender			0.4
Male	25 (42%)	29 (51%)
Female	34 (58%)	28 (49%)
BMI	32 (9)	34 (8)	0.2
Smoking Status			0.7
Non-Smoker	40 (68%)	37 (65%)
Smoker	19 (32%)	20 (35%)
Diabetes Status			0.5
No Diabetes	48 (81%)	49 (86%)
Diabetes	11 (19%)	8 (14%)
¹ Mean (SD); n (%)
² Wilcoxon rank sum test; Pearson’s Chi-squared test

We can see that the distribution of the variables at baseline are balanced among groups (no statistically significant differences).

Q4.2

Summarize and report the distribution of the CTSAQ symptom burden scores between the surgery group and the non-surgery group over time.

library(gtsummary)
library(dplyr)

# Reshape to long format
 cts_long <- cts %>%
  dplyr::select(id, treatassign, starts_with("ctsaqs")) %>%
  pivot_longer(
    cols = starts_with("ctsaqs"),
    names_to = "timepoint",
    values_to = "symptom_burden"
  ) %>%
  mutate(
    timepoint = as.character(timepoint),  # First convert to character
    timepoint = dplyr::recode(timepoint,
      "ctsaqs0" = "Baseline",
      "ctsaqs1" = "3 Months",
      "ctsaqs2" = "6 Months",
      "ctsaqs3" = "9 Months",
      "ctsaqs4" = "12 Months"
    )
  )

# Summarize symptom burden scores by treatment group and time
cts_long %>%
  tbl_strata(
    strata = timepoint,
    .tbl_fun = ~ .x %>%
      tbl_summary(
        by = treatassign,
        statistic = symptom_burden ~ "{mean} ({sd})",
        label = list(symptom_burden ~ "Symptom Burden Score"),
        missing = "no"
      )
  ) %>%
  bold_labels()

Characteristic	12 Months		3 Months		6 Months		9 Months		Baseline
Characteristic	Non-Surgical N = 59¹	Surgical N = 57¹	Non-Surgical N = 59¹	Surgical N = 57¹	Non-Surgical N = 59¹	Surgical N = 57¹	Non-Surgical N = 59¹	Surgical N = 57¹	Non-Surgical N = 59¹	Surgical N = 57¹
id	13,149 (13,014, 14,066)	13,160 (13,015, 14,063)	13,149 (13,014, 14,066)	13,160 (13,015, 14,063)	13,149 (13,014, 14,066)	13,160 (13,015, 14,063)	13,149 (13,014, 14,066)	13,160 (13,015, 14,063)	13,149 (13,014, 14,066)	13,160 (13,015, 14,063)
Symptom Burden Score	2.07 (0.88)	1.74 (0.76)	2.61 (0.85)	2.04 (0.87)	2.42 (0.80)	2.02 (1.03)	2.14 (0.75)	1.71 (0.76)	3.01 (0.64)	2.95 (0.77)
¹ Median (Q1, Q3); Mean (SD)

# Summarize graphically
ggplot(cts_long, aes(x = timepoint, y = symptom_burden, fill = treatassign)) +
  geom_boxplot(alpha=0.5) +
  geom_jitter(aes(color = treatassign),width = 0.2) +
  guides(fill = "none") +
  labs(
    title = "Distribution of CTSAQ Scores Over Time",
    x = "Time",
    y = "CTSAQ Symptom Burden Score",
    fill = "Treatment Group",
    color = "Treatment Group"
  ) +
  theme_classic()

# Trajectories over time
ggplot(cts_long, aes(timepoint,symptom_burden)) +
  geom_line(aes(group = factor(id))) +
  geom_smooth() +
  theme_bw()+
    labs(
    title = "Distribution of CTSAQ Scores Over Time by treatment group",
    x = "Time",
    y = "CTSAQ Symptom Burden Score"
    ) +
  facet_grid(~ treatassign)

The surgical and non-surgical groups had similar baseline symptom burden scores. Over time, both groups improved, but the surgical group consistently reported lower symptom burden at all follow-up points. By 3, 6, 9, and 12 months, differences between groups approached or exceeded 0.30, suggesting a possible clinical benefit from surgery sustained over one year.

Q4.3

Estimate and report the correlation matrix for the CTSAQ symptom burden scores over time.

#install.packages("corrr")
library(corrr)

# Compute correlation matrix
cts_corr <- cts %>%
  dplyr::select(ctsaqs0:ctsaqs4) %>%
  correlate(method = "pearson")

# Show matrix
cts_corr

Q4.4

Perform a linear mixed model to measure the impact of randomization to surgery on the rate of change in CTSAQ symptom scores over the course of one year. Report your methods and interpret the results. ### Q4.5 Repeat the analysis in Q4 using a generalized estimating equation and compare the findings.

library(lme4)
# Fit linear mixed effects model
# Random intercept model
glmm_cts <- lmer(
  symptom_burden ~ timepoint * treatassign + (1 | id),
  data = cts_long
) %>%
tbl_regression(
  exponentiate = FALSE,
  label = list(
    timepoint ~ "Time (months)",
    treatassign ~ "Treatment Assignment (Surgical vs Non-Surgical)",
    `timepoint:treatassign` ~ "Interaction (Time × Treatment)"
  )
) %>%
  modify_header(label = "**Linear Mixed Model (Random Intercept + Slope)**")


library(geepack)
gee_cts <- geeglm(
  symptom_burden ~ timepoint * treatassign,
  id = id,
  data = cts_long,
  family = gaussian,
  corstr = "exchangeable"
) %>%tbl_regression(
  exponentiate = FALSE,
  label = list(
    timepoint ~ "Time (months)",
    treatassign ~ "Treatment Assignment (Surgical vs Non-Surgical)",
    `timepoint:treatassign` ~ "Interaction (Time × Treatment)"
  )
) %>%
  modify_header(label = "**GEE Model**")

# Combine side-by-side
tbl_merge(
  tbls = list(glmm_cts, gee_cts),
  tab_spanner = c("**GLMM**", "**GEE**")
)

Linear Mixed Model (Random Intercept + Slope)	GLMM		GEE
Linear Mixed Model (Random Intercept + Slope)	Beta	95% CI¹	Beta	95% CI¹	p-value
Time (months)
12 Months	—	—	—	—
3 Months	0.53	0.29, 0.76	0.53	0.21, 0.85	0.001
6 Months	0.35	0.12, 0.59	0.35	0.08, 0.63	0.010
9 Months	0.10	-0.15, 0.35	0.10	-0.10, 0.30	0.3
Baseline	0.92	0.69, 1.2	0.92	0.69, 1.2	<0.001
Treatment Assignment (Surgical vs Non-Surgical)
Non-Surgical	—	—	—	—
Surgical	-0.30	-0.61, 0.01	-0.30	-0.62, 0.02	0.066
Interaction (Time × Treatment)
3 Months * Surgical	-0.29	-0.62, 0.05	-0.28	-0.69, 0.12	0.2
6 Months * Surgical	-0.12	-0.46, 0.23	-0.11	-0.45, 0.22	0.5
9 Months * Surgical	-0.15	-0.50, 0.20	-0.15	-0.39, 0.10	0.2
Baseline * Surgical	0.23	-0.10, 0.57	0.23	-0.13, 0.60	0.2
¹ CI = Confidence Interval

Both the linear mixed model and GEE show similar results. Symptom burden improves over time in both treatment groups, with statistically significant improvements at 3 months and 6 months compared to baseline. The surgical group shows slightly greater improvement (negative coefficients for the interaction terms), but none of the surgery × time interactions are statistically significant. At baseline, surgical and non-surgical groups do not differ substantially. Overall, both models suggest symptom burden decreases over time, but there is no strong statistical evidence that surgery accelerated symptom improvement compared to non-surgical treatment within the first year.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

PH724_Assignment4_VDC

Vanessa Davila

2025-04-29