Subgroup Analyses(Separate survival analyses for metastatic and in-situ patient groups.
Comparative Effectiveness of Treatments (Assessing and comparing survival outcomes among different drug treatments.)
# Load required packagespacman::p_load( lubridate, ggsurvfit, gtsummary, tidycmprsk, tidyverse, dplyr, gt, survival, broom, survminer, survRM2)# Loading the cleaned synthetic data from Synthea.df <-read_csv("parquet.csv")# Quick look at the data structureglimpse(df)
The package gtsummary allows us to create a publication ready summary statistics.
# Create a named list of column labels for the summary tablecolumn_labels <-list('gender'='Gender, n (%)','race'='Race, n (%)','ethnicity'='Ethnicity, n (%)','age_at_dx'='Age at Dx (years)','income'='Income (K USD)','death_reason'='Death Reason, n (%)','lines_of_treatment'='Lines of Treatment, n (%)')# Identify columns that have more than one distinct valuecolumns_include <-names(column_labels)columns_use <- df %>%select(all_of(columns_include)) %>%summarise(across(everything(), ~n_distinct(.))) %>%pivot_longer(everything(), names_to ="var", values_to ="n_levels") %>%filter(n_levels >1) %>%pull(var)# Build summary table, stratifying by 'progression' summary_stats <- df %>%select(all_of(c(columns_use, "progression"))) %>%tbl_summary(by ="progression",label = column_labels,statistic =list(all_categorical() ~"{n} ({p}%)",all_continuous() ~"{mean} ({sd})" ) ) %>%add_n() %>%add_p(test =list(all_categorical() ~"chisq.test",all_continuous() ~"aov" ),test.args =all_tests("chisq.test") ~list(simulate.p.value =TRUE, B =10000) )# Print the tablesummary_stats
Characteristic
N
in situ N = 7521
in situ -> metastatic N = 941
metastatic N = 811
p-value2
Race, n (%)
927
0.5
black
62 (8.2%)
7 (7.4%)
10 (12%)
other
56 (7.4%)
4 (4.3%)
6 (7.4%)
white
634 (84%)
83 (88%)
65 (80%)
Ethnicity, n (%)
927
0.4
hispanic
77 (10%)
8 (8.5%)
12 (15%)
nonhispanic
675 (90%)
86 (91%)
69 (85%)
Age at Dx (years)
927
68 (9)
68 (8)
67 (7)
0.3
Income (K USD)
927
114 (157)
131 (175)
111 (150)
0.6
Death Reason, n (%)
927
<0.001
Alive
607 (81%)
23 (24%)
34 (42%)
Missing
1 (0.1%)
0 (0%)
0 (0%)
Other causes
99 (13%)
8 (8.5%)
3 (3.7%)
prostate cancer
45 (6.0%)
63 (67%)
44 (54%)
Lines of Treatment, n (%)
846
<0.001
L1
62 (8.2%)
94 (100%)
0 (NA%)
L2
690 (92%)
0 (0%)
0 (NA%)
Unknown
0
0
81
1 n (%); Mean (SD)
2 Pearson’s Chi-squared test with simulated p-value (based on 10000 replicates); One-way analysis of means
The cohort of 927 participants consisted predominantly of white (84%-88%) and non-Hispanic individuals (85%-91%), with no significant racial or ethnic differences across disease stages (p=0.5 and p=0.3, respectively). Age at diagnosis is 68 years in the in-situ and the progression from in-situ to the metastatic stage. While the age at diagnosis at the metastatic stage is 67. Income were similar across groups ($111K-$131K). Death due to prostate cancer was significantly higher among those who progressed from in-situ to metastatic (67%) and metastatic-only groups (54%), compared to the in-situ group (6%) (p<0.001). Treatment intensity, reflected by lines of treatment, differed significantly by disease progression status (p<0.001)
Kaplan Meier Survival Curves
# Fit the Kaplan-Meier model for followup_neoplasm_months (time) and is_dead (event)fit <-survfit2(Surv(followup_neoplasm_months, is_dead) ~ progression, data = df)# Plot the KM curves with ggsurvfitggsurvfit(fit) +add_confidence_interval() +labs(x ="Follow-up (Months)",y ="Survival Probability" ) +theme_minimal()
The Kaplan Meier curve above shows the survival probability and the follow-up time for the groups. As observed, there is a clear difference in survival probability across the group. In-situ (red): have the highest survival probability with over 50% surviving beyond 10 years. The in-situ -> metastatic (green) experienced a sharp decline in survival, with most deaths occurring within 5 years (60 months). The metastatic (blue) group aslo experienced a sharp decline in survival within 5 years. In what follows, we will estimate the time point at which 50% of the study population is expected to die from disease progression.
Median survival
This will return the median survival time for each group (state). If a group does not reach the median survival probability, the function will return NA. From the table, we observe that patients in the in-situ state stayed above 0.5 over the study period (120 months), suggesting that over half of the patients with in-situ prostate cancer lived longer than 10 years (long term survival benefit). For those progressing from the in-situ to the metastatic state, the median survival time is approximately 31 months, whereas for those already in the metastatic state, it is about 40 months. Suggesting that disease progression worsens compared to in-situ group.
# Calculate median survival times for each groupmed_surv <-surv_median(fit)# Display in a gt tablegt(med_surv)
strata
median
lower
upper
progression=in situ
NA
NA
NA
progression=in situ -> metastatic
30.45585
26.15195
36.43532
progression=metastatic
39.75359
36.86242
42.97331
Survival Estimate in 30 and 60 months
While median survival times provide a useful overall summary of survival patterns, time-specific survival probabilities at clinically relevant follow-up points offer additional insight into the short- and mid-term prognosis across groups.
The estimate below shows survival probabilities at 30 (2.5 years) and 60 (5 years) months
# Display survival at 2.5 years (30 months) and 5 years (60 months)summary(fit, times =c(30, 60))
Call: survfit(formula = Surv(followup_neoplasm_months, is_dead) ~ progression,
data = df)
progression=in situ
time n.risk n.event survival std.err lower 95% CI upper 95% CI
30 465 59 0.902 0.0122 0.878 0.926
60 259 56 0.766 0.0199 0.728 0.806
progression=in situ -> metastatic
time n.risk n.event survival std.err lower 95% CI upper 95% CI
30 38 40 0.5034 0.0561 0.4046 0.626
60 4 28 0.0691 0.0327 0.0273 0.175
progression=metastatic
time n.risk n.event survival std.err lower 95% CI
30.000 39.000 17.000 0.725 0.057 0.621
upper 95% CI
0.846
We observe that patients in the in-situ state have a 90% survival probability at 30 months, which declines to 77% at 60 months. Patients who progress from in-situ to metastatic show a significantly lower survival probability—50% at 30 months and just 7% at 60 months. For those already in the metastatic state, survival at 30 months is approximately 72% A natural question is are there differences across the groups?
Cox Proportional Hazards Model.
Estimating the relative hazard of death associated with each progression compared to the in-situ group.
# Fit the Cox model: Surv(time, event) ~ progressioncox_model <-coxph(Surv(followup_neoplasm_months, is_dead) ~ progression, data = df)cox_model
Call:
coxph(formula = Surv(followup_neoplasm_months, is_dead) ~ progression,
data = df)
coef exp(coef) se(coef) z p
progressionin situ -> metastatic 2.2210 9.2161 0.1564 14.20 <2e-16
progressionmetastatic 1.9177 6.8053 0.1779 10.78 <2e-16
Likelihood ratio test=210.7 on 2 df, p=< 2.2e-16
n= 927, number of events= 263
The Cox model results show that patients who progress from the in-situ to the metastatic state have a hazard ratio of 9.22, meaning their risk of death is over nine times higher than patients who remain in the in-situ state. Patients who are already in the metastatic state have a hazard ratio of 6.81. Both groups have significantly elevated mortality risks compared to the in-situ group. Interestingly, the hazard is higher in those who progressed from in-situ to metastatic than in those who were metastatic at baseline. This leads us to investigate if the effect changes over time. To do this, we test the:
Proportional hazard assumption
We need to check the proportional hazard assumption to assess if the effect changes over time
cox.zph(cox_model)
chisq df p
progression 28.4 2 6.7e-07
GLOBAL 28.4 2 6.7e-07
As shown above, the global test for the proportional hazards assumption shows a significant p-value (p = 6.7e-07), indicating that the effect of the ‘progression’ variable on the hazard is not constant over time. This violation suggests that the proportional hazards assumption is not met. To address this, additional covariates should be included in the model or time-varying effects should be considered to account for heterogeneity and time-fixed effects.
Adjusted cox models
Accounting for patient characteristics such as age at diagnosis, income, race and ethnicity.
# Fit a Cox model adjusting for other variablescox_model_adj <-coxph(Surv(followup_neoplasm_months, is_dead) ~ progression + age_at_dx + income + race + ethnicity,data = df)# Summarize the adjusted modelsummary(cox_model_adj)
chisq df p
progression 27.489 2 1.1e-06
age_at_dx 0.425 1 0.51
income 0.117 1 0.73
race 2.105 2 0.35
ethnicity 0.267 1 0.61
GLOBAL 30.379 7 8.1e-05
# Display with gtsummarycox_model_adj %>%tbl_regression(exp =TRUE) %>%bold_p(t =0.05) %>%modify_caption("**Cox Proportional Hazards Model (Adjusted)**")
Cox Proportional Hazards Model (Adjusted)
Characteristic
HR
95% CI
p-value
progression
in situ
—
—
in situ -> metastatic
9.18
6.75, 12.5
<0.001
metastatic
6.85
4.83, 9.72
<0.001
age_at_dx
1.01
1.00, 1.03
0.040
income
1.00
1.00, 1.00
0.3
race
black
—
—
other
0.72
0.37, 1.39
0.3
white
0.98
0.64, 1.48
>0.9
ethnicity
hispanic
—
—
nonhispanic
0.82
0.55, 1.22
0.3
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio
Even after controlling for patient characteristics, we observe that patients progressing from in-situ to metastatic disease had a hazard ratio similar to the univariate case (HR) of 9.18 (95% CI: 6.75–12.49), indicating a 9-fold increased risk of death compared to those who remained in the in-situ state. Patients already in the metastatic state had an HR of 6.85 (95% CI: 4.83–9.72). Age at diagnosis was also significantly associated with survival (HR = 1.014, p = 0.0402), suggesting that each additional year of age increases the hazard by approximately 1.4%. Income, race, and ethnicity were not statistically significant predictors in the adjusted model. The non significance of income, race and ethnicity means they do not violate the proportional hazard assumption. The concordance statistic is 0.705, indicating good discrimination.
Further diagnosis. Stratified Cox Model
The idea is to stratify by disease progression levels while assuming that the effect on other covariates are constant across strata.
coxph(Surv(followup_neoplasm_months, is_dead) ~ age_at_dx + income + race + ethnicity +strata(progression), data = df)
Call:
coxph(formula = Surv(followup_neoplasm_months, is_dead) ~ age_at_dx +
income + race + ethnicity + strata(progression), data = df)
coef exp(coef) se(coef) z p
age_at_dx 0.0108583 1.0109174 0.0070961 1.530 0.126
income 0.0003780 1.0003781 0.0003827 0.988 0.323
raceother -0.4239805 0.6544367 0.3378339 -1.255 0.209
racewhite -0.0992986 0.9054723 0.2142302 -0.464 0.643
ethnicitynonhispanic -0.1605612 0.8516657 0.2062997 -0.778 0.436
Likelihood ratio test=5.2 on 5 df, p=0.3925
n= 927, number of events= 263
Results from the stratified cox model shows that non of the covariates included in the model independently affected the survival. This shows that the results from the cox model may be driven mainly by disease progression.
Further diagnosis
# Quick check of death counts by progression groupdf %>%group_by(progression) %>%summarise(deaths =sum(is_dead),total =n(),death_rate = deaths / total )
# A tibble: 3 × 4
progression deaths total death_rate
<chr> <dbl> <int> <dbl>
1 in situ 145 752 0.193
2 in situ -> metastatic 71 94 0.755
3 metastatic 47 81 0.580
Our further descriptive assessment shows that the death rate among patients in the in-situ state is approximately 19%. In contrast, patients who progressed from in-situ to metastatic had a significantly higher death rate of 76%. Among those initially diagnosed in the metastatic state, the death rate was 58%. These findings highlight the substantial increase in mortality risk associated with disease progression, particularly among patients who transition from in-situ to metastatic stages.
Plotting the fitted cox model
# Plot overall survival curve from the fitted Cox modelggsurvplot(survfit(cox_model), data = df, conf.int =TRUE)
In-situ ->Metastatic and Metastatic Group Analysis
What could explain the low survival rates in the in-situ -> metastatic and metastatic state. Here, the idea is to filter the data to focus on patients who progress from the in-situ to the metastatic state, as well as those already in the metastatic state.
# Filter to cases: "in situ -> metastatic" and "metastatic"met_df <- df %>%filter(progression %in%c("in situ -> metastatic", "metastatic"))
Plotting the Kaplan Meier curve for these specific groups
# Fit KM model using followup_metastasis_months as time variablekm_fit <-survfit2(Surv(followup_metastasis_months, is_dead) ~ progression, data = met_df)# Plot the curveskm_fit %>%ggsurvfit() +add_confidence_interval() +labs(x ="Follow-up (Months)",y ="Survival Probability",title ="Kaplan-Meier Survival Curves for Metastatic Groups" ) +theme_minimal()
# Extract median survival timesmedsurv_overall <-surv_median(km_fit)# Display in a tablemedsurv_overall %>%select(strata, median) %>%rename(Group = strata,Median_Survival = median ) %>%gt() %>%fmt_number(columns =c(Median_Survival), decimals =2) %>%tab_header(title ="Median Survival Times for Metastatic Groups")
Median Survival Times for Metastatic Groups
Group
Median_Survival
progression=in situ -> metastatic
27.50
progression=metastatic
39.75
# Check survival at the 50% time quantilesummary(km_fit, times =quantile(met_df$followup_metastasis_months, 0.5))
Call: survfit(formula = Surv(followup_metastasis_months, is_dead) ~
progression, data = met_df)
progression=in situ -> metastatic
time n.risk n.event survival std.err lower 95% CI
24.2793 43.0000 37.0000 0.5430 0.0557 0.4441
upper 95% CI
0.6639
progression=metastatic
time n.risk n.event survival std.err lower 95% CI
24.2793 45.0000 15.0000 0.7606 0.0539 0.6619
upper 95% CI
0.8740
At around 24 months, the in situ → metastatic group shows an estimated survival probability of 54.3% (95% CI: 44.4–66.4), whereas the metastatic group’s estimated survival is 76.1% (95% CI: 66.2–87.4). This suggests that patients whose disease progressed from in situ to metastatic experienced lower survival rates compared to those who were initially diagnosed with metastatic prostate cancer.
Conduct a log rank test between the two groups
# Perform a log-rank test between the two metastatic progression groupslogrank_test <-survdiff(Surv(followup_metastasis_months, is_dead) ~ progression, data = met_df)# Convert to a tidy formatlogrank_df <-tibble(Statistic = logrank_test$chisq,P_Value =1-pchisq(logrank_test$chisq, df =length(logrank_test$n) -1))# Display the log-rank test resultlogrank_df %>%gt() %>%fmt_number(columns =c(Statistic, P_Value), decimals =3) %>%tab_header(title ="Log-Rank Test for Survival Differences")
Log-Rank Test for Survival Differences
Statistic
P_Value
6.609
0.010
With a p-value 0.010. We can confirm there is a significant difference in survival between the two metastatic subgroups
Now, lets conduct a cox model for the metastatic group
# Fit Cox model for metastatic-only datacox_model_met <-coxph(Surv(followup_metastasis_months, is_dead) ~ progression, data = met_df)summary(cox_model_met)
Call:
coxph(formula = Surv(followup_metastasis_months, is_dead) ~ progression,
data = met_df)
n= 175, number of events= 118
coef exp(coef) se(coef) z Pr(>|z|)
progressionmetastatic -0.4855 0.6154 0.1908 -2.544 0.011 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
progressionmetastatic 0.6154 1.625 0.4234 0.8945
Concordance= 0.587 (se = 0.026 )
Likelihood ratio test= 6.61 on 1 df, p=0.01
Wald test = 6.47 on 1 df, p=0.01
Score (logrank) test = 6.59 on 1 df, p=0.01
Cox Proportional Hazards Model for Metastatic Groups
Predictor
HR
95% CI Lower
95% CI Upper
P-Value
progressionmetastatic
0.62
0.42
0.89
0.011
# Check PH assumptioncox.zph(cox_model_met)
chisq df p
progression 5.07 1 0.024
GLOBAL 5.07 1 0.024
#A significant value shows the the model is time dependent
Comparative effectiveness of treatment on in-situ
# Filter dataset for patients who remain 'in situ'insitu_df <- df %>%filter(progression =="in situ")# Kaplan-Meier by lines_of_treatmentkm_fit_insitu <-survfit2(Surv(followup_neoplasm_months, is_dead) ~ lines_of_treatment, data = insitu_df)# Plot survival curveskm_fit_insitu %>%ggsurvfit() +labs(x ="Follow-up (Months)",y ="Survival Probability",title ="Kaplan-Meier Survival for In Situ Disease (by Lines of Treatment)" ) +theme_minimal()
# Extract median survival timesmedsurv_overall <-surv_median(km_fit_insitu)# Display as a tablemedsurv_overall %>%select(strata, median) %>%rename(Group = strata,Median_Survival = median ) %>%gt() %>%fmt_number(columns =c(Median_Survival), decimals =2) %>%tab_header(title ="Median Survival for In Situ Disease by Lines of Treatment")
Median Survival for In Situ Disease by Lines of Treatment
Group
Median_Survival
lines_of_treatment=L1
29.40
lines_of_treatment=L2
NA
We observe a notable trend in survival outcomes based on the number of treatment lines received. Patients who received only one line of treatment experienced a sharp decline in survival probability during the early follow-up period. In contrast, those who received two lines of treatment demonstrated a consistently higher survival probability throughout the follow-up. This suggests that receiving additional lines of therapy may be associated with improved survival outcomes
Cox model by lines of treatment.
# Define formulaformula <-as.formula("Surv(followup_neoplasm_months, is_dead) ~ lines_of_treatment")# Fit Cox modelcoxph_model <-coxph(formula, data = insitu_df)# Extract hazard ratios, confidence intervals, p-valuescox_result <-tidy(coxph_model, exponentiate =TRUE, conf.int =TRUE) %>%select(term, estimate, conf.low, conf.high, p.value) %>%rename(Variable = term,Hazard_Ratio = estimate,CI_Lower = conf.low,CI_Upper = conf.high,P_Value = p.value )# Display with gtcox_result %>%gt() %>%fmt_number(columns =c(Hazard_Ratio, CI_Lower, CI_Upper), decimals =2) %>%fmt_number(columns = P_Value, decimals =3) %>%tab_header(title ="Cox Proportional Hazards Model by Lines of Treatment (In Situ Patients)") %>%cols_label(Variable ="Treatment Line",Hazard_Ratio ="HR",CI_Lower ="95% CI Lower",CI_Upper ="95% CI Upper",P_Value ="P-Value" )
Cox Proportional Hazards Model by Lines of Treatment (In Situ Patients)
Treatment Line
HR
95% CI Lower
95% CI Upper
P-Value
lines_of_treatmentL2
0.08
0.05
0.12
0.000
The Hazard Ratio value is lower than one for the group with two lines of therapy, and the confidence intervals do not cross 1, confirming that the risk of death is lower for the group with two lines of treatment than for the group with one line of treatment.