—library(here)
title: “Prostate Cancer Screening Simulation Analysis” author: “aysa hasanzade, MD” date: “December 14, 2025” output: html_document: toc: true toc_float: true toc_depth: 3 theme: flatly highlight: tango —
This report analyzes simulated data from a prostate cancer screening trial. The simulation includes: - 2000 patients randomized to screening or control arms - Realistic features: missing data, loss to follow-up, varying cancer aggressiveness - Time-to-event outcomes for cancer diagnosis
# Load the simulated data
trial_data <- read_csv(here("simulated_prostate_screening_trial.csv"))
# Quick overview
cat("Dataset dimensions:", dim(trial_data), "\n")
## Dataset dimensions: 2000 15
cat("Variables:", names(trial_data), "\n")
## Variables: patient_id arm age family_history psa_baseline has_cancer cancer_aggressiveness time_to_diagnosis vital_status time_to_death lost_to_followup psa_missing time_to_event event_type time_to_diagnosis_months
# Create a publication-ready Table 1
table1 <- trial_data %>%
select(arm, age, family_history, has_cancer, vital_status) %>%
mutate(
family_history = factor(family_history,
labels = c("No", "Yes")),
has_cancer = factor(has_cancer,
labels = c("No cancer", "Cancer detected")),
vital_status = factor(vital_status,
labels = c("Alive", "Deceased"))
) %>%
tbl_summary(
by = arm,
label = list(
age ~ "Age, years",
family_history ~ "Family history of prostate cancer",
has_cancer ~ "Cancer detected during trial",
vital_status ~ "Vital status at end of follow-up"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = list(all_continuous() ~ 1, all_categorical() ~ 1)
) %>%
add_p() %>%
modify_header(label = "**Variable**") %>%
bold_labels()
# Display the table
table1
| Variable | Control N = 1,0211 |
Screening N = 9791 |
p-value2 |
|---|---|---|---|
| Age, years | 64.9 (7.8) | 64.8 (7.7) | 0.7 |
| Family history of prostate cancer | 138.0 (13.5%) | 146.0 (14.9%) | 0.4 |
| Cancer detected during trial | 0.9 | ||
| No cancer | 946.0 (92.7%) | 905.0 (92.4%) | |
| Cancer detected | 75.0 (7.3%) | 74.0 (7.6%) | |
| Vital status at end of follow-up | 0.3 | ||
| Alive | 793.0 (77.7%) | 740.0 (75.6%) | |
| Deceased | 228.0 (22.3%) | 239.0 (24.4%) | |
| 1 Mean (SD); n (%) | |||
| 2 Wilcoxon rank sum test; Pearson’s Chi-squared test | |||
Interpretation: This table shows the baseline characteristics by trial arm. The p-value tests whether there are significant differences between groups (should be non-significant due to randomization).
# Calculate cancer detection rates
cancer_summary <- trial_data %>%
filter(has_cancer == 1) %>%
group_by(arm) %>%
summarise(
n_patients = n(),
mean_age = mean(age, na.rm = TRUE),
mean_time_to_dx = mean(time_to_diagnosis_months, na.rm = TRUE),
sd_time_to_dx = sd(time_to_diagnosis_months, na.rm = TRUE),
.groups = 'drop'
)
# Display as a clean table
cancer_summary %>%
kbl(caption = "Cancer Detection by Trial Arm") %>%
kable_classic(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| arm | n_patients | mean_age | mean_time_to_dx | sd_time_to_dx |
|---|---|---|---|---|
| Control | 75 | 63.01333 | 71.22327 | 84.43766 |
| Screening | 74 | 65.74324 | 68.07265 | 70.70905 |
# Prepare data for survival analysis (only cancer patients)
cancer_patients <- trial_data %>%
filter(has_cancer == 1) %>%
mutate(
# Ensure all have event (diagnosis) = 1
event = 1,
# Use time_to_diagnosis_months as survival time
time = time_to_diagnosis_months
)
# Create survival object
surv_obj <- Surv(time = cancer_patients$time, event = cancer_patients$event)
# Fit Kaplan-Meier curves by arm
km_fit <- survfit(surv_obj ~ arm, data = cancer_patients)
# Plot with survminer
diagnosis_plot <- ggsurvplot(
km_fit,
data = cancer_patients,
pval = TRUE, # Add log-rank p-value
pval.method = TRUE, # Add test name
conf.int = TRUE, # Add confidence intervals
risk.table = TRUE, # Add risk table
risk.table.height = 0.25, # Adjust table size
title = "Time to Prostate Cancer Diagnosis by Screening Arm",
xlab = "Time (months)",
ylab = "Proportion Without Diagnosis",
legend.title = "Trial Arm",
legend.labs = c("Control", "Screening"),
palette = c("#E7B800", "#2E9FDF"), # Custom colors
ggtheme = theme_minimal()
)
# Display the plot
diagnosis_plot
Key finding: The screening arm shows earlier diagnosis times (shift to the left in the curve), demonstrating the lead time effect of screening.
# Fit Cox model adjusting for age
cox_model <- coxph(surv_obj ~ arm + age, data = cancer_patients)
# Create regression table
cox_table <- tbl_regression(
cox_model,
exponentiate = TRUE, # Show Hazard Ratios (not log-HRs)
label = list(
arm ~ "Trial Arm (Screening vs Control)",
age ~ "Age (per year increase)"
)
) %>%
modify_header(estimate ~ "**Hazard Ratio**") %>%
bold_labels() %>%
italicize_levels()
cox_table
| Characteristic | Hazard Ratio | 95% CI | p-value |
|---|---|---|---|
| Trial Arm (Screening vs Control) | |||
| Control | — | — | |
| Screening | 0.95 | 0.68, 1.32 | 0.8 |
| Age (per year increase) | 1.03 | 1.00, 1.05 | 0.020 |
| Abbreviations: CI = Confidence Interval, HR = Hazard Ratio | |||
Interpretation: The hazard ratio for the screening arm represents how much faster/slower diagnoses occur compared to control, adjusted for age.
# Summarize missingness patterns
missing_summary <- trial_data %>%
summarise(
n_patients = n(),
# PSA missing (should be 100% in control, some % in screening)
psa_missing_overall = mean(is.na(psa_baseline)),
psa_missing_screening = mean(is.na(psa_baseline[arm == "Screening"])),
psa_missing_control = mean(is.na(psa_baseline[arm == "Control"])),
# Loss to follow-up
lost_to_fu = mean(lost_to_followup),
.groups = 'drop'
)
# Display
cat("### Missing Data Summary\n")
## ### Missing Data Summary
cat("Total patients:", missing_summary$n_patients, "\n")
## Total patients: 2000
cat("Overall PSA missing:", round(missing_summary$psa_missing_overall*100, 1), "%\n")
## Overall PSA missing: 51 %
cat("PSA missing in screening arm:", round(missing_summary$psa_missing_screening*100, 1), "%\n")
## PSA missing in screening arm: 0 %
cat("PSA missing in control arm:", round(missing_summary$psa_missing_control*100, 1), "%\n")
## PSA missing in control arm: 100 %
cat("Lost to follow-up:", round(missing_summary$lost_to_fu*100, 1), "%\n")
## Lost to follow-up: 30.5 %
This simulation has several limitations that mirror real trial challenges:
Next analytical steps: - Add competing risks analysis - Model interval-censoring (diagnoses only known at biopsy times) - Conduct power calculations based on these results
All code for this analysis is available. To reproduce: 1. Run
01_simulate_trial_data.R to generate data 2. Run this R
Markdown document (03_report.Rmd)
Analysis performed using R version 4.5.2 with key packages: tidyverse (2.0.0), survival (3.8.3).