—library(here)

title: “Prostate Cancer Screening Simulation Analysis” author: “aysa hasanzade, MD” date: “December 14, 2025” output: html_document: toc: true toc_float: true toc_depth: 3 theme: flatly highlight: tango —

1. Introduction

This report analyzes simulated data from a prostate cancer screening trial. The simulation includes: - 2000 patients randomized to screening or control arms - Realistic features: missing data, loss to follow-up, varying cancer aggressiveness - Time-to-event outcomes for cancer diagnosis

2. Load and Prepare Data

# Load the simulated data
trial_data <- read_csv(here("simulated_prostate_screening_trial.csv"))

# Quick overview
cat("Dataset dimensions:", dim(trial_data), "\n")
## Dataset dimensions: 2000 15
cat("Variables:", names(trial_data), "\n")
## Variables: patient_id arm age family_history psa_baseline has_cancer cancer_aggressiveness time_to_diagnosis vital_status time_to_death lost_to_followup psa_missing time_to_event event_type time_to_diagnosis_months

3. Patient Characteristics (Table 1)

# Create a publication-ready Table 1
table1 <- trial_data %>%
  select(arm, age, family_history, has_cancer, vital_status) %>%
  mutate(
    family_history = factor(family_history, 
                           labels = c("No", "Yes")),
    has_cancer = factor(has_cancer,
                       labels = c("No cancer", "Cancer detected")),
    vital_status = factor(vital_status,
                         labels = c("Alive", "Deceased"))
  ) %>%
  tbl_summary(
    by = arm,
    label = list(
      age ~ "Age, years",
      family_history ~ "Family history of prostate cancer",
      has_cancer ~ "Cancer detected during trial",
      vital_status ~ "Vital status at end of follow-up"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = list(all_continuous() ~ 1, all_categorical() ~ 1)
  ) %>%
  add_p() %>%
  modify_header(label = "**Variable**") %>%
  bold_labels()

# Display the table
table1
Variable Control
N = 1,021
1
Screening
N = 979
1
p-value2
Age, years 64.9 (7.8) 64.8 (7.7) 0.7
Family history of prostate cancer 138.0 (13.5%) 146.0 (14.9%) 0.4
Cancer detected during trial

0.9
    No cancer 946.0 (92.7%) 905.0 (92.4%)
    Cancer detected 75.0 (7.3%) 74.0 (7.6%)
Vital status at end of follow-up

0.3
    Alive 793.0 (77.7%) 740.0 (75.6%)
    Deceased 228.0 (22.3%) 239.0 (24.4%)
1 Mean (SD); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test

Interpretation: This table shows the baseline characteristics by trial arm. The p-value tests whether there are significant differences between groups (should be non-significant due to randomization).

4. Cancer Detection Results

# Calculate cancer detection rates
cancer_summary <- trial_data %>%
  filter(has_cancer == 1) %>%
  group_by(arm) %>%
  summarise(
    n_patients = n(),
    mean_age = mean(age, na.rm = TRUE),
    mean_time_to_dx = mean(time_to_diagnosis_months, na.rm = TRUE),
    sd_time_to_dx = sd(time_to_diagnosis_months, na.rm = TRUE),
    .groups = 'drop'
  )

# Display as a clean table
cancer_summary %>%
  kbl(caption = "Cancer Detection by Trial Arm") %>%
  kable_classic(full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
Cancer Detection by Trial Arm
arm n_patients mean_age mean_time_to_dx sd_time_to_dx
Control 75 63.01333 71.22327 84.43766
Screening 74 65.74324 68.07265 70.70905

5. Survival Analysis: Time to Diagnosis

# Prepare data for survival analysis (only cancer patients)
cancer_patients <- trial_data %>%
  filter(has_cancer == 1) %>%
  mutate(
    # Ensure all have event (diagnosis) = 1
    event = 1,
    # Use time_to_diagnosis_months as survival time
    time = time_to_diagnosis_months
  )

# Create survival object
surv_obj <- Surv(time = cancer_patients$time, event = cancer_patients$event)

# Fit Kaplan-Meier curves by arm
km_fit <- survfit(surv_obj ~ arm, data = cancer_patients)

# Plot with survminer
diagnosis_plot <- ggsurvplot(
  km_fit,
  data = cancer_patients,
  pval = TRUE,                 # Add log-rank p-value
  pval.method = TRUE,         # Add test name
  conf.int = TRUE,            # Add confidence intervals
  risk.table = TRUE,          # Add risk table
  risk.table.height = 0.25,   # Adjust table size
  title = "Time to Prostate Cancer Diagnosis by Screening Arm",
  xlab = "Time (months)",
  ylab = "Proportion Without Diagnosis",
  legend.title = "Trial Arm",
  legend.labs = c("Control", "Screening"),
  palette = c("#E7B800", "#2E9FDF"),  # Custom colors
  ggtheme = theme_minimal()
)

# Display the plot
diagnosis_plot

Key finding: The screening arm shows earlier diagnosis times (shift to the left in the curve), demonstrating the lead time effect of screening.

6. Cox Proportional Hazards Model

# Fit Cox model adjusting for age
cox_model <- coxph(surv_obj ~ arm + age, data = cancer_patients)

# Create regression table
cox_table <- tbl_regression(
  cox_model,
  exponentiate = TRUE,  # Show Hazard Ratios (not log-HRs)
  label = list(
    arm ~ "Trial Arm (Screening vs Control)",
    age ~ "Age (per year increase)"
  )
) %>%
  modify_header(estimate ~ "**Hazard Ratio**") %>%
  bold_labels() %>%
  italicize_levels()

cox_table
Characteristic Hazard Ratio 95% CI p-value
Trial Arm (Screening vs Control)


    Control
    Screening 0.95 0.68, 1.32 0.8
Age (per year increase) 1.03 1.00, 1.05 0.020
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

Interpretation: The hazard ratio for the screening arm represents how much faster/slower diagnoses occur compared to control, adjusted for age.

7. Missing Data Assessment

# Summarize missingness patterns
missing_summary <- trial_data %>%
  summarise(
    n_patients = n(),
    # PSA missing (should be 100% in control, some % in screening)
    psa_missing_overall = mean(is.na(psa_baseline)),
    psa_missing_screening = mean(is.na(psa_baseline[arm == "Screening"])),
    psa_missing_control = mean(is.na(psa_baseline[arm == "Control"])),
    # Loss to follow-up
    lost_to_fu = mean(lost_to_followup),
    .groups = 'drop'
  )

# Display
cat("### Missing Data Summary\n")
## ### Missing Data Summary
cat("Total patients:", missing_summary$n_patients, "\n")
## Total patients: 2000
cat("Overall PSA missing:", round(missing_summary$psa_missing_overall*100, 1), "%\n")
## Overall PSA missing: 51 %
cat("PSA missing in screening arm:", round(missing_summary$psa_missing_screening*100, 1), "%\n")
## PSA missing in screening arm: 0 %
cat("PSA missing in control arm:", round(missing_summary$psa_missing_control*100, 1), "%\n")
## PSA missing in control arm: 100 %
cat("Lost to follow-up:", round(missing_summary$lost_to_fu*100, 1), "%\n")
## Lost to follow-up: 30.5 %

8. Limitations and Next Steps

This simulation has several limitations that mirror real trial challenges:

  1. Simplified cancer biology: Real prostate cancer has more complex progression
  2. Missing data mechanisms: Our simulation uses simplified missingness patterns
  3. Competing risks: Death from other causes isn’t fully modeled

Next analytical steps: - Add competing risks analysis - Model interval-censoring (diagnoses only known at biopsy times) - Conduct power calculations based on these results

9. Reproducibility

All code for this analysis is available. To reproduce: 1. Run 01_simulate_trial_data.R to generate data 2. Run this R Markdown document (03_report.Rmd)

Analysis performed using R version 4.5.2 with key packages: tidyverse (2.0.0), survival (3.8.3).