Phase 3 PFS Group-Sequential Simulation — Report

1. Why this design

The user provided a fixed Phase 3 oncology spec (implementation mode): two-arm trial, 1:1 randomization, exponential PFS with median 60 months in placebo and HR = 0.74 in treatment, 1200 patients, group-sequential testing with one interim and one final at information fractions 0.66 and 1.00, O’Brien-Fleming alpha spending, one-sided α = 0.025, 80% power target. The simulation answers four questions: power at the interim, overall power, expected trial duration accounting for binding efficacy stop, and the calendar-time distribution of when the interim and final milestones fire.

Adaptive sample-size reassessment was raised by the user as a possible follow-up but is explicitly out of scope for this run.

2. Confirmed parameters

Item	Value	Notes
Endpoint	PFS, TTE, exponential	single primary endpoint
Placebo hazard	`log(2) / 60` ≈ 0.01155 / month	median PFS = 60 mo
Treatment hazard	`log(2) / 60 * 0.74` ≈ 0.00855 / month	HR = 0.74
Sample size	1200 patients	1:1 randomization
Trial duration cap	96 months	generous upper bound; final fires on event count, not calendar
Accrual schedule	24/mo for months 0–6, then 42/mo	linear-midpoint approximation of 6→42 ramp
Dropout	exponential, `rate = -log(1 - 0.025) / 12` ≈ 0.00211 / month	2.5% by month 12, both arms
Stratification factors	none
Seed	NULL (auto per replicate)
Interim trigger	233 PFS events	IF = 0.66
Final trigger	353 PFS events	IF = 1.00
Interim z-bound	2.524 (one-sided upper)	from `gsDesign::gsSurv` with `sfLDOF`
Final z-bound	1.992 (one-sided upper)	from `gsDesign::gsSurv` with `sfLDOF`
Replicates	1000

2.5 Boundary computation (gsDesign)

Boundaries and required event counts come from scripts/boundaries.R. The information fractions are pre-specified in the protocol and the event counts deterministically fix them, so this calculation is constant across replicates — it is run once and the literals are hardcoded into actions.R.

library(gsDesign)

gs <- gsSurv(
  k         = 2,
  test.type = 1,
  alpha     = 0.025,
  beta      = 0.20,
  timing    = c(0.66, 1.00),
  sfu       = sfLDOF,
  lambdaC   = log(2) / 60,
  hr        = 0.74,
  ratio     = 1
)

--- Critical z-values (one-sided upper) ---
[1] 2.524189 1.991501

--- Cumulative alpha spent ---
[1] 0.005798279 0.025000000

--- Target events per stage (cumulative) ---
[1] 233 353

gsSurv also reports its own implied N (3120) and study duration (18 months) under its own assumed accrual model. These are not used in the simulation — the simulation uses the user-specified 1200 patients and the accrual schedule in §4. Only the event counts (233 / 353) and z-bounds (2.524 / 1.992) are taken from this calculation.

3. Arms

Both arms share the same endpoint structure — a single TTE PFS endpoint generated by rexp — with only the rate parameter differing. Inlining log(2) / 60 and log(2) / 60 * 0.74 at each call site keeps the placebo median and the hazard ratio visible to a reviewer.

Placebo

ep_pfs_placebo <- endpoint(
  name      = "pfs",
  type      = "tte",
  generator = rexp,
  rate      = log(2) / 60
)

placebo <- arm(name = "placebo")
placebo$add_endpoints(ep_pfs_placebo)

Median PFS = 60 months under exponential survival.

Treatment

ep_pfs_treatment <- endpoint(
  name      = "pfs",
  type      = "tte",
  generator = rexp,
  rate      = log(2) / 60 * 0.74
)

treatment <- arm(name = "treatment")
treatment$add_endpoints(ep_pfs_treatment)

HR = 0.74 implies median PFS ≈ 81.1 months under exponential survival. Constant hazard ratio — Cox PH and log-rank are both valid; we use log-rank per spec.

4. Trial setup

accrual_rate <- data.frame(
  end_time       = c(6, Inf),
  piecewise_rate = c(24, 42)
)

tr <- trial(
  name         = "phase3_pfs_obf",
  n_patients   = 1200,
  duration     = 96,
  enroller     = StaggeredRecruiter,
  accrual_rate = accrual_rate,
  dropout      = rexp,
  rate         = -log(1 - 0.025) / 12
)
tr$add_arms(sample_ratio = c(1, 1), placebo, treatment)

Sample size 1200 with 1:1 randomization (600 per arm).
Duration 96 months — a generous calendar cap; the final fires on event count (353) so the cap only matters if events accrue more slowly than expected. In the production run, the latest final milestone time observed is well under 60 months (see §7), so 96 is comfortably above the worst-case replicate.
Accrual modeled in two pieces. The spec’s ramp from 6/mo to 42/mo over months 0–6 is approximated as a constant 24/mo (linear midpoint) for that window; thereafter constant at 42/mo. With this schedule full enrollment of 1200 is reached in 6 + (1200 − 144)/42 ≈ 31 months.
Dropout exponential with rate = -log(1 - 0.025) / 12, matching 2.5% dropout by month 12 in both arms (single-landmark exponential default; no Weibull solver needed).
No stratification factors. Randomization is unstratified per spec.

5. Milestones

Two milestones, in chronological order. Both are event-driven on PFS so their information fractions are deterministic across replicates.

m_interim <- milestone(
  name   = "interim",
  when   = eventNumber(endpoint = "pfs", n = 233),
  action = action_interim
)

m_final <- milestone(
  name   = "final",
  when   = eventNumber(endpoint = "pfs", n = 353),
  action = action_final
)

Interim triggers when 233 PFS events have accrued (66% of the 353 final target). Calibration-run mean trigger time ≈ 39.1 months (sd 1.5).
Final triggers at 353 PFS events. Calibration-run mean trigger time ≈ 53.4 months (sd 2.0).

The TrialSimulator engine never actually stops a trial early — every replicate runs through both milestones, and the binding efficacy stop is applied post-hoc in §7 from the saved reject_interim flag.

6. Action functions

Both actions follow the same shape: lock data, run a one-sided log-rank against placebo, save the test stat and decision flag. The OBF z-bounds are hardcoded from boundaries.R so no optimizer runs per replicate.

action_interim <- function(trial, ...) {
  data <- trial$get_locked_data(milestone_name = "interim")

  fit <- fitLogrank(
    formula     = Surv(pfs, pfs_event) ~ arm,
    placebo     = "placebo",
    data        = data,
    alternative = "less"
  )

  # OBF interim z-bound = 2.524 (one-sided upper) from scripts/boundaries.R.
  # fitLogrank with alternative = "less" returns z < 0 when treatment
  # hazard is lower, so the rejection rule is z <= -2.524.
  trial$save(value = fit$z, name = "z_interim")
  trial$save(value = fit$p, name = "p_interim")
  trial$save(value = as.integer(fit$z <= -2.524), name = "reject_interim")
}

Trigger: 233 PFS events (IF = 0.66).
Data lock: PFS event/censoring at calendar time of the trigger; both arms populated.
Analysis: one-sided log-rank via fitLogrank with alternative = "less" because the treatment is expected to have lower PFS hazard; this returns z < 0 under the alternative.
Adaptation: none — fixed group-sequential design. No trial$set_duration / $resize / $remove_arms calls.
Saves: z_interim, p_interim, reject_interim (1 if z ≤ −2.524). reject_interim feeds power-at-interim and the expected-duration calculation; z_interim and p_interim let a reviewer cross-check the decision flag against the test statistic.

action_final <- function(trial, ...) {
  data <- trial$get_locked_data(milestone_name = "final")

  fit <- fitLogrank(
    formula     = Surv(pfs, pfs_event) ~ arm,
    placebo     = "placebo",
    data        = data,
    alternative = "less"
  )

  # OBF final z-bound = 1.992 (one-sided upper) from scripts/boundaries.R.
  trial$save(value = fit$z, name = "z_final")
  trial$save(value = fit$p, name = "p_final")
  trial$save(value = as.integer(fit$z <= -1.992), name = "reject_final")
}

Trigger: 353 PFS events (IF = 1.00).
Data lock: all PFS events through the final lock; same arms.
Analysis: identical log-rank as the interim, on the larger locked sample.
Adaptation: none.
Saves: z_final, p_final, reject_final (1 if z ≤ −1.992). reject_final combines with reject_interim to give overall power.

7. Operating characteristics

All values from 1000 replicates of the production run (scripts/main.R).

Q1. Power at the interim

“What is the probability of crossing the OBF interim bound under the assumed HR = 0.74?”

mean(out$reject_interim)
# 0.415

gsDesign predicted ~0.407 for marginal interim crossing under HR = 0.74; the simulation’s 0.415 is within MCSE (≈ √(0.415·0.585/1000) ≈ 0.016).

Q2. Overall power

“What is the probability of rejecting H₀ at either the interim or the final?”

mean(out$reject_interim | out$reject_final)
# 0.805

Matches the 80% design target. MCSE ≈ √(0.805·0.195/1000) ≈ 0.013, so the 95% confidence band is roughly 0.78–0.83 — comfortably consistent with 0.80.

Q3. Expected trial duration

“How long does the trial run on average, accounting for binding efficacy stop at the interim?”

stop_time <- ifelse(out$reject_interim == 1,
                    out[["milestone_time_<interim>"]],
                    out[["milestone_time_<final>"]])
mean(stop_time)
# 47.46  months

Interpretation: with binding efficacy stopping, ~41.5% of replicates end at the interim trigger (mean 39.1 mo) and the rest at the final (mean 53.4 mo); the weighted average is ~47.5 months. Without the binding stop (every trial runs to final), expected duration would be the unconditional mean final time, 53.4 mo.

Q4. Milestone trigger time distribution

summarizeMilestoneTime(out)

Milestone	Mean (mo)	Median (mo)	SD	n
interim	39.11	39.09	1.49	1000
final	53.38	53.31	2.02	1000

Milestone trigger times

The SDs are small relative to the means — accrual is large enough that event-count timing is concentrated. The 96-month duration cap is far above any observed final-trigger time, so no replicate is censored by the calendar limit.

8. Caveats and limitations

Accrual ramp approximation. The spec’s “ramp from 6/mo to 42/mo by month 6” was modeled as a constant 24/mo for months 0–6 (linear midpoint of the ramp). A finer piecewise approximation (e.g., 3-month blocks at ~12, ~30) is straightforward if a sponsor reviewer wants it; the impact on event-time distributions is small because most events accrue after enrollment completes.
Binding efficacy interpretation. The expected-duration calculation assumes a binding interim efficacy stop. If the protocol treats the interim as non-binding, the relevant duration metric is mean(out[["milestone_time_<final>"]]) ≈ 53.4 months instead.
No covariate adjustment, no stratification. As specified. Cox-adjusted or stratified log-rank are easy substitutions if needed.
Adaptive SSR not implemented. The user noted this as a possible follow-up; this run is a fixed group-sequential design only.
MCSE on power. At 1000 replicates, overall power MCSE ≈ 0.013. If a tighter precision is required for regulatory purposes (e.g., ±0.005 on 0.80), increase to ~6000 replicates.