HIMALAYA (NCT03298451) — Simulation Report

Source: AstraZeneca SAP D419CC00002, Edition 4.0, 30-JUL-2021.


0. Resource Utilization

Metric Value
Input tokens (uncached) 113
Output tokens 110,840
Cache write tokens 216,055
Cache read tokens 7,081,951
Estimated cost (USD) $4.60
Model claude-sonnet-4-6
Session period (UTC) 2026-05-11 03:29 – 04:15+
TrialSimulator version 1.18.1 (GitHub HEAD)
R version 4.5.2 (2025-10-31)
rpact version 4.4.0

0.5 Output Files and Reproduction

File tree:

runs/himalaya/
├── scripts/
│   ├── boundaries.R     3.4K   rpact OBF boundary derivation + NPH post-delay HR (run once)
│   ├── actions.R        6.2K   action_ia2, action_fa — H1 → H2(NI) → H3(sup) cascade
│   └── main.R          12.0K   endpoints, arms, trial, milestones, controller;
│                               three scenarios + OC summaries
├── output_planning.rds  46K    raw controller$get_output(), planning alternative, n=1000
├── output_null.rds      49K    raw controller$get_output(), global null, n=1000
├── output_nph_sens.rds 137K    list of three raw outputs, NPH sensitivity, n=1000 each
├── oc_summary.rds      138K    OC list saved by main.R for report
├── milestone_times.png  17K    milestone-time distribution, planning alternative
├── report.md            this document
└── report.html          rendered via markdown::mark_html

Reproduction:

cd runs/himalaya
Rscript scripts/boundaries.R          # once, to confirm boundary literals and NPH derivation
Rscript scripts/main.R                # regenerates all output_*.rds, oc_summary.rds, milestone_times.png
Rscript -e 'markdown::mark_html("report.md", output = "report.html")'

1. Design Rationale

HIMALAYA is a three-arm, Phase 3 randomized controlled trial in first-line unresectable advanced hepatocellular carcinoma. The primary question is whether the STRIDE regimen (single priming dose of tremelimumab followed by durvalumab, Arm C) is superior in overall survival to sorafenib (Arm D). A secondary question, gated behind the primary, is whether durvalumab monotherapy (Arm A) is non-inferior and potentially superior to sorafenib.

The simulation objective is to verify three classes of operating characteristics: (1) empirical power for H1, H2, and H3 under the sponsor’s planning alternative; (2) familywise error rate under the global null; and (3) sensitivity of H1 power to the specific parameterization of the 2-month treatment-effect delay. Calendar timing of the two looks is reported as a secondary check.

The planning alternative for Arm C (STRIDE) specifies a non-proportional hazards model: an average HR of 0.70 versus sorafenib with a 2-month delay in separation. Translation of this clinical specification to a piecewise-hazard model is derived analytically (see §2 and §2.5). Arm A (durvalumab monotherapy) follows proportional hazards at HR = 0.84. The unstratified analysis specified in the prompt is implemented; the SAP uses stratified analysis, so power estimates here are expected to be slightly conservative relative to the protocol.


2. Design Parameters

Parameter Value Source / Notes
Trial arms A (durva mono), C (STRIDE), D (sorafenib) protocol
Allocation 1 : 1 : 1 protocol
N 1,324 (≈441 per arm) protocol (curator-supplied; redacted in SAP)
Accrual Piecewise: 25/mo (0–4 mo), 60/mo (4–10 mo), 72/mo (10+ mo); capped at N=1,324 at ~month 22 assumed — standard ramp; actual schedule not provided. User confirmed this assumption.
Dropout None protocol — “censoring only at administrative cutoffs”
Arm D OS Exponential, rate = log(2)/11.5 = 0.060274/mo (median 11.5 mo) protocol
Arm A OS Exponential, rate = 0.84 × log(2)/11.5 = 0.050630/mo (HR = 0.84 vs D, PH) protocol
Arm C OS — delay period Hazard = log(2)/11.5 = 0.060274/mo for t ∈ [0, 2) (HR = 1.0) derived — 2-month delay per SAP wording
Arm C OS — post-delay Hazard = 0.6616 × log(2)/11.5 = 0.039875/mo for t ≥ 2 (HR = 0.6616 vs D) derived — see §2.5; chosen so event-weighted average HR under Arm D exponential = 0.70
Arm C OS generator PiecewiseConstantExponentialRNG, risk = data.frame(end_time=c(2, 1e6), piecewise_risk=c(0.060274, 0.039875)) derived
FWER budget 0.049 two-sided (H1/H2/H3 chain); 0.001 two-sided reserved for ORR-IA1 (out of scope) protocol
Alpha spending Lan-DeMets approximation of O’Brien-Fleming (asOF) protocol
H1 (C vs D superiority) IF 404/515 = 0.7845 (IA2), 1.000 (FA) — based on C+D event counts protocol
H2/H3 (A vs D NI / superiority) IF 453/560 = 0.8089 (IA2), 1.000 (FA) — based on A+D event counts protocol
H1 critical z at IA2 2.286840 (two-sided local α = 0.02221, SAP cross-check: 0.0222 ✓) derived from boundaries.R (rpact)
H1 critical z at FA 2.028910 (two-sided local α = 0.04247, SAP cross-check: 0.0425 ✓) derived from boundaries.R (rpact)
H2/H3 critical z at IA2 2.244733 (two-sided local α = 0.02479, SAP cross-check: 0.0248 ✓) derived from boundaries.R (rpact)
H2/H3 critical z at FA 2.035601 (two-sided local α = 0.04179, SAP cross-check: 0.0418 ✓) derived from boundaries.R (rpact)
H2 NI margin HR = 1.08; reject NI null if upper CI bound = log_HR + z_crit × SE < log(1.08) = 0.07696 protocol
IA2 trigger 404 OS events in arms C+D protocol
FA trigger 515 OS events in arms C+D protocol
Analysis Unstratified log-rank (H1, H3); Cox PH for H2 NI CI protocol (pilot: unstratified)
Replicates 1,000 per scenario protocol
Seed NULL (auto per replicate, recorded in output) software default
Trial duration ceiling 60 months (generous; event triggers end the analysis window) software default

2.5 Decision Boundary Derivation

Arm C NPH post-delay HR

The SAP specifies an average HR of 0.70 for STRIDE versus sorafenib, with a 2-month delay in treatment-effect separation. The simulation implements this as a two-piece piecewise-constant hazard:

HR(t) = 1.0    for t ∈ [0, 2)
HR(t) = h      for t ≥ 2

The “average” is interpreted as the event-time-weighted average under Arm D’s exponential marginal distribution:

average HR = ∫₀^∞ HR(t) · λ_D · exp(−λ_D · t) dt
           = (1 − exp(−λ_D · 2)) · 1.0 + exp(−λ_D · 2) · h

Solving for h with λ_D = log(2)/11.5 = 0.060274 and average HR = 0.70:

S_D(2) = exp(−0.060274 × 2) = 0.886435
h = (0.70 − (1 − 0.886435)) / 0.886435 = 0.5865 / 0.886435 = 0.661566

The post-delay hazard for Arm C is therefore 0.6616 × log(2)/11.5 = 0.039875/month.

OBF boundary derivation (rpact, run from scripts/boundaries.R)

library(rpact)

design_h1 <- getDesignGroupSequential(
  kMax             = 2,
  informationRates = c(404/515, 1),
  alpha            = 0.049,
  sided            = 2,
  typeOfDesign     = "asOF"
)

design_h2 <- getDesignGroupSequential(
  kMax             = 2,
  informationRates = c(453/560, 1),
  alpha            = 0.049,
  sided            = 2,
  typeOfDesign     = "asOF"
)

Output (key fields):

H1 design:
  Critical values:               2.287, 2.029
  Cumulative alpha spending:     0.02221, 0.04900
  Stage levels (one-sided):      0.01110, 0.02123

H2/H3 design:
  Critical values:               2.245, 2.036
  Cumulative alpha spending:     0.02479, 0.04900
  Stage levels (one-sided):      0.01239, 0.02090

All four stage-level cross-checks match the SAP-stated two-sided local alpha values to four decimal places (0.0222, 0.0425, 0.0248, 0.0418).


3. Treatment Arms and Endpoints

Arm D — sorafenib (control)

ep_os_d <- endpoint(
  name      = "os",
  type      = "tte",
  generator = rexp,
  rate      = log(2) / 11.5
)
arm_d <- arm(name = "arm_d")
arm_d$add_endpoints(ep_os_d)

OS follows an exponential distribution with hazard rate log(2)/11.5 = 0.060274/month, corresponding to a median of 11.5 months. The exponential model is consistent with the SAP assumption.

Arm A — durvalumab monotherapy

ep_os_a <- endpoint(
  name      = "os",
  type      = "tte",
  generator = rexp,
  rate      = 0.84 * log(2) / 11.5
)
arm_a <- arm(name = "arm_a")
arm_a$add_endpoints(ep_os_a)

OS follows an exponential distribution with hazard rate 0.84 × log(2)/11.5 = 0.050630/month (median ≈ 13.69 months). The proportional hazards assumption (HR = 0.84 vs Arm D, constant over time) is specified in the SAP.

Arm C — STRIDE

risk_c <- data.frame(
  end_time       = c(2, 1e6),
  piecewise_risk = c(log(2) / 11.5, 0.661566 * log(2) / 11.5)
)
ep_os_c <- endpoint(
  name          = "os",
  type          = "tte",
  generator     = PiecewiseConstantExponentialRNG,
  risk          = risk_c,
  endpoint_name = "os"
)
arm_c <- arm(name = "arm_c")
arm_c$add_endpoints(ep_os_c)

OS follows a piecewise-constant hazard. For the first 2 months the hazard is identical to sorafenib (HR = 1.0). After month 2 the hazard is 0.6616 times that of sorafenib (HR = 0.6616). This parameterization is derived so that the event-time-weighted average HR under sorafenib’s survival equals 0.70 (see §2.5). The Gumbel-copula or illness-death generators were not used because the SAP does not specify a correlation structure for OS across arms; the three arms are generated independently. PiecewiseConstantExponentialRNG handles the os_event indicator automatically.


4. Trial Configuration

accrual_rate <- data.frame(
  end_time       = c(4, 10, Inf),
  piecewise_rate = c(25, 60, 72)
)

tr <- trial(
  name         = label,
  n_patients   = 1324,
  duration     = 60,
  enroller     = StaggeredRecruiter,
  accrual_rate = accrual_rate,
  dropout      = NULL,
  silent       = silent
)
tr$add_arms(sample_ratio = c(1, 1, 1), arm_a, arm_c, arm_d)

Sample size and duration. The trial enrolls a maximum of 1,324 patients in a 1:1:1 allocation across arms A, C, and D (~441 per arm). The duration ceiling is 60 months; in practice enrollment is complete by approximately month 22 and analysis milestones are triggered by event counts well before month 60.

Accrual. A three-phase ramp is assumed: 25 patients/month for the first 4 months (ramp-up phase), 60/month for months 4–10 (build phase), and 72/month thereafter until N = 1,324 is reached (plateau phase). Enrollment completes at approximately month 22 under this schedule. The specific piecewise rates were not available from the SAP; the ramp-up shape is assumed based on common practice and confirmed by the user before simulation. The accrual schedule is a material sensitivity; see §7 (calendar timing discrepancy) and §8.

Dropout. The SAP specifies no dropout in the simulation; dropout = NULL implements this directly.

Stratification. The SAP stratifies by etiology (HBV/HCV/other), ECOG performance status (0/1), and macrovascular invasion (yes/no). This pilot simulation uses unstratified analysis as specified in the prompt; power estimates are slightly conservative relative to the protocol’s stratified analysis.


5. Milestones

IA2 — Interim Analysis 2

m_ia2 <- milestone(
  name     = "ia2",
  when     = eventNumber(endpoint = "os", n = 404, arms = c("arm_c", "arm_d")),
  action   = action_ia2,
  h1_z_ia2 = h1_z_ia2,
  h2_z_ia2 = h2_z_ia2
)

IA2 is triggered when the 404th OS event occurs in the combined Arm C + Arm D population. At this trigger, approximately 453 A+D OS events have also accumulated (pre-specified IF for H2/H3 = 0.8089). The action function performs the H1 → H2(NI) → H3(sup) hypothesis cascade. The sponsor’s expected calendar time for IA2 is approximately 30 months from first subject randomized (FSR); the simulation observes a mean of 25.2 months (see §7).

FA — Final Analysis

m_fa <- milestone(
  name     = "fa",
  when     = eventNumber(endpoint = "os", n = 515, arms = c("arm_c", "arm_d")),
  action   = action_fa,
  h1_z_fa  = h1_z_fa,
  h2_z_fa  = h2_z_fa
)

FA is triggered when the 515th OS event occurs in the Arm C + Arm D population, at which point approximately 560 A+D events have accumulated (H2/H3 IF = 1.000). The action function continues any untested hypotheses from the cascade. The sponsor’s expected calendar time for FA is approximately 37.5 months; the simulation observes a mean of 30.6 months.


6. Milestone Actions

action_ia2

action_ia2 <- function(trial,
                       h1_z_ia2,
                       h2_z_ia2,
                       ...) {

  data <- trial$get_locked_data(milestone_name = "ia2")

  # H1: C vs D log-rank (one-sided; treatment benefit = lower hazard -> z < 0)
  cd_data <- data[data$arm %in% c("arm_c", "arm_d"), ]
  res_h1  <- fitLogrank(
    formula     = Surv(os, os_event) ~ arm,
    placebo     = "arm_d",
    data        = cd_data,
    alternative = "less"
  )
  h1_z      <- res_h1$z
  h1_reject <- as.integer(h1_z <= -h1_z_ia2)

  h2_ni  <- 0L
  h3_sup <- 0L

  if (h1_reject == 1L) {
    # H2: A vs D NI (margin HR = 1.08); only if H1 rejected
    ad_data  <- data[data$arm %in% c("arm_a", "arm_d"), ]
    res_h2   <- fitCoxph(
      formula     = Surv(os, os_event) ~ arm,
      placebo     = "arm_d",
      data        = ad_data,
      alternative = "less",
      scale       = "log hazard ratio"
    )
    se_h2     <- .cox_se(res_h2)
    upper_ci  <- res_h2$estimate + h2_z_ia2 * se_h2
    h2_ni     <- as.integer(!is.na(upper_ci) & upper_ci < log(1.08))

    if (h2_ni == 1L) {
      # H3: A vs D superiority; only if H2 NI achieved
      res_h3 <- fitLogrank(
        formula     = Surv(os, os_event) ~ arm,
        placebo     = "arm_d",
        data        = ad_data,
        alternative = "less"
      )
      h3_sup <- as.integer(res_h3$z <= -h2_z_ia2)
    }
  }

  trial$save(value = h1_reject, name = "h1_reject_ia2")
  trial$save(value = h2_ni,     name = "h2_ni_ia2")
  trial$save(value = h3_sup,    name = "h3_sup_ia2")
  trial$save(value = h1_z,      name = "h1_z_ia2")

  trial$save_custom_data(value = h1_reject, name = "h1_rejected",  overwrite = TRUE)
  trial$save_custom_data(value = h2_ni,     name = "h2_ni_done",   overwrite = TRUE)
  trial$save_custom_data(value = h3_sup,    name = "h3_sup_done",  overwrite = TRUE)
}

Trigger. IA2 fires when 404 Arm-C + Arm-D OS events accumulate. The locked data covers all enrolled patients at that calendar time, with OS event status up to the lock date.

Data lock. get_locked_data("ia2") returns all three arms. The H1 test subsets to arm_c and arm_d; the H2/H3 tests subset to arm_a and arm_d.

H1 analysis. OS is tested with an unstratified log-rank test via fitLogrank(..., alternative = "less"). The one-sided z-statistic is compared to the negative of the OBF critical value at IA2 (z ≤ −2.2868 to reject). This tests H0: HR(C/D) ≥ 1.

H2 analysis. If H1 is rejected, the NI test for Arm A uses the Cox proportional hazards model via fitCoxph(..., scale = "log hazard ratio"). The upper bound of the one-sided confidence interval is constructed as:

upper_ci = log_HR_estimate + z_crit_ia2 × SE
SE       = log_HR_estimate / z_Wald

Reject H2 NI null (HR(A/D) ≥ 1.08) if upper_ci < log(1.08) = 0.07696. The critical value z_crit_ia2 = 2.2447 corresponds to the OBF spending for H2 at IA2.

H3 analysis. If H2 NI is achieved, OS superiority for Arm A is tested with an unstratified log-rank test. Reject H3 if z ≤ −2.2447 (same critical value as H2, since H3 uses the same A+D information fraction).

What gets saved. h1_reject_ia2, h2_ni_ia2, h3_sup_ia2 (binary rejection flags) feed into the cumulative-power OCs. h1_z_ia2 is saved for boundary verification. Custom data (h1_rejected, h2_ni_done, h3_sup_done) carry state forward to the FA action.

action_fa

action_fa <- function(trial,
                      h1_z_fa,
                      h2_z_fa,
                      ...) {

  data <- trial$get_locked_data(milestone_name = "fa")

  h1_rejected <- trial$get("h1_rejected")
  h2_ni_done  <- trial$get("h2_ni_done")
  h3_sup_done <- trial$get("h3_sup_done")
  if (is.null(h1_rejected)) h1_rejected <- 0L
  if (is.null(h2_ni_done))  h2_ni_done  <- 0L
  if (is.null(h3_sup_done)) h3_sup_done <- 0L

  # H1: test at FA only if not yet rejected at IA2
  h1_reject_fa <- 0L
  if (h1_rejected == 0L) {
    cd_data      <- data[data$arm %in% c("arm_c", "arm_d"), ]
    res_h1       <- fitLogrank(
      formula     = Surv(os, os_event) ~ arm,
      placebo     = "arm_d",
      data        = cd_data,
      alternative = "less"
    )
    h1_reject_fa <- as.integer(res_h1$z <= -h1_z_fa)
    trial$save(value = res_h1$z, name = "h1_z_fa")
  } else {
    trial$save(value = NA_real_, name = "h1_z_fa")
  }

  h1_ever <- as.integer(h1_rejected == 1L | h1_reject_fa == 1L)

  # H2: test at FA if H1 ever rejected and H2 NI not yet achieved
  h2_ni_fa <- 0L
  if (h1_ever == 1L && h2_ni_done == 0L) {
    ad_data  <- data[data$arm %in% c("arm_a", "arm_d"), ]
    res_h2   <- fitCoxph(
      formula     = Surv(os, os_event) ~ arm,
      placebo     = "arm_d",
      data        = ad_data,
      alternative = "less",
      scale       = "log hazard ratio"
    )
    se_h2    <- .cox_se(res_h2)
    upper_ci <- res_h2$estimate + h2_z_fa * se_h2
    h2_ni_fa <- as.integer(!is.na(upper_ci) & upper_ci < log(1.08))
  }

  h2_ever <- as.integer(h2_ni_done == 1L | h2_ni_fa == 1L)

  # H3: test at FA if H2 NI ever achieved and H3 not yet done
  h3_sup_fa <- 0L
  if (h2_ever == 1L && h3_sup_done == 0L) {
    ad_data  <- data[data$arm %in% c("arm_a", "arm_d"), ]
    res_h3   <- fitLogrank(
      formula     = Surv(os, os_event) ~ arm,
      placebo     = "arm_d",
      data        = ad_data,
      alternative = "less"
    )
    h3_sup_fa <- as.integer(res_h3$z <= -h2_z_fa)
  }

  trial$save(value = h1_reject_fa, name = "h1_reject_fa")
  trial$save(value = h2_ni_fa,     name = "h2_ni_fa")
  trial$save(value = h3_sup_fa,    name = "h3_sup_fa")
}

Trigger. FA fires when 515 Arm-C + Arm-D OS events accumulate (approximately 560 A+D events at the same calendar time).

Data lock. All enrolled patients, OS events censored at FA lock date.

H1 analysis. H1 is tested at FA (z ≤ −2.0289) only if it was not rejected at IA2. If H1 was already rejected at IA2, h1_z_fa is saved as NA and the C+D data are not refitted — the pre-existing H1 rejection carries forward.

H2 analysis. H2 NI is tested at FA (upper CI bound using z_crit = 2.0356) if H1 has been rejected at any prior look and H2 NI was not achieved at IA2. The FA boundary for H2 is applied regardless of whether H2 was eligible at IA2: per the SAP’s pre-specified spending schedule (0.0248 at IA2, 0.0418 at FA), the spending clock runs at each look independently of the gating status.

H3 analysis. Follows the same logic: test if H2 NI ever achieved and H3 not yet rejected, using z_crit = 2.0356 for FA.

What gets saved. h1_reject_fa, h2_ni_fa, h3_sup_fa complete the per-replicate cascade record used for OC computation.


7. Operating Characteristics

All OCs below are from the production run (n = 1,000 replicates per scenario). Monte Carlo standard error (MCSE) = √(p̂(1 − p̂)/n).

7.1 Planning alternative — H1 power

Research question. What is the probability of declaring OS superiority for STRIDE versus sorafenib at IA2, at FA, and cumulatively, under the sponsor’s planning assumptions?

Look Empirical power MCSE Sponsor cross-check
IA2 only 0.777 0.0132 ≥ 0.85
FA only (conditional on no IA2 rejection) 0.173 0.0120
Cumulative (IA2 or FA) 0.950 0.0069 ≥ 0.97

H1 power at IA2: mean(out$h1_reject_ia2). H1 cumulative: mean(pmax(out$h1_reject_ia2, out$h1_reject_fa)).

The observed cumulative power (95.0%) is approximately 2 percentage points below the sponsor’s cross-check target of ≥97%. The discrepancy at IA2 (77.7% vs ≥85%) is more pronounced. Probable contributing factors: (a) the accrual schedule assumed here is faster than the actual trial schedule, advancing IA2 in calendar time and reducing the effective information available per patient at the trigger; (b) unstratified analysis vs. the protocol’s stratified analysis; (c) the specific event-weighted definition of “average HR.” See §8.

7.2 Planning alternative — H2 NI power

Research question. Given H1 is achieved, what is the probability of declaring durvalumab monotherapy non-inferior to sorafenib (margin HR = 1.08)?

Look Empirical power (cumulative) MCSE Sponsor cross-check
Cumulative (IA2 or FA) 0.809 0.0124 ~0.84

H2 NI cumulative: mean(pmax(out$h2_ni_ia2, out$h2_ni_fa)).

The 80.9% observed power is approximately 3 percentage points below the sponsor’s stated ~84%. The same contributing factors as H1 apply.

7.3 Planning alternative — H3 superiority power

Research question. Given H2 NI is achieved, what is the probability of declaring durvalumab monotherapy superior to sorafenib in OS?

Look Empirical power (cumulative) MCSE
Cumulative (IA2 or FA) 0.507 0.0158

H3 cumulative: mean(pmax(out$h3_sup_ia2, out$h3_sup_fa)).

No sponsor cross-check target was stated for H3 power. The 50.7% power is consistent with an HR of 0.84 being a moderate effect that crosses the superiority threshold in roughly half of replicates given the A+D sample and event counts available.

7.4 Familywise error rate — global null

Research question. Under the global null (no treatment benefit for either Arm C or Arm A), what is the probability of any false rejection?

The global null is implemented as HR(C/D) = 1.0 and HR(A/D) = 1.0 for Arm D’s exponential distribution.

Metric Value MCSE Nominal
FWER (global null, n = 1,000) 0.018 0.0042 0.0245 (one-sided)

fwer = mean(pmax(out_null$h1_reject_ia2, out_null$h1_reject_fa)).

Under this configuration, rejecting H1 is the only possible type I error: the alpha-recycling gating ensures H2 and H3 cannot be rejected without H1 first, and at HR(A/D) = 1 the H2 NI null (H0: HR ≥ 1.08) is false — rejecting H2 is a correct decision. The effective one-sided type I error for H1 is 0.049/2 = 0.0245. The observed 1.8% ± 0.42% is within 1.5 MCSE of the theoretical 2.45%, consistent with correct boundary implementation.

7.5 Calendar timing of milestones

Research question. When does IA2 fire, and when does FA fire, relative to first subject randomized?

The summarizeMilestoneTime function is applicable here because no binding early-stop rule is in effect: every replicate runs through both IA2 and FA milestones in the simulation regardless of H1 outcome.

Milestone Mean (months) Median (months) SD Sponsor cross-check
IA2 25.2 25.2 0.57 ~30
FA 30.6 30.6 0.77 ~37.5

Milestone-time distribution (planning alternative)

The simulated IA2 fires approximately 4.8 months earlier than the sponsor’s projection (25.2 vs. 30), and FA fires approximately 6.9 months earlier (30.6 vs. 37.5). The assumed accrual schedule (completing enrollment by ~month 22 with a fast plateau phase at 72/mo) is the primary driver. If the actual trial’s accrual ramp was slower — with more patients enrolled closer to month 22 — each patient would have less follow-up at the event trigger time, and the event milestones would occur later in calendar time. The timing discrepancy does not affect the power OCs (which are triggered by event count), but it does indicate that the assumed accrual schedule is faster than the actual.

7.6 NPH translation sensitivity — H1 power

Research question. How sensitive is the cumulative H1 power to the assumed 2-month delay in treatment-effect separation, given that the average HR is held fixed at 0.70?

The delay length is varied over {1, 2, 3} months while re-deriving the post-delay HR (h) to maintain the event-weighted average HR at 0.70.

Delay (months) Post-delay HR h Post-delay hazard (Arm C) H1 cumulative power MCSE
1 0.6814 0.041068/mo 0.971 0.0053
2 0.6616 0.039875/mo 0.962 0.0060
3 0.6405 0.038608/mo 0.936 0.0077

A shorter delay concentrates the treatment benefit more rapidly (higher post-delay HR difference), yielding higher power under the log-rank test. A 3-month delay reduces H1 cumulative power by approximately 2.6 percentage points relative to the 2-month base case. All three translations produce substantially similar power, spanning a 3.5 percentage point range, indicating the H1 conclusion is robust to the specific delay assumption within this range.

Note: each delay scenario is an independent simulation run (n = 1,000). The delay-2 result (96.2%) differs slightly from the planning-alternative result (95.0%) due to different random seeds; both are consistent within Monte Carlo noise.


8. Limitations and Assumptions

  1. Accrual schedule assumed, not specified. The actual HIMALAYA accrual schedule was not available. The assumed ramp (25/60/72 per month) completes enrollment approximately 5 months earlier than the sponsor’s projected IA2 timing (25 vs. 30 months). Power estimates are computed at the correct event counts, not at a fixed calendar time, so this mismatch does not invalidate the power OCs. It does mean the power numbers are achieved under an implicitly faster follow-up trajectory, which may marginally overestimate power relative to the actual trial if the slower accrual gave patients less mature follow-up at each event trigger.

  2. NPH translation — averaging convention. The post-delay HR (h = 0.6616) was derived using an event-time-weighted average under Arm D’s exponential survival. Other averaging conventions (e.g., calendar-weighted, restricted-mean based, or weighted by the combined event density from both arms) would yield slightly different values of h and marginally different power. The NPH sensitivity in §7.6 brackets the translation uncertainty by varying the delay length rather than the averaging convention; varying the averaging convention would produce a similar range.

  3. Unstratified analysis. The SAP uses stratified log-rank and stratified Cox by etiology, ECOG, and macrovascular invasion. Unstratified analysis, as implemented here, is slightly less powerful. Power estimates are therefore conservative relative to the protocol’s pre-specified analysis.

  4. H2 NI CI construction. The upper CI bound is derived from the Wald statistic (SE = log_HR_estimate / z_Wald). For very small A+D event counts this approximation can be imprecise; at IA2 the A+D count is approximately 453, and at FA approximately 560, where the Wald approximation is adequate.

  5. ORR at IA1 out of scope. The IA1 ORR / duration-of-response analysis (α = 0.001, two-sided) is not simulated. The FWER verification covers only the H1/H2/H3 chain (α = 0.049, two-sided).

  6. Simulation never stops early. TrialSimulator runs all milestones (IA2 and FA) in every replicate regardless of whether H1 would trigger an early stop in a real trial. The early-stop probability for H1 is derived post-hoc as the IA2-only rejection rate; expected study duration under an early-stop rule would need to be computed from saved decision flags if required.

  7. H3 power interpretation. H3 superiority is conditional on H2 NI being achieved, which itself is conditional on H1 being achieved. The reported cumulative H3 power (50.7%) is the unconditional probability across all replicates; the conditional power given H1 and H2 success is considerably higher.