Heterogeneity of Treatment Effect in Win Statistics

Executive summary

Win-based methods—such as generalized pairwise comparisons (GPC), the win ratio (WR), win odds (WO), and net benefit / “proportion in favor of treatment”—summarize treatment effects by comparing outcomes for pairs of individuals across treatment arms using a pre-specified clinical priority rule.citeturn0search1turn0search0turn8search3turn8search2 These methods are increasingly used for hierarchical composite endpoints where traditional “time-to-first event” analyses may underrepresent clinically important outcomes.citeturn0search0turn0search1turn8search3

For patient-centered CER and PCORI-style decision support, the central HTE question becomes: how do win-based treatment effects vary across baseline covariates and patient profiles? This requires (i) formal definitions of conditional win estimands—especially conditional win probability \(p_W(x)\) and conditional net benefit \(\Delta(x)\)—and (ii) methods that estimate these targets under realistic complexities (censoring, missing data, clustering, recurrent/terminal events, and confounding).citeturn1search0turn1search1turn4search17turn6search0

A key conceptual distinction is between:

  • Marginal win estimands (e.g., probability a random treated “beats” a *random control”), typically identified in RCTs and interpretable for policy-level decisions; and
  • Conditional win estimands (e.g., \(p_W(x)\), \(\Delta(x)\)) that support individualized or subgroup decision-making but depend on pairing choices and may require stronger modeling assumptions.citeturn3search3turn7search0turn9search0

Importantly, win-probability-style estimands (Mann–Whitney / “probability of superiority”) are not the same as the (generally non-identifiable) “probability an individual benefits,” and paradoxes can arise if one conflates these.citeturn3search3turn3search7

Methodologically, the strongest current toolset for win-based HTE comprises: stratified/subgroup WR with homogeneity testing; probabilistic index models (PIMs) for pairwise regression; semiparametric proportional win-fractions regression; and causal weighting (IPTW) or censoring adjustment (IPCW/CovIPCW) for observational CER and informative censoring.citeturn13search0turn9search0turn3search0turn14search1turn14search0turn2search2 Emerging directions include nearest-neighbor pairing causal frameworks, win-fraction “pseudo-outcomes” to enable off-the-shelf ML HTE tools, and nonparametric MLE approaches for censoring + missingness.citeturn7search0turn5search1turn2search6

Win estimands and formal HTE definitions

Notation and win rule

Let \(A\in\{0,1\}\) denote treatment assignment (1=treatment, 0=control), \(X\) baseline covariates, and \(Y\) the (possibly multicomponent) outcome used for win comparisons.

Win methods define a pairwise comparison function (a “win kernel”) \[ W(Y_i, Y_j)\in\{-1,0,+1\}, \] where \(W(Y_i,Y_j)=+1\) means “\(i\) wins over \(j\),” \(-1\) means “\(i\) loses,” and \(0\) indicates a tie. For hierarchical composites, \(W(\cdot,\cdot)\) is determined by a priority order of components and possibly thresholds that define clinically negligible differences as ties.citeturn0search1turn0search0turn8search3

Marginal win estimands

A common estimand is the probability that a random treated individual has a more favorable outcome than a random control individual under the win rule: \[ p_W \;=\; \Pr\{W(Y^{(1)},Y^{(0)\,*})=+1\}, \] where \(Y^{(1)}\) is drawn from the marginal distribution under treatment and \(Y^{(0)\,*}\) is an independent draw from the marginal control distribution. The corresponding loss probability is \[ p_L \;=\; \Pr\{W(Y^{(1)},Y^{(0)\,*})=-1\}, \] and \(p_T=1-p_W-p_L\).

From these, standard summaries include:

  • Net benefit / “proportion in favor” difference:
    \[ \Delta \;=\; p_W - p_L. \] citeturn0search1turn8search3turn8search2

  • Win ratio:
    \[ \mathrm{WR} \;=\; \frac{p_W}{p_L}, \] typically defined with conventions about how ties are handled.citeturn0search0turn8search2

  • Win odds (incorporating ties):
    \[ \mathrm{WO} \;=\; \frac{p_W + \tfrac12 p_T}{p_L + \tfrac12 p_T}. \] citeturn2search1turn2search18

Large-sample inference for \(\Delta\) and WR can be developed using multivariate multi-sample \(U\)-statistic theory.citeturn8search10turn8search2

Conditional win estimands and win-CATE analogues

To define HTE, introduce covariate-dependent (conditional) win probabilities. A useful general object is the two-profile conditional win probability \[ p_W(x,x') \;=\; \Pr\{W(Y^{(1)}\mid X=x,\;Y^{(0)}\mid X=x')=+1\}. \] PIMs are precisely regression models for such probabilistic-index parameters (probability of superiority) as functions of covariates of both individuals compared.citeturn9search0turn9search14turn1search6

Two specializations are central for HTE reporting:

  1. Same-profile conditional win probability \[ p_W(x) \equiv p_W(x,x), \] which compares two hypothetical draws from the treatment and control outcome distributions at the same baseline profile \(x\).

  2. Conditional net benefit (win difference) \[ \Delta(x) \;=\; p_W(x) - p_L(x), \] with \(p_L(x)=\Pr\{W(\cdot)=-1\mid X=x\}\) defined analogously.

These define win-based analogues of a CATE: rather than contrasting means, risks, or hazards, they contrast probabilities of being better under the win rule. The “effect modification” question becomes whether \(\Delta(x)\) or \(\log\{\mathrm{WR}(x)\}\) varies with \(x\).citeturn9search0turn3search0

Estimand taxonomy and pairing choices

A critical and often under-specified part of win-based HTE is the pairing operator: how treated and control individuals are paired for comparison.

  • All-pairs (complete) pairing: compare every treated with every control. This targets marginal estimands like \(p_W\) (random treated vs random control).citeturn0search1turn8search10turn7search10
  • Matched/stratified pairing: compare within strata or matched sets, changing the weighting of covariate regions and thus the implied estimand.citeturn13search0turn13search1turn7search0
  • Nearest-neighbor pairing: approximate “same-profile” comparisons by pairing treated and control individuals close in covariate space; this can target a different, more individual-level-like causal estimand than the historical all-pairs estimator, and can materially change conclusions in heterogeneous populations.citeturn7search0turn7search4

Causal interpretations under randomization and unconfoundedness

In an RCT, randomization yields identification of marginal win estimands because treatment assignment is independent of potential outcomes, enabling \(p_W\) and \(\Delta\) to be interpreted causally as population-level contrasts.citeturn1search1turn16search8turn3search3

In observational CER, causal interpretations require standard assumptions such as consistency/SUTVA, conditional exchangeability (unconfoundedness), and positivity.citeturn16search0turn12search10turn12search1 Under these, one can define causal win estimands that contrast the distributions of potential outcomes under \(A=1\) vs \(A=0\) using the win rule, and then estimate them via weighting or doubly robust methods.citeturn2search2turn15search3turn7search0

A critical caution is that the marginal “probability of superiority” estimand is not equal to the generally non-identifiable causal probability \(\Pr\{Y^{(1)} > Y^{(0)}\}\) for the same individual, and paradoxical reversals can occur if interpreted as such.citeturn3search7turn3search3

flowchart TB
  A[Define clinical priority rule<br/>W(·,·): win/loss/tie] --> B[Choose estimand class]
  B --> B1[Marginal: random treated vs random control<br/>p_W, p_L, Δ, WR, WO]
  B --> B2[Conditional: profile-specific<br/>p_W(x), Δ(x), WR(x)]
  B --> B3[Pairing-sensitive causal targets<br/>stratified/matched/NN pairing]
  B1 --> C[Method choices]
  B2 --> C
  B3 --> C
  C --> C1[Design-based: stratified WR, subgroup tests]
  C --> C2[Regression: PIM, PW regression, semiparametric U-statistic]
  C --> C3[Causal: IPTW, IPCW/CovIPCW, DR/TMLE, cross-fitting]
  C --> C4[ML HTE: win-fraction pseudo-outcomes, causal trees/forests]
  C --> D[Inference layer]
  D --> D1[U-statistic asymptotics / sandwich]
  D --> D2[EIF / one-step / TMLE + cross-fitting]
  D --> D3[Bootstrap / permutation]
  D --> D4[Cluster-robust / cluster bootstrap]

Methods for identifying and estimating win-based HTE

This section provides method “profiles” matching the requested list: targeted estimand, assumptions, implementation steps, inference, software, pros/cons, and gaps.

Stratified WR and subgroup analyses

Targeted estimand. Stratum-specific marginal win estimands \(p_W(s), \Delta(s), \mathrm{WR}(s)\) for strata \(S(X)=s\), and (optionally) a Mantel–Haenszel-type combined estimand across strata.citeturn13search0turn13search1

Assumptions.
- Subgroups/strata are well-defined and not overly sparse.citeturn13search0
- In observational settings, within-stratum exchangeability may still fail unless covariates are adequately controlled.citeturn12search10

Implementation steps.
1) Pre-specify strata and hierarchy. PCORI emphasizes transparent reporting and discourages claims of differential effects based on “significant in one subgroup but not another.”citeturn1search0turn1search4
2) Compute stratum-specific WR/WO/NB.citeturn13search0
3) Combine via stratified weighting (analogous to Mantel–Haenszel) if a single summary is desired.citeturn13search1
4) Test heterogeneity using a Cochran-style homogeneity test adapted in the stratified WR framework.citeturn13search1turn13search0

Inference. Plug-in variance estimators (built from win-count covariance estimators) and large-sample normal approximations; can be complemented with bootstrap for finite-sample robustness.citeturn13search1turn8search10

Software. Win-statistic packages commonly support stratification workflows; WR supports stratified analyses for prioritized survival composites in its methods suite.citeturn3search5turn3search1

Pros/cons.
- Pros: highly interpretable for stakeholders; naturally aligned with PCORI HTE reporting expectations.citeturn1search0turn1search4
- Cons: limited to low-dimensional HTE; multiplicity and low power remain; sparse strata can destabilize estimates.citeturn1search0

Open gaps. Multiplicity-aware stratified win HTE procedures and principled integration with high-dimensional adjustment.

Probabilistic index models and pairwise regression

Targeted estimand. A conditional probabilistic index (pairwise win probability) of the form \[ \Pr\{Y \preceq Y^* \mid X, X^*\} \quad \text{or} \quad \Pr\{W(Y,Y^*)=+1\mid X,X^*\}, \] linked to a regression predictor via a link function (often logit/probit).citeturn9search0turn9search14turn1search6

Assumptions.
- Correct specification of the PIM link-linear structure (or use as a working model).citeturn9search0
- Pairwise pseudo-observations create dependence; robust variance or design-restricted comparisons are needed for clustered designs.citeturn5search2turn9search2

Implementation steps.
1) Define an ordering or win rule; define pairwise response (e.g., “wins” indicator).citeturn9search0
2) Construct pairwise covariates (often using contrasts \(X-X^*\)).citeturn9search0
3) Fit PIM with treatment and treatment×covariate interactions to estimate HTE in \(p_W(x,x')\) or in \(p_W(x,x)\).citeturn9search0turn9search22
4) Address computational scaling (\(O(n^2)\) pseudo-observations) using scalable algorithms if needed.citeturn9search3

Inference. Semiparametric theory provides asymptotic normality and consistent covariance estimators; robust/sandwich methods are typical.citeturn9search0

Software. pim (CRAN) implements PIMs and provides practical guidance.citeturn1search7turn1search3

Pros/cons.
- Pros: continuous covariates + interactions for HTE; direct modeling of conditional win probability.citeturn9search0
- Cons: computational burden and dependence; interpretation requires mapping regression output to patient-facing probabilities.citeturn9search3turn9search0

Open gaps. Robust, scalable PIMs for large pragmatic trials; principled handling of censoring in hierarchical comparisons; and standardized patient-facing summaries of \(\Delta(x)\).

Proportional win-fractions regression

Targeted estimand. A semiparametric regression model for covariate-dependent win fractions whose special cases include the two-sample WR; coefficients can be interpreted as log win-ratio-type effects under a proportionality structure.citeturn3search0turn3search4

Assumptions.
- A “proportional win-fractions” structure (analogous in spirit to proportional hazards) underpinning the regression parameter interpretation.citeturn3search0
- Censoring assumptions relevant to the chosen win estimand; diagnostics are needed when follow-up-time dependence threatens interpretability.citeturn4search17turn4search1

Implementation steps.
1) Specify \(W(\cdot,\cdot)\) for the prioritized endpoint.citeturn3search0
2) Fit the proportional win-fractions regression; include treatment×covariate interactions for HTE.citeturn3search0turn3search4
3) Use model diagnostics proposed for the PW model (score-process-based checks).citeturn3search0
4) Translate fitted parameters into predicted \(p_W(x)\) or \(\Delta(x)\) for representative profiles.citeturn3search0

Inference. Semiparametric \(U\)-process theory; model-based standard errors and diagnostics are provided in the methodological framework.citeturn3search0

Software. WR R package includes PW regression with vignettes for use.citeturn3search5turn3search1turn3search16

Pros/cons.
- Pros: regression-based HTE within a WR-aligned model family; bridges WR and Cox-type thinking.citeturn3search0turn3search4
- Cons: proportionality may fail; interpretability can be follow-up dependent if estimands are not time-restricted.citeturn4search17turn4search1

Open gaps. Extensions to doubly robust / ML nuisance fitting and to complex missingness.

Marginal covariate-adjusted win odds

Targeted estimand. A marginal win-odds estimand (linked to the marginal probabilistic index) with covariate adjustment for precision gains, rather than a conditional PIM coefficient.citeturn4search2turn4search10

Assumptions.
- Correct specification of the adjustment approach (as developed via the connection to the marginal probabilistic index).citeturn4search2
- Finite-sample behavior may involve type I error inflation in small samples (as reported in simulations).citeturn4search2turn4search10

Implementation steps.
1) Define WO (and tie rule).citeturn2search1turn2search4
2) Implement covariate adjustment using the marginal-PI connection.citeturn4search2
3) Produce adjusted WO and CIs; evaluate small-sample performance if sample sizes are modest.citeturn4search2

Inference. Derived within the marginal PI framework; simulation-based operating characteristics reported by the authors.citeturn4search2

Software. Research code availability varies; the method is presented as an arXiv preprint with accompanying theory and simulation examples.citeturn4search2turn4search6

Pros/cons.
- Pros: retains marginal estimand interpretation (useful for CER/policy); can increase power when covariates are prognostic.citeturn4search2
- Cons: methodology is newer; careful attention to small-sample calibration is needed.citeturn4search2

Open gaps. Generalization to hierarchical time-to-event + non-survival mixtures with censoring and missing endpoints.

IPTW-adjusted win ratio

Targeted estimand. A marginal causal win estimand (ATE/ATT-type) depending on the weighting scheme; addresses baseline imbalance/confounding via inverse probability of treatment weighting.citeturn2search2turn14search14

Assumptions.
- Observational CER: conditional exchangeability, positivity, and a sufficiently accurate propensity model.citeturn12search10turn12search6
- Weight stability (diagnostics and truncation often needed).citeturn12search2turn12search10

Implementation steps.
1) Estimate propensity scores \(e(X)=\Pr(A=1\mid X)\).citeturn2search2turn12search10
2) Construct IPTW weights (ATE, stabilized ATE, or ATT).citeturn2search2
3) Compute weighted win counts and derive WR (and/or \(\Delta\), WO).citeturn2search2turn7search5
4) Check balance and overlap; consider truncation.citeturn12search2turn12search10

Inference. As developed in the IPTW-adjusted WR framework with derived variance estimators and simulation assessments.citeturn2search2turn14search17

Software. WINS documents IPTW-adjusted WR and related win-statistic tooling.citeturn7search1turn7search5

Pros/cons.
- Pros: directly aligned with marginal causal estimands for CER; compatible with subgroup HTE by stratification or interaction modeling.citeturn2search2turn1search0
- Cons: sensitive to poor overlap; does not automatically handle censoring/missingness unless combined with IPCW/DR layers.citeturn12search2turn14search1

Open gaps. Unified weighting for simultaneous confounding + informative censoring + missing endpoints in win settings.

IPCW-adjusted WR and CovIPCW for dependent censoring

Targeted estimand. A win estimand corrected for censoring-induced bias (independent censoring via IPCW; dependent censoring via CovIPCW using baseline and/or time-dependent predictors).citeturn14search1turn14search0

Assumptions.
- IPCW validity requires correct modeling of the censoring mechanism (or at least correct weights for the relevant comparison contributions).citeturn14search1turn14search0
- For CovIPCW, dependent censoring is assumed predictable by included covariates/time-dependent covariates.citeturn14search0turn14search8

Implementation steps.
1) Model censoring (e.g., estimate \(\Pr(C\ge t\mid \text{covariates})\)).citeturn14search1
2) Construct IPCW (or CovIPCW) weights and apply to pairwise win contributions.citeturn14search1turn14search8
3) Compute adjusted WR/WO/NB; assess sensitivity to censoring model choices.citeturn14search1turn14search0

Inference. Asymptotic variance formulas and simulations are provided for IPCW and extensions for dependent censoring.citeturn14search1turn14search0

Software. WINS supports IPCW-style adjusted win estimation (and documents CovIPCW concepts).citeturn7search1turn14search3

Pros/cons.
- Pros: addresses a major threat to interpretable win estimands: censoring and follow-up-time dependence.citeturn14search1turn4search17
- Cons: reliant on censoring model quality; can be complex with multiple endpoints and time-dependent covariates.citeturn14search0turn14search8

Open gaps. Cross-fitted IPCW for flexible ML censoring models while maintaining valid inference in win settings.

AIPW/TMLE and cross-fitting for win outcomes

Targeted estimand. Generally, a causal \(U\)-statistic estimand defined by a contrast kernel \(W(\cdot,\cdot)\) averaged over the marginal distributions of potential outcomes; this includes Mann–Whitney-type causal effects and is directly relevant to win-based kernels.citeturn15search3turn15search0turn12search1

Assumptions.
- Causal identification in observational settings: unconfoundedness and positivity.citeturn12search1turn12search10
- Double robustness typically requires at least one of the nuisance components (e.g., propensity or outcome/distribution model) to be correctly specified under the method’s assumptions.citeturn15search3turn15search5
- Cross-fitting requires sample-splitting and appropriate rate conditions to avoid overfitting bias.citeturn12search0turn12search12

Implementation steps (blueprint).
1) Specify the win kernel \(W\) and define the causal target (marginal \(\Delta\) vs conditional \(\Delta(x)\)).citeturn15search3turn7search0
2) Estimate nuisance functions: propensity \(e(X)\) and outcome/distributional components as required for the DR score.citeturn12search1turn15search3
3) Construct an orthogonal / EIF-based one-step or AIPW estimator; for flexible ML nuisances, use \(K\)-fold cross-fitting.citeturn12search0turn12search12
4) For TMLE-style targeting, use the TMLE framework (when an EIF is available) to update nuisance fits toward the estimand.citeturn7search3turn12search3

Inference.
- EIF-based variance estimation (sandwich) plus asymptotic normality where established.citeturn15search3turn12search0
- Bootstrap may be used but can be computationally heavy for \(U\)-statistic-like win kernels; this is discussed as a limitation in the DR Mann–Whitney literature.citeturn15search5turn9search3

Software. No dominant “TMLE-for-win-ratio” package is established; building blocks exist in general TMLE and DML ecosystems and win-statistic packages.citeturn7search3turn12search0turn7search2

Pros/cons.
- Pros: strongest route to causal win-based HTE with high-dimensional confounding and ML-based adjustment; principled inference via orthogonal scores.citeturn12search0turn15search3
- Cons: EIF derivations for general hierarchical, censored, mixed-type win rules are complex and remain a key research frontier.citeturn4search17turn2search6

Open gaps. Full EIF/TMLE development for prioritized survival + PRO mixtures, with censoring and missing endpoints, and scalable computation.

Nearest-neighbor pairing causal frameworks

Targeted estimand. A pairing-sensitive causal estimand for hierarchical outcomes that better approximates an individual-level causal notion by using nearest-neighbor pairing; shown to differ from the historical all-pairs WR/NB estimand and to avoid “reversed recommendations” under heterogeneity in constructed examples.citeturn7search0turn7search4

Assumptions.
- Identification assumptions (randomization or unconfoundedness) and adequate covariate support for meaningful nearest-neighbor pairing.citeturn7search0
- Curse-of-dimensionality concerns motivate additional modeling (distributional regression) in observational settings.citeturn7search0

Implementation steps.
1) Define the causal estimand explicitly (all-pairs vs NN-paired).citeturn7search0
2) Form NN pairs between treated and controls in covariate space.citeturn7search0
3) Compute win statistics on the NN-paired sample; extend via propensity weighting and augmented estimation as proposed.citeturn7search0

Inference. Provided in the proposed framework with consistency results and DR claims; more work is needed for broad, routine inferential tooling.citeturn7search0

Software. Research-stage; described as straightforward to implement, but not yet standardized.citeturn7search0

Pros/cons.
- Pros: directly addresses the “pairing defines the estimand” issue; offers a path to individual-centric, HTE-aware win estimands.citeturn7search0turn3search7
- Cons: sensitive to high-dimensional covariates; needs robust inference and extensions for censoring/missingness.citeturn7search0turn4search17

Open gaps. Integration with IPCW/missing-data methods and development of inferential theory under complex sampling and clustering.

Win-fraction pseudo-outcomes for ML

Targeted estimand. Individual-level win fraction summaries, used as pseudo-outcomes whose averages correspond to global win probability parameters; these enable regression and mixed models directly on win fractions.citeturn5search1turn5search9

Assumptions.
- The win fraction construction is rank-based and relies on the chosen endpoint ordering and tie handling; inference in clustered designs assumes a workable mixed-model framework for the pseudo-outcomes.citeturn5search1turn5search9

Implementation steps.
1) Convert each endpoint to ranks and compute per-subject win fractions against the opposite arm.citeturn5search9turn5search1
2) Aggregate across endpoints into a single “global win fraction.”citeturn5search9
3) Fit regression or linear mixed models; for HTE, include interactions or use ML regressors on the pseudo-outcome.citeturn5search9turn5search1

Inference. Mixed-model-based interval estimation for cluster trials is demonstrated with simulation support for coverage and type I error.citeturn5search1turn5search9

Software. The method is designed to be implementable using standard tools, with R/SAS code provided in the work.citeturn5search9turn5search1

Pros/cons.
- Pros: makes win-based outcomes compatible with standard regression/ML pipelines; attractive for pragmatic cluster trials.citeturn5search9turn5search2
- Cons: pseudo-outcomes may obscure component-level interpretation unless decomposed; theoretical links to causal win-CATE need strengthening.citeturn5search9turn7search0

Open gaps. EIF-based pseudo-outcomes for \(\Delta(x)\) with honest ML inference under pairwise dependence.

Causal forests, meta-learners, and recursive partitioning adapted to win outcomes

Targeted estimand. Win-CATE analogues such as \(\tau(x)=\Delta(x)\), learned flexibly via ML HTE methods once an appropriate outcome or pseudo-outcome is defined. Causal trees/forests and meta-learners provide frameworks for subgroup discovery and individualized effect estimation under unconfoundedness.citeturn11search1turn11search0turn11search2

Assumptions.
- Unconfoundedness and positivity for causal CATE learning in observational data; randomization in RCTs.citeturn11search0turn11search2
- Valid inference requires honest splitting or asymptotic theory for forest estimators.citeturn11search1turn11search0

Implementation steps (win-adaptation pattern).
1) Choose a win-based target (e.g., \(\Delta(x)\)) and define a pseudo-outcome with \(\mathbb{E}[\text{pseudo}\mid X=x]=\Delta(x)\). EIF-based pseudo-outcomes are principled when available.citeturn15search3turn12search0
2) Fit causal forests (or meta-learners) using nuisance models for propensity and outcome components.citeturn11search0turn11search2
3) Use honest sub-sampling / cross-fitting to control overfitting.citeturn11search1turn12search12
4) Validate subgroups with pre-specification and careful reporting per PCORI standards.citeturn1search0turn1search4

Inference.
- Causal forests: asymptotic normality and interval methods are provided in the causal forest theory and implementations.citeturn11search0turn11search3
- Recursive partitioning: uses “honest” trees and built-in testing strategies for heterogeneity.citeturn11search1

Software. grf provides causal forests and related estimators, with support for HTE inference and some censoring/missingness options.citeturn11search3turn11search11turn11search23

Pros/cons.
- Pros: scalable high-dimensional HTE discovery; strong inferential foundation for forests and honest trees.citeturn11search0turn11search1
- Cons: win outcomes are pairwise/structured; defining correct pseudo-outcomes and preserving component interpretability requires methodological care.citeturn15search3turn5search9

Open gaps. A “standard recipe” for win-based causal forests with validated pseudo-outcomes and component-level explainability.

Mixed-effects and hierarchical PIMs

Targeted estimand. Cluster- or period-adjusted win estimands in clustered and stepped-wedge designs, often combining GPC with mixed models for cluster-period summaries or using within-cluster PIMs.citeturn5search2turn9search2

Assumptions.
- Correct handling of time trends and clustering is essential in stepped-wedge designs; failure can inflate type I error.citeturn5search2turn5search10
- Mixed-effects modeling assumptions for random effects; design-specific restrictions (e.g., within-cluster comparisons) may be required.citeturn5search2turn5search9

Implementation steps.
1) For stepped-wedge CRTs, compute cluster-period win odds (or related summaries) and fit hierarchical mixed-effects models; alternatively, apply a cluster-restricted PIM using within-cluster comparisons.citeturn5search2turn9search2
2) Include fixed effects for time/sequence and random effects for clusters (and possibly random slopes).citeturn5search2turn5search10
3) For HTE: include patient- and cluster-level interactions (e.g., treatment×baseline risk, treatment×cluster characteristics).citeturn1search0turn5search2

Inference. Simulation evidence shows that mixed-effects and cluster-restricted PIM approaches can maintain nominal type I error where naive methods fail in stepped-wedge settings.citeturn5search2turn5search6

Software. Standard mixed-model software + PIM tools; the win-fraction mixed-model approach emphasizes implementability in standard packages.citeturn5search9turn5search1

Pros/cons.
- Pros: directly addresses pragmatic trial designs common in PCORI portfolios; supports cluster-level HTE.citeturn5search10turn5search2
- Cons: win-based mixed modeling remains less standardized; methodological guidance is still emerging.citeturn5search2turn5search6

Open gaps. Formal semiparametric theory for hierarchical PIMs; small-sample corrections and robust variance tools for few clusters.

Semiparametric and U-statistic approaches

Targeted estimand. Many win estimators are (multivariate) \(U\)-statistics; semiparametric theory provides large-sample inference for WR and \(\Delta\) and supports stratified and multi-group extensions.citeturn8search10turn8search2turn0search1

Assumptions.
- Regularity conditions for \(U\)-statistic asymptotics; independence assumptions may be violated under clustering or repeated measures without correction.citeturn8search10turn5search2

Implementation steps.
1) Write the estimator as a \(U\)-statistic or sum of pairwise contributions.citeturn8search10turn0search1
2) Use derived asymptotic variance formulas; optionally use resampling (bootstrap/permutation) for calibration.citeturn8search10turn7search10turn2search18

Inference. Large-sample normal approximations for WR and “proportion in favor of treatment” are derived from multivariate multi-sample \(U\)-statistic limits.citeturn8search10turn8search2

Software. BuyseTest explicitly supports inference via asymptotic \(U\)-statistic theory or resampling methods.citeturn7search10turn7search6

Pros/cons.
- Pros: principled inference for classical win estimands; strong foundation for extensions.citeturn8search10
- Cons: scaling and dependence issues in complex designs; extensions to censored/missing hierarchical mixtures require additional work.citeturn4search17turn2search6

Open gaps. Efficient computation and EIF-based extensions for modern causal/ML HTE targets.

NPMLE for censoring and missingness in hierarchical endpoints

Targeted estimand. Win ratio for hierarchical endpoints when data include right-censoring and missing endpoints; a nonparametric MLE uses all observed information and yields closed-form asymptotic variance.citeturn2search6turn14search7

Assumptions.
- As specified in the NPMLE framework: censoring and missingness mechanisms compatible with the modeling/identification assumptions (e.g., missing-at-random-type structures, depending on the exact setup).citeturn2search6turn14search7

Implementation steps.
1) Specify the two-level hierarchical endpoint structure and missingness indicators.citeturn2search6
2) Fit the NPMLE for the joint structure and derive WR.citeturn2search6turn14search7
3) Report asymptotic variance-based CIs; validate via simulation if possible.citeturn2search6turn14search7

Inference. Closed-form asymptotic variance estimator is provided.citeturn2search6turn14search7

Software. Research-stage; distributed with the preprint and emerging tooling.citeturn2search6

Pros/cons.
- Pros: directly addresses a common PCORI pain point (censoring + missing patient-centered outcomes) without heavy parametric assumptions.citeturn2search6turn10search17
- Cons: currently specialized (two hierarchical endpoints); generalization to richer hierarchies and ML-based missingness modeling remains open.citeturn2search6turn5search4

Open gaps. Extending NPMLE concepts to multi-endpoint hierarchies and integrating with causal DR methods.

Inference, censoring, recurrent/terminal events, and missing data

Variance estimation and inference toolbox

U-statistic asymptotics. The large-sample inference framework for WR and the “proportion in favor” parameter uses multivariate \(U\)-statistics to derive tests and confidence intervals, and it extends to stratified settings.citeturn8search10turn8search2

Influence functions and local efficiency. A general causal \(U\)-statistic framework provides naive IPW and locally efficient doubly robust estimators for contrast-function estimands—which directly applies to win kernels—and supports the path toward EIF-based win estimators.citeturn15search3turn15search0

Bootstrap and permutation. Win-odds inference work explicitly discusses exact permutation and bootstrap variance estimators and regression extensions via probabilistic index ideas.citeturn2search18turn2search16 Win-statistic software notes both resampling and asymptotic approaches.citeturn7search10turn7search6

Cluster-robust inference. In cluster and stepped-wedge designs, ignoring clustering/time can compromise type I error, while mixed-effects and cluster-restricted PIM approaches can restore calibration.citeturn5search2turn5search6

Censoring and follow-up-time dependence

A recurring challenge is that some “traditional” WR estimands can be influenced by trial-specific censoring patterns if the estimand does not explicitly define a comparison horizon; this threatens interpretability and cross-study transportability.citeturn4search17turn4search1

IPCW/CovIPCW. IPCW-adjusted win ratio estimators are developed to remove bias under right censoring, with extensions (CovIPCW) for dependent censoring predictable by covariates/time-dependent covariates.citeturn14search1turn14search0turn14search8

Censoring-aware GPC. GPC has been extended to right-censoring by defining pairwise contributions based on estimated survival functions.citeturn8search3turn8search7

Time-restricted estimands. A principled response is to define restricted-time win ratio estimands at a pre-specified horizon and provide corresponding estimation approaches, improving interpretability across studies.citeturn6search2turn4search5

RMT-IF (restricted mean time in favor). RMT-IF is defined as the net average time treated individuals spend in more favorable states than controls over a pre-specified time window, generalizing RMST to multistate settings and providing a patient-friendly time-based effect size.citeturn4search4turn4search16turn4search0

Recurrent and terminal events

Win methods are naturally suited to settings where mortality competes with recurrent nonfatal events (hospitalizations). Variants such as “last-event-assisted” WR use more recurrent-event information than the standard WR.citeturn6search0turn6search16

For semi-competing risks (terminal and non-terminal events), event-specific win ratios and global tests based on them are proposed to enhance power and interpretability when effects differ by event type—an explicit form of component-level heterogeneity that can interact with patient-level HTE.citeturn5search3turn5search7

Inference for win-loss parameters with right-censored death and recurrent events has been developed, addressing a gap in censored multiple-event win inference.citeturn6search1turn6search5

Missing data: MAR, MNAR sensitivity, and joint models

PCORI and ICH emphasize clarity about estimands and sensitivity analyses for missing data.citeturn1search1turn1search0turn10search1

General missing data guidance. Missing data can undermine trial validity and should be addressed via design and principled analysis; major guidance recommends sensitivity analyses for plausible MNAR mechanisms.citeturn10search17turn10search13turn10search0

Win-specific missingness methods.
- Global win probability methods explicitly accommodate missing data and baseline adjustment across endpoints.citeturn5search4turn5search0
- NPMLE approaches explicitly address hierarchical endpoints with censoring and missing data.citeturn2search6turn14search7

MNAR sensitivity. Practical frameworks exist for MNAR sensitivity analysis and are recommended in applied settings when MNAR is plausible.citeturn10search0turn10search13

Joint models. When missingness is driven by informative dropout linked to longitudinal and survival processes, joint models are widely used to model the outcome and dropout/time-to-event jointly; these can be used as sensitivity/robustness tools around win analyses (particularly when PROs are missing due to deteriorating health).citeturn10search10turn10search18turn10search2

Method comparison table for PCORI-style CER

Method Estimand targeted Causal vs associational Handles censoring Handles missing data Covariate adjustment ML nuisance estimation Inference Software PCORI-style CER suitability
All-pairs WR / \(\Delta\) / WO Marginal \(p_W,p_L,p_T\), \(\Delta\), WR/WO Causal in RCT; associational in observational Partially; estimand may be follow-up dependent Limited Not beyond ad hoc No \(U\)-stat asymptotics; boot/permutation BuyseTest, WINSciteturn7search2turn7search1turn8search10 Strong for transparent primary analyses; limited for confounding/HTE
Stratified WR Stratum-specific marginal effects; MH-type combined Causal in RCT; conditional on adequacy in observational If combined with IPCW or restricted-time Limited Low-dim via strata No Plug-in variance + homogeneity test Implementable via win packages Strong for pre-specified subgroups and PCORI reportingciteturn13search0turn1search0
PIM / pairwise regression Conditional \(p_W(x,x')\) (and interactions) Conditional; causal in RCT; causal in obs w/ assumptions Not native; needs extension Limited Yes (interactions) Possible (research) Sandwich/semiparametric pimciteturn9search0turn1search7 Strong for modeling HTE; needs careful interpretation
Proportional win-fractions regression Conditional win-fractions regression parameters Conditional; causal in RCT Requires estimand clarity; can combine with IPCW/time restriction Not primary Yes Not standard Semiparametric theory + diagnostics WRciteturn3search5turn3search0 Good for trials/pragmatic studies; needs robustness extensions
Marginal covariate-adjusted WO Marginal WO via marginal PI Causal in RCT As per WO; extensions needed Limited Yes (precision) Not emphasized Theory + simulations Research-stage Promising for trial efficiency; still emergingciteturn4search2
IPTW-adjusted WR Marginal causal WR (ATE/ATT) Causal under unconfoundedness Not automatic; can combine with IPCW Not automatic Yes (via IPTW) Yes (PS w/ ML + cross-fitting) Derived variance + diagnostics WINS + custom High relevance for observational CERciteturn2search2turn12search10
IPCW / CovIPCW Censoring-robust marginal WR/WO/NB Causal in RCT; under censoring-model assumptions Yes No Yes (censoring model) Yes (censor model ML) Theory + simulations WINS Essential when censoring differs by covariatesciteturn14search1turn14search0
DR / AIPW / TMLE + cross-fitting Causal win estimands defined by win kernel Causal under unconfoundedness In principle (needs EIF) In principle (needs EIF) Yes Yes (core strength) EIF-based + cross-fit General TMLE/DML toolchain High potential; major open developmentciteturn15search3turn12search0turn7search3
NN pairing causal win estimands Pairing-sensitive causal estimands Causal (assumptions) Not central (extendable) Not central (extendable) Yes (pairing/PS) Suggested via distributional regression Research-stage Research-stage High potential for patient-centric HTE; needs broader toolingciteturn7search0
Win-fraction pseudo-outcomes + ML \(\Delta(x)\) analogues via pseudo-outcomes Causal if pseudo-outcome identifies CATE Needs integration Possible (depends) Yes Yes ML inference + resampling/forest theory Standard ML + mixed models Promising for scalable HTE + communication; needs validationciteturn5search9turn11search0
Cluster/stepped-wedge GPC (mixed effects / cluster PIM) Cluster/time-adjusted win estimands Causal under design assumptions Depends Depends Yes Possible Simulation-supported calibration Standard mixed model + PIM Highly relevant for pragmatic PCORI designsciteturn5search2turn5search10
NPMLE (censoring + missingness) WR for hierarchical endpoints Causal in RCT Yes Yes Limited (via modeling) Not emphasized Closed-form asymptotic variance Research-stage Highly relevant for patient-centered endpoints with missingnessciteturn2search6turn14search7

Practical guidance for applied researchers

Estimation workflow for win-based HTE in CER

A PCORI-aligned workflow emphasizes transparency, stakeholder relevance, and reproducibility.citeturn1search0turn1search4turn1search1

  1. Define the clinical priority rule \(W(\cdot,\cdot)\) with stakeholders. Document priorities, tie thresholds, and the rationale for patient-centered relevance.citeturn0search0turn0search1

  2. Specify the estimand and pairing choice. Explicitly state whether the target is marginal (\(p_W\), \(\Delta\)) or conditional (\(p_W(x)\), \(\Delta(x)\)), and whether all-pairs, stratified/matched, or NN pairing is used.citeturn4search17turn7search0turn1search1

  3. Primary overall analysis. Estimate marginal WR/WO/NB with \(U\)-statistic-based inference or resampling.citeturn8search10turn2search18turn7search10

  4. Pre-specified subgroup HTE. Use stratified WR and homogeneity tests; report effect sizes and uncertainty within each subgroup and avoid interpretive fallacies highlighted by PCORI.citeturn13search0turn1search0turn1search4

  5. Model-based HTE.

    • Use PIMs to model \(p_W(x,x')\) and derive \(p_W(x)\) for clinically relevant profiles.citeturn9search0turn1search7
    • Alternatively, use proportional win-fractions regression with interactions plus diagnostics.citeturn3search0turn3search1
  6. Observational CER adjustment. Use IPTW-adjusted WR; check balance/overlap and consider doubly robust extensions where feasible.citeturn2search2turn12search10turn15search3

  7. Censoring and missing data robustness.

    • If censoring is nontrivial or differs by covariates, use IPCW/CovIPCW or time-restricted estimands; report sensitivity.citeturn14search1turn6search2turn4search17
    • If endpoints (e.g., PROs) are missing, follow missing-data best practices and consider NPMLE or sensitivity analyses for MNAR.citeturn2search6turn10search13turn10search0

Diagnostics checklist

  • HTE multiplicity and transparency: document subgroup definitions and the effective number of comparisons (PCORI).citeturn1search0turn1search4
  • Propensity overlap and weight stability: inspect extreme weights; consider truncation.citeturn12search2turn12search10
  • Censoring model diagnostics: compare censoring patterns by arm and covariates; assess sensitivity to IPCW specification.citeturn14search1turn14search0
  • PW regression diagnostics: assess proportional win-fractions model adequacy.citeturn3search0
  • Cluster/time effects (pragmatic designs): ensure time and clustering are modeled; validate calibration using simulation when possible.citeturn5search2turn5search10

Patient-facing communication templates

Win-based HTE is often easier to communicate on absolute scales (e.g., \(\Delta(x)\)) than ratio scales (WR).citeturn2search18turn0search1 Suggested templates:

  • Forest plots of subgroup \(\Delta(s)\) (or win probability \(p_W(s)\)) with confidence intervals, emphasizing that subgroup differences require direct statistical contrasts.citeturn1search4turn13search0
  • Component attribution plots: decompose net benefit into contributions from each prioritized component to show why treatment wins differ by subgroup. This aligns with the GPC framework’s emphasis on prioritized outcomes.citeturn0search1turn8search3
  • Win-probability curves across clinically meaningful thresholds (for time-to-event margins) or across time horizons (restricted-time estimands).citeturn6search2turn14search2
  • Personalized \(\Delta(x)\) plots: show predicted \(\Delta(x)\) vs baseline risk score or key covariates; report uncertainty bands and define the target population. Win-fraction pseudo-outcomes can support this style of reporting with standard regression/ML tools.citeturn5search9turn11search3
  • Time-based summaries (RMT-IF): present “net time in a more favorable state” over a defined window, which maps naturally to patient understanding.citeturn4search4turn4search16

References

Below are prioritized primary/official/original sources (≥25) with DOIs/URLs when available.

  1. Win ratio for hierarchical composite endpoints: foundational description and examples. European Heart Journal (2012). doi:10.1093/eurheartj/ehr352. citeturn0search0turn0search4
  2. Generalized pairwise comparisons of prioritized outcomes (GPC). Statistics in Medicine (2010). doi:10.1002/sim.3923. citeturn0search1turn0search5
  3. Large-sample \(U\)-statistic inference for WR and “proportion in favor.” Biostatistics (2016). doi:10.1093/biostatistics/kxv032. citeturn8search10turn8search2
  4. Extension of GPC to right-censoring. Statistical Methods in Medical Research (2018). doi:10.1177/0962280216658320. citeturn8search3turn8search7
  5. Stratified win ratio (MH-type estimator + homogeneity test). Journal of Biopharmaceutical Statistics (2018). doi:10.1080/10543406.2017.1397007. citeturn13search0turn13search1
  6. Stratified win statistics: WR, WO, and net benefit. Pharmaceutical Statistics (2023). citeturn13search7
  7. Semiparametric proportional win-fractions regression. Biometrics (2021). doi:10.1111/biom.13382. citeturn3search0turn3search4
  8. WR R package (PW regression and win methodology). https://cran.r-project.org/package=WR. citeturn3search5turn3search1
  9. IPCW-adjusted WR (independent censoring). Journal of Biopharmaceutical Statistics (2020). citeturn14search1turn14search9
  10. CovIPCW adjustment for dependent censoring. Pharmaceutical Statistics (2021). doi:10.1002/pst.2086. citeturn14search0turn14search4
  11. IPTW-adjusted WR for baseline imbalances/confounding. Journal of Biopharmaceutical Statistics (2023/2025). PMID page. citeturn2search2turn14search14
  12. WINS R package documentation. https://cran.r-project.org/web/packages/WINS/. citeturn7search1turn7search5
  13. Win odds accounting for ties. Statistics in Medicine (2021). doi:10.1002/sim.8967. citeturn2search1turn2search4
  14. Win odds inference and regression review. Statistics in Medicine (2023). citeturn2search18turn2search16
  15. Covariate adjustment for the win odds via marginal probabilistic index. arXiv (2025). https://arxiv.org/abs/2511.14292. citeturn4search2turn4search6
  16. Probabilistic index models (PIM): semiparametric regression for superiority probability. JRSS-B (2012). doi:10.1111/j.1467-9868.2011.01020.x. citeturn9search0turn9search14
  17. pim R package. https://cran.r-project.org/package=pim. citeturn1search7turn1search3
  18. Causal estimands for Wilcoxon–Mann–Whitney parameters and the “probability of benefit” paradox. Statistics in Medicine (2018). doi:10.1002/sim.7799. citeturn3search3turn3search7
  19. IPW causal inference for Mann–Whitney–Wilcoxon and related statistics. Statistics in Medicine (2014). doi:10.1002/sim.6026. citeturn12search1turn12search5
  20. Causal estimation using \(U\)-statistics: IPW and doubly robust estimators for contrast kernels. Biometrika (2018). doi:10.1093/biomet/asx071. citeturn15search0turn15search3
  21. Efficient/robust estimation for Mann–Whitney-type causal effects. International Statistical Review (2019). doi:10.1111/insr.12326. citeturn15search1turn15search14
  22. Double/debiased machine learning and cross-fitting for causal parameters. The Econometrics Journal (2018). doi:10.1111/ectj.12097. citeturn12search0turn12search4turn12search12
  23. Targeted learning and TMLE. Book (2011). doi:10.1007/978-1-4419-9782-1. citeturn7search3turn12search3
  24. Nearest-neighbor pairing causal framework for hierarchical win outcomes. arXiv (2025). https://arxiv.org/abs/2501.16933. citeturn7search0turn7search4
  25. Recursive partitioning for heterogeneous causal effects (honest trees). PNAS (2016). doi:10.1073/pnas.1510489113. citeturn11search1
  26. Causal forests for HTE inference. arXiv/JASA lineage (2015+). https://arxiv.org/abs/1510.04342. citeturn11search0
  27. Meta-learners (X-learner and framework). PNAS (2019). doi:10.1073/pnas.1804597116. citeturn11search2turn11search6
  28. grf package for causal forests and HTE inference. https://cran.r-project.org/package=grf. citeturn11search3turn11search11turn11search23
  29. On recurrent-event win ratio (last-event-assisted). Statistical Methods in Medical Research (2022). citeturn6search0turn6search16
  30. Win-loss parameters with right-censored death + recurrent events. Statistics in Medicine (2023). doi:10.1002/sim.9937. citeturn6search1turn6search5
  31. Event-specific win ratios for terminal and non-terminal events. Clinical Trials (2021). doi:10.1177/1740774520972408. citeturn5search3turn5search7
  32. Estimand clarity for WR and censoring dependence. (2024). citeturn4search17turn4search1
  33. Restricted time win ratio: estimands to estimation. Statistics in Biopharmaceutical Research (2025). doi:10.1080/19466315.2024.2332675. citeturn6search2turn6search10
  34. Restricted mean time in favor of treatment (RMT-IF). Biometrics (2023). doi:10.1111/biom.13570. citeturn4search4turn4search0turn4search16
  35. Global win probability beyond O’Brien–Wei–Lachin with missing data. Statistics in Medicine (2024). citeturn5search4turn5search0
  36. Win fractions and mixed-model inference for cluster trials with multiple endpoints. (2024/2025). citeturn5search9turn5search1
  37. GPC in stepped-wedge CRTs: simulation comparing mixed-effects and PIM approaches. arXiv (2026). https://arxiv.org/abs/2603.02003. citeturn5search2turn9search2
  38. NPMLE for WR with censoring and missing data in hierarchical endpoints. arXiv (2026). https://arxiv.org/abs/2602.13533. citeturn2search3turn2search6
  39. PCORI HTE standards (reporting, multiplicity, subgroup definition transparency). https://www.pcori.org/.../page-5. citeturn1search0turn1search4
  40. ICH E9(R1) estimands and sensitivity analysis addendum. PDF. citeturn1search1turn1search5
  41. Missing data in trials: National Academies report (2010). doi:10.17226/12955. citeturn10search13turn10search5
  42. Missing data impact and recommendations. NEJM (2012). doi:10.1056/NEJMsr1203730. citeturn10search17
  43. Practical MNAR sensitivity analysis tutorial. (2018). citeturn10search0
  44. Causal inference assumptions and target trials framework. Causal Inference: What If (Hernán & Robins). https://miguelhernan.org/whatifbook. citeturn16search8turn16search0