1 Title: The central role of the propensity score in observational studies for causal effects

Estimating the causal effects of the treatments is a missing data problem, since either \(r_{i0}\) or \(r_{i1}\) is missing.

In randomized experiments, the results in the two treatment groups could be directly compared because their units are likely to be similar, whereas in nonrandomized experiments, such direct comparisons may be misleading because the units exposed to one treatment generally differ systematically from the units exposed to the other treatment. Balancing score, can be used to group treated and control units so that direct comparisons are more meaningful.

In randomized trial, the propensity score is a known function. In non-randomized trial, the PS can be estimated from the observed data (logit). The second way different is randomized trial, the treatment assignment z and the response (r1, r0) are conditionally independent given x (treatment assignment is strongly ignorable given baseline covariates).

b(x) is “finer” than e(x) means there exist some function f, such that e(x)=f(b(x)).

If the treatment assignment is strongly ignorable, then adjustment for a balancing score b(x) is sufficient to produce unbiased estimates of the average treatment effect.

average treatment effect: \(E(r_1) - E(r_0)\) (1.1) \(E(r_1|z=1)-E(r_0|z=0)\) (1.2) is different from (1.1), since (1.2) is conditional distribution of \(r_t\) given z=t.

2 Title: Full Mathcing in an Observational Study of Coaching for the SAT

Goal: evaluates the practical performance of full matching, modifying it to minimize variance as well as bias. With restrictions on the ratio of treated to controls, full matching makes use of many more obs than pair matching, and achieves far closer matches than does matching with $k $ controls.

As compared to 1:k matching or to matching with a variable number of controls (Ming and Rosenbaum 2000), pair matching is the least flexible and the least able to make use of a large reservoir of potential controls.

Difference between optimal and nearest matching:

Nearest available (greedy): move down the list of treated from top to bottom, at each step matching a treated subject to the nearest available control, which is then removed from the list of controls available at the next step. Such matching was made without attention to how they affect possibilities for later matchings, which could result the overall difference not minimum/optimal.

optimal matching optimize global rather than local.

Rosenbaum and Rubin (1985) pointed out that greedy algorithms and optimal algorithms have the nearly same performance with a large reservoir of controls; with limited controls, greedy algorithms can do much worse than optimal ones.

Dehejia and Wahba (1999) indicated that attempting to use most or all of the control reservoir invites sharp penalties in terms of bias.

The optimal full match uses all controls and balances every covariate, but some of its matched sets are too heavy with controls, and in others controls are quite sparse. This disparities stand behind the optimal full matching’s disappointing relative precision.

Regression adjustments assume that we know or can reliably discern patterns relating pretreatment, treatment, and response variables, and require the statistician to specify and fit a corresponding statistical model. Adjustment by stratification assumes only that treatment and control groups sufficiently alike in terms of pretreatment characteristics are comparable in terms of response to treatment; but it requires the statistician to make precise what it means for groups to be sufficiently alike prior to treatment, and it requires a method for grouping subjects into sufficiently uniform blocks.

Main conclusions:

  • Optimal pair matching: minimize the overall distance between treated and control subjects within the matched setting;
  • ATT could be estimated by averaging the matched sets, weighted by the number of treated subjects in each matching;

3 Title: Augmenting both arms of a randomized controlled trial using external data: an application of the propensity-score integrated approaches

Introduction (example): A medical device company plans to conduct an RCT to evaluate the safety and effectiveness of a device in order to seek approval for its marketing in the US, with control therapy being optimal medical management, and primary endpoint being a one-year adverse event rate. Based on the projected enrollment rate, it is anticipated that enrollment will take five years. With an additional one year of follow-up, it is going to take six years for the trial to complete. The company wants to explore the possibility of leveraging external data to speed up the trial, so that this new technology can reach patients sooner. The company thinks that such a study design may be viable because a significant amount of clinical data has already been accumulated in the EU, where the device was CE marked, and a high-quality registry has been established in the EU for patients treated with the device. This registry can potentially be used as a source of external data that can be leveraged to augment the treated arm of the RCT. To augment the control arm, a high-quality disease registry in the US may be used. A proposal is thus put forward to conduct a study consisting of an RCT with both arms supplemented by data from these two external sources, respectively.

Objective: describe an appropriate statistical procedure that can be used to design the study and subsequently analyze the data.

Suppose a prospective RCT with a treated and a control arm is actively enrolling patients, and an unforeseen event, such as the COVID-19 pandemic, occurs which stops the enrollment before it is completed. While sometimes the enrollment may restart later, this is not always possible or practical. Using the current data to test study hypotheses would result in an underpowered study as the planned sample size is not reached. In such a case, PS-integrated approaches can be used to augment one or both arms of the RCT with external data (if it is appropriate to do so) to recover the lost power, thereby salvaging the study.

Due to their flexibility and adaptability, the PS-integrated approaches are a viable statistical innovation that may be utilized in a variety of situations as a tool for the leveraging of external data to further support regulatory purposes.

4 Title: Mathcing methods for causal inference: a review and a look forward

Goal: This paper provides a structure for thinking about matching methods and guidance on their use, coalescing the existing research (both old and new) and providing a summary of where the literature on matching methods is now and where it should be headed.

2 stages to estimate the effect of some intervention: 1) design and 2) outcome analysis. Statg 1 uses only background information on the individuals in the study, designing the non-experimental study as would be a randomized experiment, without access to the outcome values. Matching methods are a key tool for stage 1. After Stage 1 finished, Stage 2 begins to compare the outcomes of the treated and control individuals.

Matching methods highlight areas of the covariate distribution where there is not sufficient overlap between the treatment and control groups, such taht the resulting treatment effect estimates would rely heavily on extrapolation. Selection models and regression models rely heavily on situations where there is insufficient overlap, but their standard diagnostics do not involve checking this overlap. Matching methods in part serve to make researchers aware of the quality of resulting inferences.

Rubin 1976 showed that matching based on Propensity score or Mahalanobis metric matching are the Equal Percent Bias Reducing; EPBR was showed by Rubin and Stuart (2006) to hold under much more general setting, in which the covariate distributions are discriminant mixtures of ellipsoidally symmetric distributions. EPBR methods reduce bias in all covariate directions (i.e., makes the covariate means closer) by the same amount, ensuring that if close matches are obtained in some direction (such as the propensity score), then the matching is also reducing bias in all other directions.

When matching using propensity scores, there is little cost to including variables that are actually unassociated with treatment assignment, as they will be of little influence in the propensity score model. Including variables that are actually unassociated with the outcome can yield slight increases in variance. However, excluding a potentially important confounder can be very costly in terms of increased bias.

One strategy is to include a small set of covariates known to be related to the outcomes of interest, do the matching, and then check the balance on all of the available covariates, including any additional variables that remain particularly unbalanced after the matching.

One type of variable that should not be included in the matching process are those that may have been affected by the treatment of interest (Rosenbaum, 1984; Frangakis and Rubin, 2002; Greenland, 2003).

Matching Distance:

  • Exact: \(D_ij = 0\) if \(X_i=X_j\); \(D_ij = \infty\) if \(X_i\neq X_j\)

  • Mahalanobis: \(D_ij=(X_i-X_j)'\Sigma^{-1}(X_i-X_j)\)

  • Propensity score: \(D_ij=|e_i - e_j|\), where \(e_k\) is the propensity score for individual k

  • Linear propensity score: \(D_ij=|logit(e_i)-logit(e_j)|\)

The exact and Mahalanobis distances measure neither work well when X is high dimensional. Exact matching often leads to many individuals not being matched, which can result in larger bias (Rosenbaum and Rubin, 1985). Coarsened Exact Matching (Iacus et al. 2009), which transform continuous variable to categorical, could be used to do exact matching on broader ranges of variables. The Mahalanobis matching could work quite well when the number of covariates is less than 8 (Rubin 1979 and Zhao 2004), but it does not perform well when the covariates are not normally distributed.

Propensity score (Rosenbaum and Rubin, 1983) summarize all the covariates into one scalar: the probability of being treated. Two properties:

  • It is a balancing score: at each value of PS, the distribution of covariate X defining the PS is the same in the treated and control groups;

  • If the treatment assignment is ignorable given the covariates, then the treatment assignment is also ignorable given the propensity socre - the difference in means in the outcome between treated and control individuals with a particular PS value is an unbiased estimate of the treatment effect at that PS value.

With propensity score estimation, concern is not with the parameter estimates of the model, but rather with the resulting balance of the covariates (Augurzky and Schmidt, 2001). Because of this, standard concerns about collinearity do not apply. Also, since they do not use covariate balance as criterion, model fit statistic identifying classification ability (c-statistic) and stepwise selection models are not helpful for variable selection. One strategy that is helpful to examine the balance of covariates is their squares and interactions in the matched samples.

Simulations showed mis-estimation of the PS is not a big issue, the treatment effect estimates are more biased when the outcome model (interpreting a particular regression coefficient) is mis-specified than when the PS model (get covarite balance) is mis-specified. Future research should involve more systematic evaluations of propensity score estimation, perhaps through more sophisticated simulations as well as analytic work, and consideration should include how the propensity scores will be used, for example in weighting versus subclassification.

Nearest neighbor matching

  • proposed by Rubin 1973a
  • the most effective method for settings where the goal is to select individuals for follow-up
  • always used for estimating the ATT

1:1 matching

  • Complaint: discard a large number of obs and thus apparently lead to reduced power
  • The reduced power is minimal in 2 reasons:
    • in two sample comparison of means, the precision largely driven by the smaller group: if the treatment group stays the same size, only the control group decreases in size, the overall power may not actually be reduced very much (Ho et al., 2007);
    • the power increases when the groups are more similar because of the reduced extrapolation and higher precision (Snedecor and Cochran 1980)

1:1 matching has lower standard deviation than estimates from a linear regression (Smith 1997), even though thousands of obs were discarded in the matching.

k:1 matching can lead to poor matches, if there are no control individuals with PS similar to a given treated individual. One strategy is to impose a caliper and only select a match within the caliper, but this could lead to difficulties in interpreting effects if many treated individuals do not receive a match though if can avoid poor matches.

Optimal matching: minimize global distance measure competition for controls. If the goal is simply to find well-matched groups, simple nearest neighbor (greedy matching) may be sufficient; if the goal is well-matched pairs, then optimal matching may be preferable (Gu and Rosenbaum 1993).

Ratio Matching

Selecting the number of matches involves a bias-variance trade-off: selecting multiple controls for each treated individuals will generally increase bias since the 2nd, 3rd, 4th closest matches are further away from the treated individual than the 1st closest match. On the other hand, utilizing multiple matches can decrease variance due to the larger matched smaple size.

k:1 is not optimal matching since it does not account for the fact that some treated individuals may have many close matches while others may have few. A more advanced form is variable ratio matching, allows the ratio to vary (Ming and Rosenbaum, 2001).

Matching with Replacement

  • Reduce bias since controls look similar to many treated individuals can be used multiple times;
  • The matching order does not matter with replacement;
  • inference becomes more complex: matched controls are no longer independent (using frequency weights)
  • The number of times each control is matched should be monitored.

Subclassification, full matching and weighting use all individuals. These methods could be thought of as giving all individuals weights b/t 0 and 1.

Subclassification:

Rosenbaum and Rubin 1985b shows that creating 5 PS subclasses removes at least 90% of bias in the estiamted treatment effect due to all of the covaraites that went into the PS. With larger samples, more subclasses may be feasible and appropriate (Lunceford and Davidian 2004). More work needs to be done: enough to get adequate bias reduction but not too many cause the within-subclass estimates unstable.

Full matching: at least 1 treated and 1 control in each matched set.

  • optimal in minimizing the average of the distance between each treated individual and each control individual within each match set;
  • appeal for researchers who are reluctant to discard controls but want to obtain optimal balance on the PS.

Weighting adjustment: inverse weights in estiamtes of the ATE (inverse probability of treatment weighting IPTW). The weight was defined as \(w_i = \frac{T_i}{\hat e_i}+\frac{1-T_i}{1-\hat e_i}\), where \(\hat e_i\) is the estimated PS for individual i.

An alternative weighting technique: \(w_i=T_i+ (1-T_i)\frac{\hat e_i}{1-\hat e_i}\). This weigh assign treated individuals a weight of 1.

Drawback of weighting adjustment: large variance with extreme weights.

Numerical Diagnostics

  • Standardized mean difference: \(\frac{\bar x_t - \bar x_c}{\sigma_t}\); compared before and after the matching;
  • For binary covariates, either the above formula or simple difference in proportions can be calculated.

Rubin 2001 presents three balance measurres:

  • standardized difference of means of PS (the absolute value should be less than 0.25)
  • ratio of the variances of the PS in the treated and control groups (value should be 0.5 to 2)
  • for each covariate, the ratio of the variance of the residuals orthogonal to the PS in the treated and control groups

Common hypothesis tests and p-value should NOT be used as measure of balance:

  • balance is in-sample property, without reference to any broader population or super-population
  • hypothesis tests can be misleading, they often conflate changes in balance with changes in statistical power.

Imai et al. 2008 shows an example: randomly discarding controls seemingly leads to increased balance, simply because of the reduced power.

Graphical Diagnostics

  • examine the distribution of the PS in the original and matched groups; for weighting or subclassification, plots could be used sized dots to show the proportional to weights;
  • for continuous variables, QQ plot could be used to compare the empirical distributions of each variable in the treated and control groups. QQ plot compare the quantiles of a variable in the treatment group against the corresponding quantiles in the control group. If the 2 groups have identical empirical distributions, all points would lie on the 45 degree line.
  • Plot of comparing standardized mean difference.

Outcome Analysis: after the matching, this stage involves regression adjustments using the matched samples.

After k:1 matching:

It is more common to simply pool all the matches into matched treated and control groups and run analyses using the groups as a whole; since PS matching does not guarantee the individual pairs will be well-matched on the full set of covarites.

Weights should be incorporated into the analysis for matching with replacement or variable ratio matching. - matching with replacement: control group individuals receive a weight that reflects the number of times they were selected as a match; - using variable ratio, controls receive a weight that is proportional to the number of controls matched to their treated individual. (eg. 1 treated has 3 controls, then each of those controls receives a weight of 1/3).

After subclassification or full matching:

There are fairly imbalance remaining in each subclass and thus it is important to do regression adjustment within each subclass. When the number of subclasses is too large (and the number of individuals within each subclass is too small), to estimate separate regression models, a joint model can be fit, with subclass and subclass by treatment indicator (fixed effects); this is especially useful for full matching, estimate separate effect for each subclass, but assumes the relationship between the covaraites X and the outcome is constant across subclasses.

\(Y_{ij}= \beta_{0j} + \beta_1T_{ij}+\gamma X_{ij}+e_{ij}\), i is individual and j indexes subclasses: - \(\beta_{1j}\): treatment effect for subclass j; these effects are aggregated across subclasses to obtain an overall treatment effect:\(\beta=\frac{N_j}{N}\sum_{j=1}^{J}\beta_{1j}\), \(N_j\) is the number of individuals in subclass j and N is the total number of individuals.

When it is possible to obtain 100% or nearly 100% bias reduction by matching on true or estimated PS, using the estimated PS will result in more precise estimates of the average treatment effect.

missing values

Generalized boosted models do not require fully observed covariates.

5 The use of propensity score to assess the generalizability of results from randomized trials

Participants in effectiveness trials are rarely representative of the target population of interest and effects often vary for different types of people and in different contexts; which means that the results that are seen in a randomized trial may not reflect the effects that would be seen if the intervention were implemented in a different target population.

PS to assess generalizability

The PS difference between the trial sample and population (\(\Delta_p\)) as the difference in average PS between those who are in the trial and those who are not in the trial:

\(\Delta_p=\frac{1}{n} \sum_{i \in {S_i=1} }{\hat{p}} - \frac{1}{N-n} \sum_{i \in S_i=0}{\hat{p}}\).

  • If the sample is large random from the poplation, we expect \(E(p_i|Si=1)=E(p_i|Si=0)\) and \(E(\Delta_p)=0\);
  • In real example, in finite samples, \(\Delta_p\) is likely to be a small positive value, reflecting small chance differences between the trial participants and the population.

Simulations showed (Cochran and Rubin, 1973; Rubin, 1973) that PS means that differ by more than 0.25 SDs indicate a large amount of extrapolation and heavy reliance on the models beings used for estimation.

IPTW, full matching and Subclassification

  • IPTW: giving each individual their own weight (\(w_i=1/\hat{P_i}\)); this weighting forms a pseudo-population with characteristics that are similar to those of the target population. If no one in the population was receiving the intervention of interest, then if the weights are effective, the weighted control group outcomes should be similar to the outcomes that are observed in the target population;

  • IPTW Concern: results can be unstable, especially if there are extreme weights, and the method is more sensitive to the specification of the PS model than are other PS approaches.

  • Subclassification: 5-10 number of subclasses but may suffer from having too few subclasses and thus insufficient bias reduction.

  • Full matching: compromise between IPTW and subclassification; at least one member of the sample and at least one member of the target population but the ratio of the sample to population members in each subclass can vary. Full matching has been shown to be optimal in terms of reducing PS differences within subclasses (Rosenbaum, 1991).

Both subclassification and full matching the control group members in the trial are given weights that are proportional to the number of population members in their subclass. Example: sample members in a subclass with two sample members and 10 population members would receive weights proportional to 5 (10/2), whereas sample members in a subclass with 10 sample members and two population members would receive weights that are proportional to 0.2 (2/10).

6 Title: Randomization, matching, and propensity socres in the design and analysis of experimental studies with measured baseline covariates [Travis Loux, 2015]

Regression or ANVOCA could mathematically effective reduce the standard error of the effect estimator, but they are sensitive to the model misspecificaiton.

Reduce SE: also means more efficiency, reasons as follow:

MES(\(\hat {\theta}\)) = \(E[(\hat{\theta} - \theta)^2]\)=Var(\(\hat{\theta}\)) + Bias(\(\hat{\theta}\))^2 For randomization, the bias is 0; the CI of \(\bar{y}\) is \(\bar{y} \pm 2*s_{y}/\sqrt{n}\), as we put more variables in the model, more variation of y will be explained and \(s_{y}\) will be decreased, which means that standard error decreases - efficiency.

The advantage of randomization:

  • reduce conditional bias of treatment effect;
  • improve the efficiency of effect estimate (as shown above);
  • prospective about which variables are associated with the outcome.

Stratification does not work efficiently when only a few strata exist.

Propensity-constrained randomization (PCR) is proposed in this paper, differs from the previous methods by avoiding stratification or matching of units, instead using variance of the empirical PS of all units as a measure of similarity summarizing the treatment groups.

Any one randomization is likely to yield some imbalance, resulting in empirical PS which deviate from constant. These deviations from balance are common in smaller samples. Then we consider empirical PS \(\hat {e} = \hat {P}(T=1|X=x)\).

Constant PS means the distribution of X are equal in the treatment and control group, and vice versa. We use variance of the fitted PS to measure the deviation of the distribution from the perfect balance.

The less-parametric methods (boosting, bagging, random forest and generalized additive models) may lead more precise PS estimation when logit does not fit the data well, however, they may be more computationally intensive.

Conclusion:

  • IPTW can lead to improved results even treatment is randomized, though the errors may increase in small samples when too many covariates are incorporated;
  • Restricted randomization should be employed when a relatively small study with a propensity-adjusted analysis plan;
  • PCR is an alternative when NBP (non-bipartite matching) is infeasible, which allows for finding the best possible pairs to form treatment contrasts before assigning treatment.
  • PCR could yield near-constant propensity scores.
  • Combining randomization methods (NBP matching before assignment treatment and then PCR on allocations generated from the paired subjects) is another option - such combination yield significant improvements for the unadjusted estimators, but none for the IPTW or regression estimators.

Limitations:

  • covariates must be observed for all subjects prior randomization;
  • when the number of covariates is large, PCR as well as other restricted randomization methods may fail to improve balance;
  • Computational complexities

7 Inverse probability weighting for covariate adjustment in randomized studies [Changyu Shen, Xiaochun Li, Lingling Li 2014]

Two-stage estimation procedure based on IPTW was proposed to achieve better precision without comparison objectivity, which the covariate adjustment is performed before seeing the outcome, effectively reducing the possibility of selecting a “favorable” model that yields a strong intervention effect.

Problem of IPTW: ANCOVA could be used for covariates adjustment, variable and model selection can be used to choose an optimal model that best accentuates the estimate and statistical significance of the treatment difference.

Main festures:

  • the analysts never see the outcome data and the baseline covariates in the same data set for drawing the primary conclusion of a study, effectively reducing the possibility of selecting a favorable model through examination of the relationship between the covariates and the outcome (more stringent control on what data the analysts have access to);
  • the covariates to be adjusted do not need to be specified until the time of fitting the PS;
  • the two-stage IPW estimator offers an improvement in precision without compromising objectivity.

2-stage IPW steps:

  1. fit logit for A (treatment assignment indicator) with whatever covariates and obtain the estimated PS \(p(X,\hat {\alpha})\) for each subject;
  2. fit a linear model by OLS for A-r (r= \(p(X,\hat {\alpha})\)) using the same covariate X and obtain the fitted value q for each subject;
  3. Compute IPW \(\hat {\theta}_I\) and SE as follows:

\(\hat {\theta}_I = \frac{1}{n} \sum (\frac{Y_iA_i}{p((X_i,\hat{\alpha}))} + \frac{Y_i(1-A_i)}{1-p((X_i,\hat{\alpha}))})\)

\(\Gamma = V-M^2 - H\), where

\(V = \frac {1}{n} \sum[(\frac {A_i}{r} - \frac {1-A_i}{1-r})Y_i]^2\), \(M = \frac {1}{n} \sum[(\frac {A_i}{r} - \frac {1-A_i}{1-r})Y_i]\), \(H = \frac {1}{nr^4(1-r)^4}\sum[Y_i^2(A_i-r)^4q_i^4]\), r here represents PS.

\(\color{red}{\text{Question}}\):

  1. \(\hat {\theta}_I\) is IPW or sample mean difference?
  2. If \(\hat {\theta}_I\) is IPW, IPTW=1/ps + (1-T)/(1-ps)?
  3. Any covariates in the logit and the same in OLS?
  4. Why and how improve the precision? More precise PS?

8 Propensity score weighting for covariate adjustment in randomized clinical trials [Shuxi Zeng, Fan Li, Rui Wang, Fan Li]

Method proposed: overlap weighting (OW).

Strength of OW:

  • OW leads to exact mean balance of any baseline covariate in that model, and completely removing chance imbalance when PS is estimated by logit;
  • OW has the same semiparametric variance lower bound as ANCOVA and IPW for continuous outcome;
  • OW outperforms IPW in finite samples and improves the efficiency over ANCOVA and augments IPW when the degree of treatment effect heterogeneity is moderate or when the outcome model is incorrectly specified;

The OW estimator for the ATE in RCT is

\(\hat {\tau}^{OW} = \frac{\sum (1- \hat{e_i})Z_iY_i}{\sum (1- \hat{e_i})Z_i} - \frac{\sum \hat{e_i}(1-Z_i)Y_i}{\sum \hat{e_i} (1-Z_i)}\), where

\(\hat {e_i}\) is the estiamted PS from logit, \(Z_i\) is the treatment indicator (0/1), \(Y_i\) is the outcome.

OW estimator from logit lead to exact mean balance of any predictor included in the model:

\(\frac{\sum (1- \hat{e_i})Z_iY_i}{\sum (1- \hat{e_i})X_{ji}} - \frac{\sum \hat{e_i}(1-Z_i)X_{ji}}{\sum \hat{e_i} (1-Z_i)}=0\), for j=1,…p

Results:

\(\hat {\tau}^{IPW} = \frac{\sum Z_iY_i/\hat{e_i}}{\sum Z_i/\hat{e_i}} - \frac{\sum (1-Z_i)Y_i/(1-\hat{e_i})}{\sum (1-Z_i)/(1-\hat{e_i})}\)

\(\hat {\tau}^{Unadjusted} = \frac{\sum Z_iY_i}{\sum Z_i} - \frac{\sum (1-Z_i)Y_i}{\sum (1-Z_i)}\)

  • \(\hat {\tau}^{IPW}\), \(\hat {\tau}^{LR}\) (linear regression), \(\hat {\tau}^{OW}\) are consistently more efficient than the unadjusted estimator, and the relative efficiency increases with a larger sample size;
  • When N<100, OW leads to higher efficiency compared to LR and IPW;
  • OW becomes more efficient than LR when the randomization probability deviates from 0.5.

9 Covariate adjustment in subgroup analyses of randomized clinical trials: a propensity score approach [Siyun Yang, Fan Li, Laine E Thomas, Fan Li, 2021]

Goal: develop PS weighting for covariate adjustment to improve the precision and power of subgroup analyses in RCT.

Extend PS weighting to subgroup analyses by fitting a logit model with pre-specified covariate-subgroup interactions. Overlap weighting exactly balances the covariate with interaction terms in each subgroup.

Results:

  • SE of the adjusted estimators are smaller than those of the unadjusted estimator;
  • PS weighting estimator is as efficient as ANCOVA and more efficient when subgroup sample size is small (N<125), and when the outcome model is misspecified;
  • the weighting estimators with full-interactions consistently outperform the standard main-effect propensity model;
  • for the weighting estimators, the full interaction propensity model consistently outperforms the main-effect propensity model;
  • \(\color{red}{\text{for large subgroups}}\), (number of subgroups? or N for each subgroup? > 125: prop of subgroups with N > 125?) OW and IPW perform similarly, but OW is more efficient than IPW in small samples (N<125);
  • when ANCOVA model is correctly specified, weighting methods with full interaction are as efficient as ANCOVA-S and are even more efficient when paired with OW under small subgroup sample size (<125);
  • when ANCOVA is misspecified, OW-Full outperforms the ANCOVA-S;

Define a vector of R pre-specified subgroup indicators, \(S_i=(S_{i1}, S_{i2}, ..., S_{iR})\). \(S_{ir}=1\) if unit i belongs to the \(r^{th}\) subgroup, and \(\color{red}{\text{one unit can belong to multiple subgroups simultaneously}}\).

The \(r^{th}\) subgroup average treatment effect is defined as

\(\tau_{r} = E(Y(1)-Y(0)|S_r=1)\).

In RCTs, investigators wish to test whether there is heterogeneous treatment effect across subgroup levels in one-at-a-time fashion for each r. Hence, the heterogeneous treatment effect across two subgroup levels can be formalized as

\(\delta_{r}^{HTE}=E(Y(1)-Y(0)|S_r=1)-E(Y(1)-Y(0)|S_r=0)\).

Due to the relatively small subgroup sample sizes, there is a higher chance for imbalance to occur within a subgroup, and adjusting for these imbalances may improve precision and power.

ANCOVA-S Model:

\(Y_i = \beta_0 + \beta_1^T \textbf{X}_\textbf{i} + \beta_2^T \textbf{S}_\textbf{i}+ \beta_3^T Z_i + \beta_4^T \textbf{X}_\textbf{i}\textbf{S}_\textbf{i} + \beta_5^T Z_i\textbf{S}_\textbf{i}+ \beta_6^T Z_i\textbf{X}_\textbf{i} + \beta_7^T \textbf{X}_\textbf{i} \textbf{S}_\textbf{i}Z_i + \epsilon_i\),

\(\hat {Y_{i}(1)}\) and \(\hat {Y_{i}(0)}\) are the predicted values from the above model when Z=1 and Z=0.

The “ANCOVA-S” estimator for subgroup average treatment effect is

\(\hat{\tau}_r^{ANCOVA} = \frac{\sum (\hat{Y_i(1)} - \hat{Y_i(0)})S_{ir}}{\sum S_{ir}}\).

PS for subgroup analysis is

\(e(X,S)=Pr(Z=1|X,S)\)

For PS weighting, it is crucial to include covariates predictive of the outcome, the subgroup variable of interest and its interactions with covariates.

When subgroup sample size is large, OW and IPW perform similarly, but OW outperforms IPW in terms of power with small subgroups.

OW-Full (OW estimator with the PS estimated from the full-interaction model) outperforms ANCOVA-S under small subgroup sample size when ANCOVA is misspecified. Because the exact balance property of OW guarantees all covariates are balanced within small subgroups, which translate into improved power in estimating the subgroup average treatment effect.

As the number of subgroup variables increases, the number of pairwise interactions would grow exponentially, leading to non-convergence in model fitting and unstable PS estimates. We should apply variable selection in PS model. [Yang S, Lorenzi E, Papadogeorgou G, et al. Propensity score weighting for causal subgroup analysis.]