Estimating the causal effects of the treatments is a missing data problem, since either \(r_{i0}\) or \(r_{i1}\) is missing.
In randomized experiments, the results in the two treatment groups could be directly compared because their units are likely to be similar, whereas in nonrandomized experiments, such direct comparisons may be misleading because the units exposed to one treatment generally differ systematically from the units exposed to the other treatment. Balancing score, can be used to group treated and control units so that direct comparisons are more meaningful.
In randomized trial, the propensity score is a known function. In non-randomized trial, the PS can be estimated from the observed data (logit). The second way different is randomized trial, the treatment assignment z and the response (r1, r0) are conditionally independent given x (treatment assignment is strongly ignorable given baseline covariates).
b(x) is “finer” than e(x) means there exist some function f, such that e(x)=f(b(x)).
If the treatment assignment is strongly ignorable, then adjustment for a balancing score b(x) is sufficient to produce unbiased estimates of the average treatment effect.
average treatment effect: \(E(r_1) - E(r_0)\) (1.1) \(E(r_1|z=1)-E(r_0|z=0)\) (1.2) is different from (1.1), since (1.2) is conditional distribution of \(r_t\) given z=t.
Goal: evaluates the practical performance of full matching, modifying it to minimize variance as well as bias. With restrictions on the ratio of treated to controls, full matching makes use of many more obs than pair matching, and achieves far closer matches than does matching with $k $ controls.
As compared to 1:k matching or to matching with a variable number of controls (Ming and Rosenbaum 2000), pair matching is the least flexible and the least able to make use of a large reservoir of potential controls.
Difference between optimal and nearest matching:
Nearest available (greedy): move down the list of treated from top to bottom, at each step matching a treated subject to the nearest available control, which is then removed from the list of controls available at the next step. Such matching was made without attention to how they affect possibilities for later matchings, which could result the overall difference not minimum/optimal.
optimal matching optimize global rather than local.
Rosenbaum and Rubin (1985) pointed out that greedy algorithms and optimal algorithms have the nearly same performance with a large reservoir of controls; with limited controls, greedy algorithms can do much worse than optimal ones.
Dehejia and Wahba (1999) indicated that attempting to use most or all of the control reservoir invites sharp penalties in terms of bias.
The optimal full match uses all controls and balances every covariate, but some of its matched sets are too heavy with controls, and in others controls are quite sparse. This disparities stand behind the optimal full matching’s disappointing relative precision.
Regression adjustments assume that we know or can reliably discern patterns relating pretreatment, treatment, and response variables, and require the statistician to specify and fit a corresponding statistical model. Adjustment by stratification assumes only that treatment and control groups sufficiently alike in terms of pretreatment characteristics are comparable in terms of response to treatment; but it requires the statistician to make precise what it means for groups to be sufficiently alike prior to treatment, and it requires a method for grouping subjects into sufficiently uniform blocks.
Main conclusions:
Introduction (example): A medical device company plans to conduct an RCT to evaluate the safety and effectiveness of a device in order to seek approval for its marketing in the US, with control therapy being optimal medical management, and primary endpoint being a one-year adverse event rate. Based on the projected enrollment rate, it is anticipated that enrollment will take five years. With an additional one year of follow-up, it is going to take six years for the trial to complete. The company wants to explore the possibility of leveraging external data to speed up the trial, so that this new technology can reach patients sooner. The company thinks that such a study design may be viable because a significant amount of clinical data has already been accumulated in the EU, where the device was CE marked, and a high-quality registry has been established in the EU for patients treated with the device. This registry can potentially be used as a source of external data that can be leveraged to augment the treated arm of the RCT. To augment the control arm, a high-quality disease registry in the US may be used. A proposal is thus put forward to conduct a study consisting of an RCT with both arms supplemented by data from these two external sources, respectively.
Objective: describe an appropriate statistical procedure that can be used to design the study and subsequently analyze the data.
Suppose a prospective RCT with a treated and a control arm is actively enrolling patients, and an unforeseen event, such as the COVID-19 pandemic, occurs which stops the enrollment before it is completed. While sometimes the enrollment may restart later, this is not always possible or practical. Using the current data to test study hypotheses would result in an underpowered study as the planned sample size is not reached. In such a case, PS-integrated approaches can be used to augment one or both arms of the RCT with external data (if it is appropriate to do so) to recover the lost power, thereby salvaging the study.
Due to their flexibility and adaptability, the PS-integrated approaches are a viable statistical innovation that may be utilized in a variety of situations as a tool for the leveraging of external data to further support regulatory purposes.
Goal: This paper provides a structure for thinking about matching methods and guidance on their use, coalescing the existing research (both old and new) and providing a summary of where the literature on matching methods is now and where it should be headed.
2 stages to estimate the effect of some intervention: 1) design and 2) outcome analysis. Statg 1 uses only background information on the individuals in the study, designing the non-experimental study as would be a randomized experiment, without access to the outcome values. Matching methods are a key tool for stage 1. After Stage 1 finished, Stage 2 begins to compare the outcomes of the treated and control individuals.
Matching methods highlight areas of the covariate distribution where there is not sufficient overlap between the treatment and control groups, such taht the resulting treatment effect estimates would rely heavily on extrapolation. Selection models and regression models rely heavily on situations where there is insufficient overlap, but their standard diagnostics do not involve checking this overlap. Matching methods in part serve to make researchers aware of the quality of resulting inferences.
Rubin 1976 showed that matching based on Propensity score or Mahalanobis metric matching are the Equal Percent Bias Reducing; EPBR was showed by Rubin and Stuart (2006) to hold under much more general setting, in which the covariate distributions are discriminant mixtures of ellipsoidally symmetric distributions. EPBR methods reduce bias in all covariate directions (i.e., makes the covariate means closer) by the same amount, ensuring that if close matches are obtained in some direction (such as the propensity score), then the matching is also reducing bias in all other directions.
When matching using propensity scores, there is little cost to including variables that are actually unassociated with treatment assignment, as they will be of little influence in the propensity score model. Including variables that are actually unassociated with the outcome can yield slight increases in variance. However, excluding a potentially important confounder can be very costly in terms of increased bias.
One strategy is to include a small set of covariates known to be related to the outcomes of interest, do the matching, and then check the balance on all of the available covariates, including any additional variables that remain particularly unbalanced after the matching.
One type of variable that should not be included in the matching process are those that may have been affected by the treatment of interest (Rosenbaum, 1984; Frangakis and Rubin, 2002; Greenland, 2003).
Matching Distance:
Exact: \(D_ij = 0\) if \(X_i=X_j\); \(D_ij = \infty\) if \(X_i\neq X_j\)
Mahalanobis: \(D_ij=(X_i-X_j)'\Sigma^{-1}(X_i-X_j)\)
Propensity score: \(D_ij=|e_i - e_j|\), where \(e_k\) is the propensity score for individual k
Linear propensity score: \(D_ij=|logit(e_i)-logit(e_j)|\)
The exact and Mahalanobis distances measure neither work well when X is high dimensional. Exact matching often leads to many individuals not being matched, which can result in larger bias (Rosenbaum and Rubin, 1985). Coarsened Exact Matching (Iacus et al. 2009), which transform continuous variable to categorical, could be used to do exact matching on broader ranges of variables. The Mahalanobis matching could work quite well when the number of covariates is less than 8 (Rubin 1979 and Zhao 2004), but it does not perform well when the covariates are not normally distributed.
Propensity score (Rosenbaum and Rubin, 1983) summarize all the covariates into one scalar: the probability of being treated. Two properties:
It is a balancing score: at each value of PS, the distribution of covariate X defining the PS is the same in the treated and control groups;
If the treatment assignment is ignorable given the covariates, then the treatment assignment is also ignorable given the propensity socre - the difference in means in the outcome between treated and control individuals with a particular PS value is an unbiased estimate of the treatment effect at that PS value.
With propensity score estimation, concern is not with the parameter estimates of the model, but rather with the resulting balance of the covariates (Augurzky and Schmidt, 2001). Because of this, standard concerns about collinearity do not apply. Also, since they do not use covariate balance as criterion, model fit statistic identifying classification ability (c-statistic) and stepwise selection models are not helpful for variable selection. One strategy that is helpful to examine the balance of covariates is their squares and interactions in the matched samples.
Simulations showed mis-estimation of the PS is not a big issue, the treatment effect estimates are more biased when the outcome model (interpreting a particular regression coefficient) is mis-specified than when the PS model (get covarite balance) is mis-specified. Future research should involve more systematic evaluations of propensity score estimation, perhaps through more sophisticated simulations as well as analytic work, and consideration should include how the propensity scores will be used, for example in weighting versus subclassification.
Nearest neighbor matching
1:1 matching
1:1 matching has lower standard deviation than estimates from a linear regression (Smith 1997), even though thousands of obs were discarded in the matching.
k:1 matching can lead to poor matches, if there are no control individuals with PS similar to a given treated individual. One strategy is to impose a caliper and only select a match within the caliper, but this could lead to difficulties in interpreting effects if many treated individuals do not receive a match though if can avoid poor matches.
Optimal matching: minimize global distance measure competition for controls. If the goal is simply to find well-matched groups, simple nearest neighbor (greedy matching) may be sufficient; if the goal is well-matched pairs, then optimal matching may be preferable (Gu and Rosenbaum 1993).
Ratio Matching
Selecting the number of matches involves a bias-variance trade-off: selecting multiple controls for each treated individuals will generally increase bias since the 2nd, 3rd, 4th closest matches are further away from the treated individual than the 1st closest match. On the other hand, utilizing multiple matches can decrease variance due to the larger matched smaple size.
k:1 is not optimal matching since it does not account for the fact that some treated individuals may have many close matches while others may have few. A more advanced form is variable ratio matching, allows the ratio to vary (Ming and Rosenbaum, 2001).
Matching with Replacement
Subclassification, full matching and weighting use all individuals. These methods could be thought of as giving all individuals weights b/t 0 and 1.
Subclassification:
Rosenbaum and Rubin 1985b shows that creating 5 PS subclasses removes at least 90% of bias in the estiamted treatment effect due to all of the covaraites that went into the PS. With larger samples, more subclasses may be feasible and appropriate (Lunceford and Davidian 2004). More work needs to be done: enough to get adequate bias reduction but not too many cause the within-subclass estimates unstable.
Full matching: at least 1 treated and 1 control in each matched set.
Weighting adjustment: inverse weights in estiamtes of the ATE (inverse probability of treatment weighting IPTW). The weight was defined as \(w_i = \frac{T_i}{\hat e_i}+\frac{1-T_i}{1-\hat e_i}\), where \(\hat e_i\) is the estimated PS for individual i.
An alternative weighting technique: \(w_i=T_i+ (1-T_i)\frac{\hat e_i}{1-\hat e_i}\). This weigh assign treated individuals a weight of 1.
Drawback of weighting adjustment: large variance with extreme weights.
Numerical Diagnostics
Rubin 2001 presents three balance measurres:
Common hypothesis tests and p-value should NOT be used as measure of balance:
Imai et al. 2008 shows an example: randomly discarding controls seemingly leads to increased balance, simply because of the reduced power.
Graphical Diagnostics
Outcome Analysis: after the matching, this stage involves regression adjustments using the matched samples.
After k:1 matching:
It is more common to simply pool all the matches into matched treated and control groups and run analyses using the groups as a whole; since PS matching does not guarantee the individual pairs will be well-matched on the full set of covarites.
Weights should be incorporated into the analysis for matching with replacement or variable ratio matching. - matching with replacement: control group individuals receive a weight that reflects the number of times they were selected as a match; - using variable ratio, controls receive a weight that is proportional to the number of controls matched to their treated individual. (eg. 1 treated has 3 controls, then each of those controls receives a weight of 1/3).
After subclassification or full matching:
There are fairly imbalance remaining in each subclass and thus it is important to do regression adjustment within each subclass. When the number of subclasses is too large (and the number of individuals within each subclass is too small), to estimate separate regression models, a joint model can be fit, with subclass and subclass by treatment indicator (fixed effects); this is especially useful for full matching, estimate separate effect for each subclass, but assumes the relationship between the covaraites X and the outcome is constant across subclasses.
\(Y_{ij}= \beta_{0j} + \beta_1T_{ij}+\gamma X_{ij}+e_{ij}\), i is individual and j indexes subclasses: - \(\beta_{1j}\): treatment effect for subclass j; these effects are aggregated across subclasses to obtain an overall treatment effect:\(\beta=\frac{N_j}{N}\sum_{j=1}^{J}\beta_{1j}\), \(N_j\) is the number of individuals in subclass j and N is the total number of individuals.
When it is possible to obtain 100% or nearly 100% bias reduction by matching on true or estimated PS, using the estimated PS will result in more precise estimates of the average treatment effect.
missing values
Generalized boosted models do not require fully observed covariates.
Participants in effectiveness trials are rarely representative of the target population of interest and effects often vary for different types of people and in different contexts; which means that the results that are seen in a randomized trial may not reflect the effects that would be seen if the intervention were implemented in a different target population.
PS to assess generalizability
The PS difference between the trial sample and population (\(\Delta_p\)) as the difference in average PS between those who are in the trial and those who are not in the trial:
\(\Delta_p=\frac{1}{n} \sum_{i \in {S_i=1} }{\hat{p}} - \frac{1}{N-n} \sum_{i \in S_i=0}{\hat{p}}\).
Simulations showed (Cochran and Rubin, 1973; Rubin, 1973) that PS means that differ by more than 0.25 SDs indicate a large amount of extrapolation and heavy reliance on the models beings used for estimation.
IPTW, full matching and Subclassification
IPTW: giving each individual their own weight (\(w_i=1/\hat{P_i}\)); this weighting forms a pseudo-population with characteristics that are similar to those of the target population. If no one in the population was receiving the intervention of interest, then if the weights are effective, the weighted control group outcomes should be similar to the outcomes that are observed in the target population;
IPTW Concern: results can be unstable, especially if there are extreme weights, and the method is more sensitive to the specification of the PS model than are other PS approaches.
Subclassification: 5-10 number of subclasses but may suffer from having too few subclasses and thus insufficient bias reduction.
Full matching: compromise between IPTW and subclassification; at least one member of the sample and at least one member of the target population but the ratio of the sample to population members in each subclass can vary. Full matching has been shown to be optimal in terms of reducing PS differences within subclasses (Rosenbaum, 1991).
Both subclassification and full matching the control group members in the trial are given weights that are proportional to the number of population members in their subclass. Example: sample members in a subclass with two sample members and 10 population members would receive weights proportional to 5 (10/2), whereas sample members in a subclass with 10 sample members and two population members would receive weights that are proportional to 0.2 (2/10).
Regression or ANVOCA could mathematically effective reduce the standard error of the effect estimator, but they are sensitive to the model misspecificaiton.
Reduce SE: also means more efficiency, reasons as follow:
MES(\(\hat {\theta}\)) = \(E[(\hat{\theta} - \theta)^2]\)=Var(\(\hat{\theta}\)) + Bias(\(\hat{\theta}\))^2 For randomization, the bias is 0; the CI of \(\bar{y}\) is \(\bar{y} \pm 2*s_{y}/\sqrt{n}\), as we put more variables in the model, more variation of y will be explained and \(s_{y}\) will be decreased, which means that standard error decreases - efficiency.
The advantage of randomization:
Stratification does not work efficiently when only a few strata exist.
Propensity-constrained randomization (PCR) is proposed in this paper, differs from the previous methods by avoiding stratification or matching of units, instead using variance of the empirical PS of all units as a measure of similarity summarizing the treatment groups.
Any one randomization is likely to yield some imbalance, resulting in empirical PS which deviate from constant. These deviations from balance are common in smaller samples. Then we consider empirical PS \(\hat {e} = \hat {P}(T=1|X=x)\).
Constant PS means the distribution of X are equal in the treatment and control group, and vice versa. We use variance of the fitted PS to measure the deviation of the distribution from the perfect balance.
The less-parametric methods (boosting, bagging, random forest and generalized additive models) may lead more precise PS estimation when logit does not fit the data well, however, they may be more computationally intensive.
Conclusion:
Limitations:
Two-stage estimation procedure based on IPTW was proposed to achieve better precision without comparison objectivity, which the covariate adjustment is performed before seeing the outcome, effectively reducing the possibility of selecting a “favorable” model that yields a strong intervention effect.
Problem of IPTW: ANCOVA could be used for covariates adjustment, variable and model selection can be used to choose an optimal model that best accentuates the estimate and statistical significance of the treatment difference.
Main festures:
2-stage IPW steps:
\(\hat {\theta}_I = \frac{1}{n} \sum (\frac{Y_iA_i}{p((X_i,\hat{\alpha}))} + \frac{Y_i(1-A_i)}{1-p((X_i,\hat{\alpha}))})\)
\(\Gamma = V-M^2 - H\), where
\(V = \frac {1}{n} \sum[(\frac {A_i}{r} - \frac {1-A_i}{1-r})Y_i]^2\), \(M = \frac {1}{n} \sum[(\frac {A_i}{r} - \frac {1-A_i}{1-r})Y_i]\), \(H = \frac {1}{nr^4(1-r)^4}\sum[Y_i^2(A_i-r)^4q_i^4]\), r here represents PS.
\(\color{red}{\text{Question}}\):
Method proposed: overlap weighting (OW).
Strength of OW:
The OW estimator for the ATE in RCT is
\(\hat {\tau}^{OW} = \frac{\sum (1- \hat{e_i})Z_iY_i}{\sum (1- \hat{e_i})Z_i} - \frac{\sum \hat{e_i}(1-Z_i)Y_i}{\sum \hat{e_i} (1-Z_i)}\), where
\(\hat {e_i}\) is the estiamted PS from logit, \(Z_i\) is the treatment indicator (0/1), \(Y_i\) is the outcome.
OW estimator from logit lead to exact mean balance of any predictor included in the model:
\(\frac{\sum (1- \hat{e_i})Z_iY_i}{\sum (1- \hat{e_i})X_{ji}} - \frac{\sum \hat{e_i}(1-Z_i)X_{ji}}{\sum \hat{e_i} (1-Z_i)}=0\), for j=1,…p
Results:
\(\hat {\tau}^{IPW} = \frac{\sum Z_iY_i/\hat{e_i}}{\sum Z_i/\hat{e_i}} - \frac{\sum (1-Z_i)Y_i/(1-\hat{e_i})}{\sum (1-Z_i)/(1-\hat{e_i})}\)
\(\hat {\tau}^{Unadjusted} = \frac{\sum Z_iY_i}{\sum Z_i} - \frac{\sum (1-Z_i)Y_i}{\sum (1-Z_i)}\)
Goal: develop PS weighting for covariate adjustment to improve the precision and power of subgroup analyses in RCT.
Extend PS weighting to subgroup analyses by fitting a logit model with pre-specified covariate-subgroup interactions. Overlap weighting exactly balances the covariate with interaction terms in each subgroup.
Results:
Define a vector of R pre-specified subgroup indicators, \(S_i=(S_{i1}, S_{i2}, ..., S_{iR})\). \(S_{ir}=1\) if unit i belongs to the \(r^{th}\) subgroup, and \(\color{red}{\text{one unit can belong to multiple subgroups simultaneously}}\).
The \(r^{th}\) subgroup average treatment effect is defined as
\(\tau_{r} = E(Y(1)-Y(0)|S_r=1)\).
In RCTs, investigators wish to test whether there is heterogeneous treatment effect across subgroup levels in one-at-a-time fashion for each r. Hence, the heterogeneous treatment effect across two subgroup levels can be formalized as
\(\delta_{r}^{HTE}=E(Y(1)-Y(0)|S_r=1)-E(Y(1)-Y(0)|S_r=0)\).
Due to the relatively small subgroup sample sizes, there is a higher chance for imbalance to occur within a subgroup, and adjusting for these imbalances may improve precision and power.
ANCOVA-S Model:
\(Y_i = \beta_0 + \beta_1^T \textbf{X}_\textbf{i} + \beta_2^T \textbf{S}_\textbf{i}+ \beta_3^T Z_i + \beta_4^T \textbf{X}_\textbf{i}\textbf{S}_\textbf{i} + \beta_5^T Z_i\textbf{S}_\textbf{i}+ \beta_6^T Z_i\textbf{X}_\textbf{i} + \beta_7^T \textbf{X}_\textbf{i} \textbf{S}_\textbf{i}Z_i + \epsilon_i\),
\(\hat {Y_{i}(1)}\) and \(\hat {Y_{i}(0)}\) are the predicted values from the above model when Z=1 and Z=0.
The “ANCOVA-S” estimator for subgroup average treatment effect is
\(\hat{\tau}_r^{ANCOVA} = \frac{\sum (\hat{Y_i(1)} - \hat{Y_i(0)})S_{ir}}{\sum S_{ir}}\).
PS for subgroup analysis is
\(e(X,S)=Pr(Z=1|X,S)\)
For PS weighting, it is crucial to include covariates predictive of the outcome, the subgroup variable of interest and its interactions with covariates.
When subgroup sample size is large, OW and IPW perform similarly, but OW outperforms IPW in terms of power with small subgroups.
OW-Full (OW estimator with the PS estimated from the full-interaction model) outperforms ANCOVA-S under small subgroup sample size when ANCOVA is misspecified. Because the exact balance property of OW guarantees all covariates are balanced within small subgroups, which translate into improved power in estimating the subgroup average treatment effect.
As the number of subgroup variables increases, the number of pairwise interactions would grow exponentially, leading to non-convergence in model fitting and unstable PS estimates. We should apply variable selection in PS model. [Yang S, Lorenzi E, Papadogeorgou G, et al. Propensity score weighting for causal subgroup analysis.]