1 Matching Method Review

Matching is a form of selection, that is pruning and weighting of units to arrive at a weighted subset of the units from the original dataset. Ideally, and if done successfully, subset selection produces a new sample where the treatment is unassociated with the covariates so that a comparison of the outcomes treatment and control groups is not confounded by the measured and balanced covariates.

There has been some debate about the importance of stratification after subset selection; while some authors have argued that, with some forms of matching, pair membership is incidental (Stuart 2008; Schafer and Kang 2008), others have argued that correctly incorporating pair membership into effect estimation can improve the quality of inferences (Austin and Small 2014; Wan 2019).

1.1 Nearest Neigbor Matching

NNM is also known as greedy matching, in the sense that each pairing occurs withhout reference to how other units will be or have been paired, and therefore does not aim to optimize any criterion.

NNM needs to specify a distance measure to define which control unit is cloest to each treated unit. The default is PS difference; another popular distance is the Mahalanobis distance.

MDM (Mahalanobis distance matching): identical MDM means the identical covariates values. PSM: reduce the entire covariate distribution into a single dimension, which means that two units with similar PS will not necessarily have similar covariate values.

The order of the treated units are to be paired must be specified. By default, PS is to go in descending order from the highest PS, which could allow the units that may have the hardest time finding close matches to be matched first. Random orderings should be an alternative until be tried multiple times to find an adequate matched sample. The order doesn’t matter when doing the matching with replacement.

1.2 Optimal Pair Matching

Also called optimal matching; it is optimal in the sense that it attempts to choose matches that collectively optimize an overall criterion (the sum of the absolute pair distances in the matched sample).

Optimal pair matching requires the specification of a distance measure between units.

Optimal pair matching could assign 1:1 or 1:k. The algorithm was created based on Hungarian Algorithm, Algorithm Details. [too complicated to figure out: Cover the zero elements with the minimum number of lines it is possible to cover them with]. R package for this computation: Clue (solve_LSAP). R package for pair matching: optmatch (pairmatch).

Rubin (1973, 1979) compared covariance adjustment of the matched and unmatched samples, finding that covariance adjustment of matched samples is more robust. In his simulation, the expected response of a control was \(f(X)\) for some function \(f(.)\) and the expected response of a treated unit was \(f(X)+\tau\), the treatment increased the expected response by \(\tau\) for every unit. He estimated \(\tau\) using both linear covariance adjustment of unmatched samples and linear covariance adjustment of treated-minus-control differences within matched pair, finding that linear covariance adjustment of matched samples gave better estimates of \(\tau\) when \(f(X)\) is not linear. — matching can increase efficiency.

Matching Distance:

  • Mahalanobis distance: \(M(x_1, x_2)=(x_1-x_2)^{T}S^{-1}(x_1-x_2)\);
  • PS Distance: \(P(x_1,x_2)=|(e(x_1)-e(x_2)|\);
    • some very different x’s may have similar PS;
    • PS makes no effort to locate a control that is close on all the coordinates of x.
  • Mahalanobis distance within propensity calipers:
    • MP is infinite if e(x1) and e(x2) differ by more than the width c of the calipers; otherwise it is the Mahalanobis distance.

Why Greedy matching is worse?

x <- matrix(c(0,1,1,100), nrow = 2)
colnames(x) <- c("T1", "T2")
rownames(x) <- c("C1", "C2")
x
##    T1  T2
## C1  0   1
## C2  1 100

For greedy matching, the total distance will be 100 while for the optimal matching the distance will be 2.

Determine the number of controls in matching:

1.3 Optimal Full Matching

Papers:

  • Full matching in an observational study of coaching for the SAT (Ben Hansen, 2004);
  • Using Full matching to estimate causal effects in nonexperimental studies: examing the relationship between adolescent marijuana use and adult outcomes (Stuart and Green, 2008);
  • The performance of inverse probability of treatment weighting and full matching on the propensity score in the presence of model misspecification when estimating the effect of treatment on survival outcomes (Austin and Stuart, 2015.)

Each subclass contains one treated unit and one or more control units; OR one control unit and one or more treated units. It is optimal in the sense of chosen the number of subclasses and the assignment of units to subclasses minimize the sum of the absolute within-subclass distances in the matched sample.

Optimal full matching requires the specification of a distance measure between units. Full matching can use to estimate ATE; it is also less sensitive to the form of PS model since the original PS are used just to create the subclasses, not to form the weights directly.

Units may be weighted in a way that they contribute less to the sample than would unweighted units, so the effective sample size of the full matching weighted sample may be lower than even that of 1:1 pair matching.

effective sample size: an estimate of sample size required to achieve the same level of precision if that sample was a simple random sample. Used as summarize the amount of information in the data.

1.3.1 Ben Hansen, 2014

Full matching with/without restrictions, does a better job with missing data, facilitates fully adjusted but simple comparisons of treated and control groups, and lays bare heterogeneities of treatment effect that regression analyses obscured.

Nearest neighbor does the matching at a given stage without attention to how they affect possibilities for later matchings. Optimal matching optimize global, rather than local.

Optimal matches are never worse and often better then greedy matching (nearest neighbor).

The relative precision formula is

\(R(S, \tilde{S}) = (\sum_{s=1}^{S}w_{s}^2\frac{elements number}{trt+ctrl}) \times (\sum_{s=1}^\tilde{S} w_{\tilde{s}}^2 \frac{elements number}{trt+ctrl})\),

\(w_s\) is the weight, which is \(\frac{number of trt in each subclass}{number of total trts}\). The relative precision for 1:1, 1:3, 1:5, 1:7 are 1, 0.82, 0.77, 0.76. Matchings with multiple controls appear more precise.

In general, if controls are K times as numerous as treated units, then adjustment using a 1:k matching amounts to no adjustment at all.

The number of treated subjects divided by the number of controls range from about half up to about twice what that ratio is in the sample as a whole. For example we have 4 treated and 5 controls, the treat-to-control ratio of about 1:2.5 (5/4 * 2) up to 1.6:1 (4/5 *2).

Unlike both pair matching and ANCOVA, full matching’s estimate and SE do not assume treatment effects to be the same across units; they average estimates of individual treatment effects that can be quite different.

1.3.2 Stuart, 2008

Full matching has been shown to be particularly effective at reducing bias due to observed confounding variables.

The benefits of propensity score matching:

  • reduced bias in the estimation of causal effects using nonexperimental data, partly through reduced reliance on the outcome model itself (violation of assumptions of a normal distribution, or linearity);
  • intuitive and easy explanations;
  • diagnostics are easy to understand and implement.

In non-experimantal studies, researchers must assume that there are no observed differences between the treatment group and a comparison group after conditioning on the observed covariates; which is known as unconfounded treatment assignment, no hidden bias or on unobserved confounding.

Rosenbaum and Rubin (1983b) indicated that, if the treatment assignment is independent of the potential outcomes given the full set of covariates, then it is also independent of the potential outcomes given the propensity score - this implies that the benefits of matching on all covariates individually are also attained when matching on the PS.

One of the drawbacks of the k:1 nearest neighbor matching is many potential comparison individuals may be discarded and not used in the analysis.

Drawbacks of subclassification:

  • some differences between treated and control units within each subclass still exists, which lead to substantial bias;
  • the difficulty of determining the number of subclass;
  • comparing the balance on each covariate across different number of subclasses can be time consuming.

Simply including the PS in a regression model of the outcome discarding or down-weighting individuals who are dissimilar on background characteristics is the least ideal use of PS. (violations of model assumptions: linearity, …). Simply replacing all the covariates by the PS in the outcome regression model may be worse than including the individual covariates in the model and not using the PS at all, as the PS is not designed for reducing dimensions in that way.

Full matching is a compromise between k:1 matching and subclassification matching. Full matching minimizes the sum of the distances between all pairs of treated and comparison individuals within each matched set, across all matched sets, which is similar to the global distance measure. One problem with full matching is the widely varying ratios of treated to comparison individuals, which can lead to large variacne of the resulting effect estimate.

One way to deal with the ratio is limited the ratio of treated:control to be less than half and no more than double what it was in the original data set. (for example: 1 treated with 5 controls, the ratio varies from 2:5 to 1:10). Another way, discard control units with PS lower than the lowest PS in the treatment group (and discard treated units with PS higher than the highest PS in the control group).

Full matching has the potential to produce matched samples with very good balance and thus may often be a good choice, particularly if there is desire to use data on all individuals available.

1.3.3 Austin and Stuart, 2015

When the PS model was correctly specified, bias tended to be lower for full matching than for IPTW.

ATE: the average effect of treatment in an entire population or sample when that entire population or sample is moved from control to treated. It may be of interest if one can imagine applying the treatment to all eligible subjects.

ATT: the average effect of treatment in those subjects who were ultimately treated; one would not impose the treatment on all eligible subjects.

Four ways PS that can be used to estimate the effect of treatment in observational studies:

  • matching
  • IPTW
  • stratification
  • covariate adjustment

Full matching could be used to estimate both ATE and ATT.

Full matching resulted in estimates with lower MSE than did IPTW. Both full matching with a caliper restriction and IPTW in the restricted sample resulted in estimates with substantially lower MSE than the estimates obained using the conventional full matching or IPTW.

Deal with IPTW extreme weights:

  • use of standardized weights, trimmed weights or truncated weights;
  • non-parametric estimation: random forest or generalized boosted models - to estimate PS model.

1.4 Genetic Matching

Paper:

  • Genetic matching for estimating causal effects: a general multivariate matching method for achieving balance in observational studies (Alexis Diamond and Jasjeet S. Skehon, 2013)

1.4.1 Diamond and Skehon, 2013

Genetic matching: a method of multivariate matching that uses an evolutionary search algorithm to determine the weight each covariate is given.

Matching on a correctly specified PS will asymptotically balance the observed covariates, and it will asymptotically remove the bias conditional on such covariates. A misspecified PS model may increase the imbalance of some observed variables postmatching, especially if the covariates do have have normal distributions.

Genetic matching (GenMatch) eliminates the need to manually and interactively check the PS; GenMatch uses a search algorithm to iteratively check and improve covariate balance, and it is a generalization of PS and Mahalanobis distance matching.

If the distributions of observed confounders are not similar after matching on an estimated propensity score, the PS must be misspecified or the sample size is too small for the PS to remove the conditional bias.

Mahalanobis distance (MD): a scalar quantity that measures the multivariate distance between individuals different groups.

\(MD(X_i, X_j) = \sqrt{(X_i-X_j)^TS^{-1}(X_i-X_j)}\), where S is the sample covariance matrix of X.

MD does not perform well when covariate have nonelliposidal distributions.

ellipsoidal distribution:a broad family of probability distributions that generalize the multivarite normal distribution. Intuitively, in the simplified two or three dimensional cases, the joint distribution forms an ellipse and an ellipsoid, respectively.

Due to sampling variation and nonexact matching, \(T \perp\!\!\!\!\perp X|\pi(x)\) may not hold after matching on propensity score. Rosenbaum and Robin (1985) therefore recommend in addition to PS matching, one should match individual covariates by minimizing the MD of X to obtain balance on X. They argue that propensity should be included among the covariates X or one may first match on PS and them match based on MD within PS strata.

GenMatch algorithm searches a range of distance metrics to find the particular measure that optimizes postmatching covariate balance. The algorithm weights each variable according to its relative importance for achieving the best overall balance.

\(GMD(X_i, X_j, W) = \sqrt{(X_i-X_j)^T(S^{-1/2})^TWS^{-1/2}(X_i-X_j)}\), where W is a diagonal matrix with weights, \(S^{-1/2}\) is the Cholesky decomposition (\(A=LL^T\), L is lower triangular matrix).

X in the above equation could be replaced with Z, where Z is a matrix consisting of both the PS, and the underlying covariate X.

The algorithm will find that neither minimizing the MD nor matching on the propensity score minimizes the loss function and will search for improved metrics that optimizes balance.

In GenMatch, the default loss function requires the algorithm to minimize overall imbalance by minimizing the largest individual discrepancy, based on p-values from KS tests and paired t-test for all variables that are being matched on.

Genetic Matching algorithm:

  • Step 1: set W to initial value;
  • Step 2: create generation of W’s of size pop.size (pop.size??);
  • Step 3: do GMD matching for each W in generation;
  • Step 4: compute loss for each matched sample;
  • Step 5: identify sample with minimum loss;
  • Step 6: get W from the sample with minimum loss;
  • Step 7: check if the criterion on generations was satisfied? if no, go to step 2;
  • Step 8: do GMD matching with optimal W.

Strength of GenMatch:

  • directly optimizes covariate balance;
  • by using an automated process to search the data for the best matches, GenMatch is able to obtain better levels of balance without requiring the analyst to correctly sprcify the PS.

Paper: Multivariate and propensity score matching software with automated balance optimization: the matching package for R, Jasjeet S. Sekhon

Genetic matching is a method which finds the set of matches which minimizes the discrepancy between the distribution of potential confounders in the treated and control groups, namely covarite balance is maximized.

R package: Matching

A significant shortcoming of common matching methods such as Mahalanobis distance and PS matching is that they may make balance worse across measured potential confounders. Even if covariates are distributed ellipsoidally (normally), they make balance worse due to a given finite sample there may be departures from an ellipsoidal distribution.

PS matching has good theoretical properties iff the true PS model is known and the sample size is large.

The genetic matching is non-parametric and does not depend on knowing or estimating the PS, but the method is improved when a PS is incorporated.

A causal effect is defined as the difference between an observed outcome and its counterfactual.

ATE:

\(\tau = E(Y_i|T_i=1)-E(Y_i|T_i=0)\), assuming all the units are treated - all the units are controlled.

ATT:

\(\tau = E(Y_{i1}|T_i=1)-E(Y_{i0}|T_i=0)\), treated units - controlled units if treated.

It is useful to combine the PS with Mahalanobis distance matching since PS matching is particularly good at minimizing the discrepancy along the PS and Mahalanobis distance is good at minimizing the distance between individual coordinates of X.

If one has a good PS model, one should include it as one of the covariate in the GenMatch.

For a given set of matches resulting from a given W, the loss function is defined as the minimum p value observed across a series of standardized statistics.

For the genetic matching code, see the “genetic matching code” in Box research/matching method/genetic matching folder.

1.5 Exact Matching

Creating subclasses based on unique combinations of covariate values and assigning each unit into their corresponding subclass so that only units with identical covariate values are placed into the same subclass.

weakness:

  • few units will remain after matching, lack precision;
  • ineffective with continuous variables.

1.6 Coarsened Exact Matching

It first coarsening the covariates by creating bins and then performing exact matching on the new coarsened versions of covariates. The degree and method of coarsening can be controlled by the user to manage the trade-off between exact and approximate balancing.

Like exact matching, CEM is susceptible to the curse of dimensionality, making it less viable solution with many covariates, expecailly with few units.

Paper: Causal inference without balance checking: coarsened exact matching

The goal of matching is to prune observations from the data so that the remaining data have balance between the treated and control groups, meaning the empirical distributions of the covariates in the groups are more similar. Exactly balanced data mean that controlling further for X is unnecessary, and so a simple difference in means on the matched data can estimate the causal effect. Model dependence and statistical bias are much bigger problems than large variance. For most of matching methods, it is easy to find a matching solution that improves balance between treated and control groups, but the result often leaves balance worse for some other variables at the same time.

Propensity score and Mahalanobis matching, are members of the class of matching methods known as “equal percent bais reducing” (EPBR), which does not guarantee imbalance reduction in any given data set; its property only hold on average across samples. A single use of these technique can increase imbalance and model dependence by any amount.

CEM dominates used existing matching methods in its ability to reduce imbalance, model dependence, estimation error, bias, variance, MSE and other criteria.

If one-to-one exact matching is used, then the difference in means between Y in the treated and control groups provides a fully nonparametric estimator of the causal effect. When the treated and controls do not match exactly, the estimator will necessarily incorporate some modeling assumptions designed to span the remaining differences, and so results will be modeled dependent to some degree.

When one or more treated units have no reasonable matches among the set of available controls:

  • Option 1: remove the treated units;
  • Option 2: create controls units from the given model;
  • Option 3: applying PS or Mahalanobis matching with calipers.

The purpose of matching is to reduce model dependence, and so it does not make sense to assume that the analysis model is correct. Matching as much of the entire empiricla distribution as possible is the goal.

Multivariate Imbalance Measure:

\(L_1 = \frac{1}{2}\sum_{l_1,l_2,...l_k}|f_{l_1...l_k}-g_{l_1...l_k}|\),

where f and g are the relative empirical frequency distribution for the treated and control groups. For some specific variable X1, it is divided by \(l_1, ..., l_k\) intervals. For example, for X1, treated divided by 0.2, 0.3, 0.4 and 0.1. Control was divided by 0,1, 0.2, 0.2, 0.5. L1=1/2(0.1+0.1+0.2+0.4)=0.4, which means 60% of the two histogram overlap. If L1=1, means exactly no overlap and if L1=0, means completely overlap.

Question: For each variable, there is a causal effect? For example, age was used to select treated units and control units by CEM; BMI will be also used to estimate causal effect with different treated and control units?

Paper: cem: software for coarsened exact matching

Matching is not a method of estimation: it is a way to preprocess a data set so that estimation of the sample average treatment effect on the treated (SATT) based on the matched data set will be less “model-dependent”.

Look the example here: slu research folder - matching method - coarsened exact matching.

1.7 Subclassification

A form of coarsened exact matching. The bins are based on specified quantiles of the PS distribution either in the treated group, control group or overall, depending on the desired estimand.

The binning of PS is based on dividing the distribution of covariates into approximately equally sized bins.

Paper: Propensity score matching and subclassification in obvervational studies with multi-level treatments [Shu Yang]

2 Paper 1

Title: Improving propensity score weighting using machine learning (within cite 4) [Brian K. Lee, Justin Lessler, Elizabeth A. Stuart, 5/10/2022]

This paper talks about the PS estimation using logit, CART, pruned CART, the ensembled methods of bagged CART, random forest, and boosted CART.

Under conditions of both moderate non-additivity and moderate non-linearity, logit had the poorest performance, whereas ensemble methods provided substantially better bias reduction and more consistent 95% CI coverage. The results suggest that ensemble methods, especially boosted CART should be used for PS weighting.

CART advantageous:

  • ability to handle missing data;
  • insensitive to outliers and monotonic transformation of variables;
  • interactions and non-linearities could be modelled as a result of splits.

CART disadvantageous:

  • difficulty in modeling smooth functions and main effects (what is main effect: no interactions, only include the baseline covariates);
  • sensitive to overfitting (could be pruned to a simpler tree, which is less sensitive to noise and generalize better to new data).

Setoguchi et al. found that neural networks produced the least-biased estimates in many scenarios. (read: evaluation uses of data mining techniques in PS estimation: a simulation study, Paper 2)

Boosted CART estimates the PS using a piecewise linear combination of multiple CARTs. To reduce the prediction error, each successive CART is estimated from a random subsample of the data. Also, the shrinkage coefficient to the downweight each additional CART helps to prevent overfitting. And the use of piecewise constants has the effect of flattening the estimated PS at the extreme values of the predictors.

Conclusions:

  • Logit regression PS estimates overestimated the true effect with increasing frequency as non-additivity and non-linearity increased;
  • Boosted CART’s superior CI coverage did not come at the expense of relatively large standard errors;
  • For all the method, covariate balance incrased as the sample size increased;

Future Work

  • Logit with interaction terms?
  • Logit with splines (rcs)?
  • Random forests sometimes produced large propensity score weights, the algorithm could be calibrated to reduce the likelihood of extreme weighs?
  • Machine learning methods are traditionally applied to larger data sets, this paper mentioned that ML methods performed well under N=500. How about N=100? More covariates than obs?

3 Paper 2

Title: Evaluating uses of data mining techniques in PS estimation: a simulation study (within cite 4) [Soko Setoguchi, Sebastian Schneeweiss,.. 5/10/2022]

Objective: what situations analyses based on various types of exposure PS models using recursive partitioning (PR) and neutral networks (NN) produce unbiased and efficieint results.

Simulations: - n=2000; - binary outcome and 10 binary/continuous covariates with 7 scenarios non-linear or non-additive associations between exposure and covariates; - models: logit (all possible main effect), RP1 (without pruning), RP2 (with pruning) and NN; - c-statistic, SE and bias of exposure-effect estimates were calculated.

c-statistic (concordance).: In clinical studies, the C-statistic gives the probability a randomly selected patient who experienced an event (e.g. a disease or condition) had a higher risk score than a patient who had not experienced the event. It is equal to the area under the Receiver Operating Characteristic (ROC) curve and ranges from 0.5 to 1. Used for measuring GOF for binary outcomes in logit regression model.

  • A value below 0.5 indicates a very poor model.
  • A value of 0.5 means that the model is no better than predicting an outcome than random chance.
  • Values over 0.7 indicate a good model.
  • Values over 0.8 indicate a strong model.
  • A value of 1 means that the model perfectly predicts those group members who will experience a certain outcome and those who will not.

Results:

  • SEs of the effect estimates were greater in models with high C, since the PS model with high C creates less overlap in PS and results in a smaller size of matched dataset;
  • NN tend to produce the least biased estimates in non-additivity and non-linearity when the data is small or large; however this tendency in NN was not enhanced in the simulation with only confounders;
  • PS model with higher C might result in lower precision in the effect estimates; the validity and appropriateness of PS should not be judged by the size of C;
  • Misspecification of PS model introduces less bias than does misspecification of multivariate logit outcome model;
  • NN has the ability to correctly detect the associations between the exposure and the covariates.

4 Simulations

See another rmd file [r publish].