A Practical Guide to Propensity Score Designs in R: Exploring MatchIt and WeightIt on the Lalonde Dataset
Author
Dinesh Kumar
Published
November 23, 2025
1. Introduction
In randomized trials, treatment allocation is determined by design, and therefore treated and control groups are comparable at baseline. In observational settings, however, treatment is driven by clinical judgment, access, socioeconomic factors, or patient characteristics. As a result, the treated and control groups often differ even before treatment begins.
If we evaluate outcomes without addressing these differences, the analysis does not reflect a true treatment comparison. The purpose of propensity score designs is to first make the treated and control populations comparable, and only after that, study the outcome.
This tutorial focuses entirely on the design stage and compares multiple methods from MatchIt and WeightIt. We use the well-known Lalonde dataset, which contains information on a job-training program. The perspective here is practical: rather than recommending a single best method, we show how different methods alter covariate balance and sample structure.
The outcome analysis at the end is minimal by design, because the primary goal is to understand how each design handles confounding before modeling.
Choosing the right design based on the estimand
The target population (estimand) should drive the selection of the propensity score method.
The table below summarizes the commonly used estimands and the recommended designs.
Estimand
Scientific Question
Target Population
Recommended Methods
Notes
ATT (Average Treatment Effect on the Treated)
What is the treatment effect among those who actually received the treatment?
Mimics a matched cohort. May discard controls when no close matches exist.
ATE (Average Treatment Effect)
What would happen if everyone in the population were treated vs. untreated?
Full study population
IPTW via logistic regression (PS), IPTW via GBM, Entropy balancing, CBPS weighting
Uses full dataset. Sensitive to extreme PS values and weight variability; needs diagnostics.
ATO (Average Treatment Effect in the Overlap Population)
What is the treatment effect among patients who realistically could have received either treatment?
Region of good overlap
Overlap weighting
Produces highly stable estimates. Often the most robust when PS distributions do not overlap well.
The estimand determines the scientific meaning of the treatment effect.
ATT focuses on the patients who received treatment, ATE focuses on the entire population, and ATO focuses on the subgroup where treatment decision could plausibly have gone either way. The method should always be selected to match the estimand rather than the other way around.
Selecting the appropriate method: why and when to use each design
The table below summarizes the practical motivations and ideal use-cases for commonly used matching and weighting techniques.
Method
Why use it
When it is most appropriate
Considerations
Nearest Neighbor Matching
Creates a sample that closely resembles a randomized experiment by pairing similar treated and untreated individuals
When the clinical or analytical team prefers a “matched cohort” framework and the sample size is sufficiently large for good matches
Loss of sample if good matches are unavailable; results apply to the treated population (ATT)
Optimal Matching
Finds globally optimal pairs rather than local greedy matches
When nearest neighbor matching results in poor balance or inefficient pairs
Slightly more complex; still susceptible to sample loss
Full Matching
Uses sets of multiple treated and control individuals to improve efficiency and preserve sample size
When maintaining sample size is important while still achieving good covariate balance
Produces weights rather than strict pairs; more complex to describe
Subclassification
Stratifies subjects into strata based on propensity score and compares within each stratum
When a simple and interpretable design is preferred, often as an exploratory analysis
Balance may not be perfect; strata must contain both treatment groups
Exact / Coarsened Exact Matching
Prevents unrealistic comparisons by forcing treated and untreated subjects to match within predefined covariate categories
When key categorical covariates should not be compared across levels (for example, race, disease stage)
May drop large numbers of participants when covariates are many or finely stratified
Mahalanobis Matching
Matches based on multivariate distance between covariates rather than the PS
When covariates are few and continuous and treatment assignment is not too imbalanced
Not scalable when many covariates or categorical variables exist
IPTW (Logistic Regression PS)
Creates a pseudo-population where treatment assignment is independent of covariates
When full sample retention and population-level (ATE) estimates are desired
Requires diagnostics for extreme weights; model misspecification can harm balance
Boosted IPTW (GBM)
Captures nonlinear and higher-order relationships automatically
When logistic regression weighting does not yield acceptable balance
More computationally intensive; requires tuning
Entropy Balancing
Directly balances covariate moments without explicitly modeling treatment
When strict covariate balance is required for ATE with minimal weight variability
Requires continuous or discretized covariates; less intuitive to communicate
CBPS Weighting
Integrates balance constraints into the PS estimation step
When logistic regression PS is inadequate and Boosting is not preferred
Useful compromise between PS and balancing methods
Overlap Weighting
Focuses on the region where treated and control groups are most comparable
When the interest lies in a realistic treatment population and extreme PS values exist
Produces highly stable estimates; interprets the ATO estimand
Matching Weights
Emulates matching behavior using weights rather than discarding subjects
When ATT is desired without sample loss associated with strict pair matching
Effectively a weighted form of PSM; easy to present when ATT is the goal
In practice, method selection should be guided by the estimand, the structure of the dataset, and analytical goals. Matching is generally preferred when the scientific audience relates well to a “matched cohort” interpretation. Weighting is preferred when sample retention and statistical efficiency are priorities, particularly for survival analysis. Regardless of method, covariate balance diagnostics must be reviewed before any treatment effect is interpreted.
2. Data setup and causal question
The simplified causal question from the Lalonde study is:
Does participation in the training program (treat) improve earnings in 1978 (re78)?
We first inspect baseline differences to see how strongly confounded the dataset is.
2 Wilcoxon rank sum test; Pearson’s Chi-squared test
As expected, there are substantial differences between the treated and control groups, especially in earnings prior to treatment (re74, re75). These imbalances are enough to create misleading estimates if the data is analyzed without adjustment.
We define a consistent propensity score formula to be used across all methods:
Nearest neighbor is often a first choice because it is intuitive and easy to communicate clinically. The trade-off is sample loss when good matches do not exist.
3.2 Optimal matching
Optimal matching looks globally across all possible pairings instead of making greedy decisions pair by pair.
The analyst controls how strict the comparisons should be by setting the cutpoints.
4. WeightIt: Weighting designs
Where matching creates pairs or sets, weighting creates a pseudo-population in which treatment is independent of observed covariates. We now explore several types of weight construction.
No single method is universally optimal. Good practice is to compare designs with respect to: 1. Covariate balance (SMD should be less than 0.1). 2. Effective sample size (matching may drop observations). 3. Weight stability (avoid extreme or highly variable weights). 4. Target population (ATT, ATE, or ATO).
In RWE studies, overlap weighting and full matching are often strong choices because they combine good balance with efficient sample use. Nearest neighbor matching remains common because it closely resembles the familiar idea of a matched cohort.
While Nearest Neighbor matching is the most popular method, notice how Entropy Balancing and Overlap Weighting achieved near-perfect balance (SMDs \(\approx\) 0) in this dataset without discarding data. In my own practice, I am increasingly moving toward these weighting methods for survival analysis to preserve statistical power.