In a randomized clinical trial, treatment assignment is randomized by design. This means both groups start out similar in terms of age, disease severity, risk factors, and other baseline characteristics. Because of this balance, any difference in outcomes can be attributed with confidence to the treatment itself.
Real-world evidence is different. In everyday clinical practice, treatment is not assigned randomly. It is influenced by clinical judgment and patient characteristics. For example, physicians might choose a newer therapy for patients who appear sicker or have a higher risk profile, while healthier or low-risk patients may continue with standard care. As a result, the treatment and control groups are no longer directly comparable at baseline.
If we simply compare outcomes between these groups without any adjustment, we are not comparing the treatment effect alone. Instead, we are mixing treatment effects with pre-existing differences between patients. This leads to biased conclusions. For instance, a drug that genuinely improves survival may appear ineffective if it is mostly prescribed to high-risk patients who already have poor prognosis.
Propensity score methods help correct this problem. By adjusting for baseline differences, they create conditions that resemble a randomized trial, even in an observational dataset. After adjustment, the goal is that both groups look similar with respect to key covariates, so that outcome differences reflect treatment effects rather than patient selection.
In this tutorial, we focus on two widely used techniques in real-world studies:
Propensity Score Matching (PSM), which pairs treated subjects with similar control subjects based on their likelihood of receiving treatment
Inverse Probability of Treatment Weighting (IPTW), which uses weights to construct a balanced pseudo-population while retaining the full sample
1.1 When each method is suitable
Method
Suitable When
Advantages
Limitations
Exact Matching
Few categorical covariates and a large dataset
Complete covariate balance for matched variables
Drops many observations when several covariates exist
Propensity Score Matching
Cohorts with many covariates; need a matched control group structure
Easy to interpret and communicate to clinicians
Unmatched patients are discarded, reducing statistical power
IPTW Weighting
Time-to-event/survival analysis; priority on retaining full sample
Retains sample size and increases power
May produce unstable estimates when weights become extreme
1.2 Standard workflow
Regardless of the specific technique used, the following workflow represents best practice:
Select baseline confounders that influence treatment assignment and outcome.
Estimate the propensity score using logistic regression or another model.
Apply matching or weighting to adjust baseline differences.
Check balance using Standardized Mean Differences (SMD). As a general rule, final SMD values should be less than 0.1 in absolute value.
Conduct outcome analysis, such as Cox proportional hazards or Kaplan Meier curves, only after confirming adequate balance.
2. Setup and Data Simulation
The following code generates a simulated dataset where the true treatment effect is known. This structure helps demonstrate whether PSM and IPTW recover the correct effect.
2 Wilcoxon rank sum test; Pearson’s Chi-squared test
This initial summary reveals whether the treatment and control groups are imbalanced at baseline. A typical observational dataset will show statistically significant differences, highlighting the need for propensity score adjustment before outcome modeling.
3. Propensity Score Matching with MatchIt
The following code performs 1:1 nearest neighbor matching based on the logit of the propensity score.
A `matchit` object
- method: 1:1 nearest neighbor matching without replacement
- distance: Propensity score [caliper]
- estimated with logistic regression
- caliper: <distance> (0.041)
- number of obs.: 4000 (original), 2656 (matched)
- target estimand: ATT
- covariates: age, severity, comorb
Matching results reveal how many individuals were matched and how many were dropped. The next step is to confirm whether matching meaningfully reduces baseline differences.
3.1 Balance assessment after PSM
summary(m_out)
Call:
matchit(formula = treat ~ age + severity + comorb, data = df,
method = "nearest", distance = "logit", caliper = 0.2, std.caliper = TRUE,
ratio = 1)
Summary of Balance for All Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance 0.6617 0.4838 0.9398 1.0365 0.2466
age 66.5547 62.6451 0.4009 0.9864 0.1106
severity 0.5259 0.2060 0.6408 . 0.3200
comorb 2.1491 1.6695 0.3307 1.3254 0.0480
eCDF Max
distance 0.3954
age 0.1704
severity 0.3200
comorb 0.1376
Summary of Balance for Matched Data:
Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance 0.5610 0.5333 0.1461 1.1644 0.0396
age 64.9905 64.5286 0.0474 1.0622 0.0135
severity 0.3102 0.2545 0.1116 . 0.0557
comorb 1.9337 1.8328 0.0696 1.1295 0.0101
eCDF Max Std. Pair Dist.
distance 0.0881 0.1463
age 0.0339 0.9537
severity 0.0557 0.3438
comorb 0.0309 0.9232
Sample Sizes:
Control Treated
All 1646 2354
Matched 1328 1328
Unmatched 318 1026
Discarded 0 0
love.plot( m_out,thresholds =c(m =0.1),abs =TRUE,title ="Covariate Balance After PSM")
The standardized mean differences should approach zero after matching. A well-balanced design will have every covariate within the 0.1 threshold.
3.2 Cox model after PSM
After confirming balance, the matched sample can be used to estimate the treatment effect using a Cox proportional hazards model.
The optimal design is the one that: 1. Achieves SMD less than 0.1 for all important covariates 2. Does not require excessive sample loss or extremely large weights