Propensity Score Matching and Weighting Tutorial

Author

Dinesh Kumar

Published

November 23, 2025

1. Introduction and Background

In a randomized clinical trial, treatment assignment is randomized by design. This means both groups start out similar in terms of age, disease severity, risk factors, and other baseline characteristics. Because of this balance, any difference in outcomes can be attributed with confidence to the treatment itself.

Real-world evidence is different. In everyday clinical practice, treatment is not assigned randomly. It is influenced by clinical judgment and patient characteristics. For example, physicians might choose a newer therapy for patients who appear sicker or have a higher risk profile, while healthier or low-risk patients may continue with standard care. As a result, the treatment and control groups are no longer directly comparable at baseline.

If we simply compare outcomes between these groups without any adjustment, we are not comparing the treatment effect alone. Instead, we are mixing treatment effects with pre-existing differences between patients. This leads to biased conclusions. For instance, a drug that genuinely improves survival may appear ineffective if it is mostly prescribed to high-risk patients who already have poor prognosis.

Propensity score methods help correct this problem. By adjusting for baseline differences, they create conditions that resemble a randomized trial, even in an observational dataset. After adjustment, the goal is that both groups look similar with respect to key covariates, so that outcome differences reflect treatment effects rather than patient selection.

In this tutorial, we focus on two widely used techniques in real-world studies:

  • Propensity Score Matching (PSM), which pairs treated subjects with similar control subjects based on their likelihood of receiving treatment

  • Inverse Probability of Treatment Weighting (IPTW), which uses weights to construct a balanced pseudo-population while retaining the full sample


1.1 When each method is suitable

Method Suitable When Advantages Limitations
Exact Matching Few categorical covariates and a large dataset Complete covariate balance for matched variables Drops many observations when several covariates exist
Propensity Score Matching Cohorts with many covariates; need a matched control group structure Easy to interpret and communicate to clinicians Unmatched patients are discarded, reducing statistical power
IPTW Weighting Time-to-event/survival analysis; priority on retaining full sample Retains sample size and increases power May produce unstable estimates when weights become extreme

1.2 Standard workflow

Regardless of the specific technique used, the following workflow represents best practice:

  1. Select baseline confounders that influence treatment assignment and outcome.
  2. Estimate the propensity score using logistic regression or another model.
  3. Apply matching or weighting to adjust baseline differences.
  4. Check balance using Standardized Mean Differences (SMD). As a general rule, final SMD values should be less than 0.1 in absolute value.
  5. Conduct outcome analysis, such as Cox proportional hazards or Kaplan Meier curves, only after confirming adequate balance.

2. Setup and Data Simulation

The following code generates a simulated dataset where the true treatment effect is known. This structure helps demonstrate whether PSM and IPTW recover the correct effect.

library(tidyverse)
library(survival)
library(survminer)
library(MatchIt)
library(WeightIt)
library(cobalt)
library(gtsummary)

set.seed(404)
n <- 4000

age <- rnorm(n, 65, 10)
severity <- sample(c(0, 1), n, replace = TRUE, prob = c(0.6, 0.4))
comorb <- rpois(n, 2)

ps_true <- plogis(-4 + 0.05 * age + 1.5 * severity + 0.3 * comorb)
treat <- rbinom(n, 1, ps_true)

lambda <- 0.001
hazard <- lambda * exp(
  log(0.8) * treat +
    log(1.02) * age +
    log(1.5) * severity
)

time <- rexp(n, hazard)
status <- ifelse(time > 365 * 2, 0, 1)
time <- pmin(time, 365 * 2)

df <- data.frame(
  id = 1:n,
  treat = treat,
  time = time,
  status = status,
  age = age,
  severity = severity,
  comorb = comorb
)

tbl_summary(df, by = treat, include = c(age, severity, comorb)) %>%
  add_overall() %>%
  add_p()
Characteristic Overall
N = 4,0001
0
N = 1,6461
1
N = 2,3541
p-value2
age 65 (58, 72) 63 (56, 70) 67 (60, 73) <0.001
severity 1,577 (39%) 339 (21%) 1,238 (53%) <0.001
comorb 2 (1, 3) 2 (1, 2) 2 (1, 3) <0.001
1 Median (Q1, Q3); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test

This initial summary reveals whether the treatment and control groups are imbalanced at baseline. A typical observational dataset will show statistically significant differences, highlighting the need for propensity score adjustment before outcome modeling.


3. Propensity Score Matching with MatchIt

The following code performs 1:1 nearest neighbor matching based on the logit of the propensity score.

m_out <- matchit(
  treat ~ age + severity + comorb,
  data    = df,
  method  = "nearest",
  distance = "logit",
  ratio   = 1,
  caliper = 0.2,
  std.caliper = TRUE
)

m_out
A `matchit` object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score [caliper]

             - estimated with logistic regression
 - caliper: <distance> (0.041)
 - number of obs.: 4000 (original), 2656 (matched)
 - target estimand: ATT
 - covariates: age, severity, comorb

Matching results reveal how many individuals were matched and how many were dropped. The next step is to confirm whether matching meaningfully reduces baseline differences.


3.1 Balance assessment after PSM

summary(m_out)

Call:
matchit(formula = treat ~ age + severity + comorb, data = df, 
    method = "nearest", distance = "logit", caliper = 0.2, std.caliper = TRUE, 
    ratio = 1)

Summary of Balance for All Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance        0.6617        0.4838          0.9398     1.0365    0.2466
age            66.5547       62.6451          0.4009     0.9864    0.1106
severity        0.5259        0.2060          0.6408          .    0.3200
comorb          2.1491        1.6695          0.3307     1.3254    0.0480
         eCDF Max
distance   0.3954
age        0.1704
severity   0.3200
comorb     0.1376

Summary of Balance for Matched Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance        0.5610        0.5333          0.1461     1.1644    0.0396
age            64.9905       64.5286          0.0474     1.0622    0.0135
severity        0.3102        0.2545          0.1116          .    0.0557
comorb          1.9337        1.8328          0.0696     1.1295    0.0101
         eCDF Max Std. Pair Dist.
distance   0.0881          0.1463
age        0.0339          0.9537
severity   0.0557          0.3438
comorb     0.0309          0.9232

Sample Sizes:
          Control Treated
All          1646    2354
Matched      1328    1328
Unmatched     318    1026
Discarded       0       0
love.plot(
  m_out,
  thresholds = c(m = 0.1),
  abs = TRUE,
  title = "Covariate Balance After PSM"
)

The standardized mean differences should approach zero after matching. A well-balanced design will have every covariate within the 0.1 threshold.


3.2 Cox model after PSM

After confirming balance, the matched sample can be used to estimate the treatment effect using a Cox proportional hazards model.

df_match <- match.data(m_out)

fit_psm <- coxph(
  Surv(time, status) ~ treat,
  data    = df_match,
  robust  = TRUE,
  cluster = subclass
)

tbl_regression(fit_psm, exponentiate = TRUE) %>%
  as_gt() %>%
  gt::tab_header(title = "PSM Adjusted Hazard Ratio (MatchIt)")
PSM Adjusted Hazard Ratio (MatchIt)
Characteristic HR 95% CI p-value
treat 0.82 0.76, 0.88 <0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

Robust standard errors with clustering account for the matched design and should always be used when analyzing matched survival data.


4. Inverse Probability Weighting with WeightIt

IPTW retains the entire sample while weighting patients according to their inverse probability of receiving the treatment.

w_out <- weightit(
  treat ~ age + severity + comorb,
  data     = df,
  method   = "ps",
  estimand = "ATE",
  stabilize = TRUE
)

summary(w_out$weights)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4573  0.6879  0.8101  1.0006  1.1146  8.4792 

Extreme weights can indicate limited overlap in propensity scores between treatment groups. Stabilization helps improve estimator stability.


4.1 Balance assessment after IPTW

bal.tab(
  w_out,
  stats = c("m"),
  thresholds = c(m = 0.1)
)
Balance Measures
               Type Diff.Adj    M.Threshold
prop.score Distance  -0.0033 Balanced, <0.1
age         Contin.   0.0055 Balanced, <0.1
severity     Binary  -0.0010 Balanced, <0.1
comorb      Contin.  -0.0098 Balanced, <0.1

Balance tally for mean differences
                   count
Balanced, <0.1         4
Not Balanced, >0.1     0

Variable with the greatest mean difference
 Variable Diff.Adj    M.Threshold
   comorb  -0.0098 Balanced, <0.1

Effective sample sizes
           Control Treated
Unadjusted  1646.  2354.  
Adjusted    1107.7 2002.69
love.plot(
  w_out,
  thresholds = c(m = 0.1),
  abs = TRUE,
  title = "Covariate Balance After IPTW (WeightIt)"
)

IPTW should only be used for outcome analysis once all key covariates achieve acceptable balance.


4.2 Cox model after IPTW

df_w <- df %>%
  mutate(w_iptw = w_out$weights)

fit_iptw <- coxph(
  Surv(time, status) ~ treat,
  data    = df_w,
  weights = w_iptw,
  robust  = TRUE,
  cluster = id
)

tbl_regression(fit_iptw, exponentiate = TRUE) %>%
  as_gt() %>%
  gt::tab_header(title = "IPTW Adjusted Hazard Ratio (WeightIt)")
IPTW Adjusted Hazard Ratio (WeightIt)
Characteristic HR 95% CI p-value
treat 0.83 0.77, 0.89 <0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

The resulting estimate reflects the Average Treatment Effect across the weighted pseudo-population.


5. Comparing PSM and IPTW

covs <- c("age", "severity", "comorb")

love.plot(
  list(PSM = m_out, IPTW = w_out),
  covs       = df[, covs],
  treat      = df$treat,
  stats      = "m",
  thresholds = c(m = 0.1),
  abs        = TRUE,
  title      = "Covariate Balance Comparison: PSM vs IPTW"
)

The optimal design is the one that: 1. Achieves SMD less than 0.1 for all important covariates 2. Does not require excessive sample loss or extremely large weights


6. Kaplan Meier survival curves after IPTW

fit_km <- survfit(
  Surv(time, status) ~ treat,
  data    = df_w,
  weights = w_iptw
)

ggsurvplot(
  fit_km,
  data        = df_w,
  risk.table  = TRUE,
  conf.int    = TRUE,
  pval        = TRUE,
  title       = "Adjusted Survival Curves using IPTW",
  ggtheme     = theme_minimal()
)

Weighted survival curves reflect treatment effect in the balanced pseudo-population rather than the original biased dataset.


7. Summary of results

fit_crude <- coxph(
  Surv(time, status) ~ treat,
  data = df
)

tbl_merge(
  tbls = list(
    tbl_regression(fit_crude, exponentiate = TRUE),
    tbl_regression(fit_psm, exponentiate = TRUE),
    tbl_regression(fit_iptw, exponentiate = TRUE)
  ),
  tab_spanner = c(
    "**Unadjusted**",
    "**PSM Adjusted**",
    "**IPTW Adjusted**"
  )
) %>%
  as_gt() %>%
  gt::tab_header(title = "Hazard Ratios Across Models")
Hazard Ratios Across Models
Characteristic
Unadjusted
PSM Adjusted
IPTW Adjusted
HR 95% CI p-value HR 95% CI p-value HR 95% CI p-value
treat 0.97 0.91, 1.04 0.4 0.82 0.76, 0.88 <0.001 0.83 0.77, 0.89 <0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

8. Key takeaways

  1. Balance diagnostics must always precede outcome modeling. A hazard ratio is unreliable if SMDs exceed 0.1 after adjustment.
  2. PSM produces an RCT-like matched cohort but reduces sample size and power when many subjects are unmatched.
  3. IPTW is often preferred in survival analysis when the full sample should be retained, but weight distributions must be checked for stability.
  4. Always use robust variance estimation when analyzing matched or weighted samples.