Observational data often contain inherent differences between comparison groups because subjects are not assigned through randomization. Such baseline imbalances can introduce confounding, making direct comparisons unreliable if not properly controlled. Propensity score methods provide a statistical approach to address this issue by estimating the probability of group membership based on observed characteristics and creating balanced samples with similar covariate distributions.
Propensity score matching (PSM) is widely used to emulate key features of randomized study designs in real-world datasets. By pairing individuals with comparable propensity scores, PSM reduces bias attributable to measured variables and enables more valid group comparisons. Nearest-neighbour matching is one of the most commonly applied algorithms due to its simplicity, transparency, and effectiveness in achieving covariate balance.
In this analysis, Familial and Sporadic individuals originated from highly unequal sample sizes, necessitating adjustment prior to further investigation. Therefore, age- and sex-based propensity score matching was implemented to construct comparable groups and ensure that downstream analyses are not driven by demographic imbalance.
This analysis used an observational dataset comparing individuals classified as Familial and Sporadic. Because group assignment was not randomized, propensity score matching based on age and sex was applied to reduce baseline differences and construct comparable groups prior to further analysis.
The dataset (n = 5,936) included a unique identifier (MRNO), age (years), sex (Gender_code), and group classification (dataset_code).
The aim of this analysis was to construct age- and sex-balanced comparison groups between Familial and Sporadic individuals prior to downstream analyses.
# Import dataset
data = read_sav("data/multiplex_18022026.sav")
# Preview the first rows of the dataset
head(data)## # A tibble: 6 × 4
## MRNO Age Gender_code dataset_code
## <chr> <dbl> <dbl+lbl> <dbl+lbl>
## 1 APKRN0000001309 22 0 [female] 1 [familial]
## 2 APVWD0000001348 18 0 [female] 1 [familial]
## 3 APVWD0000001795 8 0 [female] 1 [familial]
## 4 KABVG0000001379 20 1 [male] 1 [familial]
## 5 KLCLT0000003378 15 0 [female] 1 [familial]
## 6 PYSAR0000010680 33 1 [male] 1 [familial]
## tibble [5,936 × 4] (S3: tbl_df/tbl/data.frame)
## $ MRNO : chr [1:5936] "APKRN0000001309" "APVWD0000001348" "APVWD0000001795" "KABVG0000001379" ...
## ..- attr(*, "format.spss")= chr "A16"
## ..- attr(*, "display_width")= int 16
## $ Age : num [1:5936] 22 18 8 20 15 33 15 37 18 30 ...
## ..- attr(*, "format.spss")= chr "F11.0"
## ..- attr(*, "display_width")= int 11
## $ Gender_code : dbl+lbl [1:5936] 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1,...
## ..@ format.spss : chr "F8.0"
## ..@ display_width: int 13
## ..@ labels : Named num [1:2] 0 1
## .. ..- attr(*, "names")= chr [1:2] "female" "male"
## $ dataset_code: dbl+lbl [1:5936] 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## ..@ format.spss : chr "F8.0"
## ..@ display_width: int 20
## ..@ labels : Named num [1:2] 0 1
## .. ..- attr(*, "names")= chr [1:2] "sporadic" "familial"
## MRNO Age Gender_code dataset_code
## 0 0 4 0
Exclusion of missing genders and formatting of variables.
data_clean <- data %>%
filter(!is.na(Gender_code)) %>%
mutate(
dataset_code = factor(dataset_code, levels = c(0,1),
labels = c("Sporadic","Familial")),
Gender_code = factor(Gender_code, levels = c(0,1),
labels = c("Female","Male"))
)
# Check missingness after
colSums(is.na(data_clean))## MRNO Age Gender_code dataset_code
## 0 0 0 0
Records with missing sex information (n = 4) are excluded prior to matching because propensity score estimation requires complete covariate information. No missing data were observed for age or group classification.
Standardized mean differences (SMD) are used instead of hypothesis testing because balance diagnostics, rather than statistical significance, are recommended for evaluating propensity score matching performance.
vars <- c("Age", "Gender_code")
table_before <- CreateTableOne(vars = vars,
strata = "dataset_code",
data = data_clean,
test = FALSE)
print(table_before, smd = TRUE)## Stratified by dataset_code
## Sporadic Familial SMD
## n 5845 87
## Age (mean (SD)) 29.10 (11.23) 26.86 (11.69) 0.195
## Gender_code = Male (%) 3156 (54.0) 46 (52.9) 0.022
Baseline comparability between familial and sporadic groups is assessed using standardized mean differences (SMD). An SMD greater than 0.10 is considered indicative of meaningful imbalance. Prior to matching, age shows evidence of imbalance (SMD = 0.195), while sex is already well balanced (SMD = 0.022).
Age- and sex-matched analysis between familial and sporadic groups was performed using 1:1 nearest-neighbour propensity score matching without replacement. This approach pairs each treated subject with the control subject whose propensity score is closest, ensuring that each control is used at most once.
In the matching framework, the familial group is treated as the reference (treated) group and matched to sporadic individuals (controls).
##
## Sporadic Familial
## 87 87
##
## Call:
## matchit(formula = dataset_code ~ Age + Gender_code, data = data_clean,
## method = "nearest", replace = FALSE, ratio = 1)
##
## Summary of Balance for All Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio
## distance 0.0153 0.0147 0.1987 1.0875
## Age 26.8621 29.1011 -0.1916 1.0834
## Gender_codeFemale 0.4713 0.4601 0.0225 .
## Gender_codeMale 0.5287 0.5399 -0.0225 .
## eCDF Mean eCDF Max
## distance 0.0485 0.1400
## Age 0.0467 0.1400
## Gender_codeFemale 0.0112 0.0112
## Gender_codeMale 0.0112 0.0112
##
## Summary of Balance for Matched Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio
## distance 0.0153 0.0153 0 1
## Age 26.8621 26.8621 0 1
## Gender_codeFemale 0.4713 0.4713 0 .
## Gender_codeMale 0.5287 0.5287 0 .
## eCDF Mean eCDF Max Std. Pair Dist.
## distance 0 0 0
## Age 0 0 0
## Gender_codeFemale 0 0 0
## Gender_codeMale 0 0 0
##
## Sample Sizes:
## Control Treated
## All 5845 87
## Matched 87 87
## Unmatched 5758 0
## Discarded 0 0
Matching quality is evaluated using standardized mean differences before and after matching. An SMD < 0.10 was considered evidence of adequate covariate balance. After matching, standardized mean differences for both age and sex were reduced to approximately zero, demonstrating successful covariate balance.
We check that equilibrium has been reached
table_after <- CreateTableOne(vars = vars,
strata = "dataset_code",
data = matched_data,
test = FALSE)
print(table_after, smd = TRUE)## Stratified by dataset_code
## Sporadic Familial SMD
## n 87 87
## Age (mean (SD)) 26.86 (11.69) 26.86 (11.69) <0.001
## Gender_code = Male (%) 46 (52.9) 46 (52.9) <0.001
The matching procedure produced 87 matched pairs (n = 174 individuals). Unmatched sporadic individuals were excluded from subsequent analyses, as propensity score matching constructs an analytical sample restricted to comparable subjects.
This graph serves to answer a single scientific question: Are the Familial and Sporadic groups comparable after matching?
Covariate balance improved substantially after matching, with all
standardized mean differences reduced to approximately zero, indicating
successful alignment of age and sex distributions between groups.
Propensity score matching successfully constructed age- and sex-balanced comparison groups between Familial and Sporadic individuals. The matching procedure reduced baseline imbalance to negligible levels, producing a well-aligned analytical cohort suitable for valid downstream analyses. This reproducible workflow demonstrates how propensity score methods can be applied to transform observational data into a comparable framework for unbiased investigation.
Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41-55.
Rosenbaum PR, Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Stat. 1985;39(1):33-38.
Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav Res. 2011;46(3):399-424.
Austin PC. A comparison of 12 algorithms for matching on the propensity score. Stat Med. 2014;33(6):1057-1069.
Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med. 2009;28(25):3083-3107.