1 Background

Observational data often contain inherent differences between comparison groups because subjects are not assigned through randomization. Such baseline imbalances can introduce confounding, making direct comparisons unreliable if not properly controlled. Propensity score methods provide a statistical approach to address this issue by estimating the probability of group membership based on observed characteristics and creating balanced samples with similar covariate distributions.

Propensity score matching (PSM) is widely used to emulate key features of randomized study designs in real-world datasets. By pairing individuals with comparable propensity scores, PSM reduces bias attributable to measured variables and enables more valid group comparisons. Nearest-neighbour matching is one of the most commonly applied algorithms due to its simplicity, transparency, and effectiveness in achieving covariate balance.

In this analysis, Familial and Sporadic individuals originated from highly unequal sample sizes, necessitating adjustment prior to further investigation. Therefore, age- and sex-based propensity score matching was implemented to construct comparable groups and ensure that downstream analyses are not driven by demographic imbalance.

2 Data and Study Design

This analysis used an observational dataset comparing individuals classified as Familial and Sporadic. Because group assignment was not randomized, propensity score matching based on age and sex was applied to reduce baseline differences and construct comparable groups prior to further analysis.

The dataset (n = 5,936) included a unique identifier (MRNO), age (years), sex (Gender_code), and group classification (dataset_code).

3 Objective of analysis

The aim of this analysis was to construct age- and sex-balanced comparison groups between Familial and Sporadic individuals prior to downstream analyses.

4 Load Required Libraries

library(haven)      # Import and reads SPSS data
library(dplyr)      # Data cleaning & manipulation
library(tableone)   # Baseline comparison tables (SMD) or Balance checking
library(MatchIt)    # Performs propensity score matching

5 Data Import and Initial Inspection

# Import dataset 
data = read_sav("data/multiplex_18022026.sav") 
# Preview the first rows of the dataset
head(data)

## # A tibble: 6 × 4
##   MRNO              Age Gender_code dataset_code
##   <chr>           <dbl> <dbl+lbl>   <dbl+lbl>   
## 1 APKRN0000001309    22 0 [female]  1 [familial]
## 2 APVWD0000001348    18 0 [female]  1 [familial]
## 3 APVWD0000001795     8 0 [female]  1 [familial]
## 4 KABVG0000001379    20 1 [male]    1 [familial]
## 5 KLCLT0000003378    15 0 [female]  1 [familial]
## 6 PYSAR0000010680    33 1 [male]    1 [familial]

# Check structure (variable types) 
str(data)

## tibble [5,936 × 4] (S3: tbl_df/tbl/data.frame)
##  $ MRNO        : chr [1:5936] "APKRN0000001309" "APVWD0000001348" "APVWD0000001795" "KABVG0000001379" ...
##   ..- attr(*, "format.spss")= chr "A16"
##   ..- attr(*, "display_width")= int 16
##  $ Age         : num [1:5936] 22 18 8 20 15 33 15 37 18 30 ...
##   ..- attr(*, "format.spss")= chr "F11.0"
##   ..- attr(*, "display_width")= int 11
##  $ Gender_code : dbl+lbl [1:5936] 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1,...
##    ..@ format.spss  : chr "F8.0"
##    ..@ display_width: int 13
##    ..@ labels       : Named num [1:2] 0 1
##    .. ..- attr(*, "names")= chr [1:2] "female" "male"
##  $ dataset_code: dbl+lbl [1:5936] 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
##    ..@ format.spss  : chr "F8.0"
##    ..@ display_width: int 20
##    ..@ labels       : Named num [1:2] 0 1
##    .. ..- attr(*, "names")= chr [1:2] "sporadic" "familial"

# Check missing in each variable
colSums(is.na(data))

##         MRNO          Age  Gender_code dataset_code 
##            0            0            4            0

6 Data Cleaning

Exclusion of missing genders and formatting of variables.

data_clean <- data %>%
  filter(!is.na(Gender_code)) %>%
  mutate(
    dataset_code = factor(dataset_code, levels = c(0,1),
                          labels = c("Sporadic","Familial")),
    Gender_code  = factor(Gender_code, levels = c(0,1),
                          labels = c("Female","Male"))
  )

# Check missingness after
colSums(is.na(data_clean))

##         MRNO          Age  Gender_code dataset_code 
##            0            0            0            0

Records with missing sex information (n = 4) are excluded prior to matching because propensity score estimation requires complete covariate information. No missing data were observed for age or group classification.

7 Describe Data Before Matching

Standardized mean differences (SMD) are used instead of hypothesis testing because balance diagnostics, rather than statistical significance, are recommended for evaluating propensity score matching performance.

vars <- c("Age", "Gender_code")

table_before <- CreateTableOne(vars = vars,
                               strata = "dataset_code",
                               data = data_clean,
                               test = FALSE)

print(table_before, smd = TRUE)

##                         Stratified by dataset_code
##                          Sporadic      Familial      SMD   
##   n                       5845            87               
##   Age (mean (SD))        29.10 (11.23) 26.86 (11.69)  0.195
##   Gender_code = Male (%)  3156 (54.0)     46 (52.9)   0.022

Baseline comparability between familial and sporadic groups is assessed using standardized mean differences (SMD). An SMD greater than 0.10 is considered indicative of meaningful imbalance. Prior to matching, age shows evidence of imbalance (SMD = 0.195), while sex is already well balanced (SMD = 0.022).

8 Perform Propensity Score Matching (Age + Sex)

Age- and sex-matched analysis between familial and sporadic groups was performed using 1:1 nearest-neighbour propensity score matching without replacement. This approach pairs each treated subject with the control subject whose propensity score is closest, ensuring that each control is used at most once.

In the matching framework, the familial group is treated as the reference (treated) group and matched to sporadic individuals (controls).

set.seed(2026)

match_model <- matchit(dataset_code ~ Age + Gender_code,
                       data = data_clean,
                       method = "nearest",
                       ratio = 1,
                       replace = FALSE)

9 Extract Matched Dataset

matched_data <- match.data(match_model)

table(matched_data$dataset_code)

## 
## Sporadic Familial 
##       87       87

10 Assessment of Matching Quality

summary(match_model)

## 
## Call:
## matchit(formula = dataset_code ~ Age + Gender_code, data = data_clean, 
##     method = "nearest", replace = FALSE, ratio = 1)
## 
## Summary of Balance for All Data:
##                   Means Treated Means Control Std. Mean Diff. Var. Ratio
## distance                 0.0153        0.0147          0.1987     1.0875
## Age                     26.8621       29.1011         -0.1916     1.0834
## Gender_codeFemale        0.4713        0.4601          0.0225          .
## Gender_codeMale          0.5287        0.5399         -0.0225          .
##                   eCDF Mean eCDF Max
## distance             0.0485   0.1400
## Age                  0.0467   0.1400
## Gender_codeFemale    0.0112   0.0112
## Gender_codeMale      0.0112   0.0112
## 
## Summary of Balance for Matched Data:
##                   Means Treated Means Control Std. Mean Diff. Var. Ratio
## distance                 0.0153        0.0153               0          1
## Age                     26.8621       26.8621               0          1
## Gender_codeFemale        0.4713        0.4713               0          .
## Gender_codeMale          0.5287        0.5287               0          .
##                   eCDF Mean eCDF Max Std. Pair Dist.
## distance                  0        0               0
## Age                       0        0               0
## Gender_codeFemale         0        0               0
## Gender_codeMale           0        0               0
## 
## Sample Sizes:
##           Control Treated
## All          5845      87
## Matched        87      87
## Unmatched    5758       0
## Discarded       0       0

Matching quality is evaluated using standardized mean differences before and after matching. An SMD < 0.10 was considered evidence of adequate covariate balance. After matching, standardized mean differences for both age and sex were reduced to approximately zero, demonstrating successful covariate balance.

11 Final Matched Sample

We check that equilibrium has been reached

table_after <- CreateTableOne(vars = vars,
                              strata = "dataset_code",
                              data = matched_data,
                              test = FALSE)

print(table_after, smd = TRUE)

##                         Stratified by dataset_code
##                          Sporadic      Familial      SMD   
##   n                         87            87               
##   Age (mean (SD))        26.86 (11.69) 26.86 (11.69) <0.001
##   Gender_code = Male (%)    46 (52.9)     46 (52.9)  <0.001

The matching procedure produced 87 matched pairs (n = 174 individuals). Unmatched sporadic individuals were excluded from subsequent analyses, as propensity score matching constructs an analytical sample restricted to comparable subjects.

12 Visual Assessment of Matching

This graph serves to answer a single scientific question: Are the Familial and Sporadic groups comparable after matching?

plot(summary(match_model), var.order = "unmatched")

Covariate balance improved substantially after matching, with all standardized mean differences reduced to approximately zero, indicating successful alignment of age and sex distributions between groups.

13 Export Final Matched Dataset

write_sav(matched_data, "output/Age_Sex_Matched_Dataset.sav")

14 Conclusion

Propensity score matching successfully constructed age- and sex-balanced comparison groups between Familial and Sporadic individuals. The matching procedure reduced baseline imbalance to negligible levels, producing a well-aligned analytical cohort suitable for valid downstream analyses. This reproducible workflow demonstrates how propensity score methods can be applied to transform observational data into a comparable framework for unbiased investigation.

15 References

Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41-55.
Rosenbaum PR, Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Stat. 1985;39(1):33-38.
Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav Res. 2011;46(3):399-424.
Austin PC. A comparison of 12 algorithms for matching on the propensity score. Stat Med. 2014;33(6):1057-1069.
Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med. 2009;28(25):3083-3107.

Age and Sex Matching Between Familial and Sporadic Groups

TCHAKONDO Samadou

2026-02-23