Exact matching using R - MatchIt package

Mark Bounthavong

30 September 2023

Introduction

Recently, I was asked to help create a matching algorithm for a retrospective cohort study. The request was to perform an exact match on a single variable using a 2 to 1 ratio (unexposed to exposed). Normally, I would use a propensity score match (PSM) approach, but the data did not have enough variables for each unique subject. With PSM, I tend to build a logit (or probit) model using variables that would be theoretically associated with the treatment assignment. However, this approach requires enough observable variables to construct these PSM models. For this request, there were a few variables for each subject; the only variable available were the unique identifier, site, and a continuous variable.

Fortunately, R has a package called MatchIt that makes it possible to perform an exact match on one or more variables. In this tutorial, I’ll review how to perform an exact match using the MatchIt package on a dataset with few variables.

Create a dataframe

For this tutorial, let’s create a dataframe with several variables for 30 hypothetical subjects. This dataframe will contain the following variables: patientid, treatment, site, and age. The patientid is the unique identifier for the subject. The treatment variable is the binary assignment (0 = No, 1 = Yes). The site variable is the categorical variable that categorizes the subject to one of five sites (1, 2, 3, 4, 5). The age variable is the age in years for the subject upon study entry.

## Create data [We will create a hypothetical dataframe with three variables]

## Variables: patientid, treatment, and site

data1 <- data.frame(
  patientid = c(1:30),                           ## N = 30 unique id

  treatment = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,    ## treatment is 0 and 1
                0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                1, 1, 1, 1, 1, 1, 1, 1, 1, 1),

  site = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5,         ## sites = 1, 2, 3, 4, 5
           1, 2, 3, 4, 5, 1, 2, 3, 4, 5,
           1, 2, 3, 4, 5, 1, 2, 3, 4, 5),

  age = c(23, 24, 45, 34, 22, 34, 44, 55, 66, 31,
          66, 87, 23, 56, 45, 22, 67, 45, 76, 34,
          23, 45, 68, 59, 74, 55, 38, 21, 34, 44)
  )

data1
##    patientid treatment site age
## 1          1         0    1  23
## 2          2         0    1  24
## 3          3         0    2  45
## 4          4         0    2  34
## 5          5         0    3  22
## 6          6         0    3  34
## 7          7         0    4  44
## 8          8         0    4  55
## 9          9         0    5  66
## 10        10         0    5  31
## 11        11         0    1  66
## 12        12         0    2  87
## 13        13         0    3  23
## 14        14         0    4  56
## 15        15         0    5  45
## 16        16         0    1  22
## 17        17         0    2  67
## 18        18         0    3  45
## 19        19         0    4  76
## 20        20         0    5  34
## 21        21         1    1  23
## 22        22         1    2  45
## 23        23         1    3  68
## 24        24         1    4  59
## 25        25         1    5  74
## 26        26         1    1  55
## 27        27         1    2  38
## 28        28         1    3  21
## 29        29         1    4  34
## 30        30         1    5  44

Randomize the dataframe

Once you have the dataframe created, let’s randomize the order of the subjects. This process may not be necessary, but if we use a nearest neighbor approach, then this will add a little randomization to the chaos.

We set a seed using set.seed so that we can reproduce this randomized order.

## Randomize the data frame
set.seed(455793)
data_random = data1[sample(1:nrow(data1), replace = FALSE), ]
data_random
##    patientid treatment site age
## 28        28         1    3  21
## 23        23         1    3  68
## 4          4         0    2  34
## 9          9         0    5  66
## 5          5         0    3  22
## 17        17         0    2  67
## 8          8         0    4  55
## 20        20         0    5  34
## 29        29         1    4  34
## 16        16         0    1  22
## 12        12         0    2  87
## 15        15         0    5  45
## 6          6         0    3  34
## 24        24         1    4  59
## 22        22         1    2  45
## 21        21         1    1  23
## 25        25         1    5  74
## 13        13         0    3  23
## 3          3         0    2  45
## 14        14         0    4  56
## 2          2         0    1  24
## 30        30         1    5  44
## 19        19         0    4  76
## 10        10         0    5  31
## 18        18         0    3  45
## 7          7         0    4  44
## 11        11         0    1  66
## 26        26         1    1  55
## 27        27         1    2  38
## 1          1         0    1  23

MatchIt package

We’ll need to install and load the MatchIt package.

Note: Once you install the MatchIt package the first time, you don’t need to re-install each time you initiate a new R session. But you will need to load it using the library command.

## Load libraries
## Step 1: Install the packages using the following code:
##         install.packages("MatchIt")
## Step 2: Once it's installed, load the library:
library("MatchIt")

Perform the exact match

To perform the exact match, we’ll need to use the matchit function. For this example, we’ll perform an exact match on the site variable, but we’ll also include age as another matching condition. Since age is a continuous data, it will be difficult to perform an exact match, so we’ll use another approach called the nearest neighbor approach.

In Step 1, we need to perform the exact match on the site variable while using the nearest neighbor approach for the age variable.

We’ll use the as.factor function to let R know that the treatment variable is a factor variable.

There are several options that will be necessary: data, match, distance, exact, and ratio.

The data option assigns the dataframe.

The exact option tells R that we want to perform an exact match on the site variable.

The method option tells R that we are using the nearest neighbor approach.

We will create an object called m.out that contains the matching results. Then, we need to convert this to a dataframe for analysis (Step 2)

## Match on site
## Step 1: Perform exact match in site and nearest neighbor with age
##         with a 2:1 ratio (control to treatment)
m.out <- matchit(as.factor(treatment) ~ site + age,
                 data = data_random,
                 exact = "site",
                 method = "nearest",
                 ratio = 2)
summary(m.out)
## 
## Call:
## matchit(formula = as.factor(treatment) ~ site + age, data = data_random, 
##     method = "nearest", exact = "site", ratio = 2)
## 
## Summary of Balance for All Data:
##          Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance        0.3339         0.333          0.0649     0.7338    0.0839
## site            3.0000         3.000          0.0000     1.0556    0.0000
## age            46.1000        44.950          0.0642     0.8299    0.0500
##          eCDF Max
## distance     0.20
## site         0.00
## age          0.15
## 
## Summary of Balance for Matched Data:
##          Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance        0.3339         0.333          0.0649     0.7338    0.0839
## site            3.0000         3.000          0.0000     1.0556    0.0000
## age            46.1000        44.950          0.0642     0.8299    0.0500
##          eCDF Max Std. Pair Dist.
## distance     0.20          0.8969
## site         0.00          0.0000
## age          0.15          0.8350
## 
## Sample Sizes:
##           Control Treated
## All            20      10
## Matched        20      10
## Unmatched       0       0
## Discarded       0       0
## Step 2: Convert to a dataframe for analysis
data_match1 <- match.data(m.out)
data_match1
##    patientid treatment site age  distance weights subclass
## 28        28         1    3  21 0.3144375       1        1
## 23        23         1    3  68 0.3511401       1        6
## 4          4         0    2  34 0.3265547       1        9
## 9          9         0    5  66 0.3450543       1        5
## 5          5         0    3  22 0.3151968       1        1
## 17        17         0    2  67 0.3525929       1        3
## 8          8         0    4  55 0.3385693       1        2
## 20        20         0    5  34 0.3200638       1        5
## 29        29         1    4  34 0.3222199       1       10
## 16        16         0    1  22 0.3194817       1        4
## 12        12         0    2  87 0.3688232       1        9
## 15        15         0    5  45 0.3285473       1        7
## 6          6         0    3  34 0.3243835       1        6
## 24        24         1    4  59 0.3417291       1        2
## 22        22         1    2  45 0.3351255       1        3
## 21        21         1    1  23 0.3202474       1        4
## 25        25         1    5  74 0.3514450       1        5
## 13        13         0    3  23 0.3159570       1        1
## 3          3         0    2  45 0.3351255       1        3
## 14        14         0    4  56 0.3393579       1        2
## 2          2         0    1  24 0.3210141       1        8
## 30        30         1    5  44 0.3277713       1        7
## 19        19         0    4  76 0.3553133       1       10
## 10        10         0    5  31 0.3177703       1        7
## 18        18         0    3  45 0.3329255       1        6
## 7          7         0    4  44 0.3299541       1       10
## 11        11         0    1  66 0.3540484       1        8
## 26        26         1    1  55 0.3452449       1        8
## 27        27         1    2  38 0.3296584       1        9
## 1          1         0    1  23 0.3202474       1        4

The subclass column indicates the matching groups. For example a subclass = 5 indicate that those subjects assigned a subclass value of 5 are matched 2 to 1 (unexposed to exposed). For our example, these are subjects 9, 20, and 25.

Testing the matching

We can see if the matching algorithm balanced the two groups with a t test. Since we have age as a continuous variable, we can check to see if age is balanced between the two groups. According to the t test, the average age of subjects in the unexposed group (treatment = 0) is 44.95 and the average age of subjects in the exposed group (treatment = 1) is 46.10; the p-value is 0.8742, which indicates that the difference in average ages between the two groups is not statistically significant at the 5% two-tailed alpha level.

## Perform a t-test to compare the ages between the treatment and control groups
t.test(data = data1, age ~ treatment, unequal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  age by treatment
## t = -0.16046, df = 19.721, p-value = 0.8742
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -16.11389  13.81389
## sample estimates:
## mean in group 0 mean in group 1 
##           44.95           46.10

Conclusions

The MatchIt package allows us to perform exact matching on a single variable but it also allows flexibility to include other types of variables. There are a lot of other features of the MatchIt package that we will explore in the future such as performing propensity score matching and using its other arguments.

Acknowledgements

Noah Greifer, the authors of the MatchIt package, wrote a great article on the MatchIt package that was invaluable in the development of this tutorial.

Work in progress

This is a work in progress and may be updated in the future. This work is only for educational purposes.