Introduction
Recently, I was asked to help create a matching algorithm for a retrospective cohort study. The request was to perform an exact match on a single variable using a 2 to 1 ratio (unexposed to exposed). Normally, I would use a propensity score match (PSM) approach, but the data did not have enough variables for each unique subject. With PSM, I tend to build a logit (or probit) model using variables that would be theoretically associated with the treatment assignment. However, this approach requires enough observable variables to construct these PSM models. For this request, there were a few variables for each subject; the only variable available were the unique identifier, site, and a continuous variable.
Fortunately, R has a package called MatchIt that makes
it possible to perform an exact match on one or more variables. In this
tutorial, I’ll review how to perform an exact match using the
MatchIt package on a dataset with few variables.
Create a dataframe
For this tutorial, let’s create a dataframe with several variables
for 30 hypothetical subjects. This dataframe will contain the following
variables: patientid, treatment,
site, and age. The patientid is
the unique identifier for the subject. The treatment
variable is the binary assignment (0 = No, 1 = Yes). The
site variable is the categorical variable that categorizes
the subject to one of five sites (1, 2, 3, 4, 5). The age
variable is the age in years for the subject upon study entry.
## Create data [We will create a hypothetical dataframe with three variables]
## Variables: patientid, treatment, and site
data1 <- data.frame(
patientid = c(1:30), ## N = 30 unique id
treatment = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ## treatment is 0 and 1
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
site = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, ## sites = 1, 2, 3, 4, 5
1, 2, 3, 4, 5, 1, 2, 3, 4, 5,
1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
age = c(23, 24, 45, 34, 22, 34, 44, 55, 66, 31,
66, 87, 23, 56, 45, 22, 67, 45, 76, 34,
23, 45, 68, 59, 74, 55, 38, 21, 34, 44)
)
data1
## patientid treatment site age
## 1 1 0 1 23
## 2 2 0 1 24
## 3 3 0 2 45
## 4 4 0 2 34
## 5 5 0 3 22
## 6 6 0 3 34
## 7 7 0 4 44
## 8 8 0 4 55
## 9 9 0 5 66
## 10 10 0 5 31
## 11 11 0 1 66
## 12 12 0 2 87
## 13 13 0 3 23
## 14 14 0 4 56
## 15 15 0 5 45
## 16 16 0 1 22
## 17 17 0 2 67
## 18 18 0 3 45
## 19 19 0 4 76
## 20 20 0 5 34
## 21 21 1 1 23
## 22 22 1 2 45
## 23 23 1 3 68
## 24 24 1 4 59
## 25 25 1 5 74
## 26 26 1 1 55
## 27 27 1 2 38
## 28 28 1 3 21
## 29 29 1 4 34
## 30 30 1 5 44
Randomize the dataframe
Once you have the dataframe created, let’s randomize the order of the subjects. This process may not be necessary, but if we use a nearest neighbor approach, then this will add a little randomization to the chaos.
We set a seed using set.seed so that we can reproduce
this randomized order.
## Randomize the data frame
set.seed(455793)
data_random = data1[sample(1:nrow(data1), replace = FALSE), ]
data_random
## patientid treatment site age
## 28 28 1 3 21
## 23 23 1 3 68
## 4 4 0 2 34
## 9 9 0 5 66
## 5 5 0 3 22
## 17 17 0 2 67
## 8 8 0 4 55
## 20 20 0 5 34
## 29 29 1 4 34
## 16 16 0 1 22
## 12 12 0 2 87
## 15 15 0 5 45
## 6 6 0 3 34
## 24 24 1 4 59
## 22 22 1 2 45
## 21 21 1 1 23
## 25 25 1 5 74
## 13 13 0 3 23
## 3 3 0 2 45
## 14 14 0 4 56
## 2 2 0 1 24
## 30 30 1 5 44
## 19 19 0 4 76
## 10 10 0 5 31
## 18 18 0 3 45
## 7 7 0 4 44
## 11 11 0 1 66
## 26 26 1 1 55
## 27 27 1 2 38
## 1 1 0 1 23
MatchIt package
We’ll need to install and load the MatchIt package.
Note: Once you install the MatchIt package the first
time, you don’t need to re-install each time you initiate a new R
session. But you will need to load it using the library command.
## Load libraries
## Step 1: Install the packages using the following code:
## install.packages("MatchIt")
## Step 2: Once it's installed, load the library:
library("MatchIt")
Perform the exact match
To perform the exact match, we’ll need to use the
matchit function. For this example, we’ll perform an exact
match on the site variable, but we’ll also include
age as another matching condition. Since age
is a continuous data, it will be difficult to perform an exact match, so
we’ll use another approach called the nearest neighbor approach.
In Step 1, we need to perform the exact match on the
site variable while using the nearest neighbor approach for
the age variable.
We’ll use the as.factor function to let R know that the
treatment variable is a factor variable.
There are several options that will be necessary: data,
match, distance, exact, and
ratio.
The data option assigns the dataframe.
The exact option tells R that we want to perform an
exact match on the site variable.
The method option tells R that we are using the nearest
neighbor approach.
We will create an object called m.out that contains the
matching results. Then, we need to convert this to a dataframe for
analysis (Step 2)
## Match on site
## Step 1: Perform exact match in site and nearest neighbor with age
## with a 2:1 ratio (control to treatment)
m.out <- matchit(as.factor(treatment) ~ site + age,
data = data_random,
exact = "site",
method = "nearest",
ratio = 2)
summary(m.out)
##
## Call:
## matchit(formula = as.factor(treatment) ~ site + age, data = data_random,
## method = "nearest", exact = "site", ratio = 2)
##
## Summary of Balance for All Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance 0.3339 0.333 0.0649 0.7338 0.0839
## site 3.0000 3.000 0.0000 1.0556 0.0000
## age 46.1000 44.950 0.0642 0.8299 0.0500
## eCDF Max
## distance 0.20
## site 0.00
## age 0.15
##
## Summary of Balance for Matched Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance 0.3339 0.333 0.0649 0.7338 0.0839
## site 3.0000 3.000 0.0000 1.0556 0.0000
## age 46.1000 44.950 0.0642 0.8299 0.0500
## eCDF Max Std. Pair Dist.
## distance 0.20 0.8969
## site 0.00 0.0000
## age 0.15 0.8350
##
## Sample Sizes:
## Control Treated
## All 20 10
## Matched 20 10
## Unmatched 0 0
## Discarded 0 0
## Step 2: Convert to a dataframe for analysis
data_match1 <- match.data(m.out)
data_match1
## patientid treatment site age distance weights subclass
## 28 28 1 3 21 0.3144375 1 1
## 23 23 1 3 68 0.3511401 1 6
## 4 4 0 2 34 0.3265547 1 9
## 9 9 0 5 66 0.3450543 1 5
## 5 5 0 3 22 0.3151968 1 1
## 17 17 0 2 67 0.3525929 1 3
## 8 8 0 4 55 0.3385693 1 2
## 20 20 0 5 34 0.3200638 1 5
## 29 29 1 4 34 0.3222199 1 10
## 16 16 0 1 22 0.3194817 1 4
## 12 12 0 2 87 0.3688232 1 9
## 15 15 0 5 45 0.3285473 1 7
## 6 6 0 3 34 0.3243835 1 6
## 24 24 1 4 59 0.3417291 1 2
## 22 22 1 2 45 0.3351255 1 3
## 21 21 1 1 23 0.3202474 1 4
## 25 25 1 5 74 0.3514450 1 5
## 13 13 0 3 23 0.3159570 1 1
## 3 3 0 2 45 0.3351255 1 3
## 14 14 0 4 56 0.3393579 1 2
## 2 2 0 1 24 0.3210141 1 8
## 30 30 1 5 44 0.3277713 1 7
## 19 19 0 4 76 0.3553133 1 10
## 10 10 0 5 31 0.3177703 1 7
## 18 18 0 3 45 0.3329255 1 6
## 7 7 0 4 44 0.3299541 1 10
## 11 11 0 1 66 0.3540484 1 8
## 26 26 1 1 55 0.3452449 1 8
## 27 27 1 2 38 0.3296584 1 9
## 1 1 0 1 23 0.3202474 1 4
The subclass column indicates the matching groups. For
example a subclass = 5 indicate that those subjects
assigned a subclass value of 5 are matched 2
to 1 (unexposed to exposed). For our example, these are subjects 9, 20,
and 25.
Testing the matching
We can see if the matching algorithm balanced the two groups with a t
test. Since we have age as a continuous variable, we can
check to see if age is balanced between the two groups.
According to the t test, the average age of subjects in the unexposed
group (treatment = 0) is 44.95 and the average age of
subjects in the exposed group (treatment = 1) is 46.10; the
p-value is 0.8742, which indicates that the difference in average ages
between the two groups is not statistically significant at the 5%
two-tailed alpha level.
## Perform a t-test to compare the ages between the treatment and control groups
t.test(data = data1, age ~ treatment, unequal = FALSE)
##
## Welch Two Sample t-test
##
## data: age by treatment
## t = -0.16046, df = 19.721, p-value = 0.8742
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -16.11389 13.81389
## sample estimates:
## mean in group 0 mean in group 1
## 44.95 46.10
Conclusions
The MatchIt package allows us to perform exact matching
on a single variable but it also allows flexibility to include other
types of variables. There are a lot of other features of the
MatchIt package that we will explore in the future such as
performing propensity score matching and using its other arguments.
Acknowledgements
Noah Greifer, the authors of the MatchIt package, wrote
a great article
on the MatchIt package that was invaluable in the
development of this tutorial.
Work in progress
This is a work in progress and may be updated in the future. This work is only for educational purposes.