Introduction
Recently, I was asked to help create a matching algorithm for a retrospective cohort study. The request was to perform an exact match on a single variable using a 2 to 1 ratio (unexposed to exposed). Normally, I would use a propensity score match (PSM) approach, but the data did not have enough variables for each unique subject. With PSM, I tend to build a logit (or probit) model using variables that would be theoretically associated with the treatment assignment. However, this approach requires enough observable variables to construct these PSM models. For this request, there were a few variables for each subject; the only variable available were the unique identifier, site, and a continuous variable.
Fortunately, R has a package called MatchIt
that makes
it possible to perform an exact match on one or more variables. In this
tutorial, I’ll review how to perform an exact match using the
MatchIt
package on a dataset with few variables.
Create a dataframe
For this tutorial, let’s create a dataframe with several variables
for 30 hypothetical subjects. This dataframe will contain the following
variables: patientid
, treatment
,
site
, and age
. The patientid
is
the unique identifier for the subject. The treatment
variable is the binary assignment (0 = No, 1 = Yes). The
site
variable is the categorical variable that categorizes
the subject to one of five sites (1, 2, 3, 4, 5). The age
variable is the age in years for the subject upon study entry.
## Create data [We will create a hypothetical dataframe with three variables]
## Variables: patientid, treatment, and site
data1 <- data.frame(
patientid = c(1:30), ## N = 30 unique id
treatment = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ## treatment is 0 and 1
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
site = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, ## sites = 1, 2, 3, 4, 5
1, 2, 3, 4, 5, 1, 2, 3, 4, 5,
1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
age = c(23, 24, 45, 34, 22, 34, 44, 55, 66, 31,
66, 87, 23, 56, 45, 22, 67, 45, 76, 34,
23, 45, 68, 59, 74, 55, 38, 21, 34, 44)
)
data1
## patientid treatment site age
## 1 1 0 1 23
## 2 2 0 1 24
## 3 3 0 2 45
## 4 4 0 2 34
## 5 5 0 3 22
## 6 6 0 3 34
## 7 7 0 4 44
## 8 8 0 4 55
## 9 9 0 5 66
## 10 10 0 5 31
## 11 11 0 1 66
## 12 12 0 2 87
## 13 13 0 3 23
## 14 14 0 4 56
## 15 15 0 5 45
## 16 16 0 1 22
## 17 17 0 2 67
## 18 18 0 3 45
## 19 19 0 4 76
## 20 20 0 5 34
## 21 21 1 1 23
## 22 22 1 2 45
## 23 23 1 3 68
## 24 24 1 4 59
## 25 25 1 5 74
## 26 26 1 1 55
## 27 27 1 2 38
## 28 28 1 3 21
## 29 29 1 4 34
## 30 30 1 5 44
Randomize the dataframe
Once you have the dataframe created, let’s randomize the order of the subjects. This process may not be necessary, but if we use a nearest neighbor approach, then this will add a little randomization to the chaos.
We set a seed using set.seed
so that we can reproduce
this randomized order.
## Randomize the data frame
set.seed(455793)
data_random = data1[sample(1:nrow(data1), replace = FALSE), ]
data_random
## patientid treatment site age
## 28 28 1 3 21
## 23 23 1 3 68
## 4 4 0 2 34
## 9 9 0 5 66
## 5 5 0 3 22
## 17 17 0 2 67
## 8 8 0 4 55
## 20 20 0 5 34
## 29 29 1 4 34
## 16 16 0 1 22
## 12 12 0 2 87
## 15 15 0 5 45
## 6 6 0 3 34
## 24 24 1 4 59
## 22 22 1 2 45
## 21 21 1 1 23
## 25 25 1 5 74
## 13 13 0 3 23
## 3 3 0 2 45
## 14 14 0 4 56
## 2 2 0 1 24
## 30 30 1 5 44
## 19 19 0 4 76
## 10 10 0 5 31
## 18 18 0 3 45
## 7 7 0 4 44
## 11 11 0 1 66
## 26 26 1 1 55
## 27 27 1 2 38
## 1 1 0 1 23
MatchIt
package
We’ll need to install and load the MatchIt
package.
Note: Once you install the MatchIt
package the first
time, you don’t need to re-install each time you initiate a new R
session. But you will need to load it using the library command.
## Load libraries
## Step 1: Install the packages using the following code:
## install.packages("MatchIt")
## Step 2: Once it's installed, load the library:
library("MatchIt")
Perform the exact match
To perform the exact match, we’ll need to use the
matchit
function. For this example, we’ll perform an exact
match on the site
variable, but we’ll also include
age
as another matching condition. Since age
is a continuous data, it will be difficult to perform an exact match, so
we’ll use another approach called the nearest neighbor approach.
In Step 1, we need to perform the exact match on the
site
variable while using the nearest neighbor approach for
the age
variable.
We’ll use the as.factor
function to let R know that the
treatment
variable is a factor variable.
There are several options that will be necessary: data
,
match
, distance
, exact
, and
ratio
.
The data
option assigns the dataframe.
The exact
option tells R that we want to perform an
exact match on the site
variable.
The method
option tells R that we are using the nearest
neighbor approach.
We will create an object called m.out
that contains the
matching results. Then, we need to convert this to a dataframe for
analysis (Step 2)
## Match on site
## Step 1: Perform exact match in site and nearest neighbor with age
## with a 2:1 ratio (control to treatment)
m.out <- matchit(as.factor(treatment) ~ site + age,
data = data_random,
exact = "site",
method = "nearest",
ratio = 2)
summary(m.out)
##
## Call:
## matchit(formula = as.factor(treatment) ~ site + age, data = data_random,
## method = "nearest", exact = "site", ratio = 2)
##
## Summary of Balance for All Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance 0.3339 0.333 0.0649 0.7338 0.0839
## site 3.0000 3.000 0.0000 1.0556 0.0000
## age 46.1000 44.950 0.0642 0.8299 0.0500
## eCDF Max
## distance 0.20
## site 0.00
## age 0.15
##
## Summary of Balance for Matched Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance 0.3339 0.333 0.0649 0.7338 0.0839
## site 3.0000 3.000 0.0000 1.0556 0.0000
## age 46.1000 44.950 0.0642 0.8299 0.0500
## eCDF Max Std. Pair Dist.
## distance 0.20 0.8969
## site 0.00 0.0000
## age 0.15 0.8350
##
## Sample Sizes:
## Control Treated
## All 20 10
## Matched 20 10
## Unmatched 0 0
## Discarded 0 0
## Step 2: Convert to a dataframe for analysis
data_match1 <- match.data(m.out)
data_match1
## patientid treatment site age distance weights subclass
## 28 28 1 3 21 0.3144375 1 1
## 23 23 1 3 68 0.3511401 1 6
## 4 4 0 2 34 0.3265547 1 9
## 9 9 0 5 66 0.3450543 1 5
## 5 5 0 3 22 0.3151968 1 1
## 17 17 0 2 67 0.3525929 1 3
## 8 8 0 4 55 0.3385693 1 2
## 20 20 0 5 34 0.3200638 1 5
## 29 29 1 4 34 0.3222199 1 10
## 16 16 0 1 22 0.3194817 1 4
## 12 12 0 2 87 0.3688232 1 9
## 15 15 0 5 45 0.3285473 1 7
## 6 6 0 3 34 0.3243835 1 6
## 24 24 1 4 59 0.3417291 1 2
## 22 22 1 2 45 0.3351255 1 3
## 21 21 1 1 23 0.3202474 1 4
## 25 25 1 5 74 0.3514450 1 5
## 13 13 0 3 23 0.3159570 1 1
## 3 3 0 2 45 0.3351255 1 3
## 14 14 0 4 56 0.3393579 1 2
## 2 2 0 1 24 0.3210141 1 8
## 30 30 1 5 44 0.3277713 1 7
## 19 19 0 4 76 0.3553133 1 10
## 10 10 0 5 31 0.3177703 1 7
## 18 18 0 3 45 0.3329255 1 6
## 7 7 0 4 44 0.3299541 1 10
## 11 11 0 1 66 0.3540484 1 8
## 26 26 1 1 55 0.3452449 1 8
## 27 27 1 2 38 0.3296584 1 9
## 1 1 0 1 23 0.3202474 1 4
The subclass
column indicates the matching groups. For
example a subclass = 5
indicate that those subjects
assigned a subclass
value of 5
are matched 2
to 1 (unexposed to exposed). For our example, these are subjects 9, 20,
and 25.
Testing the matching
We can see if the matching algorithm balanced the two groups with a t
test. Since we have age
as a continuous variable, we can
check to see if age
is balanced between the two groups.
According to the t test, the average age of subjects in the unexposed
group (treatment = 0
) is 44.95 and the average age of
subjects in the exposed group (treatment = 1
) is 46.10; the
p-value is 0.8742, which indicates that the difference in average ages
between the two groups is not statistically significant at the 5%
two-tailed alpha level.
## Perform a t-test to compare the ages between the treatment and control groups
t.test(data = data1, age ~ treatment, unequal = FALSE)
##
## Welch Two Sample t-test
##
## data: age by treatment
## t = -0.16046, df = 19.721, p-value = 0.8742
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -16.11389 13.81389
## sample estimates:
## mean in group 0 mean in group 1
## 44.95 46.10
Conclusions
The MatchIt
package allows us to perform exact matching
on a single variable but it also allows flexibility to include other
types of variables. There are a lot of other features of the
MatchIt
package that we will explore in the future such as
performing propensity score matching and using its other arguments.
Acknowledgements
Noah Greifer, the authors of the MatchIt
package, wrote
a great article
on the MatchIt
package that was invaluable in the
development of this tutorial.
Work in progress
This is a work in progress and may be updated in the future. This work is only for educational purposes.