Introduction
This is a tutorial on how to perform propensity score matching in R.
Propensity score matching is a statistical approach to balancing the observed covariates between groups.[1] In essence, the propensity score is the probability that an individual will be given the exposure conditional on their observable characteristics. A propensity score is estimated using regression model methods (e.g., logistic or probit) conditioned on the observed baseline covariates.
In randomized controlled trials, the subject is randomized into the treatment or control arms of a clinical trial. After randomization, the observed characteristics of the treatment and control groups are assessed for balance. If randomized is done correctly, then not only are the observed covariates balanced, but the unobserved covariates are equally balanced between the groups. In observational studies, this balanced is often not observed introducing potential bias or confounding.
To address this problem in observational studies, we can used multiple regression model or use propensity score matching techniques. Multiple regression model control (or adjust) for confounding by including the observable covariates into the regression model. With propensity score matching, the observable characteristics are used to estimate the probability of being in the exposure arm conditional on the observable characteristics. In theory, if this is done correctly, than the balanced observed with the observable characteristics would extend to the unobservable characteristics (e.g., similar to a randomized control trial).
In this article, we will review some of the propensity score matching methods to balanced the observable characteristics between groups.
The propensity score matching methods we will review are:
- Nearest neighbor approach
- Inverse probability weight (IPW) for average treatment effect (ATE)
- Inverse probability weight (IPW) for average treatment effect of the treated (ATT)
Causal Framework (ATE and ATT)
In a randomized control trial, we can estimate the average treatment effect (ATE), which is the difference in outcome \((Y)\) between the exposure \((Z = 1)\) and unexposed groups \((Z = 0)\). In reality, the individual is randomized to either the exposure or control arm. When we observe what happens to the individual, we can only observe what occurred to the individual in the arm they were assigned to. For example, an individual \(i\) can be assigned to the exposure arm; hence, we can only observe their outcome in the exposure arm \(Y_{i}(1)\). Conversely, an individual \(i\) in the control arm can only have their outcome in the control arm observed \(Y_{i}(0)\). Thus, we can only report on the outcome for that individual in that specific arm. This can be represented as:
\[\begin{aligned} Y_{i}(Y_{i} = Z_{i}Y_{i}(1) + (1 - Z_{i})Y_{i}(0)) \end{aligned}\]This expression states that a single outcome for the individual \(i\) is based on whether they were assigned to the exposure arm \(Z_{i}\), which generates the outcome in the exposure arm \(Y{i}(1)\), or the unexposed arm \((1 - Z_{i})\), which generate the outcome in the unexposed arm \(Y_{i}(0)\).
In other words, if \(Z = 1\), the expression reduces to:
\[\begin{aligned} Y_{i}(Y_{i} = Y_{i}(1)) \end{aligned}\]If \(Z = 0\), the expression reduces to:
\[\begin{aligned} Y_{i}(Y_{i} = Y_{i}(0)) \end{aligned}\]Thus, in this framework, the ATE is estimated as the:
\[\begin{aligned} ATE = E[Y_{i}(1) - Y_{i}(0)] \end{aligned}\]The ATE is defined as the treatment effect at the population level. We think of this as the difference in outcome when the everyone in the population who were in the unexposed group were instead in the exposure group. In reality this is an impossibility because we can only observe what happens to an individual based on their treatment assignment. However, if we can maintain this framework, we could have a situation where we can make a causal interpretation.
Similarly, we can also estimate the average treatment effect of the treated (ATT). The ATT is the difference in outcomes \((Y(1) - Y(0))\) among those individual who received treatment \(Z = 1\). It can be expressed as:
\[\begin{aligned} ATT = E[Y_{i}(1) - Y_{i}(0)|Z_{i} = 1] \end{aligned}\]In an ideal randomized controlled trial where the individual is in perfect adherence to their treatment, the ATE and ATT are exactly the same because the exposure and unexposed arms are the same.
There are other types of treatment effects such as the average treatment effect of the control group (ATC), which is the opposite of the ATT, sample average treatment effect (SATE), population average treatment effect (PATE), local average treatment effect (LATE), and conditional average treatment effect (CATE).[2]
In this article, we will only focus on the ATE and ATT since these are the two common types of analyses performed with propensity score matching techniques.
Motivating example
We will use the Agency for Healthcare Research and Quality (AHRQ) Medical Expenditure Panel Survey (MEPS) to evaluate the difference in total healthcare expenditures (costs) between responders who have a diagnosis of type 2 diabetes and responders who do not have a diagnosis of diabetes. (Note: Since MEPS is a survey administered to individual, we call them responders or respondents.)
Load libraries
These are the essential packages you need to install and load.
if (!require("pacman")) install.packages("pacman"); library("pacman")
p_load("MEPS",
"tidyverse",
"MatchIt",
"cobalt",
"broom",
"modelsummary",
"sandwich")
AHRQ MEPS data - Full-Year Consolidated File from 2021
We will use the 2021 Full-Year Consolidated File from MEPS to perform a propensity score match. This data contains information about the total healthcare expenditure of the responder and socioeconomic characteristics.
### Load data
### Load 2021 Full Year Consolidated Data File
hc2021 = read_MEPS(file = "h233")
names(hc2021) <- tolower(names(hc2021)) ## Convert column headers to lower case
## View first fix rows
head(hc2021)
## # A tibble: 6 × 1,488
## duid pid dupersid panel famid31 famid42 famid53 famid21 famidyr
## <dbl> <dbl> <chr> <dbl+lbl> <chr> <chr> <chr> <chr> <chr>
## 1 2320005 101 2320005101 23 [23 PANEL… A A A A A
## 2 2320005 102 2320005102 23 [23 PANEL… A A A A A
## 3 2320006 101 2320006101 23 [23 PANEL… A A A A A
## 4 2320006 102 2320006102 23 [23 PANEL… B B B B B
## 5 2320006 103 2320006103 23 [23 PANEL… A A A A A
## 6 2320012 102 2320012102 23 [23 PANEL… A A A A A
## # ℹ 1,479 more variables: cpsfamid <chr>, fcsz1231 <dbl+lbl>,
## # fcrp1231 <dbl+lbl>, ruletr31 <chr>, ruletr42 <chr>, ruletr53 <chr>,
## # ruletr21 <chr>, rusize31 <dbl+lbl>, rusize42 <dbl+lbl>, rusize53 <dbl+lbl>,
## # rusize21 <dbl+lbl>, ruclas31 <dbl+lbl>, ruclas42 <dbl+lbl>,
## # ruclas53 <dbl+lbl>, ruclas21 <dbl+lbl>, famsze31 <dbl+lbl>,
## # famsze42 <dbl+lbl>, famsze53 <dbl+lbl>, famsze21 <dbl+lbl>,
## # fmrs1231 <dbl+lbl>, fams1231 <dbl+lbl>, famszeyr <dbl+lbl>, …
## Rename columns / Reduce data to only a few variables
hc2021p = hc2021 %>%
rename(
age = age21x,
totexp = totexp21,
ertexp = ertexp21,
diabetes = diabdx_m18,
race = racev1x,
poverty = povcat21,
marital_status = marry21x) %>%
select(
dupersid,
age,
sex,
race,
poverty,
diabetes,
marital_status,
totexp,
ertexp)
hc2021p$year <- 2021
Subset data for adults (age >= 18 years)
We are interested in the adult population, so we will use the
subset()
function to restrict our sample to responders who
are 18 years and older.
hc2021p <- subset(hc2021p, age >= 18)
Subset data for adults with marital status information
There are some missing data on responders in regards to their marital
status. We will remove these using the subset()
function.
hc2021p <- subset(hc2021p, marital_status >= 1)
Label values for variables (Sex, Race, Poverty, and Marital Status)
Since the MEPS data uses specific codes for the values with each
variable, we will need to label these so that they are easier to
interpret. For example, the sex
variable in MEPS is coded
as = male
and 2 = female
. We will change that
to 1 = male
and 0 = female
. We also will
change each varaible to a factor so that we can use these in our
propensity score matching models.
### SEX
hc2021$sex <- factor(hc2021$sex, levels = c(1, 2), labels = c("Male", "Female"))
### RACE
hc2021p$race <- factor(hc2021p$race, levels = c(1, 2, 3, 4, 6), labels = c("White", "Black", "AI/AN", "Asian", "Multiple"))
### POVERTY
hc2021p$poverty <- factor(hc2021p$poverty, levels = c(1, 2, 3, 4, 5), labels = c("Poor", "Near Poor", "Low Income", "Middle Income", "High Income"))
### MARITAL STATUS
hc2021p$marital_status <- factor(hc2021p$marital_status, levels = c(1, 2, 3, 4, 5), labels = c("Married", "Widowed", "Divorced", "Separated", "Never Married"))
Keep respondents with diabetes indicator
The diabetes variable will need to be recoded from
1 = diabetes
and 2 = no diabetes
to
1 = diabetes
and 0 = no diabetes
. In our
dataset, there are 3184 respondents with diabetes and 19,262 respondents
without diabetes.
hc2021p <- hc2021p[hc2021p$diabetes %in% c(1, 2), ]
## Change diabetes indicator to 1 = Yes, 0 = No.
hc2021p$diabetes[hc2021p$diabetes == 1] = 1
hc2021p$diabetes[hc2021p$diabetes == 2] = 0
table(hc2021p$diabetes)
##
## 0 1
## 19262 3184
Propensity score matching
Once we have our data, we can perform the various propensity score matching approaches.
Recall, that we want to evaluate the difference in total expenditures between respondents with and without a diagnosis of diabetes in 2021. Hence, we will need to balance the observed characteristics between responders with and without diabetes.
Propensity score matching with nearest neighbor approach
The first propensity score matching technique is the nearest neighbor approach. This is where the algorithm will match an individual with another individual with the closest propensity score. We can refine how close using the caliper size. If there are ties, then the algorithm will randomly select a match. For those individuals that do not match, they are dropped from the data.
There are many ways to perform the nearest neighbor approach (e.g.,
Mahalanobis, Greedy, optimal). In this example, we will use the Greedy
method (also known as nearest
in R).
We will use a caliper size of 0.01 and perform matching without replacement. Matching can also be done with replacement where individuals in the control group will be used more than once in a match with the treatment group. This will result in an imbalanced in the number of individuals in the treatment and control group.
In our example, we will not apply the replacement option. Instead, we want to have 1:1 matching without replacement. This will give us equal number of responders who have and do not have a diagnosis of diabetes.
Lastly, we will estimate the propensity score using several observed characteristics (age, sex, race, poverty, and martial status).
For this exercise, we will use the MatchIt
package and
the matchit()
function to generate propensity scores.
##########################################################
## Propensity Score Matching - Nearest Neighbor Approach
##########################################################
## Propensity score matching
match1 <- matchit(diabetes ~ age + sex + race + poverty + marital_status,
data = hc2021p,
method = "nearest",
discard = "both",
caliper = 0.01)
Visual inspection
After we performed the matching, we will visually inspect to see how well the matching was done. We will generate a Love plot, which will have show us the standardized difference for each covariate used in the propensity score model between the responders with and without diabetes. Our aim is to have these covariates within an acceptable threshold. In this example, our threshold is a standardized difference of < 0.1. Additionally, we estimate the Kolmogorov-Smirnov test to see if the difference was statistically significant; our aim is to have a value that is less than 0.05.
## Inspect matching
summary(match1)
##
## Call:
## matchit(formula = diabetes ~ age + sex + race + poverty + marital_status,
## data = hc2021p, method = "nearest", discard = "both", caliper = 0.01)
##
## Summary of Balance for All Data:
## Means Treated Means Control Std. Mean Diff.
## distance 0.2193 0.1290 0.8088
## age 63.3841 49.3954 1.0276
## sex 1.5380 1.5378 0.0004
## raceWhite 0.6982 0.7685 -0.1531
## raceBlack 0.2048 0.1389 0.1633
## raceAI/AN 0.0154 0.0077 0.0626
## raceAsian 0.0496 0.0573 -0.0352
## raceMultiple 0.0320 0.0277 0.0245
## povertyPoor 0.2296 0.1469 0.1966
## povertyNear Poor 0.0700 0.0458 0.0948
## povertyLow Income 0.1583 0.1285 0.0815
## povertyMiddle Income 0.2795 0.2797 -0.0004
## povertyHigh Income 0.2626 0.3990 -0.3100
## marital_statusMarried 0.4491 0.4652 -0.0323
## marital_statusWidowed 0.1674 0.0754 0.2463
## marital_statusDivorced 0.1947 0.1348 0.1514
## marital_statusSeparated 0.0377 0.0217 0.0840
## marital_statusNever Married 0.1511 0.3029 -0.4241
## Var. Ratio eCDF Mean eCDF Max
## distance 1.1806 0.2438 0.3785
## age 0.5344 0.2057 0.3576
## sex 1.0002 0.0001 0.0002
## raceWhite . 0.0703 0.0703
## raceBlack . 0.0659 0.0659
## raceAI/AN . 0.0077 0.0077
## raceAsian . 0.0076 0.0076
## raceMultiple . 0.0043 0.0043
## povertyPoor . 0.0827 0.0827
## povertyNear Poor . 0.0242 0.0242
## povertyLow Income . 0.0297 0.0297
## povertyMiddle Income . 0.0002 0.0002
## povertyHigh Income . 0.1364 0.1364
## marital_statusMarried . 0.0160 0.0160
## marital_statusWidowed . 0.0920 0.0920
## marital_statusDivorced . 0.0600 0.0600
## marital_statusSeparated . 0.0160 0.0160
## marital_statusNever Married . 0.1519 0.1519
##
## Summary of Balance for Matched Data:
## Means Treated Means Control Std. Mean Diff.
## distance 0.2177 0.2177 0.0001
## age 63.2943 63.7663 -0.0347
## sex 1.5390 1.5349 0.0082
## raceWhite 0.7013 0.7379 -0.0798
## raceBlack 0.2030 0.1850 0.0446
## raceAI/AN 0.0139 0.0098 0.0333
## raceAsian 0.0496 0.0433 0.0291
## raceMultiple 0.0322 0.0240 0.0466
## povertyPoor 0.2280 0.2327 -0.0113
## povertyNear Poor 0.0695 0.0565 0.0507
## povertyLow Income 0.1582 0.1547 0.0095
## povertyMiddle Income 0.2804 0.2877 -0.0162
## povertyHigh Income 0.2640 0.2684 -0.0100
## marital_statusMarried 0.4490 0.4531 -0.0083
## marital_statusWidowed 0.1670 0.1601 0.0186
## marital_statusDivorced 0.1955 0.2097 -0.0359
## marital_statusSeparated 0.0369 0.0319 0.0265
## marital_statusNever Married 0.1516 0.1452 0.0176
## Var. Ratio eCDF Mean eCDF Max Std. Pair Dist.
## distance 1.0004 0.0001 0.0019 0.0003
## age 0.9761 0.0070 0.0256 0.1621
## sex 0.9988 0.0021 0.0041 0.2121
## raceWhite . 0.0366 0.0366 0.2476
## raceBlack . 0.0180 0.0180 0.2793
## raceAI/AN . 0.0041 0.0041 0.1513
## raceAsian . 0.0063 0.0063 0.2152
## raceMultiple . 0.0082 0.0082 0.2582
## povertyPoor . 0.0047 0.0047 0.2380
## povertyNear Poor . 0.0129 0.0129 0.1868
## povertyLow Income . 0.0035 0.0035 0.2206
## povertyMiddle Income . 0.0073 0.0073 0.1625
## povertyHigh Income . 0.0044 0.0044 0.1378
## marital_statusMarried . 0.0041 0.0041 0.1835
## marital_statusWidowed . 0.0069 0.0069 0.2098
## marital_statusDivorced . 0.0142 0.0142 0.2544
## marital_statusSeparated . 0.0051 0.0051 0.1956
## marital_statusNever Married . 0.0063 0.0063 0.1675
##
## Sample Sizes:
## Control Treated
## All 19262 3184
## Matched 3167 3167
## Unmatched 15974 16
## Discarded 121 1
## Visual inspection using Love plot
love.plot(match1,
stats = c("m", "ks"), ### m = mean difference; ks = Kolmogorov-Smirnov
thresholds = c(m = 0.1, ks = 0.05),
drop.distance = TRUE,
colors = c("dodgerblue4", "firebrick"),
shapes = c(16, 17),
stars = "none")
In the summary(match1)
output, we can review the number
of respondents who were matched. We started out with 3184 respondents
who reported having a diagnosis of diabetes. After matching, this number
was reduced to 3167.
We also see that there were an equal number of respondents without diabetes. However, 15,974 respondents without diabetes were not matched.
Moreover, based on the love plot, the covariates appear to be well-balanced between responders who have and do not have a diagnosis of diabetes.
Convert Propensity Score Matched results into an analyzable data
Now that we have the propensity score matched groups, we need to
create a dataset with these individuals. We will call this dataset
match_data
.
### Convert to data
match_data <- match.data(match1)
head(match_data)
## # A tibble: 6 × 13
## dupersid age sex race poverty diabetes marital_status totexp ertexp
## <chr> <dbl> <dbl+lbl> <fct> <fct> <dbl+lb> <fct> <dbl> <dbl>
## 1 2320012102 81 2 [2 FEM… White Low In… 1 [1 YE… Widowed 9813 0
## 2 2320024101 44 2 [2 FEM… White Low In… 0 Married 2106 0
## 3 2320028101 43 2 [2 FEM… White Middle… 1 [1 YE… Married 377 0
## 4 2320038101 83 1 [1 MAL… White High I… 0 Married 7677 0
## 5 2320038102 83 2 [2 FEM… White High I… 0 Married 3953 0
## 6 2320040101 63 1 [1 MAL… White High I… 1 [1 YE… Widowed 18740 0
## # ℹ 4 more variables: year <dbl>, distance <dbl>, weights <dbl>, subclass <fct>
Inverse Probability Weight (IPW) Methods
The nearest neighbor approach does a great job of identifying the best match, which can be displayed on a Table 1 demographics similar to a randomized controlled trial. However, there are a lot of individuals who were not matched and dropped from the data.
In some cases, this is undesirable.
Another method uses the inverse probability weight (IPW) method to keep as much of the data as possible by providing weights to each individual in the data. We use these weights (which are based on the propensity score) in the analysis to provide us with the weighted estimates of the treatment effect.
Using logistic regression to estimate the propensity score
We estimate the propensity score using a logistic regression model
with the observable covariates. The dependent variable will be the
diabetes
variable.
Once we perform the logistic regression model, we will need to estimated the predicted values of the probability of having a diagnosis of diabetes. This is essentially the propensity score.
There are two ways to estimate the predicted values. The first uses
the uagment_columns()
function. The second uses the
mutate()
function. I show how you can do both below.
After the predicted probability of having a diagnosis of diabetes, we can plot these distributions between the respondents with and without diabetes.
##########################################################
## Propensity Score Estimation
##########################################################
### Construct a logistic regression model
logit1 <- glm(diabetes ~ age + sex + race + poverty + marital_status,
data = hc2021p,
family = "binomial"(link = "logit"))
summary(logit1)
##
## Call:
## glm(formula = diabetes ~ age + sex + race + poverty + marital_status,
## family = binomial(link = "logit"), data = hc2021p)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4945 -0.5963 -0.4024 -0.2621 2.8965
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.810998 0.124969 -30.496 < 2e-16 ***
## age 0.045751 0.001448 31.604 < 2e-16 ***
## sex -0.141087 0.041395 -3.408 0.000654 ***
## raceBlack 0.474756 0.053747 8.833 < 2e-16 ***
## raceAI/AN 0.933197 0.177984 5.243 1.58e-07 ***
## raceAsian 0.288761 0.093057 3.103 0.001915 **
## raceMultiple 0.566715 0.118194 4.795 1.63e-06 ***
## povertyNear Poor -0.140514 0.090899 -1.546 0.122149
## povertyLow Income -0.287948 0.068344 -4.213 2.52e-05 ***
## povertyMiddle Income -0.407309 0.059286 -6.870 6.41e-12 ***
## povertyHigh Income -0.853575 0.061451 -13.890 < 2e-16 ***
## marital_statusWidowed -0.121386 0.065844 -1.844 0.065248 .
## marital_statusDivorced 0.016357 0.056997 0.287 0.774125
## marital_statusSeparated 0.354424 0.114800 3.087 0.002020 **
## marital_statusNever Married -0.197674 0.063481 -3.114 0.001846 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 18330 on 22445 degrees of freedom
## Residual deviance: 16263 on 22431 degrees of freedom
## AIC: 16293
##
## Number of Fisher Scoring iterations: 5
## Method 1: Using augment_columns
prob_fitted <- augment_columns(logit1,
data = hc2021p,
type.predict = "response") %>%
rename(propensity_score = .fitted)
## Inspect propensity scores (propensity_score)
summary(prob_fitted$propensity_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01311 0.05468 0.10972 0.14185 0.20192 0.68263
## Method 2: Using mutate
prob_fitted2 <- hc2021p %>%
mutate(
propensity_score = glm(diabetes ~ age + sex + race + poverty + marital_status,
data = hc2021p,
family = "binomial"(link = "logit")) %>%
predict(type = "response")
)
## Inspect propensity scores (propensity_score)
summary(prob_fitted2$propensity_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01311 0.05468 0.10972 0.14185 0.20192 0.68263
## Visualize the propensity score distributions between the diabetes and no-diabetes groups
ps_fitted <- prob_fitted2 %>%
tidyr::spread(diabetes, propensity_score, sep = "_")
ggplot(ps_fitted) +
geom_histogram(bins = 50, aes(diabetes_0)) +
geom_histogram(bins = 50, aes(x = diabetes_1, y = -after_stat(count))) +
ylab("count") + xlab("p") +
geom_hline(yintercept = 0, lwd = 0.5) +
scale_y_continuous(label = abs)
# Note: The code for the graph came from: https://livefreeordichotomize.com/posts/2019-01-17-understanding-propensity-score-weighting/
IPW for ATE
Once we have the propensity scores, we can estimate the IPW for the ATE analysis. The propensity score for the respondent \(ps_{i}\) is used to estimate the IPW of the ATE \(ipw_{ATE}\) based on whether they have a diagnosis of diabetes \(Z_{i} = 1\) or not \(Z_{i} = 0\). The formula to estimate the IPW for the ATE is:
\[\begin{aligned} ipw_{ATE} = \frac{Z_{i}}{ps_{i}} + \frac{1 - Z_{i}}{1 - ps_{i}} \end{aligned}\]Once the IPW for the ATE is estimated, we will save this as the
ate_fitted
dataset for our analysis.
##############
#### ATE IPW
##############
ate_fitted <- prob_fitted2 %>%
mutate(
ipw_ate = (diabetes / propensity_score) + ((1 - diabetes) / (1 - propensity_score))
)
## Visualize the propensity scores between the diabetes and no-diabetes groups
ps_ate <- ate_fitted %>%
tidyr::spread(diabetes, propensity_score, sep = "_")
ggplot(ps_ate) +
geom_histogram(bins = 50, aes(diabetes_1), alpha = 0.5) +
geom_histogram(bins = 50, aes(diabetes_1, weight = ipw_ate), fill = "green", alpha = 0.5) +
geom_histogram(bins = 50, alpha = 0.5, aes(x = diabetes_0, y = -..count..)) +
geom_histogram(bins = 50, aes(x = diabetes_0, weight = ipw_ate, y = -..count..), fill = "blue", alpha = 0.5) +
ylab("count") + xlab("p") +
geom_hline(yintercept = 0, lwd = 0.5) +
scale_y_continuous(label = abs)
# Note: The code for the graph came from: https://livefreeordichotomize.com/posts/2019-01-17-understanding-propensity-score-weighting/
We can visualize how well the IPW ATE does in providing weights to
the group with {green
and without blue
a
diagnosis of diabetes.
IPW for ATT
Once we have the propensity scores, we can estimate the IPW for the ATT analysis. The propensity score for the respondent \(ps_{i}\) is used to estimate the IPW of the ATT \(ipw_{ATT}\) based on whether they have a diagnosis of diabetes \(Z_{i} = 1\) or not \(Z_{i} = 0\). The formula to estimate the IPW for the ATT is:
\[\begin{aligned} ipw_{ATT} = \frac{ps_{i}Z_{i}}{ps_{i}} + \frac{ps_{i}(1 - Z_{i})}{1 - ps_{i}} \end{aligned}\]Once the IPW for the ATT is estimated, we will save this as the
att_fitted
dataset for our analysis.
##############
#### ATT IPW
##############
att_fitted <- prob_fitted2 %>%
mutate(
ipw_att = ((propensity_score * diabetes) / propensity_score) + ((propensity_score * (1 - diabetes)) / (1 - propensity_score))
)
## Visualize the propensity scores between the diabetes and no-diabetes groups
ps_att <- att_fitted %>%
tidyr::spread(diabetes, propensity_score, sep = "_")
ggplot(ps_att) +
geom_histogram(bins = 50, aes(diabetes_1), alpha = 0.5) +
geom_histogram(bins = 50, aes(diabetes_1, weight = ipw_att), fill = "green", alpha = 0.5) +
geom_histogram(bins = 50, alpha = 0.5, aes(x = diabetes_0, y = -..count..)) +
geom_histogram(bins = 50, aes(x = diabetes_0, weight = ipw_att, y = -..count..), fill = "blue", alpha = 0.5) +
ylab("count") + xlab("p") +
geom_hline(yintercept = 0, lwd = 0.5) +
scale_y_continuous(label = abs)
# Note: The code for the graph came from: https://livefreeordichotomize.com/posts/2019-01-17-understanding-propensity-score-weighting/
The IPW for ATT focused on those respondents who had diabetes. Thus,
the weights are generated based on those individuals who had a diagnosis
of diabetes. This is why there is a large grey
area with
the respondents who did not have a diagnosis of diabetes. It’s because
their weights have been reduced as denoted by the blue
area.
Comparison of propensity score matched methods
Let’s recap. So far, we have estimated the propensity scores for each respondents in our sample. We used three propensity score matching approaches. The first used the nearest neighbor approach where we matched the respondents without diabetes to respondents with diabetes in a 1:1 matching ratio without replacement. The second approach used the propensity score to estimate the inverse probability weights for respondents based on the ATE analysis. Last, we used the propensity score to estimate the inverse probability weights for respondents based on the the ATT analysis.
We can now estimate the total healthcare expenditures between respondents with and without a diagnosis of diabetes.
We will use a linear regression model where the total healthcare
expenditure totexp
is the dependent variable and the
diabetes
variable as the predictor of interest. We will not
need to include the observed characteristics (age, sex, race, poverty,
and marital status) since those were used to estimate the propensity
scores. (Note: In some cases, these characteristics may be added to
adjust for the model results.)
Compare the propensity score matching approaches
The first analysis will use the MEPS data without any adjustments or propensity score matching.
The second analysis will use the nearest neighbor matching approach.
The third analysis will use the IPW with ATE weights.
The fourth analysis will use the IPW with ATT weights.
In R
, we can apply the weights from the IPW using the
weight =
option in the glm()
function.
## No matching
model0 <- glm(totexp ~ diabetes,
data = hc2021p)
summary(model0)
##
## Call:
## glm(formula = totexp ~ diabetes, data = hc2021p)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -16703 -7519 -6096 -1224 2179645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7644.8 212.3 36.00 <2e-16 ***
## diabetes 9057.8 563.8 16.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 868430410)
##
## Null deviance: 1.9715e+13 on 22445 degrees of freedom
## Residual deviance: 1.9491e+13 on 22444 degrees of freedom
## AIC: 525691
##
## Number of Fisher Scoring iterations: 2
## Nearest neighbor matching
model1 <- glm(totexp ~ diabetes,
data = match_data)
summary(model1)
##
## Call:
## glm(formula = totexp ~ diabetes, data = match_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -16733 -10565 -7864 1186 816647
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10564.8 494.9 21.346 <2e-16 ***
## diabetes 6167.8 699.9 8.812 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 775792433)
##
## Null deviance: 4.9726e+12 on 6333 degrees of freedom
## Residual deviance: 4.9123e+12 on 6332 degrees of freedom
## AIC: 147632
##
## Number of Fisher Scoring iterations: 2
## IPW with ATE
model2 <- glm(totexp ~ diabetes,
data = ate_fitted,
weight = ipw_ate)
summary(model2)
##
## Call:
## glm(formula = totexp ~ diabetes, data = ate_fitted, weights = ipw_ate)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -100962 -8266 -7037 -1649 2208362
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8085.1 258.9 31.23 <2e-16 ***
## diabetes 6980.4 373.6 18.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1508361886)
##
## Null deviance: 3.4380e+13 on 22445 degrees of freedom
## Residual deviance: 3.3854e+13 on 22444 degrees of freedom
## AIC: 529952
##
## Number of Fisher Scoring iterations: 2
## IPW with ATT
model3 <- glm(totexp ~ diabetes,
data = att_fitted,
weight = ipw_att)
summary(model3)
##
## Call:
## glm(formula = totexp ~ diabetes, data = att_fitted, weights = ipw_att)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -16703 -3885 -2399 -1234 1001355
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10704.8 283.5 37.76 <2e-16 ***
## diabetes 5997.8 402.6 14.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 260161980)
##
## Null deviance: 5.8968e+12 on 22445 degrees of freedom
## Residual deviance: 5.8391e+12 on 22444 degrees of freedom
## AIC: 541393
##
## Number of Fisher Scoring iterations: 2
#### Comparing the models together
model_list <- list("No matching" = model0,
"Nearest neighbor matching" = model1,
"IPW wih ATE" = model2,
"IPW with ATT" = model3)
modelsummary1 <- modelsummary(model_list,
stars = TRUE,
gof_omit = ".IC",
fmt = fmt_decimal(digits = 0, pdigits = 3),
statistic = NULL,
conf_level = 0.95,
vcov = "robust",
estimate = "{estimate} [{conf.low}, {conf.high}] {stars}",
notes = list("*** P<0.001"))
modelsummary1
No matching | Nearest neighbor matching | IPW wih ATE | IPW with ATT | |
---|---|---|---|---|
*** P<0.001 | ||||
(Intercept) | 7645 [7221, 8069] *** | 10565 [9534, 11596] *** | 8085 [7639, 8531] *** | 10705 [9978, 11432] *** |
diabetes | 9058 [8062, 10054] *** | 6168 [4795, 7540] *** | 6980 [5888, 8073] *** | 5998 [4840, 7156] *** |
Num.Obs. | 22446 | 6334 | 22446 | 22446 |
R2 | 0.011 | 0.012 | 0.011 | 0.002 |
Log.Lik. | -262842.503 | -73813.132 | -299527.066 | -266362.351 |
F | 317.532 | 77.625 | 156.822 | 103.017 |
RMSE | 29467.83 | 27848.65 | 29477.10 | 29603.86 |
Std.Errors | HC3 | HC3 | HC3 | HC3 |
Note that the No matching
, IPW with ATE
,
and IPW with ATT
had the complete number of respondents
(N = 22,246
) in the analysis. Whereas, the
Nearest neighbor matching
only had N = 6334
in
the analysis. This is because the Nearest neighbor matching
approach discarded respondents who did not match.
The results varied based on the methods used.
In the no matching approach, respondents with a diagnosis of diabetes were associated with a greater increase in healthcare expenditures compared to respondents without a diagnosis of diabetes (+$9058; 95% CI: $8062, $10,054).
In the nearest neighbor matching approach, respondents with a diagnosis of diabetes were associated with a greater increase in healthcare expenditures compared to respondents without a diagnosis of diabetes (+$6184; 95% CI: $4810, $7557).
In the IPW with ATE approach, respondents with a diagnosis of diabetes were associated with a greater increase in healthcare expenditures compared to respondents without a diagnosis of diabetes (+$6980; 95% CI: $5888, $8073).
In the IPW with ATT approach, the interpretation is a little different. If the respondents who had an actual diagnosis of diabetes did not have diabetes, the difference in total healthcare expenditure would have been +$5998 greater (95% CI: $4840, $7156).
Conclusions
Propensity score matching is a useful technique that can mitigate or eliminate confounding from observational studies. However, the application of these techniques will depend on your assumptions and knowledge about the causal framework of your research endeavors. For example, which covariates used in estimating the propensity score will depend on your causal framework, which can be illustrated by a directed acyclic graph (DAG) diagram. It’s not recommended that you include all observable characteristics into the propensity score estimation. Each covariate should be carefully considered and justified for inclusion to estimate a propensity score based on some kind of causal framework. How ever you use these methods, keep in mind that they are based on assumptions, and it is our responsibility to ensure that we are transparent and explicit about these in our methods.
I have provided a couple of references below to help expand on topics related to propensity score matching.
References
- Austin PC. An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies. Multivariate Behav Res. 2011;46(3):399-424. doi:10.1080/00273171.2011.568786.
- Garrido MM, Dowd B, Hebert PL, Maciejewski ML. Understanding Treatment Effect Terminology in Pain and Symptom Management Research. J Pain Symptom Manage. 2016;52(3):446-452. doi:10.1016/j.jpainsymman.2016.01.016.
Acknowledgements
There are numerous resources that were used to develop this tutorial.
The MatchIt
package GitHub site is a great resource on
how to use this in estimating the propensity scores in R. URL: https://kosukeimai.github.io/MatchIt/
Dr. Andrew Heiss has a wealth of tutorials on propensity matching and IPW. This was a great resource and inspiration for my article: URL: https://evalsp21.classes.andrewheiss.com/example/matching-ipw/
The “Live Free or Dichotomize” blog had excellent explanations of the IPW for various treatment effects by author Lucy D’Agostino McGowan: URL: https://livefreeordichotomize.com/posts/2019-01-17-understanding-propensity-score-weighting/
Francisco Yirá’s blog on Matching was an inspiration for this article: URL: https://www.franciscoyira.com/post/matching-in-r-3-propensity-score-iptw/
The cobalt
package was used to generate the Love plots:
URL: https://ngreifer.github.io/WeightIt/articles/WeightIt.html
Disclaimers
This is for educational purposes only.
This is a work in progress and updates may occur in the future.
Update notes
Fixed an error where the IPW equations were displayed incorrectly.