Introduction
Previously, I wrote a tutorial on how to perform an interrupted time
series analysis (ITSA) in Stata
, which is located on my
RPubs site (link). This had me
thinking of how to perform an ITSA using R
. So, I decided
to create a short tutorial on how to perform an ITSA on R
using the Agency for Healthcare
Research and Quality (AHRQ) Medical Expenditure Panel Survey (MEPS)
dataset. I tried several ways to do this, and I think I’ve found a nice
approach that leverages survey weights from the MEPS database. Although
I’m writing this ITSA tutorial with this approach, I plan on researching
other methods to perform an ITSA in R
so stay
tuned.
This short tutorial will summarize how to perform an ITSA using a linear regression model with survey weights applied from the MEPS database. (Note: I previously wrote a tutorial on how to apply weights from MEPS data (link).
Motivating example - Total Healthcare Costs from 2016 to 2021 by sex
We’ll use the MEPS database to compare the trends in total healthcare costs between sex. We’ll assume that a policy or intervention was implemented in 2019, which may have changed the healthcare costs levels and trends between males and females.
I pooled data from 2016 to 2021 and adjusted the weights. I won’t go into details on how to create this dataset for this tutorial since I provided a detailed summary in a previous tutorial. For more details on how I created this dataset, please refer to my previous tutorial (link).
# To install MEPS package in R, you need to do a couple of things.
### Step 1: Install the "devtools" package.
#install.packages("devtools")
### Step 2: Install the "MEPS" package from the AHRQ MEPS GitHub site.
#devtools::install_github("e-mitchell/meps_r_pkg/MEPS")
### step 3: Load the MEPS package
library("MEPS") ## You need to load the library every time you restart R
### Step 4: Load the other libraries
library("survey")
library("foreign")
library("dplyr")
library("ggplot2")
library("questionr") # remotes::install_github("juba/questionr")
library("lspline") # devtools::install_github("mbojan/lspline", build_vignettes=TRUE)
library("ggeffects") # remotes::install_github("strengejacke/ggeffects")
library("margins")
# There are two ways to load data from AHRQ MEPS website:
#### Method 1: Load data from AHRQ MEPS website
hc2021 = read_MEPS(file = "h233")
hc2020 = read_MEPS(file = "h224")
hc2019 = read_MEPS(file = "h216")
hc2018 = read_MEPS(file = "h209")
hc2017 = read_MEPS(file = "h201")
hc2016 = read_MEPS(file = "h192")
#### Method 2: Load data from AHRQ MEPS website
hc2021 = read_MEPS(year = 2021, type = "FYC")
hc2020 = read_MEPS(year = 2020, type = "FYC")
hc2019 = read_MEPS(year = 2019, type = "FYC")
hc2018 = read_MEPS(year = 2018, type = "FYC")
hc2017 = read_MEPS(year = 2017, type = "FYC")
hc2016 = read_MEPS(year = 2016, type = "FYC")
## Change column names to lowercase
names(hc2021) <- tolower(names(hc2021))
names(hc2020) <- tolower(names(hc2020))
names(hc2019) <- tolower(names(hc2019))
names(hc2018) <- tolower(names(hc2018))
names(hc2017) <- tolower(names(hc2017))
names(hc2016) <- tolower(names(hc2016))
# We need the linkage file with the appropriate stratum of the primary sampling unit (STRA9621) and primary sampling unit (PSU9621). (Note: Each year, the linkage file sampling unit name changes)
linkage = read_MEPS(type = "Pooled linkage")
names(linkage) <- tolower(names(linkage)) # change variable name to lower case
# Select specific variables
### 2021
hc2021p = hc2021 %>%
rename(
perwt = perwt21f,
totexp = totexp21,
ertexp = ertexp21) %>%
select(
dupersid,
panel,
varstr,
varpsu,
perwt,
sex,
totexp,
ertexp)
hc2021p$year <- 2021
### 2020
hc2020p = hc2020 %>%
rename(
perwt = perwt20f,
totexp = totexp20,
ertexp = ertexp20) %>%
select(
dupersid,
panel,
varstr,
varpsu,
perwt,
sex,
totexp,
ertexp)
hc2020p$year <- 2020
### 2019
hc2019p = hc2019 %>%
rename(
perwt = perwt19f,
totexp = totexp19,
ertexp = ertexp19) %>%
select(
dupersid,
panel,
varstr,
varpsu,
perwt,
sex,
totexp,
ertexp)
hc2019p$year <- 2019
### 2018
hc2018p = hc2018 %>%
rename(
perwt = perwt18f,
totexp = totexp18,
ertexp = ertexp18) %>%
select(
dupersid,
panel,
varstr,
varpsu,
perwt,
sex,
totexp,
ertexp)
hc2018p$year <- 2018
### 2017
hc2017p = hc2017 %>%
rename(
perwt = perwt17f,
totexp = totexp17,
ertexp = ertexp17) %>%
select(
dupersid,
panel,
varstr,
varpsu,
perwt,
sex,
totexp,
ertexp)
hc2017p$year <- 2017
### 2016
hc2016p = hc2016 %>%
rename(
perwt = perwt16f,
totexp = totexp16,
ertexp = ertexp16) %>%
select(
dupersid,
panel,
varstr,
varpsu,
perwt,
sex,
totexp,
ertexp)
hc2016p$year <- 2016
# Merge data and adjust the person weight by 6 years
pool_data = bind_rows(hc2021p,
hc2020p,
hc2019p,
hc2018p,
hc2017p,
hc2016p) %>%
mutate(poolwt = perwt / 6)
# Add a new variable for the pre-post index year at 2019
pool_data$period[pool_data$year >= 2016 & pool_data$year < 2019] = 0
pool_data$period[pool_data$year >= 2019 & pool_data$year < 2022] = 1
head(pool_data)
## # A tibble: 6 × 11
## dupersid panel varstr varpsu perwt sex totexp ertexp year poolwt
## <chr> <dbl+lbl> <dbl> <dbl> <dbl> <dbl+l> <dbl> <dbl> <dbl> <dbl>
## 1 2320005101 23 [23 PANE… 2079 1 6785. 2 [2 F… 4908 0 2021 1131.
## 2 2320005102 23 [23 PANE… 2079 1 6177. 1 [1 M… 21257 0 2021 1029.
## 3 2320006101 23 [23 PANE… 2028 1 1599. 2 [2 F… 827 0 2021 267.
## 4 2320006102 23 [23 PANE… 2028 1 1649. 1 [1 M… 0 0 2021 275.
## 5 2320006103 23 [23 PANE… 2028 1 2892. 1 [1 M… 0 0 2021 482.
## 6 2320012102 23 [23 PANE… 2069 2 1273. 2 [2 F… 9813 0 2021 212.
## # ℹ 1 more variable: period <dbl>
# Reduce the linkage file to only include dupersid, panel, stra9621, psu9621
linkage_file = linkage %>%
select(dupersid, panel, stra9621, psu9621)
# Merge link file with main data
pool_data_linked = left_join(pool_data,
linkage_file,
by = c("dupersid", "panel"))
Interrupted Time Series Analysis Design
In this ITSA motivating example, we are going to compare the trends in healthcare total costs between males and females before and after the hypothetical policy intervention, which was implemented in 2019. The ITSA design is illustrated in the figure below.
For this example, the ITSA model is denoted by the following:
\(Y_i = \beta_0 + \beta_1(T_i) + \beta_2(X_i) + \beta_3(T_i * X_i) + \beta_4(Sex_i) + \beta_5(Sex_i * T_i) + \beta_6(Sex_i * X_i) + \beta_7(Sex_i * T_i * X_i) + \epsilon_i\),
where \(Sex_i\) is a new variable that denotes the groups (e.g., sexes), \(\beta_4\) denotes the difference in the outcomes between the groups (e.g., sexes) at beginning of the study (\(T = 0\)), \(\beta_5\) denotes the difference in the slopes between the groups before the intervention, \(\beta_6\) denotes the difference in the level changes between the groups immediately after implementation of the intervention \(X\), and \(\beta_7\) denotes the difference in the slopes after and before the intervention between the groups (e.g., difference-in-differences).
ITSA for two groups (sex
) before and after the intervention
(X
).
Set up the survey options
In order to estimate the appropriate costs and standard errors, we
will need to set up the survey options and survey design using the
weights from MEPS. We will use the svydesign
function to
load the primary sampling unit, individual weights, and strata, which we
will then assign to an object called survey_design
. Please
see the code below:
# Set the survey options
options(survey.lonely.psu = "adjust")
options(survey.adjust.domain.lonely = TRUE)
# Define the survey design
survey_design = svydesign(
id = ~psu9621,
strata = ~stra9621,
weights = ~poolwt,
data = pool_data_linked,
nest = TRUE)
Descriptive analysis
Once that’s done, we can do some descriptive analysis.
Using the svytable
function, we can estimate the
weighted-sample of males and females in the United States (US). In our
MEPS dataset, there are 160,228,773 males and 166,678,038 females, which
is representative of the non-institutionalized civilian US
population.
svytable(~sex, design = survey_design)
## sex
## 1 2
## 160228773 166678038
Moreover, we can look at the survey-weighted total healthcare costs between males and females. For instance, in 2016, the average annual total healthcare costs for males and females were $4351 and $5633 per person, respectively.
df1 <- svyby(~totexp, ~year+sex, survey_design, svymean)
df1
## year sex totexp se
## 2016.1 2016 1 4351.798 143.3687
## 2017.1 2017 1 4792.911 153.9208
## 2018.1 2018 1 5488.304 174.0677
## 2019.1 2019 1 5747.015 158.7999
## 2020.1 2020 1 5861.278 244.1048
## 2021.1 2021 1 6250.655 282.9494
## 2016.2 2016 2 5632.650 153.3156
## 2017.2 2017 2 5797.182 162.3600
## 2018.2 2018 2 6614.338 173.1030
## 2019.2 2019 2 6736.552 158.9384
## 2020.2 2020 2 6654.998 204.5959
## 2021.2 2021 2 7597.160 234.2284
We can also look at the survey-weighted average cost before and after the intervention (before and after 2019). For instance, before the intervention the survey-weighted average total healthcare cost for males and females were $4880 and $6016, respectively. After the intervention, the survey-weighted average total healthcare costs for males and females were $5954 and $6997, respectively.
df2 <- svyby(~totexp, ~sex+period, survey_design, svymean)
df2
## sex period totexp se
## 1.0 1 0 4879.518 103.6334
## 2.0 2 0 6016.326 109.4863
## 1.1 1 1 5954.493 166.0967
## 2.1 2 1 6997.286 138.6562
We can visualize the trend of the survey-weighted total healthcare
costs for males and females from 2016 to 2021 using the
ggsurvey
function. From the plot, we can see that the
female group had a higher total healthcare cost trend compared to
males.
# Plot total expenditures over time (Load the "questionr" package to use "ggsurvey")
ggsurvey(survey_design) +
aes(x = year, y = totexp, group = factor(sex), color = factor(sex)) +
stat_summary(fun.y = "mean", geom = "line", size = 2)
Linear regression model
We can now construct our linear regression model using the ITSA
framework. Because we have survey-weights that need to be applied, we
will use the svyglm
function. We will have several
interaction terms according to the ITSA formula presented above. The
triple interaction term (factor(period):factor(sex):year
)
denotes the difference-in-differences estimation. In other words, it is
the difference in total healthcare costs between males and females
before and after the intervention was implemented. Think of it as
subtracting the difference before and after for males and the difference
before and after for females. Thus, we are getting two types of
differences or the difference-in-differences.
sex
denotes our grouping variableperiod
denotes the binary variable that denotes when the intervention was implemented (0 = before
,1 = after
)year
denotes the time variable in the model
# ITSA: Triple interaction approach
itsa1 <- svyglm(totexp ~ factor(sex) + factor(period) + year + factor(period):factor(sex) + factor(sex):year + factor(period):factor(sex):year, design = survey_design) ## ITSA linear model
itsa1.confint <- confint(itsa1) ## generate the 95% CI
# "cbind" the coefficient output with 95% CI output
round(
cbind(
summary(itsa1, df.resid = degf(itsa1$survey.design))$coef,
confint(itsa1, df = degf(itsa1$survey.design))
), 2)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1141706.66 227313.30 -5.02 0.00
## factor(sex)2 156608.67 280182.38 0.56 0.58
## factor(period)1 638235.37 399815.90 1.60 0.11
## year 568.46 112.71 5.04 0.00
## factor(sex)2:factor(period)1 -516623.61 486326.52 -1.06 0.29
## factor(sex)2:year -77.08 138.92 -0.55 0.58
## factor(sex)1:factor(period)1:year -316.27 198.07 -1.60 0.11
## factor(sex)2:factor(period)1:year -60.45 181.29 -0.33 0.74
## 2.5 % 97.5 %
## (Intercept) -1588358.77 -695054.56
## factor(sex)2 -393926.88 707144.22
## factor(period)1 -147370.28 1423841.03
## year 347.00 789.92
## factor(sex)2:factor(period)1 -1472215.58 438968.35
## factor(sex)2:year -350.04 195.88
## factor(sex)1:factor(period)1:year -705.47 72.93
## factor(sex)2:factor(period)1:year -416.68 295.78
Plot
We can plot our ITSA along with the counterfactuals. We create two
new objects itsa1
and itsa2
, which contain the
predicted values from our model. The first object itsa1
contains the predictive value with all of our terms including the triple
interaction. The second object itsa2
contains only the
counterfactual terms.
#############
# Plot
#############
### Plot: Part 1 - Generate the predictions
plot.itsa1 <- ggpredict(itsa1, terms = c("year", "sex", "period"), ci.level = 0.95) # Full model
plot.itsa2 <- ggpredict(itsa1, terms = c("year", "sex"), ci.level = 0.95) # Counterfactual
### Plot: Part 2 - Plot the fitted values
# Start with the scatter (average values at each year by group)
colors <- c("darkgreen", "navyblue") # Create color vector
plot(df1$year, df1$totexp, pch = 19, cex = 1.5, lwd = 1,
col = colors[factor(df1$sex)],
xlab = "Time (year)",
ylab = "Average total healthcare costs ($)",
ylim = c(4000, 8000))
legend("topleft", title = "Groups",
legend = c("Male", "Female"),
fill = c("darkgreen", "navyblue"))
# Add the lines for the per-post index periods
with(subset(plot.itsa1, x <= 2019 & group == 1 & facet == 0), lines(x, predicted, col = "darkgreen", lwd = 2))
with(subset(plot.itsa1, x <= 2019 & group == 2 & facet == 0), lines(x, predicted, col = "navyblue", lwd = 2))
with(subset(plot.itsa1, x >= 2019 & group == 1 & facet == 1), lines(x, predicted, col = "darkgreen", lwd = 2))
with(subset(plot.itsa1, x >= 2019 & group == 2 & facet == 1), lines(x, predicted, col = "navyblue", lwd = 2))
# Line for the index period
abline(v = 2019, col = "red", lty = 2, lwd = 2)
# Counterfactual lines
with(subset(plot.itsa2, x >= 2019 & group == 1), lines(x, predicted, col = "darkgreen", lty = 2, lwd = 2))
with(subset(plot.itsa2, x >= 2019 & group == 2), lines(x, predicted, col = "navyblue", lty = 2, lwd = 2))
Parallel trends assumption
The most critical part of the ITSA results are the parallel trends assumption and the difference-in-differences estimator.
Let’s start with the parallel trends assumption. We need to test to
see if the trends between males and females are similar before
the implementation of the intervention, which we can find in our model
output. Based on our findings (see factor(sex)2:year
coefficient), the slopes between the two groups (males and females)
before the intervention were not statistically different (average
difference = -$77 (95% CI: -$350, $196). This means that the slopes are
not significantly different between males and females. See ITSA Figure
below with annotations.
Once we are confident that the trends prior to the implementation of the intervention are not different, any changes we see after the intervention can be attributed to the intervention. This is an important condition for the difference-in-differences design.
Additional estimations
We can use the margins
command to estimate the partial
derivatives of the total healthcare costs along different parts of the
timeline.
For instance, the average difference in total healthcare costs for males and females before and after the intervention were $1020 and $777, respectively. I’ve summarized these results into a table below.
### Average difference total costs in the pre- and post-intervention periods.
margins1 <- margins(itsa1, type = "response", design = survey_design, at = (list(period = 0:1)), variables = "sex")
summary(margins1)
## factor period AME SE z p lower upper
## sex2 0.0000 1020.3870 251.6047 4.0555 0.0001 538.0012 1524.2733
## sex2 1.0000 777.3969 260.4250 2.9851 0.0028 242.0445 1262.8918
Differences in total healthcare costs between males and females.
We could also estimate the slopes or the average annual change in the
total healthcare costs for males and females after the
intervention using the margins
commands. For instance,
average annual change in total healthcare costs for males and females
are $252 and $431, respectively.
### Slopes after the intervention
margins3 <- margins(itsa1, type = "response", design = survey_design, at = (list(period = 1, sex = (1:2))), variables = "year")
summary(margins3)
## factor period sex AME SE z p lower upper
## year 1.0000 1.0000 252.1903 151.9044 1.6602 0.0969 -45.5369 549.9175
## year 1.0000 2.0000 430.9319 134.2040 3.2110 0.0013 167.8969 693.9669
Issues with the current model
However, it is not possible to compare the slopes using the
R
version of the margins
package. I have tried
to search for a way to do this, but the margins
package for
R
currently can’t perform this comparison. You can follow
the discussion on the margins
GitHub page (link), which
discusses some potential solution. However, I prefer to use
Stata
for these comparisons. Fortunately, there is a way to
make some important conclusions using a different method–the linear
spline model.
Workaround using a linear spline model
We could get the difference in the level change immediately
after implementation of the intervention at
year == 2019
using the lspline
command. We can
also get the difference-in-differences estimate with this approach.
# ITSA: Linear splines approach
### Create a knot at 2019
spline1 <- svyglm(totexp ~ factor(sex) + factor(period) + lspline(year, c(2019), marginal = TRUE) + factor(sex):lspline(year, c(2019), marginal = TRUE) + factor(sex):factor(period), design = survey_design)
spline1.confint <- confint(spline1)
# Output with 95% CI
round(
cbind(
summary(spline1, df.resid = degf(spline1$survey.design))$coef,
confint(spline1, df = degf(spline1$survey.design))
), 2)
## Estimate Std. Error
## (Intercept) -1141706.66 227313.30
## factor(sex)2 156608.67 280182.38
## factor(period)1 -313.65 285.65
## lspline(year, c(2019), marginal = TRUE)1 568.46 112.71
## lspline(year, c(2019), marginal = TRUE)2 -316.27 198.07
## factor(sex)2:lspline(year, c(2019), marginal = TRUE)1 -77.08 138.92
## factor(sex)2:lspline(year, c(2019), marginal = TRUE)2 255.82 240.93
## factor(sex)2:factor(period)1 -118.53 370.05
## t value Pr(>|t|)
## (Intercept) -5.02 0.00
## factor(sex)2 0.56 0.58
## factor(period)1 -1.10 0.27
## lspline(year, c(2019), marginal = TRUE)1 5.04 0.00
## lspline(year, c(2019), marginal = TRUE)2 -1.60 0.11
## factor(sex)2:lspline(year, c(2019), marginal = TRUE)1 -0.55 0.58
## factor(sex)2:lspline(year, c(2019), marginal = TRUE)2 1.06 0.29
## factor(sex)2:factor(period)1 -0.32 0.75
## 2.5 % 97.5 %
## (Intercept) -1588358.77 -695054.56
## factor(sex)2 -393926.88 707144.22
## factor(period)1 -874.93 247.63
## lspline(year, c(2019), marginal = TRUE)1 347.00 789.92
## lspline(year, c(2019), marginal = TRUE)2 -705.47 72.93
## factor(sex)2:lspline(year, c(2019), marginal = TRUE)1 -350.04 195.88
## factor(sex)2:lspline(year, c(2019), marginal = TRUE)2 -217.59 729.23
## factor(sex)2:factor(period)1 -845.64 608.58
The linear spline model creates knots
that automatically
replaces some of the interactions in our previous model. For example the
model element factor(sex):lspline(year, c(2019)
replaces
the triple interaction factor(period):factor(sex):year
from
our previous model.
You’ll notice that some of the coefficient values are similar to our
first model. For instance, in our first model, the year
coefficient was equal to $568, and in the linear splines model, the
lspline(year, c(2019), marginal = TRUE)1
coefficient was
equal to $568. This denotes the average change in annual total
healthcare costs for males before the implementation of the intervention
at year == 2019
.
However, you’ll also notice some differences. In the linear splines model, we have a couple of important coefficients that will help to complete our evaluation of the ITSA model.
Differences in level immediately after implementation of the intervention
One of the most important changes we are interested in is the
immediate change after implementation of the intervention at
year == 2019
. This is captured with the
factor(sex)2:factor(period)1
coefficient, which has a value
equal to -$119 (95% CI: -$846, $609). This tells me that the difference
in level change at year == 2019
was lower for females than
males; however, this difference was not statistically significant. See
ITSA Figure below with annotations.
Difference-in-differences estimation
Lastly, we can identify the difference-in-differences estimate from
the linear spline model output in the form of the
factor(sex)2:lspline(year, c(2019), marginal = TRUE)2
coefficient, which has a value equal to $256 (95% CI: -$218, $729). This
coefficient denotes that average annual change in total healthcare
expenditure between males and females before and after implementation of
the intervention. In other words, this is a difference in the slopes for
males before and after the intervention minus the difference in the
slopes for females before and after the intervention. Hence, the term
difference-in-differences
. See ITSA Figure below with
annotations.
ITSA figure with annotations.
Putting it all together
With the ITSA, we will summarize our findings into a table. These include the main coefficients from the ITSA model.
Summary table for main ITSA coefficients.
Conclusions
The ITSA model is a very useful design to generate causal interpretation as long as the assumptions hold. For instance, the parallel trends assumptions needs to be assessed. Without it, it will be difficult to argue that the intervention was the only factor that could have impacted the changes in the level and trends between the groups.
In this tutorial, I ran into an issue with the first ITSA model using interaction terms. It didn’t provide me with the necessary coefficients to determine the main coefficients for the ITSA. For instance, I had to use a linear spline model to identify the difference-in-differences estimate. Although the ITSA model with the triple interaction was useful in generating the plot, I could have started with the linear splines model and generated the ITSA plot afterwards with the triple interaction model.
At the end of the day, I was able to generate a summary table and a
very nice figure of the ITSA. However, I will continue to learn more
about other contrasting packages such as emmeans
and
marginaleffects
.
Although the margins
package in R
doesn’t
have the pwcompare
feature from Stata
, there
are some potential workarounds that I’ll investigate further. Discussion
about this can be found in the margins
GitHub issues forum
(link)
Acknowledgements
As always, I am grateful to the R
community for
producing wonderful tools for novices like me. I’ve learned so much
reading documents, forums, and issues online that it would be impossible
to credit everyone. But I would like to highlight a few superstars that
I think deserve special recognition.
I am grateful to the author of the margins
package,
Thomas J. Leeper. You can find his margins
package on his
GitHub site
.
Additionally, the marginaleffects
package by Vincent
Arel-Bundock has helped me to compare some of R
’s marginal
effects commands with Stata
. You can find his GitHub site
here.
The emmeans
package was developed by Russel V. Length,
which I plan on learning more about on his GitHub site (link). Note: I found a
nice overview of the emmeans
package by Ariel Muldoon that
I encourage interested users to read (link).
The ggeffect
package was very useful in helping me
develop some very nice plots, which you can learn more about on their
GitHub site (link).
Lastly, I found this amazing site on ITSA by Chrissy H. Roberts here. Her site helped me to create the ITSA plot which is in this tutorial. I hope you get a chance to read her blog on how to conduct an ITSA in R. I think she does a much better job that I did with this tutorial.
Work in progress
This is a work in progress so expect some updates in the future.
Disclaimers
Any errors or mistakes are those of the author.
This is only for educational purposes.
This was built under R version 4.3.1 “Beagle Scouts”