Introduction
Previously, I wrote a tutorial on simple trend analysis using the Medical Expenditure Panel Survey (MEPS) data from the Agency for Healthcare Research and Quality (AHRQ). In that tutorial, I estimated the predicted values of the outcome (total healthcare expenditures) at specific time points for males and females (see figure below). However, I didn’t estimate the average slope across “all” time points.
Trends of total healthcare expenditure for males and females
Sometimes, we’re interested in the overall trend rather than the predicted values at each time point. Here are two questions this tutorial will try to answer:
What is the slope of the “best fit” line through the time points?
How do I plot the “best fit” line through these points?
To answer these questions, we’ll need to change our model and
generate the predicted slopes for males and females. We will achieve
this by modifying our model to use the time variable (e.g.,
year) as a continuous term instead of a factor (or
categorical) term. This will give R the appropriate
variable to estimate the slopes.
Motivating example
We will use the same data from the previous tutorial on simple trend analysis to answer the above questions. To learn more about each step, please refer to the following tutorial: MEPS Tutorial 5 - Simple Trend Analysis with Linear Models.
Step 1 - Load and clean data
############################ LOAD LIBRARIES ############################
### Load the MEPS package
library("MEPS") ## You need to load the library every time you restart R
### Load the other libraries
library("survey")
library("foreign")
library("dplyr")
library("ggplot2")
library("questionr") # remotes::install_github("juba/questionr")
library("lspline") # devtools::install_github("mbojan/lspline", build_vignettes=TRUE)
library("ggeffects") # remotes::install_github("strengejacke/ggeffects")
library("margins")
library("gtsummary") # remotes::install_github("ddsjoberg/gtsummary")
library("sjPlot") # plot marginal effects ("plot_model" function)
############################ LOAD DATA ############################
#### Load data from AHRQ MEPS website
hc2021 = read_MEPS(file = "h233")
hc2020 = read_MEPS(file = "h224")
hc2019 = read_MEPS(file = "h216")
hc2018 = read_MEPS(file = "h209")
hc2017 = read_MEPS(file = "h201")
hc2016 = read_MEPS(file = "h192")
#### Change column names to lowercase
names(hc2021) <- tolower(names(hc2021))
names(hc2020) <- tolower(names(hc2020))
names(hc2019) <- tolower(names(hc2019))
names(hc2018) <- tolower(names(hc2018))
names(hc2017) <- tolower(names(hc2017))
names(hc2016) <- tolower(names(hc2016))
############################ LOAD LINKAGE FILES ############################
#### We need the linkage file with the appropriate stratum of the primary sampling strata (STRA9621) and primary sampling unit (PSU9621). (Note: Each year, the linkage file sampling unit name changes)
linkage = read_MEPS(type = "Pooled linkage")
names(linkage) <- tolower(names(linkage)) # change variable name to lower case
############################ CREATE POOLED DATA ############################
#### Creating the pooled data file
#### Select specific variables for each year
### 2021
hc2021p = hc2021 %>%
rename(
perwt = perwt21f,
totexp = totexp21) %>%
select(
dupersid,
panel,
varstr,
varpsu,
perwt,
sex,
totexp)
hc2021p$year <- 2021
### 2020
hc2020p = hc2020 %>%
rename(
perwt = perwt20f,
totexp = totexp20) %>%
select(
dupersid,
panel,
varstr,
varpsu,
perwt,
sex,
totexp)
hc2020p$year <- 2020
### 2019
hc2019p = hc2019 %>%
rename(
perwt = perwt19f,
totexp = totexp19) %>%
select(
dupersid,
panel,
varstr,
varpsu,
perwt,
sex,
totexp)
hc2019p$year <- 2019
### 2018
hc2018p = hc2018 %>%
rename(
perwt = perwt18f,
totexp = totexp18) %>%
select(
dupersid,
panel,
varstr,
varpsu,
perwt,
sex,
totexp)
hc2018p$year <- 2018
### 2017
hc2017p = hc2017 %>%
rename(
perwt = perwt17f,
totexp = totexp17) %>%
select(
dupersid,
panel,
varstr,
varpsu,
perwt,
sex,
totexp)
hc2017p$year <- 2017
### 2016
hc2016p = hc2016 %>%
rename(
perwt = perwt16f,
totexp = totexp16) %>%
select(
dupersid,
panel,
varstr,
varpsu,
perwt,
sex,
totexp)
hc2016p$year <- 2016
############################ MERGE POOLED DATA ############################
#### Merge data and adjust the person weight by 6 years
pool_data = bind_rows(hc2021p,
hc2020p,
hc2019p,
hc2018p,
hc2017p,
hc2016p) %>%
mutate(poolwt = perwt / 6)
#### Merging with pooled survey primary sampling strata and unit
### Reduce the linkage file to only include dupersid, panel, stra9623, psu9623
### Note: The pooled linkage file is updated regularly, hence, the variable names will change; make sure to use the correct variable names.
linkage_file = linkage %>%
select(dupersid, panel, stra9623, psu9623)
### Merge link file with main data
pool_data_linked = left_join(pool_data,
linkage_file,
by = c("dupersid", "panel"))
############################ CREATE FACTOR VARIABLES ############################
#### Create factor variables:
### MALE
pool_data_linked$male[pool_data_linked$sex == 1] = 1
pool_data_linked$male[pool_data_linked$sex == 2] = 0
table(pool_data_linked$male)
##
## 0 1
## 95023 86626
pool_data_linked$male <- factor(pool_data_linked$male, levels = c(0, 1))
############################ SET SURVEY OPTIONS ############################
# Set the survey options
options(survey.lonely.psu = "adjust")
options(survey.adjust.domain.lonely = TRUE)
############################ SURVEY DESIGN ############################
# Define the survey design
survey_design = svydesign(
id = ~psu9623,
strata = ~stra9623,
weights = ~poolwt,
data = pool_data_linked,
nest = TRUE)
Step 2: Plot the total health expenditure trends
Next, we’ll visualize the trends for males and females.
############################ PLOT TOTXEP ############################
# Plot total expenditures over time (Load the "questionr" package to use "ggsurvey")
ggsurvey(survey_design) +
aes(x = year, y = totexp, group = factor(male), color = factor(male)) +
stat_summary(fun = "mean", geom = "point", size = 5) +
stat_summary(fun = "mean", geom = "line", size = 1)
Step 3: Run the regression model
After we visualize the total healthcare expenditures for males and females, we can construct the regression model.
In the previous
article, I had set up the year variable as a factor
variable. However, this prevents us from generating a smooth best fit
line using regression methods. Therefore, we will need to set up the
year variable as a continuous term.
I will call this model survey_model2.
############################ REGRESSION MODEL ############################
## Regression model:
survey_model2 <- svyglm(totexp ~ factor(male) + year + factor(male):year,
survey_design)
## Regression output:
# Summary table with 95% CI
round(
cbind(
summary(survey_model2, df.resid = degf(survey_model2$survey.design))$coef,
confint(survey_model2, df = degf(survey_model2$survey.design))
), 4)
## Estimate Std. Error t value Pr(>|t|) 2.5 %
## (Intercept) -715509.7454 104700.5713 -6.8339 0.0000 -921233.5362
## factor(male)1 -25302.6924 153943.2033 -0.1644 0.8695 -327782.2199
## year 357.6989 51.8849 6.8941 0.0000 255.7515
## factor(male)1:year 11.9953 76.2976 0.1572 0.8751 -137.9203
## 97.5 %
## (Intercept) -509785.9546
## factor(male)1 277176.8352
## year 459.6463
## factor(male)1:year 161.9108
We can get the change in the total healthcare expenditures for males and females using the following commands:
############################ SLOPE ESTIMATIONS ############################
# Female slope
coef(survey_model2)["year"]
## year
## 357.6989
# Male slope
coef(survey_model2)["year"] + coef(survey_model2)["factor(male)1:year"]
## year
## 369.6942
For females, the average increase in total healthcare expenditure per year was $271 and for males it was $267.
However, this doesn’t give us the 95% confidence interval.
Step 4: Estimate the slopes
To get those, we will need to use the margins
command.
############################ MARGINS ############################
ame1 <- margins(survey_model2,
design = survey_design,
variables = "year",
at = list(male = 0:1))
summary(ame1)
## factor male AME SE z p lower upper
## year 1.0000 357.6989 51.8844 6.8942 0.0000 256.0074 459.3905
## year 2.0000 369.6942 60.4552 6.1152 0.0000 251.2042 488.1842
For females, the average increase in total healthcare expenditure per year was $271 (95% CI: -$29, $572) and for males it was $267 (95% CI: -$93, $628).
Step 5: Plots the slopes for males and females
Once we have this, we can plot the best fit line across the average total healthcare expenditure for males and females.
To do that, we will need to make sure we generate the predicted
values using the survey weights. Then, we will create a new dataframe
that contains the survey-weighted predicted values, which we will plot
using ggplot.
############################ SURVEY-WEIGHTED PREDICTION ############################
### Average totexp by year for males and females
survey_means <- svyby(formula = ~ totexp,
by = ~ male + year,
design = survey_design,
FUN = svymean,
na.rm = TRUE) %>%
mutate(male = factor(male))
#### Create a "empty" dataframe to collect the prediction
pred_grid <- expand.grid(year = c(2016, 2017, 2018, 2019, 2020, 2021),
male = c(0, 1)
)
#### Generate the prediction and insert into the "empty" dataframe
pred_grid$prediction <- predict(survey_model2,
newdata = pred_grid)
##################### PLOT THE SURVEY-WEIGHTED PREDICTIONS ######################
#### Plot the survey-weighted average totexp at each year and add the best fit line
ggplot() +
geom_point(data = survey_means,
aes(x = year,
y = totexp,
color = factor(male),
group = factor(male)),
size = 5) +
geom_line(data = pred_grid,
aes(x = year,
y = prediction,
color = factor(male),
group = factor(male)),
linewidth = 1) +
scale_x_continuous(breaks = c(2016, 2017, 2018, 2019, 2020, 2021)) +
scale_color_discrete(labels = c("Female", "Male"), name = "Gender") +
labs(title = "Total Healthcare Expenditure Trends",
y = "Total Healthcare Expenditure",
x = "Time") +
theme_minimal()
Upon visual inspection, we can see that the fitted lines are both increasing for males and females at a rate of $271 (95% CI: -$29, $572) and $267 (95% CI: -$93, $628) per year, respectively
Conclusions
As I mentioend, the last tutorial did not plot the best fitted line
nor generate the average slope for males and females. In order for us to
get these, we had to change the year term from a factor
variable to a continuous variable in the regression model. Afterwards,
we had to estimate the predicted values and save those findings into a
dataframe so that we can plot these using ggplot.
Once we have made these changes, we can report the average rate of total healthcare expenditure change per year for males and females.
Acknowledgements
I used Claude Sonnet 4.6 to help create some of the codes to plot the survey-fitted average total healthcare expenditure and fitted lines from the regression model.
Work in Progress
This is a work in progress so expect some updates in the future.
Disclaimers
Any errors or mistakes are those of the author.
This is only for educational purposes.
This was built under R version 4.4.2 “Pile of Leaves”