MEPS Tutorial 8: Estimating slopes from a regression model using R

Mark Bounthavong

29 May 2026

Introduction

Previously, I wrote a tutorial on simple trend analysis using the Medical Expenditure Panel Survey (MEPS) data from the Agency for Healthcare Research and Quality (AHRQ). In that tutorial, I estimated the predicted values of the outcome (total healthcare expenditures) at specific time points for males and females (see figure below). However, I didn’t estimate the average slope across “all” time points.

Trends of total healthcare expenditure for males and females

Trends of total healthcare expenditure for males and females

Sometimes, we’re interested in the overall trend rather than the predicted values at each time point. Here are two questions this tutorial will try to answer:

  • What is the slope of the “best fit” line through the time points?

  • How do I plot the “best fit” line through these points?

To answer these questions, we’ll need to change our model and generate the predicted slopes for males and females. We will achieve this by modifying our model to use the time variable (e.g., year) as a continuous term instead of a factor (or categorical) term. This will give R the appropriate variable to estimate the slopes.

Motivating example

We will use the same data from the previous tutorial on simple trend analysis to answer the above questions. To learn more about each step, please refer to the following tutorial: MEPS Tutorial 5 - Simple Trend Analysis with Linear Models.

Step 1 - Load and clean data

############################ LOAD LIBRARIES ############################ 
### Load the MEPS package
library("MEPS") ## You need to load the library every time you restart R

### Load the other libraries
library("survey")
library("foreign")
library("dplyr")
library("ggplot2")
library("questionr") # remotes::install_github("juba/questionr")
library("lspline")  # devtools::install_github("mbojan/lspline", build_vignettes=TRUE)
library("ggeffects") # remotes::install_github("strengejacke/ggeffects")
library("margins")
library("gtsummary") # remotes::install_github("ddsjoberg/gtsummary")
library("sjPlot") # plot marginal effects ("plot_model" function)

############################ LOAD DATA ############################ 

#### Load data from AHRQ MEPS website
hc2021 = read_MEPS(file = "h233")
hc2020 = read_MEPS(file = "h224")
hc2019 = read_MEPS(file = "h216")
hc2018 = read_MEPS(file = "h209")
hc2017 = read_MEPS(file = "h201")
hc2016 = read_MEPS(file = "h192")

#### Change column names to lowercase
names(hc2021) <- tolower(names(hc2021))
names(hc2020) <- tolower(names(hc2020))
names(hc2019) <- tolower(names(hc2019))
names(hc2018) <- tolower(names(hc2018))
names(hc2017) <- tolower(names(hc2017))
names(hc2016) <- tolower(names(hc2016))

############################ LOAD LINKAGE FILES ############################ 
#### We need the linkage file with the appropriate stratum of the primary sampling strata (STRA9621) and primary sampling unit (PSU9621). (Note: Each year, the linkage file sampling unit name changes)
linkage = read_MEPS(type = "Pooled linkage") 
names(linkage) <- tolower(names(linkage)) # change variable name to lower case

############################ CREATE POOLED DATA ############################ 

#### Creating the pooled data file
#### Select specific variables for each year
### 2021
hc2021p = hc2021 %>%
  rename(
    perwt = perwt21f,
    totexp = totexp21) %>%
  select(
    dupersid, 
    panel, 
    varstr, 
    varpsu,
    perwt,
    sex,
    totexp)
hc2021p$year <- 2021

### 2020
hc2020p = hc2020 %>%
  rename(
    perwt = perwt20f,
    totexp = totexp20) %>%
  select(
    dupersid, 
    panel, 
    varstr, 
    varpsu,
    perwt,
    sex,
    totexp)
hc2020p$year <- 2020


### 2019
hc2019p = hc2019 %>%
  rename(
    perwt = perwt19f,
    totexp = totexp19) %>%
  select(
    dupersid, 
    panel, 
    varstr, 
    varpsu,
    perwt,
    sex,
    totexp)
hc2019p$year <- 2019

### 2018
hc2018p = hc2018 %>%
  rename(
    perwt = perwt18f,
    totexp = totexp18) %>%
  select(
    dupersid, 
    panel, 
    varstr, 
    varpsu,
    perwt,
    sex,
    totexp)
hc2018p$year <- 2018

### 2017
hc2017p = hc2017 %>%
  rename(
    perwt = perwt17f,
    totexp = totexp17) %>%
  select(
    dupersid, 
    panel, 
    varstr, 
    varpsu,
    perwt,
    sex,
    totexp)
hc2017p$year <- 2017

### 2016
hc2016p = hc2016 %>%
  rename(
    perwt = perwt16f,
    totexp = totexp16) %>%
  select(
    dupersid, 
    panel, 
    varstr, 
    varpsu,
    perwt,
    sex,
    totexp)
hc2016p$year <- 2016

############################ MERGE POOLED DATA ############################ 

#### Merge data and adjust the person weight by 6 years
pool_data = bind_rows(hc2021p, 
                      hc2020p,
                      hc2019p,
                      hc2018p,
                      hc2017p,
                      hc2016p) %>%
  mutate(poolwt = perwt / 6)

#### Merging with pooled survey primary sampling strata and unit
### Reduce the linkage file to only include dupersid, panel, stra9623, psu9623
### Note: The pooled linkage file is updated regularly, hence, the variable names will change; make sure to use the correct variable names.
linkage_file = linkage %>%
  select(dupersid, panel, stra9623, psu9623)


### Merge link file with main data
pool_data_linked = left_join(pool_data,
                             linkage_file, 
                             by = c("dupersid", "panel"))


############################ CREATE FACTOR VARIABLES ############################ 

#### Create factor variables:
### MALE
pool_data_linked$male[pool_data_linked$sex == 1] = 1
pool_data_linked$male[pool_data_linked$sex == 2] = 0
table(pool_data_linked$male)
## 
##     0     1 
## 95023 86626
pool_data_linked$male <- factor(pool_data_linked$male, levels = c(0, 1))


############################ SET SURVEY OPTIONS ############################ 

# Set the survey options
options(survey.lonely.psu = "adjust")
options(survey.adjust.domain.lonely = TRUE)


############################ SURVEY DESIGN ############################ 

# Define the survey design
survey_design = svydesign(
  id = ~psu9623,
  strata = ~stra9623,
  weights = ~poolwt,
  data = pool_data_linked,
  nest = TRUE)

Step 3: Run the regression model

After we visualize the total healthcare expenditures for males and females, we can construct the regression model.

In the previous article, I had set up the year variable as a factor variable. However, this prevents us from generating a smooth best fit line using regression methods. Therefore, we will need to set up the year variable as a continuous term.

I will call this model survey_model2.

############################ REGRESSION MODEL ############################ 

## Regression model:
survey_model2 <- svyglm(totexp ~ factor(male) + year + factor(male):year,
                        survey_design)

## Regression output:
# Summary table with 95% CI
round(
  cbind(
    summary(survey_model2, df.resid = degf(survey_model2$survey.design))$coef, 
    confint(survey_model2, df       = degf(survey_model2$survey.design))
  ), 4)
##                        Estimate  Std. Error t value Pr(>|t|)        2.5 %
## (Intercept)        -715509.7454 104700.5713 -6.8339   0.0000 -921233.5362
## factor(male)1       -25302.6924 153943.2033 -0.1644   0.8695 -327782.2199
## year                   357.6989     51.8849  6.8941   0.0000     255.7515
## factor(male)1:year      11.9953     76.2976  0.1572   0.8751    -137.9203
##                          97.5 %
## (Intercept)        -509785.9546
## factor(male)1       277176.8352
## year                   459.6463
## factor(male)1:year     161.9108

We can get the change in the total healthcare expenditures for males and females using the following commands:

############################ SLOPE ESTIMATIONS ############################ 

# Female slope
coef(survey_model2)["year"]
##     year 
## 357.6989
# Male slope
coef(survey_model2)["year"] + coef(survey_model2)["factor(male)1:year"]
##     year 
## 369.6942

For females, the average increase in total healthcare expenditure per year was $271 and for males it was $267.

However, this doesn’t give us the 95% confidence interval.

Step 4: Estimate the slopes

To get those, we will need to use the margins command.

############################ MARGINS ############################ 

ame1 <- margins(survey_model2, 
                design = survey_design, 
                variables = "year", 
                at = list(male = 0:1))
summary(ame1)
##  factor   male      AME      SE      z      p    lower    upper
##    year 1.0000 357.6989 51.8844 6.8942 0.0000 256.0074 459.3905
##    year 2.0000 369.6942 60.4552 6.1152 0.0000 251.2042 488.1842

For females, the average increase in total healthcare expenditure per year was $271 (95% CI: -$29, $572) and for males it was $267 (95% CI: -$93, $628).

Step 5: Plots the slopes for males and females

Once we have this, we can plot the best fit line across the average total healthcare expenditure for males and females.

To do that, we will need to make sure we generate the predicted values using the survey weights. Then, we will create a new dataframe that contains the survey-weighted predicted values, which we will plot using ggplot.

############################ SURVEY-WEIGHTED PREDICTION ############################ 

### Average totexp by year for males and females
survey_means <- svyby(formula = ~ totexp, 
                      by = ~ male + year, 
                      design = survey_design,
                      FUN = svymean,
                      na.rm = TRUE) %>%
                mutate(male = factor(male))

#### Create a "empty" dataframe to collect the prediction
pred_grid <- expand.grid(year = c(2016, 2017, 2018, 2019, 2020, 2021),
                         male = c(0, 1)
                         )

#### Generate the prediction and insert into the "empty" dataframe
pred_grid$prediction <- predict(survey_model2, 
                                newdata = pred_grid)


##################### PLOT THE SURVEY-WEIGHTED PREDICTIONS ###################### 

#### Plot the survey-weighted average totexp at each year and add the best fit line
ggplot() +
  geom_point(data = survey_means,
             aes(x = year, 
                 y = totexp, 
                 color = factor(male), 
                 group = factor(male)),
             size = 5) +
  geom_line(data = pred_grid,
            aes(x = year, 
                y = prediction, 
                color = factor(male), 
                group = factor(male)),
            linewidth = 1) +
  scale_x_continuous(breaks = c(2016, 2017, 2018, 2019, 2020, 2021)) +
  scale_color_discrete(labels = c("Female", "Male"), name = "Gender") +
  labs(title = "Total Healthcare Expenditure Trends",
       y     = "Total Healthcare Expenditure",
       x     = "Time") +
  theme_minimal()

Upon visual inspection, we can see that the fitted lines are both increasing for males and females at a rate of $271 (95% CI: -$29, $572) and $267 (95% CI: -$93, $628) per year, respectively

Conclusions

As I mentioend, the last tutorial did not plot the best fitted line nor generate the average slope for males and females. In order for us to get these, we had to change the year term from a factor variable to a continuous variable in the regression model. Afterwards, we had to estimate the predicted values and save those findings into a dataframe so that we can plot these using ggplot.

Once we have made these changes, we can report the average rate of total healthcare expenditure change per year for males and females.

Acknowledgements

I used Claude Sonnet 4.6 to help create some of the codes to plot the survey-fitted average total healthcare expenditure and fitted lines from the regression model.

Work in Progress

This is a work in progress so expect some updates in the future.

Disclaimers

Any errors or mistakes are those of the author.

This is only for educational purposes.

This was built under R version 4.4.2 “Pile of Leaves”