Cost as a dependent variable: Two-Part Model with Bootstrap using R

Mark Bounthavong

31 March 2026

Introduction

In this article, I wanted to expand on a previous post that describes using a two-part model to model cost (or total expenditure) as an outcome with data from the Agency for Healthcare Research and Quality (AHRQ) Medical Expenditure Panel Survey (MEPS). In the previous article, I used the twopartm package, which is great at leveraging the two-part model approach. However, it does not appear to handle data from complex survey designs like MEPS.

The best way to handle complex survey design data with weights using the two-part model approach is to perform the estimations for each part separately and then combine them.

With a little help from some AI chatbots, I was able to construct a viable code that not only estimates and combines both parts of the two-part model, but also allows me to bootstrap the results to generate 95% confidence intervals (CI).

Loading data

Step 1: Load the data from the AHRQ MEPS site.

MEPS provides a very nice package called MEPS that allows you to connect to their data repository to download any of their public use files. The code below connects to the MEPS data repository and downloads the data. The downloaded data is parsed to variables that I’m going to use in the two-part model such as the total expenditures, a variable for sex, and the weights/sampling units from the complex survey design. Note: I only include individuals with positive weights. The labled the dataframe hc2021p.

### LOAD Libraries
library("MEPS")
library("tidyverse")
library("survey")
library("gmodels")
library("boot")


### LOAD DATA
hc2021 = read_MEPS(file = "h233")
names(hc2021) <- tolower(names(hc2021))   ## Convert column headers to lower case

### We'll parse the data to a few key variables
hc2021p = hc2021 %>%
  rename(
    age = age21x,
    totexp = totexp21,
    ertexp = ertexp21,
    perwt21f = perwt21f,
    varstr = varstr,
    varpsu = varpsu) %>%
  select(
    dupersid,
    age,
    sex,
    totexp,
    ertexp,
    perwt21f,
    varstr,
    varpsu)
hc2021p$year <- 2021

hc2021p <- filter(hc2021p, perwt21f > 0) # Only select individuals with a positive weight
nrow(hc2021p) # Check number of rows
## [1] 27332

Step 2: Clean data and generate new variables

I next clean the data and create new variables for male and agecat. I wanted to change the sex variable where 0 == female and 1 == male. I also wanted to create a factor variable for age so that I could see the different effects at various age categories. Most importantly, I created a binary variable for the outcome to indicate whether is a zero or nonzero costs, which will be used in the first part of the the two-part model.

### CLEAN DATA
#### Generate male variable
hc2021p$male[hc2021p$sex == 2] = 0
hc2021p$male[hc2021p$sex == 1] = 1
table(hc2021p$male)
## 
##     0     1 
## 14441 12891
CrossTable(hc2021p$male, hc2021p$sex)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  27332 
## 
##  
##              | hc2021p$sex 
## hc2021p$male |         1 |         2 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |         0 |     14441 |     14441 | 
##              |  6811.025 |  6079.975 |           | 
##              |     0.000 |     1.000 |     0.528 | 
##              |     0.000 |     1.000 |           | 
##              |     0.000 |     0.528 |           | 
## -------------|-----------|-----------|-----------|
##            1 |     12891 |         0 |     12891 | 
##              |  7629.975 |  6811.025 |           | 
##              |     1.000 |     0.000 |     0.472 | 
##              |     1.000 |     0.000 |           | 
##              |     0.472 |     0.000 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |     12891 |     14441 |     27332 | 
##              |     0.472 |     0.528 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
#### Create a binary nonzero cost variable
hc2021p$nonzero[hc2021p$totexp == 0] = 0
hc2021p$nonzero[hc2021p$totexp >  0] = 1
table(hc2021p$nonzero)
## 
##     0     1 
##  3763 23569
#### Create 5-levels of age categories
hc2021p$agecat[hc2021p$age>=0  & hc2021p$age<30] = 0
hc2021p$agecat[hc2021p$age>=30 & hc2021p$age<40] = 1
hc2021p$agecat[hc2021p$age>=40 & hc2021p$age<50] = 2
hc2021p$agecat[hc2021p$age>=50 & hc2021p$age<60] = 3
hc2021p$agecat[hc2021p$age>=60 & hc2021p$age<99] = 4
hc2021p$agecat <- as.factor(hc2021p$agecat)
table(hc2021p$agecat)
## 
##    0    1    2    3    4 
## 8688 3269 3155 3581 8416

Step 3: Contructing the two-part model

I provide the code that I used to construct the two-part model. The first part of the model is a logit model where the outcome is the binary nonezero variable to indicate that the individual has either a zero or non-zero total expenditure. I also added the male and agecat variables as the predictors, which I then interacted with each other to create an interaction term male:agecat.

Using this code, the average difference in total expenditure between male and female is approximately -$228. In other words, males have a lower total expenditure compared to females.

#############################
### Two-part model
#############################

### APPLY survey weight
options(survey.lonely.psu = 'adjust')

mepsdsgn = svydesign(
  id = ~varpsu,
  strata = ~varstr,
  weights = ~perwt21f,
  data = hc2021p,
  nest = TRUE)

### PART 1 - Logit model
#### Run the logit model and predict the probabilities (include interaction term)
part1 <- svyglm(nonzero ~ male + agecat + male:agecat,
                design = mepsdsgn,
                family = quasibinomial(link = "logit"))

### PART 2 - Gamma model
#### Subset data with nonzero
mepsdsgn_nonzero <- subset(mepsdsgn, totexp > 0)

#### Run the gamma model and predict the probabilities (include interaction term)
part2 <- svyglm(totexp ~ male + agecat + male:agecat,
                design = mepsdsgn_nonzero,
                family = Gamma(link = "log"))

#### STORE OUTPUTS for each agecat levels by male/female
group0 <- data.frame(agecat = factor("0", 
                                     levels = levels(hc2021p$agecat)), male = 0)
group1 <- data.frame(agecat = factor("0", 
                                     levels = levels(hc2021p$agecat)), male = 1) 

#### PREDICTIONS
pred_part1_0 <- predict(part1, group0, type = "response")
pred_part1_1 <- predict(part1, group1, type = "response")

pred_part2_0 <- predict(part2, group0, type = "response")
pred_part2_1 <- predict(part2, group1, type = "response")

totexp0 <- pred_part1_0 * pred_part2_0
totexp1 <- pred_part1_1 * pred_part2_1

#### DIFFERENCE in costs between male and female
diff_groups <- totexp1 - totexp0
print(diff_groups)
##   response     SE
## 1  -227.66 0.0106
#### On average, males spend less than non-males by -$228.

Step 4: Write the Bootstrap Function

The next step involves writing the bootstrap function. The bootstrap function will run the two-part model and return the average total expenditures for males and females along with the difference between them. I named the function twopm.

#################################
#### Bootstrap code
#################################

#### Need to use this to estimate the 95% CIs
twopm <- function(data, indices) {  
  hc2021p <- data[indices, ]
  
  ### SVY DESIGN
  ### APPLY survey weight
  options(survey.lonely.psu = 'adjust')
  mepsdsgn = svydesign(
    id = ~varpsu,
    strata = ~varstr,
    weights = ~perwt21f,
    data = hc2021p,
    nest = TRUE)
  
  ### Two-part model
  
  ### Part 1 - Logit model
  #### Run the logit model and predict the probabilities (include interaction term)
  part1 <- svyglm(nonzero ~ male + agecat + male:agecat,
                  design = mepsdsgn,
                  family = quasibinomial(link = "logit"))
  
  ### Part 2 - Gamma model
  #### Subset data with nonzero
  mepsdsgn_nonzero <- subset(mepsdsgn, totexp > 0)
  
  #### Run the gamma model and predict the probabilities (include interaction term)
  part2 <- svyglm(totexp ~ male + agecat + male:agecat,
                  design = mepsdsgn_nonzero,
                  family = Gamma(link = "log"))
  
  #### STORE OUTPUTS
  group0 <- data.frame(agecat = factor("0", 
                                       levels = levels(hc2021p$agecat)), male = 0)
  group1 <- data.frame(agecat = factor("0", 
                                       levels = levels(hc2021p$agecat)), male = 1) 
  
  #### PREDICTIONS
  pred_part1_0 <- predict(part1, group0, type = "response")
  pred_part1_1 <- predict(part1, group1, type = "response")
  
  pred_part2_0 <- predict(part2, group0, type = "response")
  pred_part2_1 <- predict(part2, group1, type = "response")
  
  totexp0 <- pred_part1_0 * pred_part2_0
  totexp1 <- pred_part1_1 * pred_part2_1
  
  #### Summarize
  return(c(totexp0 = totexp0, 
           totexp1 = totexp1,
           diff = totexp1 - totexp0))
}

Step 5: Run the bootstrap

The next part involves using the boot() function, which is from the boot package, to perform the bootstrap analysis. Make sure to have the boot package loaded as a library. I used 1000 bootstrap replicates with replacement for this example.

### BOOTSTRAP 
set.seed(12321) # Set seed to reproduce findings
bootstrap_results <- boot(data = hc2021p,
                          statistic = twopm,
                          R = 1000)

Step 6: Return the bootstrap results

Once the bootstrap results are saved in the bootstrap_results object, we can generate the average total expenditures for males and females, the differences between them, and their corresponding 95% CIs. We can also plot the distributions for teach of the output.

The boot.ci() generates the 95% confidence interval.

#### Generate the summary of the average total expenditures and difference
bootstrap_results$t0
## totexp0.1 totexp1.1    diff.1 
##  3290.543  3062.885  -227.658
#### Generate the 95% CI of the mean differences
boot.ci(bootstrap_results, index = 1, type = "perc") # Males
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstrap_results, type = "perc", index = 1)
## 
## Intervals : 
## Level     Percentile     
## 95%   (2856, 3787 )  
## Calculations and Intervals on Original Scale
boot.ci(bootstrap_results, index = 2, type = "perc") # Females
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstrap_results, type = "perc", index = 2)
## 
## Intervals : 
## Level     Percentile     
## 95%   (2370, 3979 )  
## Calculations and Intervals on Original Scale
boot.ci(bootstrap_results, index = 3, type = "perc") # Diff
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstrap_results, type = "perc", index = 3)
## 
## Intervals : 
## Level     Percentile     
## 95%   (-1121.6,   819.7 )  
## Calculations and Intervals on Original Scale
#### Plot the distribution of the mean differences
plot(bootstrap_results, index = 1)

plot(bootstrap_results, index = 2)

plot(bootstrap_results, index = 3)

Step 7: Generate the marginal effects at each age category

Next, we can generate the marginal effects at each age category along with their corresponding 95% CIs. This could be useful for you if you want to know whether the differences in total expenditures were significantly different at different age categories.

The print(slopes) command line in the function will generate the average difference in total expenditures at each age categories:

  • t1* = age < 30 yrs
  • t2* = age >= 30 & < 40 yrs
  • t3* = age >= 40 & < 50 yrs
  • t4* = age >= 50 & < 60 yrs
  • t5* = age >= 60 yrs

The boot.ci() function will generate the 95% confidence interval at each index where index = 1 corresponds to agecate = 1 and so on.

### Estimate the differences in slopes by year
#### Bootstrap code
slopes <- function(data, indices) {  
  hc2021p <- data[indices, ]
  
  ### SVY DESIGN
  ### APPLY survey weight
  options(survey.lonely.psu = 'adjust')
  mepsdsgn = svydesign(
    id = ~varpsu,
    strata = ~varstr,
    weights = ~perwt21f,
    data = hc2021p,
    nest = TRUE)
  
  ### Two-part model
  
  ### Part 1 - Logit model
  #### Inspect the total expenditure (totexp)
  summary(hc2021p$totexp)
  
  #### Run the logit model and predict the probabilities (include interaction term)
  part1 <- svyglm(nonzero ~ male + agecat + male:agecat, 
                  design = mepsdsgn,
                  family = quasibinomial(link = "logit"))
  
  ### Part 2 - Gamma model
  #### Subset data with nonzero
  mepsdsgn_nonzero <- subset(mepsdsgn, totexp > 0)
  
  #### Run the gamma model and predict the probabilities (include interaction term)
  part2 <- svyglm(totexp ~ male + agecat + male:agecat,
                  design = mepsdsgn_nonzero,
                  family = Gamma(link = "log"))
  
  #### STORE OUTPUTS
  years <- levels(hc2021p$agecat)
  
  results <- sapply(years, function(yr) {
    group0 <- data.frame(agecat = factor(yr, levels = years), male = 0)
    group1 <- data.frame(agecat = factor(yr, levels = years), male = 1) 
    
    #### PREDICTIONS
    pred_part1_0 <- predict(part1, group0, type = "response")
    pred_part1_1 <- predict(part1, group1, type = "response")
    
    pred_part2_0 <- predict(part2, group0, type = "response")
    pred_part2_1 <- predict(part2, group1, type = "response")
    
    totexp0 <- pred_part1_0 * pred_part2_0
    totexp1 <- pred_part1_1 * pred_part2_1
    
    #### Summarize
    return(totexp1 - totexp0)
  })
  
  return(results)
}

### Estimate the slopes at each age category
boot_fun_slopes <- function(data, indices) {
  d <- data[indices, ]
  slopes(d)
}

### Perform the boostrap
boot_slopes <- boot(hc2021p, 
                    boot_fun_slopes, 
                    R = 1000)

#### Mean diff in slopes between male and non-male at each years (use t0)
print(boot_slopes)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = hc2021p, statistic = boot_fun_slopes, R = 1000)
## 
## 
## Bootstrap Statistics :
##       original    bias    std. error
## t1*  -227.6580  3.280211    508.8397
## t2* -3040.0937 13.152142    507.3912
## t3* -3852.4312 45.070767    813.3686
## t4*  -680.3140 -2.739276    977.6251
## t5*  -172.7028 18.594519    811.8814
#### Confidence interval at each agecat
boot.ci(boot_slopes, index = 1, type = "perc")  # agecat = 1
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_slopes, type = "perc", index = 1)
## 
## Intervals : 
## Level     Percentile     
## 95%   (-1130.5,   864.8 )  
## Calculations and Intervals on Original Scale
boot.ci(boot_slopes, index = 2, type = "perc")  # agecat = 2
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_slopes, type = "perc", index = 2)
## 
## Intervals : 
## Level     Percentile     
## 95%   (-4036, -2045 )  
## Calculations and Intervals on Original Scale
boot.ci(boot_slopes, index = 3, type = "perc")  # agecat = 3
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_slopes, type = "perc", index = 3)
## 
## Intervals : 
## Level     Percentile     
## 95%   (-5395, -2181 )  
## Calculations and Intervals on Original Scale
boot.ci(boot_slopes, index = 4, type = "perc")  # agecat = 4
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_slopes, type = "perc", index = 4)
## 
## Intervals : 
## Level     Percentile     
## 95%   (-2710.8,  1191.2 )  
## Calculations and Intervals on Original Scale
boot.ci(boot_slopes, index = 5, type = "perc")  # agecat = 5
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_slopes, type = "perc", index = 5)
## 
## Intervals : 
## Level     Percentile     
## 95%   (-1781.7,  1441.8 )  
## Calculations and Intervals on Original Scale

Step 8: Summarize findings

We can summarize our findings into a table. I used Excel to construct the tables that I was interested in from this example.

Table A. Total expenditures between males and females.

Group Mean 95% CI
Males $3291 $2856, $3787
Females $3063 $2370, $3979
Diff b/w Males & Females -$228 -$1122, $820

Table B. Difference in total expenditures between male and females by age categories.

Age Category Mean 95% CI
Age < 30 yr -$228 -$1131, $865
Age >= 30 yr and < 40 yr -$3040 -$4036, -$2045
Age >= 40 yr and < 50 yr -$3852 -$5359, -$2181
Age >= 50 yr and < 60 yr -$680 -$2711, $1191
Age >= 60 yr -$173 -$1782, $1442

Conclusions

The average total expenditures of males and females were $3291 and $3063, respectively. The average difference between them was -$228 (95% CI: -1122, 820). Statistically significant differences between males and females were reported for age categories >=30 and <40 years and >=40 and < 50 years.

The use of the two-part model can incorporate complex survey design methods, but we needed to construct each part separately and then combine them. We then created a function to use in a bootstrap analysis to generate the 95% CI from the two-part model. Lastly, we estimated the marginal effects at each age category by constructing a separate bootstrap function.

Overall, this experience has resulted in a useful code to construct a two-part model using bootstrap methods. I’m pleased with how this turned out, and I hope to work on this in future iterations.

References

A great paper on the boot package (“Resampling Methods in R: the boot package”) by Angelo J. Canty is available here.

Acknowledgements

I used Claude and ChatGPT to help construct the bootstrap code for this exercise. This was my first time using these tools, and I have to admit that they were extremely helpful. Although there were errors and hallucinations, I was able to take what was provided to construct a bootstrap function that I was happy with. I plan to experiment with these tools in the future given how impressed I’ve been with this experience.

Disclosures and Disclaimers

This is for educational purposes only.

Since this is a work in progress, expect to see updates in the future.