Cost as a dependent variable: Two-Part Model with Bootstrap using R

Mark Bounthavong

31 March 2026 (Updated: 03 April 2026)

Introduction

In this article, I wanted to expand on a previous post that describes using a two-part model to model cost (or total expenditure) as an outcome with data from the Agency for Healthcare Research and Quality (AHRQ) Medical Expenditure Panel Survey (MEPS). In the previous article, I used the twopartm package, which is great at leveraging the two-part model approach. However, it does not appear to handle data from complex survey designs like MEPS.

The best way to handle complex survey design data with weights using the two-part model approach is to perform the estimations for each part separately and then combine them.

With a little help from some AI chatbots, I was able to construct a viable code that not only estimates and combines both parts of the two-part model, but also allows me to bootstrap the results to generate 95% confidence intervals (CI).

Loading data

Step 1: Load the data from the AHRQ MEPS site.

MEPS provides a very nice package called MEPS that allows you to connect to their data repository to download any of their public use files. The code below connects to the MEPS data repository and downloads the data. The downloaded data is parsed to variables that I’m going to use in the two-part model such as the total expenditures, a variable for sex, and the weights/sampling units from the complex survey design. Note: I only include individuals with positive weights. The labled the dataframe hc2021p.

### LOAD Libraries
library("MEPS")
library("tidyverse")
library("survey")
library("gmodels")
library("boot")


### LOAD DATA
hc2021 = read_MEPS(file = "h233")
names(hc2021) <- tolower(names(hc2021))   ## Convert column headers to lower case

### We'll parse the data to a few key variables
hc2021p = hc2021 %>%
  rename(
    age = age21x,
    totexp = totexp21,
    ertexp = ertexp21,
    perwt21f = perwt21f,
    varstr = varstr,
    varpsu = varpsu) %>%
  select(
    dupersid,
    age,
    sex,
    totexp,
    ertexp,
    perwt21f,
    varstr,
    varpsu)
hc2021p$year <- 2021

hc2021p <- filter(hc2021p, perwt21f > 0) # Only select individuals with a positive weight
nrow(hc2021p) # Check number of rows
## [1] 27332

Step 2: Clean data and generate new variables

I next clean the data and create new variables for male and agecat. I wanted to change the sex variable where 0 == female and 1 == male. I also wanted to create a factor variable for age so that I could see the different effects at various age categories. Most importantly, I created a binary variable for the outcome to indicate whether is a zero or nonzero costs, which will be used in the first part of the the two-part model. Lastly, we only want to include complete cases. There are several NAs in the age variable, so we will remove those rows using the complete.cases() function.

### CLEAN DATA
#### Generate male variable
hc2021p$male[hc2021p$sex == 2] = 0
hc2021p$male[hc2021p$sex == 1] = 1
table(hc2021p$male)
## 
##     0     1 
## 14441 12891
#### Create a binary nonzero cost variable
hc2021p$nonzero[hc2021p$totexp == 0] = 0
hc2021p$nonzero[hc2021p$totexp >  0] = 1
table(hc2021p$nonzero)
## 
##     0     1 
##  3763 23569
#### Create 5-levels of age categories
hc2021p$agecat[hc2021p$age>=0  & hc2021p$age<30] = 0
hc2021p$agecat[hc2021p$age>=30 & hc2021p$age<40] = 1
hc2021p$agecat[hc2021p$age>=40 & hc2021p$age<50] = 2
hc2021p$agecat[hc2021p$age>=50 & hc2021p$age<60] = 3
hc2021p$agecat[hc2021p$age>=60 & hc2021p$age<99] = 4
hc2021p$agecat <- as.factor(hc2021p$agecat)
table(hc2021p$agecat)
## 
##    0    1    2    3    4 
## 8688 3269 3155 3581 8416
#### Only include complete cases
vars_needed <- c("nonzero", "male", "agecat", "totexp", "varpsu", "varstr", "perwt21f")
hc2021p_complete <- hc2021p[complete.cases(hc2021p[, vars_needed]), ]

Step 3: Contructing the two-part model

I provide the code that I used to construct the two-part model. The first part of the model is a logit model where the outcome is the binary nonezero variable to indicate that the individual has either a zero or non-zero total expenditure. I also added the male and agecat variables as the predictors, which I then interacted with each other to create an interaction term male:agecat.

Since we are combining the first part (logit model) with the second part (Gamma model), we need to do some predictions. In this example, the recycled predictions method is used to estimate the average total expenditures for males and females. In recycled predictions, we estimate the weighted average total expenditures if all individuals were males and then estimate the weighted average total expenditures if all individuals were females. Kleinman and Norton have a great paper on performing “recycled predictions.”[1]

\[\begin{align*} \text{Marginal Effect} = E[Y | male = 1] - E[Y | male == 0] \\ \end{align*}\]

Using this code, the average difference in total expenditure between male and female is approximately -$1098. In other words, males have a lower total expenditure compared to females.

#############################
### Two-part model
#############################

### APPLY survey weight
options(survey.lonely.psu = 'adjust')

mepsdsgn = svydesign(
  id = ~varpsu,
  strata = ~varstr,
  weights = ~perwt21f,
  data = hc2021p_complete,
  nest = TRUE)

### PART 1 - Logit model
#### Run the logit model and predict the probabilities (include interaction term)
part1 <- svyglm(nonzero ~ male + agecat + male:agecat,
                design = mepsdsgn,
                family = quasibinomial(link = "logit"))

### PART 2 - Gamma model
#### Subset data with nonzero
mepsdsgn_nonzero <- subset(mepsdsgn, nonzero == 1)

#### Run the gamma model and predict the probabilities (include interaction term)
part2 <- svyglm(totexp ~ male + agecat + male:agecat,
                design = mepsdsgn_nonzero,
                family = Gamma(link = "log"))

#### Recycled predictions
# E[Y | male = 1] - E[Y | male = 0]
recycled_pred_female <- hc2021p_complete
recycled_pred_female$male <- 0
  
recycled_pred_male <- hc2021p_complete
recycled_pred_male$male <- 1
  
#### Predict for male == 0
mu0_part1 <- predict(part1, recycled_pred_female, type = "response", na.action = na.pass)
mu0_part2 <- predict(part2, recycled_pred_female, type = "response", na.action = na.pass)

#### Predict for male == 1
mu1_part1 <- predict(part1, recycled_pred_male, type = "response", na.action = na.pass) 
mu1_part2 <- predict(part2, recycled_pred_male, type = "response", na.action = na.pass)
  
#### Combine
mu0 <- mu0_part1 * mu0_part2
mu1 <- mu1_part1 * mu1_part2
  
#### Weighted means ("margins")
valid <- !is.na(mu0) & !is.na(mu1)
weighted_mean <- weights(mepsdsgn, type = "sampling")[valid]
  
totexp0 <- sum(weighted_mean * mu0) / sum(weighted_mean)
totexp1 <- sum(weighted_mean * mu1) / sum(weighted_mean)
  
#### DIFFERENCE in costs between males and females
diff_groups <- totexp1 - totexp0
print(diff_groups)
## [1] -1098.382
#### On average, males spend less than females by -$1098.

Step 4: Write the Bootstrap Function

The next step involves writing the bootstrap function. The bootstrap function will run the two-part model and return the average total expenditures for males and females along with the difference between them. In this example, I performed a bootstrap with 1000 replications. I named the function twopm.

#################################
#### Bootstrap code
#################################

#### Need to use this to estimate the 95% CIs
twopm <- function(data, indices) {  
  hc2021p_complete <- data[indices, ]

  ### SVY DESIGN
  ### APPLY survey weight
  options(survey.lonely.psu = 'adjust')
  mepsdsgn = svydesign(
    id = ~varpsu,
    strata = ~varstr,
    weights = ~perwt21f,
    data = hc2021p_complete,
    nest = TRUE)
  
  ### Two-part model
  
  ### Part 1 - Logit model
  #### Run the logit model and predict the probabilities (include interaction term)
  part1 <- svyglm(nonzero ~ male + agecat + male:agecat,
                  design = mepsdsgn,
                  family = quasibinomial(link = "logit"))
  
  ### Part 2 - Gamma model
  #### Subset data with nonzero
  mepsdsgn_nonzero <- subset(mepsdsgn, nonzero == 1)
  
  #### Run the gamma model and predict the probabilities (include interaction term)
  part2 <- svyglm(totexp ~ male + agecat + male:agecat,
                  design = mepsdsgn_nonzero,
                  family = Gamma(link = "log"))
  
  #### Recycled predictions
  # E[Y | male = 1] - E[Y | male = 0]
  recycled_pred_female <- hc2021p_complete
  recycled_pred_female$male <- 0
  
  recycled_pred_male <- hc2021p_complete
  recycled_pred_male$male <- 1
  
  #### Predict for male == 0
  mu0_part1 <- predict(part1, recycled_pred_female, type = "response", na.action = na.pass)
  mu0_part2 <- predict(part2, recycled_pred_female, type = "response", na.action = na.pass)

  #### Predict for male == 1
  mu1_part1 <- predict(part1, recycled_pred_male, type = "response", na.action = na.pass) 
  mu1_part2 <- predict(part2, recycled_pred_male, type = "response", na.action = na.pass)
  
  #### Combine
  mu0 <- mu0_part1 * mu0_part2
  mu1 <- mu1_part1 * mu1_part2
  
  #### Weighted means ("margins")
  valid <- !is.na(mu0) & !is.na(mu1)
  weighted_mean <- weights(mepsdsgn, type = "sampling")[valid]
  
  totexp0 <- sum(weighted_mean * mu0) / sum(weighted_mean)
  totexp1 <- sum(weighted_mean * mu1) / sum(weighted_mean)
  
  #### Summarize
  return(c(totexp0 = totexp0, 
           totexp1 = totexp1,
           diff = totexp1 - totexp0))
}

Step 5: Run the bootstrap

The next part involves using the boot() function, which is from the boot package, to perform the bootstrap analysis. Make sure to have the boot package loaded as a library. I used 1000 bootstrap replicates with replacement for this example.

### BOOTSTRAP 
set.seed(12321) # Set seed to reproduce findings
bootstrap_results <- boot(data = hc2021p_complete,
                          statistic = twopm,
                          R = 1000)

Step 6: Return the bootstrap results

Once the bootstrap results are saved in the bootstrap_results object, we can generate the average total expenditures for males and females, the differences between them, and their corresponding 95% CIs. We can also plot the distributions for each of the output.

The boot.ci() generates the 95% confidence interval.

#### Generate the summary of the average total expenditures and difference
bootstrap_results$t0
##   totexp0   totexp1      diff 
##  7359.781  6261.399 -1098.382
#### Generate the 95% CI of the mean differences
boot.ci(bootstrap_results, index = 1, type = "perc") # Males
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstrap_results, type = "perc", index = 1)
## 
## Intervals : 
## Level     Percentile     
## 95%   (6961, 7751 )  
## Calculations and Intervals on Original Scale
boot.ci(bootstrap_results, index = 2, type = "perc") # Females
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstrap_results, type = "perc", index = 2)
## 
## Intervals : 
## Level     Percentile     
## 95%   (5790, 6787 )  
## Calculations and Intervals on Original Scale
boot.ci(bootstrap_results, index = 3, type = "perc") # Diff b/w Males and Females
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstrap_results, type = "perc", index = 3)
## 
## Intervals : 
## Level     Percentile     
## 95%   (-1767,  -393 )  
## Calculations and Intervals on Original Scale
#### Plot the distribution of the mean differences
plot(bootstrap_results, index = 1) # Males

plot(bootstrap_results, index = 2) # Females

plot(bootstrap_results, index = 3) # Diff b/w Males and Females

Step 7: Generate the marginal effects at each age category

Next, we can generate the marginal effects at each age category along with their corresponding 95% CIs. This could be useful for you if you want to know whether the differences in total expenditures were significantly different at different age categories.

The print(slopes) command line in the function will generate the average difference in total expenditures at each age categories:

  • t1* = age < 30 yrs
  • t2* = age >= 30 & < 40 yrs
  • t3* = age >= 40 & < 50 yrs
  • t4* = age >= 50 & < 60 yrs
  • t5* = age >= 60 yrs

The boot.ci() function will generate the 95% confidence interval at each index where index = 1 corresponds to agecate = 1 and so on.

### Estimate the differences in slopes by year
#### Bootstrap code
slopes <- function(data, indices) {  
  hc2021p_complete <- data[indices, ]
  
  ### SVY DESIGN
  ### APPLY survey weight
  options(survey.lonely.psu = 'adjust')
  mepsdsgn = svydesign(
    id = ~varpsu,
    strata = ~varstr,
    weights = ~perwt21f,
    data = hc2021p_complete,
    nest = TRUE)
  
  ### Two-part model
  
  ### Part 1 - Logit model
  #### Run the logit model and predict the probabilities (include interaction term)
  part1 <- svyglm(nonzero ~ male + agecat + male:agecat, 
                  design = mepsdsgn,
                  family = quasibinomial(link = "logit"))
  
  ### Part 2 - Gamma model
  #### Subset data with nonzero
  mepsdsgn_nonzero <- subset(mepsdsgn, nonzero == 1)
  
  #### Run the gamma model and predict the probabilities (include interaction term)
  part2 <- svyglm(totexp ~ male + agecat + male:agecat,
                  design = mepsdsgn_nonzero,
                  family = Gamma(link = "log"))
  
  #### STORE OUTPUTS
  years <- levels(hc2021p_complete$agecat)
  
  results <- sapply(years, function(yr) {
    group0 <- data.frame(agecat = factor(yr, levels = years), male = 0)
    group1 <- data.frame(agecat = factor(yr, levels = years), male = 1) 
    
    #### PREDICTIONS
    pred_part1_0 <- predict(part1, group0, type = "response", na.action = na.pass)
    pred_part1_1 <- predict(part1, group1, type = "response", na.action = na.pass)
    
    pred_part2_0 <- predict(part2, group0, type = "response", na.action = na.pass)
    pred_part2_1 <- predict(part2, group1, type = "response", na.action = na.pass)
    
    totexp0 <- pred_part1_0 * pred_part2_0
    totexp1 <- pred_part1_1 * pred_part2_1
    
    #### Summarize
    return(totexp1 - totexp0)
  })
  
  return(results)
}

### Estimate the slopes at each age category
boot_fun_slopes <- function(data, indices) {
  d <- data[indices, ]
  slopes(d)
}

### Perform the boostrap
boot_slopes <- boot(hc2021p_complete, 
                    boot_fun_slopes, 
                    R = 1000)

#### Mean diff in slopes between males and females at each year
print(boot_slopes)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = hc2021p_complete, statistic = boot_fun_slopes, R = 1000)
## 
## 
## Bootstrap Statistics :
##       original     bias    std. error
## t1*  -227.6580   9.519676    513.0284
## t2* -3040.0937 -21.011643    508.3212
## t3* -3852.4312 -15.328010    792.3462
## t4*  -680.3140  15.227059    962.9211
## t5*  -172.7028  -5.703279    833.6766
#### Confidence interval at each agecat
boot.ci(boot_slopes, index = 1, type = "perc")  # agecat = 1
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_slopes, type = "perc", index = 1)
## 
## Intervals : 
## Level     Percentile     
## 95%   (-1127.1,   892.7 )  
## Calculations and Intervals on Original Scale
boot.ci(boot_slopes, index = 2, type = "perc")  # agecat = 2
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_slopes, type = "perc", index = 2)
## 
## Intervals : 
## Level     Percentile     
## 95%   (-4087, -2113 )  
## Calculations and Intervals on Original Scale
boot.ci(boot_slopes, index = 3, type = "perc")  # agecat = 3
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_slopes, type = "perc", index = 3)
## 
## Intervals : 
## Level     Percentile     
## 95%   (-5427, -2350 )  
## Calculations and Intervals on Original Scale
boot.ci(boot_slopes, index = 4, type = "perc")  # agecat = 4
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_slopes, type = "perc", index = 4)
## 
## Intervals : 
## Level     Percentile     
## 95%   (-2537.6,  1158.8 )  
## Calculations and Intervals on Original Scale
boot.ci(boot_slopes, index = 5, type = "perc")  # agecat = 5
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_slopes, type = "perc", index = 5)
## 
## Intervals : 
## Level     Percentile     
## 95%   (-1786.7,  1472.1 )  
## Calculations and Intervals on Original Scale

Step 8: Summarize findings

We can summarize our findings into a table. I used Excel to construct the tables that I was interested in from this example.

Table A. Total expenditures between males and females.

Group Mean 95% CI
Males $6261 $5850, $6931
Females $7360 $7090, $7835
Diff b/w Males & Females -$1098 -$1494, -$189

Table B. Difference in total expenditures between male and females by age categories.

Age Category Mean 95% CI
Age < 30 yr -$228 -$1127, $893
Age >= 30 yr and < 40 yr -$3040 -$4087, -$2113
Age >= 40 yr and < 50 yr -$3852 -$5427, -$2350
Age >= 50 yr and < 60 yr -$680 -$2538, $1159
Age >= 60 yr -$173 -$1787, $1472

Conclusions

The average total expenditures of males and females were $3291 and $3063, respectively. The average difference between them was -$1098 (95% CI: -1767, 393). Statistically significant differences between males and females were reported for age categories >=30 and <40 years and >=40 and < 50 years. The use of the two-part model can incorporate complex survey design methods, but we needed to construct each part separately and then combine them. We then created a function to use in a bootstrap analysis to generate the 95% CI from the two-part model. Lastly, we estimated the marginal effects at each age category by constructing a separate bootstrap function.

Overall, this experience has resulted in a useful code to construct a two-part model using bootstrap methods. I’m pleased with how this turned out, and I hope to work on this in future iterations.

References

    1. Kleinman LC, Norton EC. What’s the Risk? A simple approach for estimating adjusted risk measures from nonlinear models including logistic regression. Health Serv Res. 2009 Feb;44(1):288-302. doi: 10.1111/j.1475-6773.2008.00900.x. Epub 2008 Sep 11. PMID: 18793213; PMCID: PMC2669627.
    1. A great paper on the boot package (“Resampling Methods in R: the boot package”) by Angelo J. Canty is available here.

Acknowledgements

I used Claude and ChatGPT to help construct the bootstrap code for this exercise. This was my first time using these tools, and I have to admit that they were extremely helpful. Although there were errors and hallucinations, I was able to take what was provided to construct a bootstrap function that I was happy with. I plan to experiment with these tools in the future given how impressed I’ve been with this experience.

Disclosures and Disclaimers

This is for educational purposes only.

Since this is a work in progress, expect to see updates in the future.