Introduction
In this article, I wanted to expand on a previous post that
describes using a two-part model to model cost (or total expenditure) as
an outcome with data from the Agency for Healthcare Research and Quality
(AHRQ) Medical Expenditure Panel Survey (MEPS). In the previous article,
I used the twopartm package, which is great at leveraging
the two-part model approach. However, it does not appear to handle data
from complex survey designs like MEPS.
The best way to handle complex survey design data with weights using the two-part model approach is to perform the estimations for each part separately and then combine them.
With a little help from some AI chatbots, I was able to construct a viable code that not only estimates and combines both parts of the two-part model, but also allows me to bootstrap the results to generate 95% confidence intervals (CI).
Loading data
Step 1: Load the data from the AHRQ MEPS site.
MEPS provides a very nice package called MEPS that
allows you to connect to their data repository to download any of their
public use files. The code below connects to the MEPS data repository
and downloads the data. The downloaded data is parsed to variables that
I’m going to use in the two-part model such as the total expenditures, a
variable for sex, and the weights/sampling units from the complex survey
design. Note: I only include individuals with positive weights. The
labled the dataframe hc2021p.
### LOAD Libraries
library("MEPS")
library("tidyverse")
library("survey")
library("gmodels")
library("boot")
### LOAD DATA
hc2021 = read_MEPS(file = "h233")
names(hc2021) <- tolower(names(hc2021)) ## Convert column headers to lower case
### We'll parse the data to a few key variables
hc2021p = hc2021 %>%
rename(
age = age21x,
totexp = totexp21,
ertexp = ertexp21,
perwt21f = perwt21f,
varstr = varstr,
varpsu = varpsu) %>%
select(
dupersid,
age,
sex,
totexp,
ertexp,
perwt21f,
varstr,
varpsu)
hc2021p$year <- 2021
hc2021p <- filter(hc2021p, perwt21f > 0) # Only select individuals with a positive weight
nrow(hc2021p) # Check number of rows
## [1] 27332
Step 2: Clean data and generate new variables
I next clean the data and create new variables for male
and agecat. I wanted to change the sex
variable where 0 == female and 1 == male. I
also wanted to create a factor variable for age so that I
could see the different effects at various age categories. Most
importantly, I created a binary variable for the outcome to indicate
whether is a zero or nonzero costs, which will be used in the first part
of the the two-part model.
### CLEAN DATA
#### Generate male variable
hc2021p$male[hc2021p$sex == 2] = 0
hc2021p$male[hc2021p$sex == 1] = 1
table(hc2021p$male)
##
## 0 1
## 14441 12891
CrossTable(hc2021p$male, hc2021p$sex)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 27332
##
##
## | hc2021p$sex
## hc2021p$male | 1 | 2 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 0 | 14441 | 14441 |
## | 6811.025 | 6079.975 | |
## | 0.000 | 1.000 | 0.528 |
## | 0.000 | 1.000 | |
## | 0.000 | 0.528 | |
## -------------|-----------|-----------|-----------|
## 1 | 12891 | 0 | 12891 |
## | 7629.975 | 6811.025 | |
## | 1.000 | 0.000 | 0.472 |
## | 1.000 | 0.000 | |
## | 0.472 | 0.000 | |
## -------------|-----------|-----------|-----------|
## Column Total | 12891 | 14441 | 27332 |
## | 0.472 | 0.528 | |
## -------------|-----------|-----------|-----------|
##
##
#### Create a binary nonzero cost variable
hc2021p$nonzero[hc2021p$totexp == 0] = 0
hc2021p$nonzero[hc2021p$totexp > 0] = 1
table(hc2021p$nonzero)
##
## 0 1
## 3763 23569
#### Create 5-levels of age categories
hc2021p$agecat[hc2021p$age>=0 & hc2021p$age<30] = 0
hc2021p$agecat[hc2021p$age>=30 & hc2021p$age<40] = 1
hc2021p$agecat[hc2021p$age>=40 & hc2021p$age<50] = 2
hc2021p$agecat[hc2021p$age>=50 & hc2021p$age<60] = 3
hc2021p$agecat[hc2021p$age>=60 & hc2021p$age<99] = 4
hc2021p$agecat <- as.factor(hc2021p$agecat)
table(hc2021p$agecat)
##
## 0 1 2 3 4
## 8688 3269 3155 3581 8416
Step 3: Contructing the two-part model
I provide the code that I used to construct the two-part model. The
first part of the model is a logit model where the outcome is the binary
nonezero variable to indicate that the individual has
either a zero or non-zero total expenditure. I also added the
male and agecat variables as the predictors,
which I then interacted with each other to create an interaction term
male:agecat.
Using this code, the average difference in total expenditure between male and female is approximately -$228. In other words, males have a lower total expenditure compared to females.
#############################
### Two-part model
#############################
### APPLY survey weight
options(survey.lonely.psu = 'adjust')
mepsdsgn = svydesign(
id = ~varpsu,
strata = ~varstr,
weights = ~perwt21f,
data = hc2021p,
nest = TRUE)
### PART 1 - Logit model
#### Run the logit model and predict the probabilities (include interaction term)
part1 <- svyglm(nonzero ~ male + agecat + male:agecat,
design = mepsdsgn,
family = quasibinomial(link = "logit"))
### PART 2 - Gamma model
#### Subset data with nonzero
mepsdsgn_nonzero <- subset(mepsdsgn, totexp > 0)
#### Run the gamma model and predict the probabilities (include interaction term)
part2 <- svyglm(totexp ~ male + agecat + male:agecat,
design = mepsdsgn_nonzero,
family = Gamma(link = "log"))
#### STORE OUTPUTS for each agecat levels by male/female
group0 <- data.frame(agecat = factor("0",
levels = levels(hc2021p$agecat)), male = 0)
group1 <- data.frame(agecat = factor("0",
levels = levels(hc2021p$agecat)), male = 1)
#### PREDICTIONS
pred_part1_0 <- predict(part1, group0, type = "response")
pred_part1_1 <- predict(part1, group1, type = "response")
pred_part2_0 <- predict(part2, group0, type = "response")
pred_part2_1 <- predict(part2, group1, type = "response")
totexp0 <- pred_part1_0 * pred_part2_0
totexp1 <- pred_part1_1 * pred_part2_1
#### DIFFERENCE in costs between male and female
diff_groups <- totexp1 - totexp0
print(diff_groups)
## response SE
## 1 -227.66 0.0106
#### On average, males spend less than non-males by -$228.
Step 4: Write the Bootstrap Function
The next step involves writing the bootstrap function. The bootstrap
function will run the two-part model and return the average total
expenditures for males and females along with the difference between
them. I named the function twopm.
#################################
#### Bootstrap code
#################################
#### Need to use this to estimate the 95% CIs
twopm <- function(data, indices) {
hc2021p <- data[indices, ]
### SVY DESIGN
### APPLY survey weight
options(survey.lonely.psu = 'adjust')
mepsdsgn = svydesign(
id = ~varpsu,
strata = ~varstr,
weights = ~perwt21f,
data = hc2021p,
nest = TRUE)
### Two-part model
### Part 1 - Logit model
#### Run the logit model and predict the probabilities (include interaction term)
part1 <- svyglm(nonzero ~ male + agecat + male:agecat,
design = mepsdsgn,
family = quasibinomial(link = "logit"))
### Part 2 - Gamma model
#### Subset data with nonzero
mepsdsgn_nonzero <- subset(mepsdsgn, totexp > 0)
#### Run the gamma model and predict the probabilities (include interaction term)
part2 <- svyglm(totexp ~ male + agecat + male:agecat,
design = mepsdsgn_nonzero,
family = Gamma(link = "log"))
#### STORE OUTPUTS
group0 <- data.frame(agecat = factor("0",
levels = levels(hc2021p$agecat)), male = 0)
group1 <- data.frame(agecat = factor("0",
levels = levels(hc2021p$agecat)), male = 1)
#### PREDICTIONS
pred_part1_0 <- predict(part1, group0, type = "response")
pred_part1_1 <- predict(part1, group1, type = "response")
pred_part2_0 <- predict(part2, group0, type = "response")
pred_part2_1 <- predict(part2, group1, type = "response")
totexp0 <- pred_part1_0 * pred_part2_0
totexp1 <- pred_part1_1 * pred_part2_1
#### Summarize
return(c(totexp0 = totexp0,
totexp1 = totexp1,
diff = totexp1 - totexp0))
}
Step 5: Run the bootstrap
The next part involves using the boot() function, which
is from the boot package, to perform the bootstrap
analysis. Make sure to have the boot package loaded as a
library. I used 1000 bootstrap replicates with replacement for this
example.
### BOOTSTRAP
set.seed(12321) # Set seed to reproduce findings
bootstrap_results <- boot(data = hc2021p,
statistic = twopm,
R = 1000)
Step 6: Return the bootstrap results
Once the bootstrap results are saved in the
bootstrap_results object, we can generate the average total
expenditures for males and females, the differences between them, and
their corresponding 95% CIs. We can also plot the distributions for
teach of the output.
The boot.ci() generates the 95% confidence interval.
#### Generate the summary of the average total expenditures and difference
bootstrap_results$t0
## totexp0.1 totexp1.1 diff.1
## 3290.543 3062.885 -227.658
#### Generate the 95% CI of the mean differences
boot.ci(bootstrap_results, index = 1, type = "perc") # Males
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootstrap_results, type = "perc", index = 1)
##
## Intervals :
## Level Percentile
## 95% (2856, 3787 )
## Calculations and Intervals on Original Scale
boot.ci(bootstrap_results, index = 2, type = "perc") # Females
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootstrap_results, type = "perc", index = 2)
##
## Intervals :
## Level Percentile
## 95% (2370, 3979 )
## Calculations and Intervals on Original Scale
boot.ci(bootstrap_results, index = 3, type = "perc") # Diff
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootstrap_results, type = "perc", index = 3)
##
## Intervals :
## Level Percentile
## 95% (-1121.6, 819.7 )
## Calculations and Intervals on Original Scale
#### Plot the distribution of the mean differences
plot(bootstrap_results, index = 1)
plot(bootstrap_results, index = 2)
plot(bootstrap_results, index = 3)
Step 7: Generate the marginal effects at each age category
Next, we can generate the marginal effects at each age category along with their corresponding 95% CIs. This could be useful for you if you want to know whether the differences in total expenditures were significantly different at different age categories.
The print(slopes) command line in the function will
generate the average difference in total expenditures at each age
categories:
t1*= age < 30 yrst2*= age >= 30 & < 40 yrst3*= age >= 40 & < 50 yrst4*= age >= 50 & < 60 yrst5*= age >= 60 yrs
The boot.ci() function will generate the 95% confidence
interval at each index where index = 1 corresponds to
agecate = 1 and so on.
### Estimate the differences in slopes by year
#### Bootstrap code
slopes <- function(data, indices) {
hc2021p <- data[indices, ]
### SVY DESIGN
### APPLY survey weight
options(survey.lonely.psu = 'adjust')
mepsdsgn = svydesign(
id = ~varpsu,
strata = ~varstr,
weights = ~perwt21f,
data = hc2021p,
nest = TRUE)
### Two-part model
### Part 1 - Logit model
#### Inspect the total expenditure (totexp)
summary(hc2021p$totexp)
#### Run the logit model and predict the probabilities (include interaction term)
part1 <- svyglm(nonzero ~ male + agecat + male:agecat,
design = mepsdsgn,
family = quasibinomial(link = "logit"))
### Part 2 - Gamma model
#### Subset data with nonzero
mepsdsgn_nonzero <- subset(mepsdsgn, totexp > 0)
#### Run the gamma model and predict the probabilities (include interaction term)
part2 <- svyglm(totexp ~ male + agecat + male:agecat,
design = mepsdsgn_nonzero,
family = Gamma(link = "log"))
#### STORE OUTPUTS
years <- levels(hc2021p$agecat)
results <- sapply(years, function(yr) {
group0 <- data.frame(agecat = factor(yr, levels = years), male = 0)
group1 <- data.frame(agecat = factor(yr, levels = years), male = 1)
#### PREDICTIONS
pred_part1_0 <- predict(part1, group0, type = "response")
pred_part1_1 <- predict(part1, group1, type = "response")
pred_part2_0 <- predict(part2, group0, type = "response")
pred_part2_1 <- predict(part2, group1, type = "response")
totexp0 <- pred_part1_0 * pred_part2_0
totexp1 <- pred_part1_1 * pred_part2_1
#### Summarize
return(totexp1 - totexp0)
})
return(results)
}
### Estimate the slopes at each age category
boot_fun_slopes <- function(data, indices) {
d <- data[indices, ]
slopes(d)
}
### Perform the boostrap
boot_slopes <- boot(hc2021p,
boot_fun_slopes,
R = 1000)
#### Mean diff in slopes between male and non-male at each years (use t0)
print(boot_slopes)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = hc2021p, statistic = boot_fun_slopes, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -227.6580 3.280211 508.8397
## t2* -3040.0937 13.152142 507.3912
## t3* -3852.4312 45.070767 813.3686
## t4* -680.3140 -2.739276 977.6251
## t5* -172.7028 18.594519 811.8814
#### Confidence interval at each agecat
boot.ci(boot_slopes, index = 1, type = "perc") # agecat = 1
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot_slopes, type = "perc", index = 1)
##
## Intervals :
## Level Percentile
## 95% (-1130.5, 864.8 )
## Calculations and Intervals on Original Scale
boot.ci(boot_slopes, index = 2, type = "perc") # agecat = 2
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot_slopes, type = "perc", index = 2)
##
## Intervals :
## Level Percentile
## 95% (-4036, -2045 )
## Calculations and Intervals on Original Scale
boot.ci(boot_slopes, index = 3, type = "perc") # agecat = 3
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot_slopes, type = "perc", index = 3)
##
## Intervals :
## Level Percentile
## 95% (-5395, -2181 )
## Calculations and Intervals on Original Scale
boot.ci(boot_slopes, index = 4, type = "perc") # agecat = 4
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot_slopes, type = "perc", index = 4)
##
## Intervals :
## Level Percentile
## 95% (-2710.8, 1191.2 )
## Calculations and Intervals on Original Scale
boot.ci(boot_slopes, index = 5, type = "perc") # agecat = 5
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot_slopes, type = "perc", index = 5)
##
## Intervals :
## Level Percentile
## 95% (-1781.7, 1441.8 )
## Calculations and Intervals on Original Scale
Step 8: Summarize findings
We can summarize our findings into a table. I used Excel to construct the tables that I was interested in from this example.
Table A. Total expenditures between males and females.
Group |
Mean |
95% CI |
|---|---|---|
Males |
$3291 |
$2856, $3787 |
Females |
$3063 |
$2370, $3979 |
Diff b/w Males & Females |
-$228 |
-$1122, $820 |
Table B. Difference in total expenditures between male and females by age categories.
Age Category |
Mean |
95% CI |
|---|---|---|
Age < 30 yr |
-$228 |
-$1131, $865 |
Age >= 30 yr and < 40 yr |
-$3040 |
-$4036, -$2045 |
Age >= 40 yr and < 50 yr |
-$3852 |
-$5359, -$2181 |
Age >= 50 yr and < 60 yr |
-$680 |
-$2711, $1191 |
Age >= 60 yr |
-$173 |
-$1782, $1442 |
Conclusions
The average total expenditures of males and females were $3291 and $3063, respectively. The average difference between them was -$228 (95% CI: -1122, 820). Statistically significant differences between males and females were reported for age categories >=30 and <40 years and >=40 and < 50 years.
The use of the two-part model can incorporate complex survey design methods, but we needed to construct each part separately and then combine them. We then created a function to use in a bootstrap analysis to generate the 95% CI from the two-part model. Lastly, we estimated the marginal effects at each age category by constructing a separate bootstrap function.
Overall, this experience has resulted in a useful code to construct a two-part model using bootstrap methods. I’m pleased with how this turned out, and I hope to work on this in future iterations.
References
A great paper on the boot package (“Resampling Methods
in R: the boot package”) by Angelo J. Canty is available here.
Acknowledgements
I used Claude and ChatGPT to help construct the bootstrap code for this exercise. This was my first time using these tools, and I have to admit that they were extremely helpful. Although there were errors and hallucinations, I was able to take what was provided to construct a bootstrap function that I was happy with. I plan to experiment with these tools in the future given how impressed I’ve been with this experience.
Disclosures and Disclaimers
This is for educational purposes only.
Since this is a work in progress, expect to see updates in the future.