Introduction and Motivating Example
I occasionally facilitate an informal meeting with students to discuss a biomedical paper and its merits. We pick at the details and discuss its impact on policy and practice. Oftentimes, we go on a tangent and explore other complementary ideas and explore higher-level themes associated with healthcare. But, ultimately, our conversation always returns to the paper, and we find ourselves wishing that we could get the data and explore its contents. I dub these meetings our “Literary Cafe” because it occurs in the morning at a local coffee house. (Note: I enjoy my coffee when I read literature.)
In our recent Literary Cafe, we reviewed a paper by Wu and colleagued titled [1], “Patterns and costs associated with glucagon-like peptide 1 receptor agonist use in US adults with type 2 diabetes.” The article was published in the Journal of Managed Care and Specialty Pharmacy (JMCP) and investigated the healthcare costs and resource utilization among US adults with type 2 diabetes who received (or did not receive) a glucagon-like peptide 1 receptor agonist (GLP1-RA).
There has been a lot of papers on GLP1-RAs due to its effectiveness at managing metabolic outcomes, reducing body weight, and cardiovascular benefits. Additionally, there have been a dearth fo studies examining its costs and cost-effectiveness due to the high-cost of these agents.
The paper by Wu and colleagues uses publicly available data from the Agency for Healthcare Research and Quality (AHRQ) Medical Expenditure Panel Survey (MEPS) to compare the healthcare costs and resource utilization between adults with type 2 diabetes with and without a GLP1-RA. I was drawn to this study because we can validate the numbers in the paper by downloading and applying the inclusion/exclusion criteria of the study, which are provided in the Methods section. Assuming that the paper contains enough details, we should be able to replicate the numbers in the paper (or compe reasonably close).
Hence, this article will be about my attempt to replicat the study’s findings by useing the same data the authors use. I’ll divide this into two (maybe three) parts.
Objectives
In part 1, I’ll demonstrate how to get the same number of responders used in the study using MEPS data.
In part 2, I’ll use the statsistical methods in the paper to replicate the findings.
Part 1: Reproduce the sample cohort
In part 1, I’ll demonstrate how we can use the same publicly available data and reproduce the sample used in the paper by Wu and colleagues.
Wu and colleagues provided a flow diagram (attrition chart) that illustrates their step-by-step process for generating their analytic sample. Our objective is to reproduce these number. We’re going to focus on the unweighted numbers instead of the weighted numbers. You’ll notice in the paper that both the unweighted and weighted numbers are presented. Since MEPS is a complex survey design, it’s necessary to apply the appropriate weights to generate the representative US population. However for this example we’re only interested in the unweighted numbers in the flow diagram, so, I only listed the unweighted numbers in the flow diagram below.
Flow diagran recreated from Wu and colleagues [1]
Data Source
We will need to get the MEPS Full Year Consolidated File for 2021 and 2022.
To do that, we will need to install and load the MEPS
package from the MEPS GitHub
site.
While we load the MEPS package, we will also load the
other packages for this exercise.
#### To install the `MEPS` package, you'll need to have the `devtools` package:
library("devtools")
install_github("e-mitchell/meps_r_pkg/MEPS")
## vctrs (0.7.0 -> 0.7.1) [CRAN]
## bit (4.5.0.1 -> 4.6.0) [CRAN]
## hms (1.1.3 -> 1.1.4) [CRAN]
## vroom (1.6.5 -> 1.7.0) [CRAN]
## readr (2.1.5 -> 2.1.6) [CRAN]
## forcats (1.0.0 -> 1.0.1) [CRAN]
## readxl (1.4.4 -> 1.4.5) [CRAN]
## haven (2.5.4 -> 2.5.5) [CRAN]
##
## The downloaded binary packages are in
## /var/folders/c1/pcc3tx0x1cxdbdhs48h7whyr0000gn/T//Rtmp0UZweQ/downloaded_packages
## ── R CMD build ─────────────────────────────────────────────────────────────────
## checking for file ‘/private/var/folders/c1/pcc3tx0x1cxdbdhs48h7whyr0000gn/T/Rtmp0UZweQ/remotes128fb38c02574/e-mitchell-meps_r_pkg-9133161/MEPS/DESCRIPTION’ ... ✔ checking for file ‘/private/var/folders/c1/pcc3tx0x1cxdbdhs48h7whyr0000gn/T/Rtmp0UZweQ/remotes128fb38c02574/e-mitchell-meps_r_pkg-9133161/MEPS/DESCRIPTION’
## ─ preparing ‘MEPS’:
## checking DESCRIPTION meta-information ... ✔ checking DESCRIPTION meta-information
## ─ checking for LF line-endings in source and make files and shell scripts
## ─ checking for empty or unneeded directories
## ─ building ‘MEPS_1.4.0.tar.gz’
##
##
library("MEPS")
#### Load other libraries
library("survey")
library("foreign")
library("tidyverse")
library("psych")
library("haven")
library("gmodels")
Clean and Prepare Data
Once we’ve loaded the MEPS package, we can start
cleaning the data from MEPS.
The next code chunk downloads the 2021 Full Year Consolidated Data File from MEPS and performs some data cleaning functions. In this process, we will also download the Medical Conditions File and the Prescribed Medicines File.
At the end of this code chunk, we will have the following dataframe:
hc2021xcond2021rx2021
###################### 2021 data #####################
#### Load data from AHRQ MEPS website
hc2021 <- read_MEPS(file = "h233") # Full-year consolidated file
cond2021 <- read_MEPS(file = "h231") # Medical conditions file
rx2021 <- read_MEPS(file = "h229a") # Prescribed medicines file
## Change column names to lowercase
names(hc2021) <- tolower(names(hc2021))
names(cond2021) <- tolower(names(cond2021))
names(rx2021) <- tolower(names(rx2021))
#### Keep only those variables of interest
hc2021x <- hc2021 %>%
select(dupersid, totexp21, totslf21, totprv21, totmcr21, totmcd21, totva21, rxexp21, obvexp21,optexp21, ertexp21, iptexp21, hhaexp21, hhnexp21, opsexp21, perwt21f, varpsu, varstr, age21x, sex, racethx, panel)
# Rename variables
hc2021x <- hc2021x %>%
rename(totexp = totexp21,
slf = totslf21,
prv = totprv21,
mcr = totmcr21,
mcd = totmcd21,
va = totva21,
rxexp = rxexp21,
obvexp = obvexp21,
optexp = optexp21,
ertexp = ertexp21,
iptexp = iptexp21,
hhaexp = hhaexp21,
hhnexp = hhnexp21,
opsexp = opsexp21,
perwtf = perwt21f,
age = age21x)
# Add year
hc2021x$year <- c(2021)
Identify the responders with diabetes
Once the data cleaning is completed, we will need to use the Medical Conditions File to identify responders who have a diagnosis of type 2 diabetes.
We’ll take advantage of the Clinical Classification Software Refined (CCSR), which contains groupings of ICD-10 codes for diabetes-related categories. These include:
END002denotes diabetes mellitus without complicationEND003denotes diabetes mellitus with complicationEND005denotes diabetes mellitus, Type 2
## Diabetes filter
pattern <- c("END002", "END003", "END005") # Combine the CCSR codes
combined_pattern <- paste(pattern, collapse = "|") # adding an OR operator
In the Medical Conditions File, there are up to three places where
CCSR codes are listed: ccsr1x, ccsr2x, and
ccsr3x. We want to query each of these to see if any of the
diabetes-related codes are present.
diabetes2021 <- cond2021 %>%
unite("all_CCSR", ccsr1x:ccsr3x, remove = FALSE) %>%
filter(grepl(combined_pattern, all_CCSR))
Once we locate the diabetes-related CCSR codes, we need to create a
new indicate variable for diabetes called diabetes in the
dataframe. We will code this as 1 = diabetes and
0 = no diabetes.
diabetes2021$diabetes <- 1 # Add an indicator for diabetes
Then, we will need to ensure that each row is a unique responder (we want to eliminate duplicates).
diabetes2021_unique <- diabetes2021 %>%
distinct(dupersid, perwt21f, varpsu, varstr, .keep_all = TRUE) # Keep unique subjects
diabetes2021_sub <- diabetes2021_unique %>%
select(dupersid, diabetes, perwt21f, varpsu, varstr) # Select only necessary variables
diabetes2021_sub <- diabetes2021_sub %>%
rename(perwtf = perwt21f) # Rename variables
diabetes2021_sub <- zap_labels(diabetes2021_sub) # Remove any labels
Next, we want to merge the hc2021x dataframe with the
diabetes2021_sub dataframe. The hc2021x
dataframe contains the demographic and outcome data, and the
diabetes2021_sub dataframe contains information on the
diabetes == 1 indicator.
We will use the left_join() function to do the merge,
and we will merge using the dupersid, perwtf,
varpsu, and varstr.
# Merge data - LEFT JOIN
hc2021x_t2dm <- left_join(hc2021x, diabetes2021_sub, by = c("dupersid", "perwtf", "varpsu", "varstr"))
This complete the data cleaning for 2021. Now, we repeat the process for 2022.
Repeat Data Cleaning and Preparation for 2022 data
##################### 2022 data #####################
#### Load data from AHRQ MEPS website
## 2022
hc2022 <- read_MEPS(file = "h243") # Full-year consolidated file
cond2022 <- read_MEPS(file = "h241") # Medical Conditions File
rx2022 <- read_MEPS(file = "h239a") # Prescribed Medicines File
## Change column names to lowercase
names(hc2022) <- tolower(names(hc2022))
names(cond2022) <- tolower(names(cond2022))
names(rx2022) <- tolower(names(rx2022))
# Keep only the variables of interest
hc2022x <- hc2022 %>%
select(dupersid, totexp22, totslf22, totprv22, totmcr22, totmcd22, totva22, rxexp22, obvexp22, optexp22, ertexp22, iptexp22, hhaexp22, hhnexp22, opsexp22, perwt22f, varpsu, varstr, age22x, sex, racethx, panel)
# Rename variables
hc2022x <- hc2022x %>%
rename(totexp = totexp22,
slf = totslf22,
prv = totprv22,
mcr = totmcr22,
mcd = totmcd22,
va = totva22,
rxexp = rxexp22,
obvexp = obvexp22,
optexp = optexp22,
ertexp = ertexp22,
iptexp = iptexp22,
hhaexp = hhaexp22,
hhnexp = hhnexp22,
opsexp = opsexp22,
perwtf = perwt22f,
age = age22x)
# Add year
hc2022x$year <- c(2022)
## Diabetes filter
pattern <- c("END002", "END003", "END005") # Combine the CCSR codes
combined_pattern <- paste(pattern, collapse = "|") # adding an OR operator
diabetes2022 <- cond2022 %>%
unite("all_CCSR", ccsr1x:ccsr3x, remove = FALSE) %>%
filter(grepl(combined_pattern, all_CCSR))
diabetes2022$diabetes <- 1 # Add an indicator for diabetes
diabetes2022_unique <- diabetes2022 %>%
distinct(dupersid, perwt22f, varpsu, varstr, .keep_all = TRUE) # Keep unique subjects
diabetes2022_sub <- diabetes2022_unique %>%
select(dupersid, diabetes, perwt22f, varpsu, varstr) # Select only necessary variables
diabetes2022_sub <- diabetes2022_sub %>%
rename(perwtf = perwt22f) # Rename variables
diabetes2022_sub <- zap_labels(diabetes2022_sub) # Remove any labels
# Merge data - LEFT JOIN
hc2022x_t2dm <- left_join(hc2022x, diabetes2022_sub, by = c("dupersid", "perwtf", "varpsu", "varstr"))
Append data
Once we have cleaned and prepared the data for 2021 and 2022, we need to append the data together.
First, we’ll need to remove the labels in the dataframe. Sometimes,
the labels may create errors when you append two dataframes together. We
will use the zap_labels() function from the
haven package.
Second, we will use the rbind() package to append the
2021 and 2022 dataframes.
############################ APPEND ############################
## Remove label due to errors
hc2021x_t2dm <- zap_labels(hc2021x_t2dm)
hc2022x_t2dm <- zap_labels(hc2022x_t2dm)
## Append 2021 and 2022 data
hc <- rbind(hc2021x_t2dm, hc2022x_t2dm)
count(hc)
## # A tibble: 1 × 1
## n
## <int>
## 1 50767
Filter sample
Wu and colleagues perform some filtering operation to focus on
respsonders with a positive person weight (perwtf), are in
panels 23 or 25 if they are from the 2021 data, are in panels 24, 26, or
27 if they are from the 2022 data, adults (>=18 years), and has
diabetes.
After we restrict the sample to those with positive
perwtf, then the sample size is 49,079, which is the same
number in the first box of the flow diagram (or attrition chart).
############################ FILTER ############################
#### Keep if weight is positive
hc_positive <- subset(hc, perwtf > 0)
count(hc_positive)
## # A tibble: 1 × 1
## n
## <int>
## 1 49079
Then, we limit the sample to those who are in the panel 23 or 25 for 2021 and panels 24, 26, or 27 for 2022. This will yield a sample size of 34,411 responders, which is the value in the second box of the flow diagra (or attrition chart).
######################## SUBSET - FILTER ########################
#### Subset panels for 2021 and 2022 (Independence)
hc_panel <- hc_positive %>%
filter(((panel == 23 | panel == 25) & year == 2021) |
((panel == 24 | panel == 26 | panel == 27) & year == 2022))
count(hc_panel)
## # A tibble: 1 × 1
## n
## <int>
## 1 34411
Next, we restrict the sample to responders who are 18 years and older. This will yield a sample size of 27,460 responders, which is the third box in the flow diagram (or attrition chart).
## Keep if age >= 18 years
hc_age <- subset(hc_panel, age >= 18)
count(hc_age)
## # A tibble: 1 × 1
## n
## <int>
## 1 27460
Then, we will look at the subset of the sammple with diabetes. This
should yield a sample size of 3587 responders, which is the fourth box
of the flow diagram (or attrition chart). We will end up with the
hc_diabetes dataframe.
## Keep if diabetes == 1
hc_diabetes <- subset(hc_age, diabetes == 1)
count(hc_diabetes)
## # A tibble: 1 × 1
## n
## <int>
## 1 3587
Identify Responders with GLP1-RA prescription
In this part of the article, we will merge the Prescribed Medicines
File with the hc_diabetes dataframe.
First, we identify the responder on a GLP1-RA. We will take advantage
of the Multum Lexicon Variables format to identify the incretim mimetic,
which includes GLP1-RAs. To do this, we use the
tc1s1_1 == 373 variable. To see the list of other types of
categories the tc1s1_1 variable contains, we will need to
view the data dictionary (link).
We do this for 2021 and 2022 and names these dataframes as
rx2021_glp and rx2022_glp.
######################## DRUGS FILTER ########################
## Identify subjects on GLP1-RA in 2021
rx2021_glp <- rx2021 %>%
filter(tc1s1_1 == 373) %>%
select(dupersid, perwt21f, varpsu, varstr, tc1s1_1) %>%
distinct(dupersid, perwt21f, varpsu, varstr, tc1s1_1) %>%
rename(perwtf = perwt21f)
## Identify subjects on GLP1-RA in 2022
rx2022_glp <- rx2022 %>%
filter(tc1s1_1 == 373) %>%
select(dupersid, perwt22f, varpsu, varstr, tc1s1_1) %>%
distinct(dupersid, perwt22f, varpsu, varstr, tc1s1_1) %>%
rename(perwtf = perwt22f)
Second, we need to combine the rx2021_glp and
rx2022_glp dataframes together.
## Append GLP1-RA data
rx_comb <- rbind(rx2021_glp, rx2022_glp)
Third, we create an indicator for GLP1-RA called
glp == 1.
## Add indicator for GLP1-RA
rx_comb$glp <- 1
Fourth, we filter to only the unique responders by removing the
duplicates. We call this new dataframe rx_unique. .
## Filter uniques
rx_unique <- rx_comb %>%
distinct(dupersid, perwtf, varpsu, varstr, glp)
Fifth, we merge rx_unique with the
hc_diabetes dataframe using the dupersid,
perwtf, varpsu, and varstr
variables.
## Merge to the main dataframe
data_merge <- left_join(hc_diabetes, rx_unique, by = c("dupersid", "perwtf", "varpsu", "varstr"))
Last, we will finalize the indicator variable for GLP1-RA. This will yield a sample size of 637 for GLP1-RA and 2950 for no-GLP1-RA.
## GLP1-RA indicator variable
table(data_merge$glp)
##
## 1
## 637
data_merge$glp[is.na(data_merge$glp)] <- 0
table(data_merge$glp)
##
## 0 1
## 2950 637
Conclusions
By following the inclusion and exclusion criteria from the paper by Wu and colleagues, we can create the sample that they used for their study. In the next article, we will attempt to replicate the findings from the study.
References
- Wu J, Perez A, Sullivan PW. Patterns and costs associated with glucagon-like peptide-1 receptor agonist use in US adults with type 2 diabetes. J Manag Care Spec Pharm. 2025 Oct;31(10):1029-1038. doi: 10.18553/jmcp.2025.31.10.1029. PMID: 41004212; PMCID: PMC12467763.
Disclosures and Disclaimers
This is a work in progress; thus, I may update this in the future.
Any errors or mistakes are my fault, and I’ll try my best to fix them.
Lastly, this is for educational purposes only.