According to the Center for Medicare and Medicade spending, in the United States healthcare spending totals over 15% of our GDP. As a result of this total and the lack of price sensitivity from the consumer, any impact we can have on healthcare spending, whether it is maintaining the health of our population or removing waste in our system will pay magnified dividends.
It is broadly believed that having a Primary Care Provider (PCP) is a great way to maintain health and lower healthcare costs for an individual. In addition to potentially providing better health outcomes by way of preventative care, it is generally accepted that by becoming more familiar with the healthcare landscape, those with Primary Care Providers are more likely to choose lower cost options when seeking care: while others might go to the Emergency Room out of panic when a healthcare need arises, those more familiar with their options might choose to visit an Urgent Care or even their PCP, ultimately contributing to a decrease in healthcare costs.
I wanted to asses these claims, and explore if having a dedicated (non-Emergency) “Usual Place of Care” is truly correlated with decreased healthcare costs on an individual basis over a one year span.
The Medical Expenditure Panel Survey captures individual level detail on a variety of healthcare metrics. It is administered by the Agency for Healthcare Research and Quality and findings have been released annual since 1996. The sample in question is from 2017.
This survey is sent to a representational sample of the American non-institutionalized population and is composed of general demographic questions, questions related to healthcare habits and spending, and other related topics. Fields such as healthcare spending are studied at the person/incident level event and aggregated to paint a full picture of healthcare spending over the calendar year.
The 2017 data set has 31,880 observations. Each observation reflects one individual. While there is family level data, each family member is treated as a unique observation with some fields shared with their family unit.
The 2017 dataset contains 1561 variables. Of these, we are primarily concerned with two variables: total healthcare costs in 2017 (TOTTCH17) and the variable indicating if they have a Usual Source of Care (HAVEUS42). Looking over another variable “Location of Usual Source of Care” (LOCATN42), we saw that “Emergency Room” qualified as a Usual Place of Care, and decided to leverage this variable as well. As the project developed, we also incorporated age (AGE17X), income levels by way of “poverty category” (POVCAT17), private insurance status (PRIDE17) and the respondents perception of physical and mental health (RTHLTH53 and MNHLTH53).
This is an observational study. As per MEPS.gov: “The set of households selected for each panel of the MEPS HC is a sub-sample of households participating in the previous year’s National Health Interview Survey (NHIS) conducted by the National Center for Health Statistics. The NHIS sampling frame provides a nationally representative sample of the U.S. civilian noninstitutionalized population and reflects an oversample of Blacks and Hispanics.”
Due to the representational nature of the sample and the size, this is very likely to satisfy the conditions of inference, even after several rounds of filtering.
Initially, the population of interest was all American individuals. Due to the size and intentionality of the sample, we can say that our findings can be generalized. That being said, there will be some inherent bias: as the respondents were aware that they were being sampled, they may have more aware of their health and healthcare spending during the year, skewing the results.
As this is an observational study and not an experiment, we can not establish a cause/effect relationship for any of our variables.
# load data
meps_cat <-
get_catalog( "meps" ,
output_dir = file.path( path.expand( "~" ) , "MEPS" ) )
meps_cons_df <-
readRDS( file.path( path.expand( "~" ) , "MEPS" ,
"2017/full year consolidated.rds" ) )
After loading our data, lets begin by analyzing how healthcare spending and Usual Source of Care are correlated. We will be using 4 variables. Here are their descriptions from the MEPS Codebook:
ACCELI42: Binary variable indicating those who are (1) and are not (-1) eligible for the access to care survey.
HAVEUS42: “Usual Source of Care” of survey respondent: -9 NOT ASCERTAINED; -8 DK; -1 INAPPLICABLE; 1 YES; 2 NO
TOTTCH17: Total healthcare charges, excluding pharmacy claims. It is worth clarifying that this is the amount charged, and not necessarily the amount paid by the respondent. As per the Data File Documentation: “This variable represents the sum of all fully established charges for care received and usually does not reflect actual payments made for services, which can be substantially lower due to factors such as negotiated discounts, bad debt, and free care … The total charge variable across services (TOTTCH17) excludes prescribed medicines.”
LOCATN42: This variable indicates in what location the Usual Source of Care operates: 1 OFFICE; 2 HOSPITAL, NOT ER; 3 HOSPITAL, ER. As the focus of our study is to see if primary care familiarity leads to a decrease in spending during the year, I elected to consider those whose Usual Source of Care was the Emergency Room as not having a Usual Source of Care at all.
# build df
Project_df_0 <- data.frame(meps_cons_df$acceli42, meps_cons_df$haveus42, meps_cons_df$tottch17,meps_cons_df$locatn42)
# remove respondents who were deemed inelligble for questions about Usual Source of Care.
Project_df_0 <- Project_df_0[ Project_df_0$meps_cons_df.acceli42 != -1, ]
# remove respondents who "did not know" if they had a Usual Source of Care
Project_df_0 <- Project_df_0[ Project_df_0$meps_cons_df.haveus42 > 0, ]
# Change Emergency Room go-ers to not having a Usual Source of Care
Project_df_0$meps_cons_df.haveus42 <-ifelse(Project_df_0$meps_cons_df.locatn42 > 2,2,Project_df_0$meps_cons_df.haveus42)
Project_df <- data.frame(Project_df_0$meps_cons_df.haveus42, Project_df_0$meps_cons_df.tottch17)
# rename columns
colnames(Project_df) <- c("Usual_Source_of_Care", "Total_Healthcare_Charges")
# rename numerical variables regarding care location.
Project_df$Usual_Source_of_Care <- factor(Project_df$Usual_Source_of_Care,
levels=c(1,2),
labels=c("Yes","No"))
At this point, we need to ensure that this sample satisfies the conditions of inference. We know that these observations are independent, and that they are a random sample. Clearly, they are sufficiently large.
We should investigate if healthcare spending is normally distributed:
ggplot(Project_df, aes( x=Total_Healthcare_Charges)) +
geom_histogram(bins=30) +
scale_x_continuous(breaks=c(100000,1000000)) +
scale_y_continuous(breaks=c(1000, 2000, 4000)) +
theme_bw() +
labs(title = 'Total Healthcare Charges',
x = element_blank(),
y = element_blank(),
subtitle = 'Histogram')
ggplot(Project_df, aes( x=Total_Healthcare_Charges)) +
geom_histogram(bins = 40) +
scale_x_continuous(trans='log1p', breaks=c(500, 5000, 50000)) +
scale_y_continuous(trans='log1p', breaks=c(500,1000, 2000, 4000)) +
theme_bw() +
labs(title = 'Total Healthcare Charges',
x = element_blank(),
y = element_blank(),
subtitle = 'Histogram, Log(x+1) transformation on both axis')
It appears healthcare spending is nearly log-normal. It is interesting, but not unexpected that the majority of respondents spent “$0” on healthcare in 2017. The nearly invisible tail implies that there are a relatively high number of very high spending individuals. While these might be considered outliers in other circumstances, as they clearly represent a trend in spending, they do not qualify as outliers in the statistical sense. For the purposes of this analysis, we can continue.
Moving on to the question at hand: Does having a non-ER Usual Source of Care correlate with decreased healthcare spending over a one year period?
ggplot(Project_df, aes( y=Total_Healthcare_Charges, fill=Usual_Source_of_Care)) +
geom_boxplot() +
facet_wrap(~Usual_Source_of_Care, nrow = 1) +
scale_x_continuous(trans='log1p', breaks=c(0, 500, 5000, 50000)) +
scale_y_continuous(trans='log1p', breaks=c(0, 500,1000, 2000, 4000)) +
theme_bw() +
labs(title = 'Total Healthcare Costs',
x = element_blank(),
subtitle = 'by Usual Source of Care (Y/N)')
ggplot(Project_df, aes( x=Total_Healthcare_Charges, fill=Usual_Source_of_Care)) +
geom_histogram(bins = 40) +
facet_wrap(~Usual_Source_of_Care) +
scale_x_continuous(trans='log1p', breaks=c(500, 5000, 50000)) +
scale_y_continuous(trans='log1p', breaks=c(500,1000, 2000, 4000)) +
theme_bw() +
labs(title = 'Total Healthcare Costs',
x = element_blank(),
subtitle = 'by Usual Source of Care (Y/N)')
Project_df %>%
group_by(Usual_Source_of_Care) %>%
summarise(Mean = mean(Total_Healthcare_Charges), Median = median(Total_Healthcare_Charges), Count = n(), )
## # A tibble: 2 x 4
## Usual_Source_of_Care Mean Median Count
## <fct> <dbl> <dbl> <int>
## 1 Yes 13200. 1638 23936
## 2 No 5160. 187 6547
As we can see above, our initial findings run counter to my hypothesis: those who had a non-ER Source of Care incurred more healthcare costs in 2017.
Thinking through the above analysis - one grim truth of the American Healthcare System comes into focus - access to care is not universal. We need to be careful not to use healthcare costs as a proxy for health. On a longer time frame, it is possible that those who are ignoring their health today may incur higher healthcare costs in the future. That being said, as we are only looking at one year’s worth of data, we are encountering an additional bit of bias: anyone who has chronic diseases or otherwise might need frequent medical attention will probably already have a primary care provider (or Usual Source of Care) to help manage these conditions. Meanwhile those who could potentially benefit from preventative care, but are ignoring these needs are not incurring any healthcare costs. In this way, we are suffering from selection bias.
While we cannot review a larger time frame (previous years of this survey do not share a common identifier), we can further subset the population in a few different ways and see if for any subset of the population Usual Source of Care is correlated with decreased healthcare costs.
Additional variables:
AGE17X: Age of respondent, top-codded at 85 years.
RTHLTH53: General health as reported by the respondent: -8 DK; -7 REFUSED; -1 INAPPLICABLE; 1 EXCELLENT; 2 VERY GOOD; 3 GOOD; 4 FAIR; 5 POOR
MNHLTH53: Mental health as reported by the respondent: -9 NOT ASCERTAINED; -8 DK; -7 REFUSED; -1 INAPPLICABLE; 1 EXCELLENT; 2 VERY GOOD; 3 GOOD; 4 FAIR; 5 POOR
POVCAT17: Poverty level of the family for which the respondent belongs. This value is shared across the family, but costs are unique to the individual: 1 POOR/NEGATIVE; 2 NEAR POOR; 3 LOW INCOME; 4 MIDDLE INCOME; 5 HIGH INCOME. As per the codebook, these translate to the following percentages relative to the poverty line: " negative or poor (less than 100%), near poor (100% to less than 125%), low income (125% to less than 200%), middle income (200% to less than 400%), and high income (greater than or equal to 400%)"
REGION53: -1 INAPPLICABLE; 1 NORTHEAST; 2 MIDWEST; 3 SOUTH; 4 WEST
PRIDE17: Indicates whether the respondent had private insurance in December 2017. Private insurance includes exchange plans, but does not include Medicare or Medicaid. -1 INAPPLICABLE; 1 YES; 2 NO.
#Import data
Project_df_1 <- data.frame(meps_cons_df$acceli42, meps_cons_df$haveus42, meps_cons_df$tottch17, meps_cons_df$age17x, meps_cons_df$rthlth53, meps_cons_df$mnhlth53, meps_cons_df$povcat17, meps_cons_df$region53,meps_cons_df$pride17, meps_cons_df$locatn42 )
# Assign ER go-ers as not having USC
Project_df_1$meps_cons_df.haveus42 <-ifelse(Project_df_1$meps_cons_df.locatn42 > 2,2,Project_df_1$meps_cons_df.haveus42)
#Rename Cols
names(Project_df_1) <- c("Survey", "Usual_Source_of_Care","Total_Charges", "Age", "General_Health","Mental_Health","Poverty_Category","Region","Insurance","Location")
#Filter
Project_df_1 <- filter(Project_df_1, Survey != -1, Usual_Source_of_Care> 0, General_Health > 0, Mental_Health > 0, Region >0)
Project_df_1 <- Project_df_1[,-1]
dim(Project_df_1)
## [1] 30349 9
With over 30,000 observations, this continues to satisfy the conditions of inference. We can use linear regression to asses the predictive power of each of these variables.
Full_lm <- lm(Total_Charges ~ Usual_Source_of_Care + Age + General_Health + Mental_Health + Region + Insurance, data = Project_df_1)
summary(Full_lm)
##
## Call:
## lm(formula = Total_Charges ~ Usual_Source_of_Care + Age + General_Health +
## Mental_Health + Region + Insurance, data = Project_df_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44440 -13260 -4989 2931 2646998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5503.7 1562.7 -3.522 0.000429 ***
## Usual_Source_of_Care -6646.0 701.9 -9.469 < 2e-16 ***
## Age 231.4 13.2 17.535 < 2e-16 ***
## General_Health 8200.8 375.9 21.818 < 2e-16 ***
## Mental_Health -1292.5 380.9 -3.393 0.000692 ***
## Region -946.0 286.5 -3.302 0.000960 ***
## Insurance 2028.2 581.2 3.489 0.000485 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50020 on 30342 degrees of freedom
## Multiple R-squared: 0.05013, Adjusted R-squared: 0.04994
## F-statistic: 266.9 on 6 and 30342 DF, p-value: < 2.2e-16
While the model is not particularly predictive with a \(R^2\) of .05, we find that Usual Care Location,Age and General Health are relatively predictive. With this in mind, it makes sense to subset the data by age and general health to see the role that Usual Care Location plays within these bands.
#Rename bianary variables.
Project_df_1$Usual_Source_of_Care <- factor(Project_df_1$Usual_Source_of_Care,
levels=c(1,2),
labels=c("Yes", "No"))
Project_df_1$Insurance<- factor(Project_df_1$Insurance,
levels=c(-1,1,2),
labels=c("NA", "Insured", "Uninsured"))
Lets break up our sample into a few subsets. One of the more popular generational demographics are the Millennials. In 2017, this group was between 21 and 35 years old. We will also look at those in their 30s, those aged 20 to 49 and 55+.
For each of these categories, we will further subset the population by how they rated their general physical health.
## [1] 30349 9
## [1] 5597 9
## [1] 3904 9
## [1] 7681 9
## [1] 8001 9
As each subset is sufficiently large, we can continue.
ggplot(Project_df_1_M, aes( y=Total_Charges, fill=Usual_Source_of_Care)) +
geom_boxplot() +
facet_wrap(~General_Health ~ Usual_Source_of_Care, nrow=1) +
scale_x_continuous(trans='log1p', breaks= NULL) +
scale_y_continuous(trans='log1p', breaks=c(0, 500,1000, 2000, 4000)) +
theme_bw() +
labs(title = 'Total Healthcare Costs, ages 21 to 35',
x = element_blank(),
subtitle = 'by Usual Source of Care (Y/N) and General Health (1 Healthiest - 5)')
ggplot(Project_df_1_30, aes( y=Total_Charges, fill=Usual_Source_of_Care)) +
geom_boxplot() +
facet_wrap(~General_Health ~ Usual_Source_of_Care, nrow=1) +
scale_x_continuous(trans='log1p', breaks= NULL) +
scale_y_continuous(trans='log1p', breaks=c(0, 500,1000, 2000, 4000)) +
theme_bw() +
labs(title = 'Total Healthcare Costs, ages 30 to 39',
x = element_blank(),
subtitle = 'by Usual Source of Care (Y/N) and General Health (1 Healthiest - 5)')
ggplot(Project_df_1_3040, aes( y=Total_Charges, fill=Usual_Source_of_Care)) +
geom_boxplot() +
facet_wrap(~General_Health ~ Usual_Source_of_Care, nrow=1) +
scale_x_continuous(trans='log1p', breaks= NULL) +
scale_y_continuous(trans='log1p', breaks=c(0, 500,1000, 2000, 4000)) +
theme_bw() +
labs(title = 'Total Healthcare Costs, ages 30 to 49',
x = element_blank(),
subtitle = 'by Usual Source of Care (Y/N) and General Health (1 Healthiest - 5)')
ggplot(Project_df_1_55, aes( y=Total_Charges, fill=Usual_Source_of_Care)) +
geom_boxplot() +
facet_wrap(~General_Health ~Usual_Source_of_Care, nrow=1) +
scale_x_continuous(trans='log1p', breaks= NULL) +
scale_y_continuous(trans='log1p', breaks=c(0, 500,1000, 2000, 4000)) +
theme_bw() +
labs(title = 'Total Healthcare Costs, ages 55+',
x = element_blank(),
subtitle = 'by Usual Source of Care (Y/N) and General Health (1 Healthiest - 5)')
In each of the above boxplots, as we move from left to right, we go from healthier groups to less healthy. As expected, healthcare spending increases as the population rates itself in increasingly poorer heath - what is noteworthy is how the subset without a Usual Source of Care remains considerably more static than those with a Source of Care. In any case, it appears that those without a Usual Source of Care incur lower costs than those who do.
For our next analysis, we will remove the age band and divide the population by those who have private insurance and those who do not:
filter(Project_df_1, Insurance != "NA") %>%
ggplot(aes( y=Total_Charges, fill=Usual_Source_of_Care)) +
geom_boxplot() +
facet_wrap(~Insurance ~ Usual_Source_of_Care, nrow=1) +
scale_x_continuous(trans='log1p', breaks= NULL) +
scale_y_continuous(trans='log1p', breaks=c(0, 500,1000, 2000, 4000)) +
theme_bw() +
labs(title = 'Total Healthcare Costs',
x = element_blank(),
subtitle = 'by Usual Care Location and Private Insurance Status')
Here we find that among those who have a Usual Source of Care, there is a very similar quantity of healthcare costs regardless of insurance status. For those who do not have a Usual Source of Care, we see considerably higher costs from the group with insurance.
As a final exercise, lets explore the impact of poverty and income level on healthcare costs.
ggplot(Project_df_1, aes( y=Total_Charges, fill=Usual_Source_of_Care)) +
geom_boxplot() +
facet_wrap(~Poverty_Category ~ Usual_Source_of_Care, nrow=1) +
scale_x_continuous(trans='log1p', breaks = NULL) +
scale_y_continuous(trans='log1p', breaks=c(0, 500,1000, 2000, 4000)) +
theme_bw() +
labs(title = 'Total Healthcare Costs',
x = element_blank(),
subtitle = 'by Usual Care Location and Family Income Category')
In the above boxplot, we move from a Poverty Category of 1 (family income lower than 100% of the poverty line), to 5 (+400% of the poverty line). In all categories, those with a Usual Source of Care incur greater costs than those without.
It seems, regardless of how we subset our population, we consistently find that those with a Usual Source of Care incur greater healthcare costs than those without. Lets asses this difference in costs between these two populations and determine if it is statistically significant.
Independence: We know that this sample is random and representational of the broader population. Furthermore, each sample is independent from the others (family relationships aside). This condition is satisfied.
Normality: As we saw in the previous section, this sample is sufficiently large, and while there were a number of users with significantly higher than average healthcare costs, no single individual was an “outlier” in the statistical sense. This applies to both the entire survey, and the survey subset by Usual Source of Care status. This condition is satisfied.
While the data seems to indicate that the mean spending by those who have a Usual Source of Care is greater than those who do not, we want to determine if this difference could be attributed to sample variation.
Null Hypothesis
\[ \mu_{uscT} - \mu_{uscF} = 0 \]
Alternative Hypothesis:
\[ \mu_{uscT} - \mu_{uscF} \neq 0 \]
Project_df %>%
group_by(Usual_Source_of_Care) %>%
summarise(Mean = mean(Total_Healthcare_Charges), Median = median(Total_Healthcare_Charges), Count = n(), SD = sd(Total_Healthcare_Charges) )
## # A tibble: 2 x 5
## Usual_Source_of_Care Mean Median Count SD
## <fct> <dbl> <dbl> <int> <dbl>
## 1 Yes 13200. 1638 23936 56652.
## 2 No 5160. 187 6547 32300.
We can start with our point estimation, the difference in means: \[\mu_{uscT} - \mu_{uscF}\]
Standard error is calculated by \[\sqrt{\frac{\sigma_t^2}{n_t}+\frac{\sigma_f^2}{n_f}}\]
Finally, regarding T-Distributions and Z-scores, as we have 30,000+ observations, and the smaller category has more than 6000, our T-Distribution is effectively indistinguishable from our normal distribution. We can use Z scores to determine our Confidence Interval.
# Create Arrays for our subsets
Project_df_y <- filter(Project_df,Usual_Source_of_Care == "Yes")
Project_df_y <- Project_df_y$Total_Healthcare_Charges
Project_df_n <- filter(Project_df,Usual_Source_of_Care == "No")
Project_df_n <- Project_df_n$Total_Healthcare_Charges
# point estimate
mean(Project_df_y) - mean(Project_df_n)
## [1] 8040.093
## [1] 541.6964
Z score:
95% = 1.96
99% = 2.576
We can build our confidence interval for both 95% and 99% certainty by \[ PointEstimate \pm ZScore * SE \]
## [1] 83978.29
## [1] 86101.71
## [1] 83644.61
## [1] 86435.39
Finally, our test statistic can be generated by \[\frac{PointEstimate - 0}{SE}\]
## [1] 156.9902
We are 95% certain that the difference in means falls between 83978 and 86101, and 99% that it falls between 83644 and 86425, a far way from our Null hypothesis value of 0. Additionally, our test statistic value is well beyond any T-Probability table. We can confidently reject our null hypothesis and say that based this data, those who have selected a Usual Source of Care have incurred more healthcare charges than those who have not.
At the start of this study, I expected to find that those with a Usual Source of Care were more familiar with the healthcare system, and as a result, spent less on healthcare by way of better preventative care and the avoidance of high priced emergency room visits. What we have demonstrated above is that, in this data sample, those with a Usual Source of Care incur significantly more healthcare spending than those who do not. With that in mind, I am cautious when discussing any broad implications.
By limiting our sample to one year, our selection of those who have a Usual Source of Care captures a broader section of the sampled population. It is perhaps easier to isolate those this classification excludes: individuals who are not engaged in their health, and those who have recently relocated or changed insurances. By comparing those who pursue healthcare such that they have a Usual Location to do so, to those who do not, it is not surprising at all that the former spends more over the course of the year.
The next logical step in this study would be to gauge the impact of having a usual source of care on healthcare spending years later. A follow up study could be centered on a data set with several years worth of data, and could involve the comparison of annual spending of those who had a Usual Source of Care some time period before the period whose health expenditures were being studied, with those who did not.