Loading Libraries

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages --------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.1     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'purrr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(plotly)
## Warning: package 'plotly' was built under R version 3.6.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.6.3
library(RColorBrewer)
library(highcharter)
## Warning: package 'highcharter' was built under R version 3.6.3
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use

Dataset Description

As the pandemic has been a serious problem in the U.S., I developed my interest in the health care system in the country, more specifically on the annual expenditures by genders and age. The dataset focuses on the health care spending by payers and the health conditions in the United States from 1996 to 2016. The report is obtained from the Global Health Data Exchange (GHDx) created and supported by the Institute for Health Metrics and Evaluation (IHME). The data provides the estimates for US spending on health care of three types of payers with public insurance including Medicare, Medicaid, and other government programs, private insurance, or out-of-pocket payments and by health condition, age group, sex, and type of care for 1996 through 2016.

# Load the dataset
spending <- read.csv("IHME_DEX_1996_2016_Y2020M03D18.csv")

The dataset consists of 106,113 observations and 22 variables. The publication provides a code book that stipulates the description of the variables as following.

# Describe the structure of the dataset
dim(spending)
## [1] 106113     22
glimpse(spending)
## Rows: 106,113
## Columns: 22
## $ year_id        <int> 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004...
## $ age_group_id   <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ age_group_name <fct> 1 to 4, 1 to 4, 1 to 4, 1 to 4, 1 to 4, 1 to 4, 1 to...
## $ sex_id         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ sex            <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male...
## $ function.      <fct> AM, AM, AM, AM, AM, AM, AM, AM, AM, AM, AM, AM, AM, ...
## $ type_of_care   <fct> Ambulatory care, Ambulatory care, Ambulatory care, A...
## $ acause         <fct> _comm, _comm, _comm, _comm, _comm, _comm, _comm, _co...
## $ cause_name     <fct> "Communicable, maternal, neonatal, and nutritional d...
## $ cause_id       <int> 295, 295, 295, 295, 295, 295, 295, 295, 295, 295, 29...
## $ mean_all       <dbl> 2575.4918, 2474.5844, 2446.2181, 2395.2106, 2382.910...
## $ lower_all      <dbl> 2001.3225, 2099.6091, 2205.8107, 2166.8178, 2140.719...
## $ upper_all      <dbl> 3124.7140, 2863.7867, 2761.8325, 2643.9929, 2626.366...
## $ mean_pub       <dbl> 605.0806, 577.0712, 554.2746, 548.1434, 558.3348, 59...
## $ lower_pub      <dbl> 437.1480, 433.3323, 446.1502, 460.7274, 471.9634, 51...
## $ upper_pub      <dbl> 778.9976, 714.7635, 657.9403, 649.4515, 648.8706, 67...
## $ mean_pri       <dbl> 1650.9957, 1582.8692, 1580.1214, 1544.0267, 1528.086...
## $ lower_pri      <dbl> 1296.8102, 1358.7341, 1399.8077, 1362.8851, 1341.050...
## $ upper_pri      <dbl> 2060.7398, 1877.6249, 1821.0514, 1737.0441, 1704.843...
## $ mean_oop       <dbl> 319.41544, 314.64410, 311.82207, 303.04054, 296.4889...
## $ lower_oop      <dbl> 240.20634, 258.49445, 273.06185, 263.23870, 253.9895...
## $ upper_oop      <dbl> 410.54140, 377.89202, 357.32671, 341.05298, 331.8522...

Data Cleansing

One of the variables has the name “function.” that will cause confusion when the codes are run as function is also a command in R. Therefore, the variable name is changed to type_id.

# Change the name of column "function." to "type_id"
spending <- spending %>% rename("type_id" = "function.")

In the code book, the causes “Well care” and “Expenditure on risk factors” do not have the assigned cause ID’s. NAs are used for these specific entries, causing unnecessary missing values. Furthermore, the levels of acause variable also describe the factor levels in the cause_name variable. Therefore, the cause_id variable is removed. Another special pair of variables is the type_id and type_of_care where one variable provides the shorten names for the levels of the other variables. The modifying names will be useful for the data visualization that will be presented in the next section of this report.

# Remove the cause_id  variable
spending <- spending %>% select(-cause_id)

There are no NA’s so this is a clean and well-organized dataset.

# Check missing values
sum(is.na(spending))
## [1] 0

Exploratory Data Analysis and Statistical Inference

Figure 1

The first visualization explores the distribution of the average spending on different types of care from 1996 to 2016.

# Sort the dataset by years and care types and calculate the average spending 
fig1_data <- spending %>% group_by(year_id, type_id) %>%
  summarise(sum1 = sum(mean_all)/10^6)
## `summarise()` regrouping output by 'year_id' (override with `.groups` argument)
# Figure 1 
fig1 <- ggplot(fig1_data) +
  geom_col(aes(year_id, sum1, fill = type_id), position = "stack") +
  coord_flip() +
  scale_fill_brewer(palette = "Set1", name = "Care Types") +
  labs(title = "Average Health Care Spending by Care Types",
       subtitle = "1996 - 2016",
       x = "Year",
       y = "Average Spending in Million")
ggplotly(fig1)
  • AM Ambulatory care
  • IP Inpatient care
  • RX Prescribed pharmaceutical care
  • ER Emergency department care
  • LT Nursing care facility
  • GA Government Administration and Net Cost of Insurance Programs
  • DV Dental care

According to the figure 1, the average spending in general gradually increases over time, from 3.6 million dollars in 1996 to over 8 million dollars in 2016. Furthermore, the ambulatory care and inpatient care account for most of the spending of each year. For example, in 2016, the spending for ambulatory care is 2.68 million dollars and that for inpatient care is 2.28 million dollars. The ambulatory care refers to medical services provided in outpatient settings. According to John Hargraves and Julie Reiff from the Health Care Cost Institute (HCCI), the cost of ambulatory care includes the facility fees and the maintenance cost that make the outpatient services more expensive than those provided by physician offices. The prescribed pharmaceutical care takes the second place, increasing from 0.304 million dollars in 1996 to 0.56 million dollars in 2016. The emergency department care and dental care constitutes small portions in the total spending with 0.4 and 0.37 million dollars in 2016.

Figure 2

The bar graph delves in more details about the distribution of average spending by the public insurance and private insurance on different types of care.

# Sort by year and types of care, Calculate the average spending, and change the summary to long format
fig2_data <- spending %>% group_by(year_id, type_id) %>%
  summarise("Public Insurance" = sum(mean_pub)/10^6, "Private Insurance" = sum(mean_pri)/10^6) %>%
  gather("insurance", "average_spending", 3:4)
## `summarise()` regrouping output by 'year_id' (override with `.groups` argument)
# Figure 2
fig2 <- ggplot(fig2_data) +
  geom_col(aes(year_id, average_spending, fill = type_id), position = "stack") +
  theme(legend.title = element_blank()) +
  labs(title = "Average Spending on Insurance by Care Types 1996 - 2016",
       x = "Year",
       y = "Average Spending in Million",
       color = "") +
  facet_grid(~insurance) 
ggplotly(fig2)
  • AM Ambulatory care
  • IP Inpatient care
  • RX Prescribed pharmaceutical care
  • ER Emergency department care
  • LT Nursing care facility
  • GA Government Administration and Net Cost of Insurance Programs
  • DV Dental care

The facet wrap is useful in showing the comparison between the average spending by private and public health insurance. From 1996 and 2016, the average spending by private insurance is higher than public insurance. According to the Federal Office of the Actuary, the annual growth rate of spending for private insurance is higher than for Medicare and Medicaid between 2007 and 2020. The trend is projected to continue through 2023. The bar charts also indicate that both insurance types have the gradual upward pattern. From 1996 to 2016, the average spending by private insurance increases from 1.8 to almost 4 million dollars and by public insurance from 1.3 to 3.4 million dollars. Furthermore, in the spending by both insurance types, the ambulatory care and inpatient care account for large portions throughout the years. The private insurance covers more expense in the dental care and government administration and net cost of Insurance Programs. For example, in 2016, the private insurance pays 0.17 million dollars for dental care while the public insurance provides $44,000. On the other hand, public insurance covers more expense of nursing care facility than private insurance, with 0.32 in comparison with 0.1 million dollars in 2016 relatively.

Figure 3

The third figure shows the distribution of average spending that out-of-pocket payers use for different care types.

# Sort by year and types of care and calculate the average spending 
fig3_data <- spending %>% group_by(year_id, type_id) %>%
  summarise(average_spending = sum(mean_oop)/10^6)
## `summarise()` regrouping output by 'year_id' (override with `.groups` argument)
# Figure 3
fig3 <- ggplot(fig3_data, aes(year_id, average_spending, color = type_id)) +
  geom_point() +
  geom_line() +
  theme_light() +
  labs(title = "Average Spending out of Pocket by Care Types",
       subtitle = "1996 - 2016",
       x = "Year",
       y = "Average Spending in Million",
       color = "")
ggplotly(fig3)
  • AM Ambulatory care
  • IP Inpatient care
  • RX Prescribed pharmaceutical care
  • ER Emergency department care
  • LT Nursing care facility
  • GA Government Administration and Net Cost of Insurance Programs
  • DV Dental care

The first interesting observation is that government administration and net cost of insurance programs are not paid out-of-pocket. In the article about the adminitrative costs in the U.S. health care system, Emiliy Gee and Topher Spiro indicate that such costs are included in insurance premiums and provider’s reimbursement. Furthermore, the article also asserts that the adminitrative costs in the United State comprise a larger portion in the health care spending than other high-income countries such as Germany, Switzerland, Australia, ans so forth. Furthermore, according to the analysis of the health system tracker Peterson-KFF, there is an upward pattern in the healthcare spending on administrative costs from 2.8 percent in 1970 to 7.9 percent in 2018 as part of total health expenditures.

In general, the average spending for most of the types of care shows a slight increase over time, except prescribed pharmaceutical care. The spending for this specific care peaks in 2015 at 0.185 million dollars and decreases in the next few years. The trend can be explained by the introduction of Medicare Part D that covers prescription drugs in 2006. The average out-of-pocket spending on this care type declined for adults age 65 and over as the public insurance started to cover three most expensive prescription drugs including Lipitor, Plavix, and Nexium.

Figure 4

The graph shows the average spending by gender from 1996 and 2016

# Filter the genders and sort by year and gender
fig4_data <- spending %>% filter(sex != "Both") %>%
  group_by(sex, year_id) %>%
  summarise(average_spending = sum(mean_all)/10^6)
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
# Firgure 4
fig4 <- highchart() %>%
  hc_add_series(data = fig4_data,
                type = "waterfall",
                hcaes(x = year_id,
                      y = average_spending, 
                      group = sex)) %>%
  hc_add_theme(hc_theme_monokai()) %>%
  hc_title(text = "Average Spending on Insurance by Gender") %>%
  hc_xAxis(title = list(text="Year")) %>%
  hc_yAxis(title = list(text="Average Spending in Million")) %>%
  hc_tooltip(shared = TRUE)
## Warning: `parse_quosure()` is deprecated as of rlang 0.2.0.
## Please use `parse_quo()` instead.
## This warning is displayed once per session.
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `rename_()` is deprecated as of dplyr 0.7.0.
## Please use `rename()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
fig4

According to the graph, healthcare spending for females is higher than for men and both groups have the same gradual upward trend. The average spending for females increases from 1.43 million dollars in 1996 to 3.1 million dollars in 2016 while the average spending for males has a growth of 56 percent during the same period of time, starting from 1 million dollars in 1996. Furthermore, the disparity between the average spending for females and males increases every year. For example, the difference is 0.424 million dollars in 1996 and increases to 0.841 million dollars in 2006. The two sample t-test below will justify if males take up less health care spending than females, at the 95 percent confidence level.

μf: mean spending for females

μm: mean spending for males

Ho: μf = μm

Ha: μm < μf

#Two sample t-test
male <- spending %>%
  filter(sex == "Male")

female <- spending %>%
  filter(sex == "Female")

t.test(male$mean_all, female$mean_all, alternative="less",
       conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  male$mean_all and female$mean_all
## t = -13.354, df = 67753, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -342.6754
## sample estimates:
## mean of x mean of y 
##  948.5044 1339.3189

P-value: 2.2×10−16 < 0.05 = α

Conclusion: Reject Ho

Real-World Interpretation: There is enough evidence to show that the average spending for males is less than the spending by females.

According to the Health Resource and Services Administration, the healthcare expense for women is higher than men due to the fact that women uses more expensive healthcare service such as childbirth delivery. For instant, in 2010, women make up 56 percent of total health care spending with $7,860 per person.

Figure 5

The tile graph explores the distribution of the average spending by age groups and causes

# Change the level names of age groups for consistency
levels(spending$age_group_name)[levels(spending$age_group_name)=="1 to 4"] <- "01 to 04"
levels(spending$age_group_name)[levels(spending$age_group_name)=="5 to 9"] <- "05 to 09"

# Assign orders to the levels of age groups
spending$age_group_name <- factor(spending$age_group_name, levels=c("<1 year","01 to 04","05 to 09","10 to 14","15 to 19","20 to 24","25 to 29","30 to 34","35 to 39","40 to 44","45 to 49","50 to 54","55 to 59","60 to 64","65 to 69","70 to 74","75 to 79","80 to 84","85 plus"))

# Arrange data for visualization
fig5_data <- spending %>% filter(age_group_name != "All Ages") %>%
  group_by(age_group_name, acause) %>%
  summarise(average_spending = sum(mean_all)/10^6)
## `summarise()` regrouping output by 'age_group_name' (override with `.groups` argument)
# Figure 5
fig5 <- fig5_data %>%
  ggplot(aes(acause, age_group_name, fill = average_spending)) +
  geom_tile(color = "grey50") +
  scale_fill_gradientn(colors = brewer.pal(9, "Blues"), trans = "sqrt") +
  theme_minimal() +  theme(panel.grid = element_blank()) +
  labs(title = "Average Spending on Diseases per Age Group",
       y = "Age Groups",
       x = "Causes",
       fill = "Average Spending") +
  theme(axis.text.x = element_text(angle = 45))
ggplotly(fig5)
  • _comm: Communicable, maternal, neonatal, and nutritional disorders
  • _dube: Diabetes, urogenital, blood, and endocrine diseases
  • _inj: Injuries
  • _mental: Mental and behavioral disorders
  • _neo: Neoplasms
  • _neuro: Neurological disorders
  • _otherncd: Other non-communicable diseases
  • cirrhosis: Cirrhosis of the liver
  • cvd: Cardiovascular diseases
  • digest: Digestive diseases
  • exp: Well care
  • msk: Musculoskeletal disorders
  • resp: Chronic respiratory diseases
  • rf: Expenditure on risk factors

The visualization shows that cardiovascular diseases are common among people over 50 years old and the average spending increases along the age range to 1.15 million dollars for the 85 plus group. In addition, neurological disorders are also a problem of this age group that requires the average spending up to 1.23 million dollars. Musculoskeletal diseases, diabetes, urogenital, blood, and endocrine diseases, injuries, and mental/behavioral disorders are more severe among the age group between 30 and 64 years old. Among these diseases, the musculoskeletal diseases require more spending compared to the other diseases, especially for people between 45 and 64 years old with over 1 million dollars. Communicable, maternal, neonatal, and nutritional disorders tend to happen more often to children under 1 year old that require the average spending of 1.07 million dollars. Besides, cirrhosis of the liver does not seem to be severe among all age groups while other non-communicable diseases seems to happen to every age group with the average spending between 0.5 and 0.7 million dollars.

One question is if the average spending is equal among the age groups and the ANOVA test will help test the hypothesis, using 95 percent confidence level.

Ho: μ5 = μ6 = μ7 = … = μ28 = μ30 = μ160

Ha: Not all μ are equal

# Anova test
fig_anova <- spending %>% filter(age_group_name != "All Ages")
results <- aov(fig_anova$mean_all ~ fig_anova$age_group_name)
summary(results)
##                              Df    Sum Sq   Mean Sq F value Pr(>F)    
## fig_anova$age_group_name     18 1.302e+10 723417217   305.9 <2e-16 ***
## Residuals                102482 2.423e+11   2364564                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

P-value: 2×10−16 < 0.05 = α

Conclusion: Reject Ho

Real-World Interpretation: There is enough evidence to show that there is a relationship between age and spending on health care insurance.

Bradley Sawyer and Gary Claxton in their research of health expenditures across the population indicate that the age group over 65 accounts for 36 percent of total health spending in 2016 while the younger groups only take up 20 percent for the 55-64 group, 13 percent for the 45-54 group, 10 percent for the 35-44 group, 11 percent for the 19-34 group, and 10 percent for under 19 group. According to the Health Affairs, an averarge elderly over 65 spend three times more than an average working-age adult and five times more than an average child in 2010.

Conclusion

The healthcare spending in the U.S. has been increasing gradually, both in private and public insurance. The distribution of average spending is not equal among different age groups, in which people over 65 years old accounts for higher spending than the other groups. In addition, the average healthcare expenditure for females is always higher than males and the disparity continues to increase over time. For further analysis, the visualization and testing should focus on the causes that happens more to females than males as the information can be used to give more details about the disparity in the average health care spending of the two genders.

Bibliography

Altman, Drew. https://www.kff.org/health-costs/perspective/public-vs-private-health-insurance-on-controlling-spending/. 16 April 2015. 2 July 2020.

Chevarley, Frances M. https://meps.ahrq.gov/data_files/publications/st288/stat288.shtml. July 2010. 2 July 2020.

Gee, Emily and Topher Spiro. https://www.americanprogress.org/issues/healthcare/reports/2019/04/08/468302/excess-administrative-costs-burden-u-s-health-care-system/. 8 April 2019. 2 July 2020.

Health Resources and Services Administration. “Women’s Health USA 2013.” Women’s Health USA data book. 2013.

LaPointe, Jacqueline. https://revcycleintelligence.com/news/care-volume-prices-are-increasing-in-the-outpatient-setting. 4 April 2019. 2 July 2020.

Lassman, David, et al. https://www.healthaffairs.org/doi/full/10.1377/hlthaff.2013.1224#:~:text=Total%20personal%20health%20care%20spending,2010%2C%20or%20%247%2C860%20per%20person. 1 May 2014. 10 July 2020.