How do Individual (Demographic) Factors Impact Health Insurance Charges?

Contributors: Xavier Le, Yuel Abraham, Bryan Li, Theodor Hezkial

Introduction

Healthcare is something everyone needs, but the cost of health insurance can be very different from person to person. Some people pay a lot more than others, and we want to understand why. This report examines how individual factors affect how much a person is charged for health insurance.

Under the 2010 Affordable Care Act (ACA), changes were made to the U.S. healthcare system that sought to increase access to health care. The ACA largely impacts the state of the health insurance industry. Before the ACA, healthcare charges were informed by factors including health status and history. The ACA allows healthcare insurance companies to weigh these 5 factors when deciding an individual’s charges: age, location, smoking status (ie. tobacco use), plan category, and family v. individual enrollment. This study primarily aims to examine the first three factors; how does a person’s location, age, and tobacco usage impact how much they pay for health insurance?

Since these sweeping changes to the health insurance industry, this has introduced much research into how the. Studies such as “State Health Insurance Coverage:2013, 2019, and 2023” from the U.S. Census Bureau examined how age influence health insurance coverage rates. In part, the study highlights how southern states often had lower coverage rates among working adults (19-64). The findings of this project motivates this present report to investigate how location may interact with other factors to influence health insurance charges.

Access to healthcare is a universally relevant issue. Financial issues, including the cost of insurance is a significant barrier to receiving medical aid. Being able to associate certain individual factors with increased health insurance charges can help people better prepare for future expenses and access towards necessary care. If we can understand what makes insurance more expensive, people can plan better for their medical expenses, insurance companies can set more equitable pricing plans, and policymakers can work toward making healthcare more affordable.

Data and Methods

The data is sourced from pre-cleaned Kaggle data set “Healthcare Insurance”. This data set includes a sample size of 1,338 people (n = 1338) and their insurance charges. The following variables are included in this data set, with italicized variables being the ones of interest for this report:

Age: The insured person’s age.
Sex: Gender (male or female) of the insured.
BMI (Body Mass Index): A measure of body fat based on height and weight.
Children: The number of dependents covered.
Smoker: Whether the insured is a smoker (yes or no).
Region: The geographic area of coverage.
Charges: The medical insurance costs incurred by the insured person.

The variables of interest for this report are Age, BMI, Smoker, Region, and Charges. For the purposes of our exploration, other variables were filtered from the set. A sample of the data is displayed below:

setwd("C:/Users/xavle/Documents/R Projects/B DATA 200/Health Insurance")
kaggle <- read.csv("insurance.csv", header = TRUE)
head(kaggle)

##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

Analysis of the data set was performed in R. We used several libraries to facilitate our analysis:

library(tidyverse)
library(ggpubr) 
library(dplyr)
library(freqdist)

To start, we used a heat map to represent the health insurance cost by region. We created a heat map by finding the median charge for each region and correlating it with the corresponding color:

region_charges <- kaggle %>%
  group_by(region) %>%
  summarise(
    mean_charges = mean(charges, na.rm = TRUE), 
    median_charges = median(charges, na.rm = TRUE),
    min_charges = min(charges, na.rm = TRUE),
    max_charges = max(charges, na.rm = TRUE),
    count = n()
  )
  
print(region_charges)

## # A tibble: 4 × 6
##   region    mean_charges median_charges min_charges max_charges count
##   <chr>            <dbl>          <dbl>       <dbl>       <dbl> <int>
## 1 northeast       13406.         10058.       1695.      58571.   324
## 2 northwest       12418.          8966.       1621.      60021.   325
## 3 southeast       14735.          9294.       1122.      63770.   364
## 4 southwest       12347.          8799.       1242.      52591.   325

ggplot(region_charges, aes(x = region, 
                           y = "region", 
                           fill = mean_charges)) +
  geom_tile(color = "white") + 
  scale_fill_gradient(low = "salmon", high = "darkred") + 
  labs(title = "Heatmap of Mean Charges by Region", 
       x = "Region", 
       fill = "Mean Charges") +
  geom_text(aes(label = mean_charges), size = 3)+
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

We then wanted to understand the distribution of how much individuals were being charged in the sample. We used a histogram to show a count of how many individuals were being charged certain amounts for their health insurance:

kaggle%>%
ggplot(aes(x=charges)) +
  geom_histogram(fill="lightsteelblue3",
                 binwidth = 1000) +
  scale_x_continuous(breaks=seq(0,65000,5000)) +
  # labels
  labs(title = "Health Insurance Charges",
       x = "Charges ($)", 
       y = "Count")+
  # add median line
  geom_vline(aes(xintercept=median(charges)), 
                color="darkblue", linetype="dashed", linewidth=1)+
  # add label for median
  geom_label(aes(x=median(charges), 
                 label= median(charges)), y = Inf, vjust = 1.5)

Among respondents, the median annual health insurance charge was $9382.03. Bins are divided by $5000.00.

Results/Findings

BMI

Because the data set shows a heavy right skewed distribution, we aimed to consider the median. Going back to investigating region, we created a box plot to capture how charges were spread across US regions. We observed that the median and IQR of the regions overall heavily overlapped which implies that by itself, region did not have an effect on the typical healthcare charges experienced.

However, we noted that a portion of the southeast region were experiencing charges that were higher than other region. In other words, the range of charged experienced by the southeast extended higher than other regions. This is verified by briefly collecting the averages of healthcare charges incurred by region, which shows how the southeast region had the highest average charges:

# bar chart of mean charges by region
avg_charges_per_region <- kaggle %>%
  group_by(region) %>%
  summarise(mean_charges = mean(charges, na.rm = TRUE))
avg_charges_per_region

## # A tibble: 4 × 2
##   region    mean_charges
##   <chr>            <dbl>
## 1 northeast       13406.
## 2 northwest       12418.
## 3 southeast       14735.
## 4 southwest       12347.

ggplot(avg_charges_per_region, aes(x = region, y = mean_charges, fill = mean_charges)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "yellow", high = "red")

# box plot of median charges by region
ggplot(kaggle, aes(x = region,
                     y = charges,
                     fill = region))+
  geom_boxplot(alpha=0.4, outlier.shape = NA)+
  # labels
  labs(title = "Health Insurance Charges by Region",
       x = "Region",
       y = "Charges ($)")+
  coord_cartesian(ylim = c(0, 45000))

charges_region <- kaggle%>%
  group_by(region)%>%
  mutate(charges_med = median(charges, na.rm = TRUE),
         charges_IQR = IQR(charges, na.rm = TRUE),
         charges_min = min(charges, na.rm = TRUE))%>%
  select(charges_med, charges_IQR, region)%>%
  arrange(region)%>%
  distinct(charges_med, charges_IQR)
charges_region

## # A tibble: 4 × 3
## # Groups:   region [4]
##   region    charges_med charges_IQR
##   <chr>           <dbl>       <dbl>
## 1 northeast      10058.      11493.
## 2 northwest       8966.       9992.
## 3 southeast       9294.      15085.
## 4 southwest       8799.       8711.

For this data set, whether we use median or mean will tell us different information about how much a resident of a region is typically being charged. Notably, the difference between mean and median is pronounced in the southeast; the mean is $14735.41 while the median, less influenced by high outliers is only $9294.13.

This leads us to investigating how other demographic factors (ie. smoking status and BMI) affect charges. Since we are interested in how these factors may be an indicator of higher charges, we used linear regression.

Before we can construct these linear models, we modified the data into separate sub-datasets. The first one sorted the numerical BMI into distinct weight categories; underweight (0-18.5), normal weight (18.5-25), overweight (25-30), and obese (30+). These weight categories were then sorted by region, with the median, minimum, and max charges:

weight_group <- cut(kaggle$bmi, c(0, 18.5, 25, 30, Inf), c("Underweight", "Normal", "Overweight", "Obese"), include.lowest=TRUE)

kaggle <- kaggle%>%
  mutate(weight_group)

bmi_region <- kaggle%>%
  select(weight_group, region, charges)%>%
  group_by(region, weight_group)%>%
  summarise(median_charges = median(charges, na.rm = TRUE),
            min_charges = min(charges, na.rm = TRUE),
            max_charges = max(charges, na.rm = TRUE),
            count = n())

## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.

bmi_region

## # A tibble: 15 × 6
## # Groups:   region [4]
##    region    weight_group median_charges min_charges max_charges count
##    <chr>     <fct>                 <dbl>       <dbl>       <dbl> <int>
##  1 northeast Underweight           9206.       1695.      15007.    10
##  2 northeast Normal                8689.       1702.      35069.    73
##  3 northeast Overweight            9224.       1708.      35148.    98
##  4 northeast Obese                11512.       1984.      58571.   143
##  5 northwest Underweight           5117.       1621.      32734.     7
##  6 northwest Normal                8017.       1625.      30167.    63
##  7 northwest Overweight            9249.       1632.      32787.   107
##  8 northwest Obese                 9379.       1640.      60021.   148
##  9 southeast Normal               11834.       1122.      27118.    41
## 10 southeast Overweight            8226.       1616.      38246.    80
## 11 southeast Obese                 9386.       1132.      63770.   243
## 12 southwest Underweight           3676.       1728.      19023.     4
## 13 southwest Normal                5080.       1242.      26237.    49
## 14 southwest Overweight            8782.       1252.      37830.   101
## 15 southwest Obese                 9630.       1256.      52591.   171

The second sub-dataset sorts smoking status by region, displaying median, minimum, and maximum charges for each region:

smoker_region <- kaggle%>%
  select(smoker, region, charges)%>%
  group_by(region, smoker)%>%
  summarise(
            median_charges = median(charges, na.rm = TRUE),
            min_charges = min(charges, na.rm = TRUE),
            max_charges = max(charges, na.rm = TRUE),
            count = n())

## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.

smoker_region

## # A tibble: 8 × 6
## # Groups:   region [4]
##   region    smoker median_charges min_charges max_charges count
##   <chr>     <chr>           <dbl>       <dbl>       <dbl> <int>
## 1 northeast no              8343.       1695.      32109.   257
## 2 northeast yes            28101.      12829.      58571.    67
## 3 northwest no              7257.       1621.      33472.   267
## 4 northwest yes            27489.      14712.      60021.    58
## 5 southeast no              6653.       1122.      36580.   273
## 6 southeast yes            37484.      16578.      63770.    91
## 7 southwest no              7348.       1242.      36911.   267
## 8 southwest yes            35165.      13845.      52591.    58

BMI

To understand how BMI varied across regions, we constructed a bar chart. Proportionally, the southeast region has the greatest amount of people categorized as obese:

# bar chart with bmi
ggplot(bmi_region, aes(x = region, y= count, fill = weight_group)) +
  geom_col(position = position_dodge())+
  # labels
  labs(title = "BMI groups by Region",
       x = "Region", 
       y = "Count")+
  geom_text(aes(label= count), size = 3, vjust = .9, position = position_dodge(.9))

In this linear regression model, we observe that ages share a positive linear relationship with health insurance charges; as age increases, the insurance charged incurred by the individual also increase. By sorting color by weight groups, we also observe that across ages, those with the highest charges were largely of the obese category:

# regression model with bmi
ggplot(kaggle, aes(x = age,
                   y = charges,
                   col = weight_group,
                   fill = weight_group))+ 
  geom_point(position = "jitter", alpha = 0.4) +
  geom_smooth(method="lm") +
  stat_regline_equation()+
  # labels
  labs(title = "Charges by BMI Across Ages",
       x = "Age", 
       y = "Charges ($)")

## `geom_smooth()` using formula = 'y ~ x'

Because the southeast sample proportionally had the most categorically obese individuals, BMI, at least in part, can be attributed to the higher costs experienced by some people in the southeast.Overall, many of those in the obese category incurred the highest charges in this sample, across regions.

Smoker Status

To understand how tobacco usage varied across regions, we constructed a bar chart. We notice that the southeast region has the most amount of people with a history of smoking:

# bar chart
ggplot(smoker_region, aes(x = region, y= count, fill = smoker)) +
  geom_col(position = position_dodge())+
  # labels
  labs(title = "Smoking Status by Region",
       x = "Region", 
       y = "Count")+
  geom_text(aes(label= count), size = 3, vjust = .9, position = position_dodge(.9))

In this linear regression model, we observe that ages share a positive linear relationship with health insurance charges; as age increases, the insurance charged incurred by the individual also increase. By sorting color by smoking status, we also observe that across ages, those with the highest charges were largely those with a history of smoking:

ggplot(kaggle, aes(x = age,
                   y = charges,
                   col = smoker,
                   fill = smoker))+ 
  geom_point(position = "jitter", alpha = 0.4) +
  geom_smooth(method="lm") +
  stat_regline_equation()+
  # labels
  labs(title = "Charges by Smoking Status Across Ages",
       x = "Age", 
       y = "Charges ($)")

## `geom_smooth()` using formula = 'y ~ x'

Because the southeast sample proportionally had the most categorically obese individuals, smoking status, can be attributed to the higher costs experienced by some people in the southeast. Overall, those with a history of smoking incurred the highest charges in this sample, across regions.

Discussion

Despite location of a person being 1 of five factors insurance companies are able to consider for individual charges under the Affordable Care Act, we find that the region a person was from itself did not significantly impact how many health insurance charges experienced. While the Kaggle data set had been published with individuals divided into 4 US regions (northwest, northest, southwest, southeast), we suspect that these regional categories are too broad (given the same size n = 1338 across all states). More specific breakdown of location may have given us more precision when analyzing region, potentially revealing more of an association to charges. Had we been given the raw data, we may have considered sorting the categories as such:

Northeast

New England: Maine, New Hampshire, Vermont, Massachusetts, Rhode Island, Connecticut
Mid-Atlantic: New York, New Jersey, Pennsylvania

Northwest

Pacific Northwest: Washington, Oregon, Idaho
Sometimes included: Montana, Alaska, California

Southeast

South Atlantic: Delaware, Maryland, Virginia, West Virginia, North Carolina, South Carolina, Georgia, Florida
East South Central: Kentucky, Tennessee, Alabama, Mississippi

Southwest

West South Central: Texas, Oklahoma, Arkansas, Louisiana
Mountain Southwest: Arizona, New Mexico, Nevada, (sometimes Colorado & Utah)

However, other demographic factors such as age, BMI, and smoking status were factors associated with increased charges. Proportionally, more members of the southeast region were reportedly obese and/or had a history with smoking. This may explain why the range of healthcare charges incurred by people in the southeast was much higher than other regions.

Both obesity and smoking may predispose an individual to more medical complications. At the same time, insurance is more expensive for them. This leads to more barriers to healthcare for these more medically at-risk people.

Many states in the southeast region did not expand Medicaid under the Affordable Care Act (ACA), leaving a coverage gap for many low-income individuals who remain uninsured. Combined with the higher proportion of tobacco users and people categorized as obese, this further increases healthcare costs for those who do have insurance.

Our report has several potential limitations. Thus, we must make some assumptions for the relationship between demographic factors and charges observed in the data to be representative of the US population. With our data being sourced from Kaggle, which allows users to self-publish data sets, the described method of collection for the data is somewhat vague. The publisher describes the data set being sourced from a combination of compiling online data sets provided by hospitals and surveying people on-site at a hospital. Given this context, we must assume that the data comes from a random sample, which should be true for both the people interviewed and the other data sets compiled here. Other individual or regional factors not measured may be attributed to higher health insurance charges.

Contributions

Name	Contribution
Xavier Le	Compiled/finalized R Markdown report Data and Methods Histogram showing distribution of charges Results - BMI and Smoker Status Discussion Section Presentation Slides
Yuel Abraham	Data and Methods Heatmap showing mean charges by region Presentation Slides
Bryan Li	Results - Region Bar chart showing mean charges by region Supported Yuel with Heatmap (in Data and Methods) Points made in Discussion section relating to Region Presentation Slides
Theodor Hezkial	Intro Data and Methods Points made in Discussion section relating to ACA Presentation Slides