Contributors: Xavier Le, Yuel Abraham, Bryan Li, Theodor Hezkial
Healthcare is something everyone needs, but the cost of health insurance can be very different from person to person. Some people pay a lot more than others, and we want to understand why. This report examines how individual factors affect how much a person is charged for health insurance.
Under the 2010 Affordable Care Act (ACA), changes were made to the U.S. healthcare system that sought to increase access to health care. The ACA largely impacts the state of the health insurance industry. Before the ACA, healthcare charges were informed by factors including health status and history. The ACA allows healthcare insurance companies to weigh these 5 factors when deciding an individual’s charges: age, location, smoking status (ie. tobacco use), plan category, and family v. individual enrollment. This study primarily aims to examine the first three factors; how does a person’s location, age, and tobacco usage impact how much they pay for health insurance?
Since these sweeping changes to the health insurance industry, this has introduced much research into how the. Studies such as “State Health Insurance Coverage:2013, 2019, and 2023” from the U.S. Census Bureau examined how age influence health insurance coverage rates. In part, the study highlights how southern states often had lower coverage rates among working adults (19-64). The findings of this project motivates this present report to investigate how location may interact with other factors to influence health insurance charges.
Access to healthcare is a universally relevant issue. Financial issues, including the cost of insurance is a significant barrier to receiving medical aid. Being able to associate certain individual factors with increased health insurance charges can help people better prepare for future expenses and access towards necessary care. If we can understand what makes insurance more expensive, people can plan better for their medical expenses, insurance companies can set more equitable pricing plans, and policymakers can work toward making healthcare more affordable.
The data is sourced from pre-cleaned Kaggle data set “Healthcare Insurance”. This data set includes a sample size of 1,338 people (n = 1338) and their insurance charges. The following variables are included in this data set, with italicized variables being the ones of interest for this report:
Age: The insured person’s age.
Sex: Gender (male or female) of the insured.
BMI (Body Mass Index): A measure of body fat based on height and weight.
Children: The number of dependents covered.
Smoker: Whether the insured is a smoker (yes or no).
Region: The geographic area of coverage.
Charges: The medical insurance costs incurred by the insured person.
The variables of interest for this report are Age, BMI, Smoker, Region, and Charges. For the purposes of our exploration, other variables were filtered from the set. A sample of the data is displayed below:
setwd("C:/Users/xavle/Documents/R Projects/B DATA 200/Health Insurance")
kaggle <- read.csv("insurance.csv", header = TRUE)
head(kaggle)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
Analysis of the data set was performed in R. We used several libraries to facilitate our analysis:
library(tidyverse)
library(ggpubr)
library(dplyr)
library(freqdist)
To start, we used a heat map to represent the health insurance cost by region. We created a heat map by finding the median charge for each region and correlating it with the corresponding color:
region_charges <- kaggle %>%
group_by(region) %>%
summarise(
mean_charges = mean(charges, na.rm = TRUE),
median_charges = median(charges, na.rm = TRUE),
min_charges = min(charges, na.rm = TRUE),
max_charges = max(charges, na.rm = TRUE),
count = n()
)
print(region_charges)
## # A tibble: 4 × 6
## region mean_charges median_charges min_charges max_charges count
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 northeast 13406. 10058. 1695. 58571. 324
## 2 northwest 12418. 8966. 1621. 60021. 325
## 3 southeast 14735. 9294. 1122. 63770. 364
## 4 southwest 12347. 8799. 1242. 52591. 325
ggplot(region_charges, aes(x = region,
y = "region",
fill = mean_charges)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "salmon", high = "darkred") +
labs(title = "Heatmap of Mean Charges by Region",
x = "Region",
fill = "Mean Charges") +
geom_text(aes(label = mean_charges), size = 3)+
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We then wanted to understand the distribution of how much individuals were being charged in the sample. We used a histogram to show a count of how many individuals were being charged certain amounts for their health insurance:
kaggle%>%
ggplot(aes(x=charges)) +
geom_histogram(fill="lightsteelblue3",
binwidth = 1000) +
scale_x_continuous(breaks=seq(0,65000,5000)) +
# labels
labs(title = "Health Insurance Charges",
x = "Charges ($)",
y = "Count")+
# add median line
geom_vline(aes(xintercept=median(charges)),
color="darkblue", linetype="dashed", linewidth=1)+
# add label for median
geom_label(aes(x=median(charges),
label= median(charges)), y = Inf, vjust = 1.5)
Because the data set shows a heavy right skewed distribution, we aimed to consider the median. Going back to investigating region, we created a box plot to capture how charges were spread across US regions. We observed that the median and IQR of the regions overall heavily overlapped which implies that by itself, region did not have an effect on the typical healthcare charges experienced.
However, we noted that a portion of the southeast region were experiencing charges that were higher than other region. In other words, the range of charged experienced by the southeast extended higher than other regions. This is verified by briefly collecting the averages of healthcare charges incurred by region, which shows how the southeast region had the highest average charges:
# bar chart of mean charges by region
avg_charges_per_region <- kaggle %>%
group_by(region) %>%
summarise(mean_charges = mean(charges, na.rm = TRUE))
avg_charges_per_region
## # A tibble: 4 × 2
## region mean_charges
## <chr> <dbl>
## 1 northeast 13406.
## 2 northwest 12418.
## 3 southeast 14735.
## 4 southwest 12347.
ggplot(avg_charges_per_region, aes(x = region, y = mean_charges, fill = mean_charges)) +
geom_bar(stat = "identity") +
scale_fill_gradient(low = "yellow", high = "red")
# box plot of median charges by region
ggplot(kaggle, aes(x = region,
y = charges,
fill = region))+
geom_boxplot(alpha=0.4, outlier.shape = NA)+
# labels
labs(title = "Health Insurance Charges by Region",
x = "Region",
y = "Charges ($)")+
coord_cartesian(ylim = c(0, 45000))
charges_region <- kaggle%>%
group_by(region)%>%
mutate(charges_med = median(charges, na.rm = TRUE),
charges_IQR = IQR(charges, na.rm = TRUE),
charges_min = min(charges, na.rm = TRUE))%>%
select(charges_med, charges_IQR, region)%>%
arrange(region)%>%
distinct(charges_med, charges_IQR)
charges_region
## # A tibble: 4 × 3
## # Groups: region [4]
## region charges_med charges_IQR
## <chr> <dbl> <dbl>
## 1 northeast 10058. 11493.
## 2 northwest 8966. 9992.
## 3 southeast 9294. 15085.
## 4 southwest 8799. 8711.
This leads us to investigating how other demographic factors (ie. smoking status and BMI) affect charges. Since we are interested in how these factors may be an indicator of higher charges, we used linear regression.
Before we can construct these linear models, we modified the data into separate sub-datasets. The first one sorted the numerical BMI into distinct weight categories; underweight (0-18.5), normal weight (18.5-25), overweight (25-30), and obese (30+). These weight categories were then sorted by region, with the median, minimum, and max charges:
weight_group <- cut(kaggle$bmi, c(0, 18.5, 25, 30, Inf), c("Underweight", "Normal", "Overweight", "Obese"), include.lowest=TRUE)
kaggle <- kaggle%>%
mutate(weight_group)
bmi_region <- kaggle%>%
select(weight_group, region, charges)%>%
group_by(region, weight_group)%>%
summarise(median_charges = median(charges, na.rm = TRUE),
min_charges = min(charges, na.rm = TRUE),
max_charges = max(charges, na.rm = TRUE),
count = n())
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
bmi_region
## # A tibble: 15 × 6
## # Groups: region [4]
## region weight_group median_charges min_charges max_charges count
## <chr> <fct> <dbl> <dbl> <dbl> <int>
## 1 northeast Underweight 9206. 1695. 15007. 10
## 2 northeast Normal 8689. 1702. 35069. 73
## 3 northeast Overweight 9224. 1708. 35148. 98
## 4 northeast Obese 11512. 1984. 58571. 143
## 5 northwest Underweight 5117. 1621. 32734. 7
## 6 northwest Normal 8017. 1625. 30167. 63
## 7 northwest Overweight 9249. 1632. 32787. 107
## 8 northwest Obese 9379. 1640. 60021. 148
## 9 southeast Normal 11834. 1122. 27118. 41
## 10 southeast Overweight 8226. 1616. 38246. 80
## 11 southeast Obese 9386. 1132. 63770. 243
## 12 southwest Underweight 3676. 1728. 19023. 4
## 13 southwest Normal 5080. 1242. 26237. 49
## 14 southwest Overweight 8782. 1252. 37830. 101
## 15 southwest Obese 9630. 1256. 52591. 171
The second sub-dataset sorts smoking status by region, displaying median, minimum, and maximum charges for each region:
smoker_region <- kaggle%>%
select(smoker, region, charges)%>%
group_by(region, smoker)%>%
summarise(
median_charges = median(charges, na.rm = TRUE),
min_charges = min(charges, na.rm = TRUE),
max_charges = max(charges, na.rm = TRUE),
count = n())
## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.
smoker_region
## # A tibble: 8 × 6
## # Groups: region [4]
## region smoker median_charges min_charges max_charges count
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 northeast no 8343. 1695. 32109. 257
## 2 northeast yes 28101. 12829. 58571. 67
## 3 northwest no 7257. 1621. 33472. 267
## 4 northwest yes 27489. 14712. 60021. 58
## 5 southeast no 6653. 1122. 36580. 273
## 6 southeast yes 37484. 16578. 63770. 91
## 7 southwest no 7348. 1242. 36911. 267
## 8 southwest yes 35165. 13845. 52591. 58
To understand how BMI varied across regions, we constructed a bar chart. Proportionally, the southeast region has the greatest amount of people categorized as obese:
# bar chart with bmi
ggplot(bmi_region, aes(x = region, y= count, fill = weight_group)) +
geom_col(position = position_dodge())+
# labels
labs(title = "BMI groups by Region",
x = "Region",
y = "Count")+
geom_text(aes(label= count), size = 3, vjust = .9, position = position_dodge(.9))
In this linear regression model, we observe that ages share a positive linear relationship with health insurance charges; as age increases, the insurance charged incurred by the individual also increase. By sorting color by weight groups, we also observe that across ages, those with the highest charges were largely of the obese category:
# regression model with bmi
ggplot(kaggle, aes(x = age,
y = charges,
col = weight_group,
fill = weight_group))+
geom_point(position = "jitter", alpha = 0.4) +
geom_smooth(method="lm") +
stat_regline_equation()+
# labels
labs(title = "Charges by BMI Across Ages",
x = "Age",
y = "Charges ($)")
## `geom_smooth()` using formula = 'y ~ x'
Because the southeast sample proportionally had the most categorically obese individuals, BMI, at least in part, can be attributed to the higher costs experienced by some people in the southeast.Overall, many of those in the obese category incurred the highest charges in this sample, across regions.
To understand how tobacco usage varied across regions, we constructed a bar chart. We notice that the southeast region has the most amount of people with a history of smoking:
# bar chart
ggplot(smoker_region, aes(x = region, y= count, fill = smoker)) +
geom_col(position = position_dodge())+
# labels
labs(title = "Smoking Status by Region",
x = "Region",
y = "Count")+
geom_text(aes(label= count), size = 3, vjust = .9, position = position_dodge(.9))
In this linear regression model, we observe that ages share a positive linear relationship with health insurance charges; as age increases, the insurance charged incurred by the individual also increase. By sorting color by smoking status, we also observe that across ages, those with the highest charges were largely those with a history of smoking:
ggplot(kaggle, aes(x = age,
y = charges,
col = smoker,
fill = smoker))+
geom_point(position = "jitter", alpha = 0.4) +
geom_smooth(method="lm") +
stat_regline_equation()+
# labels
labs(title = "Charges by Smoking Status Across Ages",
x = "Age",
y = "Charges ($)")
## `geom_smooth()` using formula = 'y ~ x'
Because the southeast sample proportionally had the most categorically obese individuals, smoking status, can be attributed to the higher costs experienced by some people in the southeast. Overall, those with a history of smoking incurred the highest charges in this sample, across regions.
Despite location of a person being 1 of five factors insurance companies are able to consider for individual charges under the Affordable Care Act, we find that the region a person was from itself did not significantly impact how many health insurance charges experienced. While the Kaggle data set had been published with individuals divided into 4 US regions (northwest, northest, southwest, southeast), we suspect that these regional categories are too broad (given the same size n = 1338 across all states). More specific breakdown of location may have given us more precision when analyzing region, potentially revealing more of an association to charges. Had we been given the raw data, we may have considered sorting the categories as such:
Northeast
New England: Maine, New Hampshire, Vermont, Massachusetts, Rhode Island, Connecticut
Mid-Atlantic: New York, New Jersey, Pennsylvania
Northwest
Pacific Northwest: Washington, Oregon, Idaho
Sometimes included: Montana, Alaska, California
Southeast
South Atlantic: Delaware, Maryland, Virginia, West Virginia, North Carolina, South Carolina, Georgia, Florida
East South Central: Kentucky, Tennessee, Alabama, Mississippi
Southwest
West South Central: Texas, Oklahoma, Arkansas, Louisiana
Mountain Southwest: Arizona, New Mexico, Nevada, (sometimes Colorado & Utah)
However, other demographic factors such as age, BMI, and smoking status were factors associated with increased charges. Proportionally, more members of the southeast region were reportedly obese and/or had a history with smoking. This may explain why the range of healthcare charges incurred by people in the southeast was much higher than other regions.
Both obesity and smoking may predispose an individual to more medical complications. At the same time, insurance is more expensive for them. This leads to more barriers to healthcare for these more medically at-risk people.
Many states in the southeast region did not expand Medicaid under the Affordable Care Act (ACA), leaving a coverage gap for many low-income individuals who remain uninsured. Combined with the higher proportion of tobacco users and people categorized as obese, this further increases healthcare costs for those who do have insurance.
Our report has several potential limitations. Thus, we must make some assumptions for the relationship between demographic factors and charges observed in the data to be representative of the US population. With our data being sourced from Kaggle, which allows users to self-publish data sets, the described method of collection for the data is somewhat vague. The publisher describes the data set being sourced from a combination of compiling online data sets provided by hospitals and surveying people on-site at a hospital. Given this context, we must assume that the data comes from a random sample, which should be true for both the people interviewed and the other data sets compiled here. Other individual or regional factors not measured may be attributed to higher health insurance charges.
| Name | Contribution |
|---|---|
| Xavier Le |
|
| Yuel Abraham |
|
| Bryan Li |
|
| Theodor Hezkial |
|