The topic I will be going over is the magnitude of medical insurance charges in the U.S. and how heavily influenced they are by factors such as smoking status, BMI, age, sex, number of children, and region. The data set I used needed little to no cleaning, being fairly small and concise with its presented data. What I did have to do for the code was filter different variables and mutate the data set in order to make a proper visualization. The data set I am using comes from Medical Sphere, a data collection organization that is based in the UK. The data was gathered by documenting insurance charges from individuals across the U.S. and then collecting information about the individuals such as their sex, smoking status, etc. I chose this topic due to my curiosity in the overall rising costs of necessities in the U.S. today, especially after recent events such as the killing of the United Health CEO due to a potential motivation of insurance prices being too high. Through the factors listed, I expected smoking status to contribute to health insurance charges the most (seeing as smoking causes a plethora of chronic illnesses) and found an article from Insurance Informant stating that “Smokers typically pay 15-20% higher insurance premiums than non-smokers with similar demographics and health conditions.” Leading me to further my conclusion of smoking being a heavy contributor.
Conclusion
Overall this was an interesting topic to research and visualize, helping me to gain a greater understanding of the average costs of insurance payments across the regions of the United States, along with how impactful certain factors are in terms of how they impact the size of insurance payments. What I wish I could have done for the first visualization would have been having more than one sex show along with the fourth region, for some reason both were omitted despite me making attempts to have both factors appear. For the second visualization I wish I could have figured out a way to have the regions be geographically interactive (having their regions be completely covered) rather than being represented by circles, which I could not get to work either.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)library(leaflet)
Warning: package 'leaflet' was built under R version 4.5.2
setwd("C:/Users/chris/Downloads/DATA 110")
insurance <-read_csv ("insurance.csv")
Rows: 1338 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): sex, smoker, region
dbl (4): age, bmi, children, charges
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
insurance |>head()
# A tibble: 6 × 7
age sex bmi children smoker region charges
<dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl>
1 19 female 27.9 0 yes southwest 16885.
2 18 male 33.8 1 no southeast 1726.
3 28 male 33 3 no southeast 4449.
4 33 male 22.7 0 no northwest 21984.
5 32 male 28.9 0 no northwest 3867.
6 31 female 25.7 0 no southeast 3757.
Factors that Contribute the most to Medical Insurance Charges in the U.S. (2025) (By Standard Deviation)
ggplot(coef_df, aes(x =reorder(Variable, Coefficient), y = Coefficient, fill = Coefficient)) +geom_col() +coord_flip() +scale_fill_gradient2(low ="blue", mid ="white", high ="red", midpoint =0) +labs(title ="Contribution of Different Variables to Medical Insurance Charges",subtitle ="(Smoking Status, BMI, Amount of Children, Sex, Region)", x ="Variable",y ="Effect on Charges (Standardized Coefficient)",caption ="Source: Medical Sphere" ) +theme_minimal()
The first visualization represents the magnitude of each variable in how they impact the size of insurance payments being represented by standard deviation. The visualization shows that being a smoker contributes the most to the size of insurance payments. While I expected smoking to be a heavy contributor to the size of insurance payments, I did not expect it to overwhelm all other factors to the point where it is at least a contributor 5x larger than age. It is also curious to see that living in the southeast helps contribute to actually having a better chance at getting lower insurance payments, which made me wonder if the people there are simply healthier than the other regions or if they have some sort of policy implementation that alleviates these charges.
The second visualization is an interactive map that represents each region of the U.S. along with a showing which ones have the highest and lowest smoker rate, other factors may be seen upon clicking on each regions specified dot. The southeast of the U.S. has the greatest smoking rate among all other regions, and as supported by the previous visualization, holds the greatest insurance payments, backing the incredible role smoking plays in insurance payments. What surprises me is that the previous visualization shows that the southeast statistically helps with having lower insurance payments, so despite this advantage, the southeast still holds the highest insurance payments along with an unhealthy population unlike what I first believed. The smoker rate correlation is also backed by the northeast having the second highest smoker rate, along with the second highest insurance payments. Average BMI seems to play a slight role, but is contested slightly because of the southwest having a higher average BMI than the northeast, but lesser insurance payments, the same can be said for average age.
Works Cited:
Cooper, A. (2025, September 18). How Does Smoking Affect Your Health Insurance?. Insurance Informant. https://insuranceinformant.com/how-does-smoking-affect-your-health-insurance.html