Medical Insurance Charges in the United States

Exploratory Data Analysis Report

Author
Affiliation

Adrian Julius Aluoch

Anywhere Data Consultancy

Published

September 16, 2024

Modified

September 20, 2024

Background


Medical insurance costs in the United States have become a focal point of discussion due to their significant impact on the financial stability of individuals and families, as well as on the broader healthcare system. These costs are not uniform but vary widely based on a complex interplay of factors, including age, gender, body mass index (BMI), smoking habits, the number of dependents, and geographical region. This variation can influence the accessibility and affordability of healthcare, making it a crucial area of study for both consumers and policymakers.

Existing research has shed light on certain aspects of this issue, particularly highlighting that behaviors such as smoking and physical conditions like a high BMI are commonly associated with increased healthcare costs. Smokers, for instance, often face higher premiums due to the higher risk of smoking-related health issues. Similarly, individuals with a high BMI are typically subject to higher charges due to the associated risks of chronic conditions. Despite these findings, the interactions between these factors and their relative contributions to overall insurance costs are not yet fully understood.

This gap in understanding poses a significant challenge for those involved in designing insurance pricing strategies, from policymakers to insurers and healthcare providers. Without a clear grasp of how these factors interact and impact insurance costs, it is difficult to create fair and effective pricing models. Addressing this challenge requires a deeper analysis to identify how different variables influence each other and contribute to the overall cost structure, ultimately helping to develop more equitable and efficient healthcare solutions.

Introduction


To address the challenge of understanding medical insurance charges, this report undertakes a thorough analysis of insurance costs across the United States, utilizing a comprehensive dataset that includes critical variables such as age, gender, body mass index (BMI), smoking status, number of children, and geographical region. This dataset provides a broad view of the factors that influence insurance pricing, allowing us to explore how these elements contribute to the overall cost structure.

Our analysis begins with an Exploratory Data Analysis (EDA), a methodical approach designed to uncover patterns and trends within the data. By scrutinizing the relationships between variables such as age, BMI, smoking status, and regional differences, we aim to identify significant patterns that can shed light on how these factors interplay. This exploration is crucial for understanding the underlying dynamics that drive insurance costs and for identifying any notable trends or anomalies in the data.

The insights gained from this initial analysis will be instrumental in forming a clearer picture of the factors influencing insurance pricing. By summarizing and visualizing the data, we aim to lay a solid foundation for future research and decision-making. This analysis will not only enhance our understanding of current cost drivers but also provide valuable information for strategic development and policy formulation, paving the way for more effective and equitable insurance pricing strategies.

Analysis


In this section, we delve into the dataset to uncover its structure and key characteristics, laying the groundwork for a more detailed exploration. Our journey begins with profiling the data, offering us a comprehensive view of its overall composition. This initial step reveals how different variables are distributed and highlights any notable features or potential issues within the dataset.

Once we have a clear understanding of the data’s structure, we shift our focus to the distribution of various variables. We start by examining numerical variables like Body Mass Index (BMI) and age, seeking to understand their range and central tendencies. This is followed by analyzing categorical variables, such as smoking status and region, to identify prevalent patterns and frequencies. For example, by looking at how BMI varies across genders and the distribution of smokers versus non-smokers, we gain insights into how these factors might impact health metrics and, ultimately, insurance costs.

The final stage of our analysis involves investigating the correlations between variables using visuals. We employ a heatmap to illustrate the strength and direction of relationships between numerical factors, such as BMI and age, and categorical aspects like smoking status. Scatterplots further help us visualize these relationships, revealing significant trends and clusters. This correlation analysis is crucial for understanding how different factors interact and influence insurance charges, providing us with a deeper insight into the primary drivers of healthcare costs.

1. Profiling

Variable Stats / Values Freqs (% of Valid) Graph Missing
age [numeric]
Mean (sd) : 39.1 (14.1)
min ≤ med ≤ max:
18 ≤ 39 ≤ 64
IQR (CV) : 25 (0.4)
47 distinct values 0 (0.0%)
sex [factor]
1. Male
2. Female
1406 ( 50.7% )
1366 ( 49.3% )
0 (0.0%)
bmi [numeric]
Mean (sd) : 30.7 (6.1)
min ≤ med ≤ max:
16 ≤ 30.4 ≤ 53.1
IQR (CV) : 8.6 (0.2)
548 distinct values 0 (0.0%)
children [numeric]
Mean (sd) : 1.1 (1.2)
min ≤ med ≤ max:
0 ≤ 1 ≤ 5
IQR (CV) : 2 (1.1)
0 : 1186 ( 42.8% )
1 : 672 ( 24.2% )
2 : 496 ( 17.9% )
3 : 324 ( 11.7% )
4 : 52 ( 1.9% )
5 : 42 ( 1.5% )
0 (0.0%)
smoker [factor]
1. Smoker
2. Non-Smoker
564 ( 20.3% )
2208 ( 79.7% )
0 (0.0%)
region [factor]
1. North East
2. South East
3. South West
4. North West
658 ( 23.7% )
766 ( 27.6% )
684 ( 24.7% )
664 ( 24.0% )
0 (0.0%)
charges [numeric]
Mean (sd) : 13261.4 (12151.8)
min ≤ med ≤ max:
1121.9 ≤ 9333 ≤ 63770.4
IQR (CV) : 11890 (0.9)
1337 distinct values 0 (0.0%)

Figure 1. Summary of the data including variable names & types, summary statistics, valid/missing observations, counts & proportions and statistical plots.

The dataset covers a fairly balanced demographic with the average age being 39, spanning from 18 to 64 years, with no missing values. The gender split is almost equal, with 50.7% male and 49.3% female. BMI averages around 30.7, indicating that many individuals might be in the overweight category.

Interestingly, the majority of participants (42.8%) report having no children, and a large portion of the population (79.7%) are non-smokers. The individuals are spread almost evenly across four regions, with no significant regional imbalance.

The charges, which represent medical costs, show considerable variability with an average of $13,261, indicating diverse healthcare expenditures across the population. Importantly, the dataset is complete with no missing values across all variables.

2. Distribution

Figure 2. QQ-plot visualizing the distribution of numerical variables.

The QQ-plot shows that while age and BMI mostly follow a normal distribution, there are some noticeable deviations at the extremes, particularly for older individuals and those with higher BMI values. This suggests the presence of outliers in both age and BMI, with some people skewing younger or significantly overweight compared to the rest of the population.

The number of children, however, shows a more stepped distribution, reflecting the fact that people tend to have whole numbers of children, with no extreme outliers.

The most striking pattern is seen in the charges, where the distribution is heavily skewed, with a few individuals incurring significantly higher medical costs than others. This suggests that while most charges are concentrated at lower values, a small portion of the population faces disproportionately high expenses.


Figure 3. Boxplot visualizing the distribution of age across gender for both smokers and non-smokers.

The plot illustrates the distribution of age between smokers and non-smokers across genders. For males, smokers and non-smokers have a similar age range, but non-smokers display a slightly wider distribution, indicating more variation in ages, with non-smokers generally skewing slightly older.

The median age for male smokers and non-smokers appears quite close, suggesting that smoking status doesn’t create a significant difference in age for men. On the other hand, among females, the age distribution shows that non-smokers have a much wider range compared to smokers.

The median age for female non-smokers is noticeably higher, suggesting that women who do not smoke tend to be older compared to their smoking counterparts.

The overall story indicates a more pronounced age difference in smoking behavior among women than men, with female non-smokers tending to be older, while the age difference is minimal among men.


Figure 4. Boxplot visualizing the distribution of body mass index across gender for both smokers and non-smokers.

The data reveals an interesting relationship between smoking status, gender, and Body Mass Index (BMI). For males, smokers tend to have a slightly higher median BMI compared to non-smokers, though non-smokers show a broader range of BMI values, including more extreme outliers. In contrast, among females, non-smokers actually have a higher median BMI than smokers, with the non-smoking group showing more variation, including a number of individuals with significantly higher BMIs.

The difference in trends between the two genders suggests that smoking might influence BMI differently for men and women, with male smokers generally having higher BMI, while for women, it’s the non-smokers who tend to have higher values. This subtle but important shift highlights how lifestyle factors, like smoking, can affect people’s health in varying ways depending on gender.


Figure 5. Boxplot visualizing the distribution of insurance charge across gender for both smokers and non-smokers.

The plot on insurance charges across gender and smoking status highlights a stark contrast between smokers and non-smokers. For both men and women, smokers tend to incur significantly higher insurance charges than their non-smoking counterparts.

Among males, smokers experience a much wider range of charges, with the median hovering around $40,000, while non-smokers remain consistently lower, with a median just below $10,000. This pattern holds true for females, though the spread of insurance charges for female smokers appears slightly narrower, but still far surpasses that of non-smoking females, whose charges remain below $10,000.

Interestingly, non-smokers, both male and female, exhibit minimal variability in their charges, while smokers demonstrate far more variability, especially among males.This suggests that smoking not only increases insurance costs significantly but also introduces more unpredictability in the amount charged, particularly for men.


Figure 6. Boxplot visualizing the distribution of insurance charge across regions for both smokers and non-smokers.

The box plot shows a clear disparity in insurance charges between smokers and non-smokers across four regions: North East, South East, South West, and North West. In all regions, smokers face significantly higher charges compared to non-smokers. For smokers, the median insurance charge is consistently higher, ranging from around $30,000 to $40,000, with the highest charges observed in the South East region.

On the other hand, non-smokers have a much lower median charge, typically around $10,000 or less. There are also noticeable outliers in the non-smoker groups across all regions, but they do not affect the overall trend that non-smokers are charged less. The spread of charges for smokers tends to be larger, indicating greater variability in the costs they incur, whereas non-smokers experience less variability.

Overall, the plot highlights the significant impact of smoking status on insurance charges across all regions, with smokers consistently facing higher costs.

3. Correlation

Figure 7. Correlation heatmap visualizing the relationship between numerical variables.

The correlation heatmap reveals how various numerical factors relate to insurance charges. Notably, age has the strongest correlation with charges, showing a moderate positive correlation of 0.3, suggesting that as individuals get older, their insurance costs tend to rise.

BMI has a smaller but still positive correlation of 0.2, indicating that higher body mass index is also associated with higher charges, though the relationship is not as strong as with age. Interestingly, the number of children a person has shows almost no correlation with insurance charges, with a value near zero. This implies that the number of dependents does not significantly impact the cost of insurance.

Overall, the most influential factors on charges appear to be age and BMI, while the number of children seems to play a negligible role in determining insurance costs.


Figure 8. Correlation heatmap visualizing the relationship between categorical variables and the target variable insurance charges.

The correlation heatmap of categorical variables highlights several important relationships in the data. The strongest correlation is between smoking and insurance charges, with a high positive correlation of 0.79 for smokers and an equally strong negative correlation of -0.79 for non-smokers.

This suggests that being a smoker significantly increases insurance costs, while being a non-smoker has the opposite effect. In contrast, the regions show very weak correlations with charges, indicating that where someone lives does not substantially influence their insurance costs. Additionally, there is little to no correlation between gender and charges, as seen by the nearly zero values for both males and females.

This suggests that insurance costs are not heavily influenced by gender. Overall, smoking status emerges as the most influential factor among the categorical variables, strongly impacting insurance charges, while region and gender seem to play much smaller roles.


Figure 9. Scatterplot visualizing the relationship between age and insurance charge for both smokers and non-smokers.

The plot illustrates the relationship between age and insurance charges for both smokers and non-smokers. As age increases, there is a clear upward trend in charges for both groups, but smokers consistently face significantly higher charges compared to non-smokers.

The plot shows that smokers, even at younger ages, are paying more than their non-smoking counterparts, and this gap widens as they age. The lines of best fit for each group emphasize this disparity, with the slope for smokers rising more steeply, reflecting the higher financial burden placed on them due to their smoking habits.

Non-smokers experience a more moderate increase in charges as they age, staying substantially lower across the board.


Figure 10. Scatterplot visualizing the relationship between body mass index and insurance charge for both smokers and non-smokers.

The relationship between Body Mass Index (BMI) and insurance charges reveals stark differences between smokers and non-smokers. As BMI increases, smokers see a dramatic rise in insurance costs, with a strong positive correlation between BMI and charges.

Non-smokers, on the other hand, experience relatively stable insurance costs regardless of BMI, as their charges remain much lower and show only a slight upward trend. Smokers with higher BMIs are particularly burdened, with the steep incline in charges becoming more pronounced as BMI exceeds 30.

The gap between smokers and non-smokers widens as BMI increases, underscoring the significant financial impact of smoking, especially when combined with higher BMI levels.

Conclusion


As we unravel the story behind medical insurance charges in the United States, a vivid picture emerges, illustrating the intricate interplay of demographics, lifestyle choices, and regional factors. Our analysis reveals a diverse population with an average age of 39 years and a nearly even gender distribution. The average BMI of 30.7 suggests that many individuals fall into the overweight category, potentially impacting their health and insurance costs. A significant majority of participants are non-smokers, and the dataset is evenly spread across regions, with no pronounced regional bias.

Delving deeper into the data, we observe that the distribution of insurance charges is far from uniform. Most individuals experience lower charges, but a small segment incurs significantly higher costs, pointing to a skewed distribution. The QQ-plot highlights deviations at the extremes, particularly among older individuals and those with higher BMI values, suggesting the presence of outliers that influence overall trends. The number of children, while showing a stepped distribution, does not exhibit extreme outliers or a strong correlation with insurance charges.

The most striking revelations come from the relationship between smoking status and insurance costs. Smokers, both male and female, face considerably higher charges compared to their non-smoking counterparts. For men, smokers face insurance charges around $40,000, while non-smokers typically have charges below $10,000. Female smokers also experience higher charges, though with slightly less variability compared to their male peers. This disparity underscores the profound financial impact of smoking, which not only elevates insurance costs but also introduces greater variability, particularly among men.

Further analysis highlights that age and BMI are the primary drivers of insurance charges. The correlation heatmap reveals a moderate positive correlation between age and charges, and a smaller but notable correlation between BMI and charges. Interestingly, the number of children has little impact on insurance costs. The interaction between BMI and smoking status shows that while smokers experience a dramatic rise in charges as their BMI increases, non-smokers’ costs remain relatively stable. This difference becomes more pronounced as BMI exceeds 30, emphasizing how smoking exacerbates the financial burden associated with higher BMI.

Overall, our findings paint a clear picture of how lifestyle choices, particularly smoking, and personal attributes like age and BMI, influence medical insurance costs. This nuanced understanding provides a foundation for more targeted interventions and strategic decisions aimed at addressing the high costs associated with certain health behaviors and demographic factors.