Health Risk and Medical Insurance Charges

ECON 465 - Stage 3 Regression Presentation

Ceren Muratsu

2026-06-04

1. Economic Question

Research question

How do observable health-risk characteristics, especially smoking status and BMI (Body Mass Index), affect individual medical insurance charges?

Motivation

  • Insurance charges represent the financial cost of health risk.
  • Smoking and BMI are observable health-risk factors.
  • Understanding these risks can help insurers, businesses, and policymakers evaluate expected medical costs.

Main hypothesis

Smoking is associated with higher insurance charges, and the effect of BMI on charges is stronger among smokers.

2. Dataset and Variables

Dataset selected for Stage 3: Medical Insurance Charges dataset

Item Description
Source Kaggle - Medical Cost Personal Datasets
Observations 1,338
Outcome variable charges
Main predictors age, bmi, smoker
Controls children, sex, region
Model type Linear regression

Why this dataset?

The outcome is continuous, so regression allows direct interpretation in dollar terms.

3. Key Distribution Findings

Measure Approximate value
Mean charges USD 13,270
Median charges USD 9,382
Maximum charges USD 63,770
Share of smokers 20.5 percent

Interpretation

Charges are right-skewed because the mean is higher than the median. A smaller high-cost group increases the average, making risk-factor analysis economically important.

4. Descriptive Evidence: Smoking

Smoking status Mean charges Median charges
Non-smoker about USD 8,434 about USD 7,345
Smoker about USD 32,050 about USD 34,456

Main descriptive result

Smokers have much higher insurance charges than non-smokers.

Why regression is needed

This comparison does not control for age, BMI, children, sex, or region.

5. Regression Models

Model Formula Purpose
Model 1 charges ~ age + bmi + smoker Baseline health-risk model
Model 2 charges ~ age + bmi + smoker + children + sex + region + bmi:smoker Controls plus BMI-smoking interaction

Why include bmi:smoker?

It tests whether BMI has a different cost effect for smokers than for non-smokers.

6. Model Comparison and Final Choice

Model Test RMSE Test R-squared
Model 1: Baseline about USD 6,253 about 0.73
Model 2: Final model about USD 4,758 about 0.84

Selected model: Model 2 — lower RMSE, higher R-squared, and a stronger economic explanation.

7. Key Coefficients and P-Values

After selecting Model 2 based on test-set performance, I refitted the selected specification on the full dataset to report final coefficient estimates and p-values.

Variable Coefficient estimate p-value Significant?
Age 263.62 less than 0.001 Yes
BMI 23.53 0.358 No
Smoker = Yes -20,415.61 less than 0.001 Yes
BMI x Smoker 1,443.10 less than 0.001 Yes
Children 516.40 less than 0.001 Yes

Note: Sex and region controls are included in the model but omitted from this table for presentation clarity.

8. Economic Interpretation

Age

Each additional year of age is associated with about USD 264 higher charges, holding other variables constant.

BMI and smoking

BMI alone is not statistically significant in the selected model.

However, the interaction term shows that:

For smokers, each additional BMI unit is associated with about USD 1,443 extra charges beyond the BMI effect for non-smokers.

Important point: The smoking coefficient should not be interpreted alone because the model includes an interaction.

9. Main Result and Implication

Answer to the economic question

Observable health-risk characteristics are strongly related to medical insurance charges.

Main findings

  • Smoking is the strongest risk category.
  • Age is positively associated with charges.
  • BMI alone is not significant, but the BMI effect is stronger among smokers.

Implication: Preventive health policies targeting smoking and obesity-related risk may help reduce future medical costs.

10. Limitations and Conclusion

Limitations

  • The data are observational, so the results show association, not definite causation.
  • Missing variables include medical history, income, insurance plan type, diet, and exercise.
  • Charges are right-skewed, so linear regression may not fully capture very high-cost cases.

What I would do differently

With better data, I would include medical history and insurance plan type, and compare the linear model with a log-linear model because charges are right-skewed.

Final conclusion

Medical insurance charges are strongly connected to observable health-risk profiles. The main result is not only that smokers have higher charges, but that BMI has a stronger association with charges among smokers.

This suggests that insurers, businesses, and policymakers should evaluate health risks jointly.