How do observable health-risk characteristics, especially smoking status and BMI (Body Mass Index), affect individual medical insurance charges?
Motivation
Insurance charges represent the financial cost of health risk.
Smoking and BMI are observable health-risk factors.
Understanding these risks can help insurers, businesses, and policymakers evaluate expected medical costs.
Main hypothesis
Smoking is associated with higher insurance charges, and the effect of BMI on charges is stronger among smokers.
2. Dataset and Variables
Dataset selected for Stage 3: Medical Insurance Charges dataset
Item
Description
Source
Kaggle - Medical Cost Personal Datasets
Observations
1,338
Outcome variable
charges
Main predictors
age, bmi, smoker
Controls
children, sex, region
Model type
Linear regression
Why this dataset?
The outcome is continuous, so regression allows direct interpretation in dollar terms.
3. Key Distribution Findings
Measure
Approximate value
Mean charges
USD 13,270
Median charges
USD 9,382
Maximum charges
USD 63,770
Share of smokers
20.5 percent
Interpretation
Charges are right-skewed because the mean is higher than the median. A smaller high-cost group increases the average, making risk-factor analysis economically important.
4. Descriptive Evidence: Smoking
Smoking status
Mean charges
Median charges
Non-smoker
about USD 8,434
about USD 7,345
Smoker
about USD 32,050
about USD 34,456
Main descriptive result
Smokers have much higher insurance charges than non-smokers.
Why regression is needed
This comparison does not control for age, BMI, children, sex, or region.
5. Regression Models
Model
Formula
Purpose
Model 1
charges ~ age + bmi + smoker
Baseline health-risk model
Model 2
charges ~ age + bmi + smoker + children + sex + region + bmi:smoker
Controls plus BMI-smoking interaction
Why include bmi:smoker?
It tests whether BMI has a different cost effect for smokers than for non-smokers.
6. Model Comparison and Final Choice
Model
Test RMSE
Test R-squared
Model 1: Baseline
about USD 6,253
about 0.73
Model 2: Final model
about USD 4,758
about 0.84
Selected model: Model 2 — lower RMSE, higher R-squared, and a stronger economic explanation.
7. Key Coefficients and P-Values
After selecting Model 2 based on test-set performance, I refitted the selected specification on the full dataset to report final coefficient estimates and p-values.
Variable
Coefficient estimate
p-value
Significant?
Age
263.62
less than 0.001
Yes
BMI
23.53
0.358
No
Smoker = Yes
-20,415.61
less than 0.001
Yes
BMI x Smoker
1,443.10
less than 0.001
Yes
Children
516.40
less than 0.001
Yes
Note: Sex and region controls are included in the model but omitted from this table for presentation clarity.
8. Economic Interpretation
Age
Each additional year of age is associated with about USD 264 higher charges, holding other variables constant.
BMI and smoking
BMI alone is not statistically significant in the selected model.
However, the interaction term shows that:
For smokers, each additional BMI unit is associated with about USD 1,443 extra charges beyond the BMI effect for non-smokers.
Important point: The smoking coefficient should not be interpreted alone because the model includes an interaction.
9. Main Result and Implication
Answer to the economic question
Observable health-risk characteristics are strongly related to medical insurance charges.
Main findings
Smoking is the strongest risk category.
Age is positively associated with charges.
BMI alone is not significant, but the BMI effect is stronger among smokers.
Implication: Preventive health policies targeting smoking and obesity-related risk may help reduce future medical costs.
10. Limitations and Conclusion
Limitations
The data are observational, so the results show association, not definite causation.
Missing variables include medical history, income, insurance plan type, diet, and exercise.
Charges are right-skewed, so linear regression may not fully capture very high-cost cases.
What I would do differently
With better data, I would include medical history and insurance plan type, and compare the linear model with a log-linear model because charges are right-skewed.
Final conclusion
Medical insurance charges are strongly connected to observable health-risk profiles. The main result is not only that smokers have higher charges, but that BMI has a stronger association with charges among smokers.
This suggests that insurers, businesses, and policymakers should evaluate health risks jointly.