Predicting Medical Insurance Charges Using Regression Analysis

Ayberk Kocakır

Economic Question & Motivation

Research Question

Which individual characteristics are associated with higher medical insurance charges?

Why is this important?

  • Healthcare costs affect households
  • Important for insurance pricing
  • Useful for healthcare policy

Dataset

Medical Cost Personal Dataset (Kaggle)

  • 1,338 observations
  • Outcome variable: charges

Predictors:

  • Age
  • Sex
  • BMI
  • Children
  • Smoking Status
  • Region

Probability Analysis

Distribution of Medical Insurance Charges

  • Distribution is positively skewed
  • A small group has very high healthcare costs

Smoking Status Analysis

Charges by Smoking Status

  • Smokers have substantially higher charges
  • Smoking appears to be a strong predictor

Modeling Approach

Model 1

charges ~ age + bmi + children

Model 2

charges ~ age + bmi + children + smoker + region + sex

Data Split:

  • 80% Training
  • 20% Test

Model Comparison

Model Performance

Model RMSE
Model 1 11,323 0.113
Model 2 6,263 0.733

Model 2 was selected because it has lower prediction error and higher explanatory power.

Economic Interpretation

Variable Effect on Charges
Smoker +24,199
Age +262
BMI +318

Key Finding:

Smoking status is the strongest predictor.

Limitations

  • No income information
  • No education data
  • No medical history
  • Limited sample size

Future Improvement:

  • Add socioeconomic variables
  • Add medical history variables

Conclusion

Key Findings

  • Smoking is the strongest predictor
  • Age and BMI increase healthcare costs
  • Model 2 performs significantly better
  • Lifestyle factors matter

Thank you for listening.

Questions?