Assignment 8

Modeling Healthcare Costs:

This dataset is from a public repository on Github. It contains information on 1,338 individuals, including demographic features (age, sex, region), health-related variables (BMI, number of children, smoking status), and individual medical costs billed by health insurance. It is commonly used to model and predict healthcare expenses. It is also used for teaching regression modeling with right-skewed outcomes and was included in Brett Lantz’s Machine Learning with R (2013). In this analysis, I originally used a gamma regression model to predict the medical costs (charges) based on age, BMI, and smoking status. The Gamma distribution is suitable for modeling positive continuous data with a right-skewed distribution, such as medical costs. However, this time I will be using the function modelsummary() to summarize the model in a format that is more organized and concise.

# Load the data

insurance <- read.csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv")
head(insurance)  
##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622
library(modelsummary)

# Convert "smoker" into a binary factor

insurance$smoker <- as.factor(insurance$smoker)

# Fit the Gamma Regression Model

m_gama <- glm(charges ~ age + bmi + smoker, 
              family = Gamma(link = "log"), 
              data = insurance)


# Display the model summary

modelsummary(m_gama, output = "markdown")
(1)
(Intercept) 7.445
(0.105)
age 0.028
(0.001)
bmi 0.012
(0.003)
smokeryes 1.471
(0.046)
Num.Obs. 1338
AIC 26448.3
BIC 26474.3
Log.Lik. -13219.142
F 488.741
RMSE 7443.59
  • This table shows that for each additional year of age, the log of expected charges increases by 0.028. This means charges increase 2.8% increase per year.

  • For each unit increase in BMI, charges go up by 1.2% increase in expected charges.

  • Smokers have much higher expected charges. exp(1.471) ≈ 4.35, meaning smokers are expected to pay 4.35× more in insurance costs.

  • Standard Errors Values in parentheses: These show the uncertainty in the estimate. All variables are statistically significant.

Conclusion:

As people get older, gain BMI, or smoke, their insurance charges go up. But smoking has the biggest impact — it’s associated with more than a 4-fold increase in expected medical costs. We find that age, BMI, and smoking status are all statistically significant predictors of healthcare costs, consistent with findings in health economics literature (Lantz, n.d.).

Lantz, Brett. n.d. “Insrance.csv.”