Ksen 2

Medical Cost Personal Datasets

Made by Kseniia Marchuk

Research Question How does smoking status modify the effect of age and body mass index (BMI) on medical charges, while controlling for the number of children, sex, and region? Can we simplify the model by removing theoretically less important predictors without substantial loss in predictive quality?

Motivation Insurance companies rely on rating factors to set premiums accurately. Understanding that the effect of BMI on medical costs is present primarily (or much stronger) among smokers helps build more precise risk-based pricing models.

Theoretical Framework Age and other risk factors often affect insurance costs in a non-linear way. Su et al. (2020) demonstrate through a stochastic gradient boosting frequency-severity model that age influences claim costs non-linearly. This non-linearity supports the choice of a Generalized Additive Model (GAM) over a standard linear regression, as GAM can flexibly capture complex relationships without assuming a constant effect per additional year of age.

Generalized Additive Models outperform standard Generalized Linear Models (GLMs) when modelling insurance data with non-linear dependencies. Chen et al. (2023) show that GAMs provide a better fit than GLMs for aggregate claims by allowing non-parametric smooth terms for frequency and severity components. Similarly, Chen (2018) applied GAMs to real insurance data and found more accurate predictions compared to GLM approaches.

These sources confirm that GAM is well-suited for insurance pricing because variables such as age and BMI often exhibit non-linear effects that linear models cannot adequately capture.

Source 1: Age affects risks non-linearly. Su, X., Bai, M., & Chen, F. (2020). Stochastic gradient boosting frequency-severity model of insurance claims. PLoS ONE, 15(8), e0238000. https://doi.org/10.1371/journal.pone.0238000

Source 2: GAM is better than GLM for insurance data. Chen, T., Desmond, A. F., & Adamic, P. (2023). Generalized additive modelling of dependent frequency and severity distributions for aggregate claims. Statistics in Economics, 12(4). https://ideas.repec.org/a/spt/stecon/v12y2023i4f12_4_1.html

Source 3: How GAM helps in insurance. Chen, T. (2018). Generalized additive models for dependent frequency and severity of insurance claims [Master’s thesis, University of Guelph]. http://hdl.handle.net/10214/14769

Data and Exploratory Analysis

library(tidyverse)
library(mgcv)
library(gratia)
library(DHARMa)
library(performance)
library(scales)

insurance <- read.csv("insurance.csv") %>% 
  as_tibble() %>%
  mutate(
    smoker = as.factor(smoker),
    sex    = as.factor(sex),
    region = as.factor(region),
    children = as.numeric(children)
  )

glimpse(insurance)
Rows: 1,338
Columns: 7
$ age      <int> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1…
$ sex      <fct> female, male, male, male, male, female, female, female, male,…
$ bmi      <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74…
$ children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0…
$ smoker   <fct> yes, no, no, no, no, no, no, no, no, no, no, yes, no, no, yes…
$ region   <fct> southwest, southeast, southeast, northwest, northwest, southe…
$ charges  <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,…
summary(insurance$charges)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1122    4740    9382   13270   16640   63770 

Visualization of key relationships

# Age vs Charges by smoker
insurance %>%
  ggplot(aes(x = age, y = charges, color = smoker)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", se = FALSE) +
  scale_y_log10(labels = label_dollar()) +
  facet_wrap(~ smoker, scales = "free_y") +
  theme_minimal() +
  labs(title = "Effect of Age on Medical Charges by Smoking Status",
       subtitle = "Y-axis is on log10 scale",
       y = "Medical Charges (log10 scale)", 
       x = "Age")

Interpretation: The plot reveals a clear interaction between age and smoking status. Among non-smokers, the relationship between age and charges (on log scale) is relatively modest and stable. Among smokers, the smoothed trend is steeper, particularly at older ages. This pattern supports modelling separate smooth terms for age by smoking status: s(age, by = smoker).

# BMI vs Charges by smoker
insurance %>%
  ggplot(aes(x = bmi, y = charges, color = smoker)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "gam", formula = y ~ s(x), se = FALSE) +
  scale_y_log10(labels = label_dollar()) +   # ← тоже лог-шкала с долларами
  facet_wrap(~ smoker, scales = "free_y") +
  theme_minimal() +
  labs(title = "Effect of BMI on Medical Charges by Smoking Status",
       subtitle = "BMI appears to increase charges primarily among smokers\nY-axis is on log10 scale",
       y = "Medical Charges (log10 scale)", 
       x = "Body Mass Index (BMI)")

Interpretation: Among non-smokers, the smoothed relationship between BMI and charges (on log scale) is nearly flat across the observed range. Among smokers, the trend is strongly non-linear: charges remain relatively stable at lower BMI values but increase noticeably once BMI exceeds approximately 30. This visual evidence strongly supports the inclusion of separate smooth terms s(bmi, by = smoker) in the GAM.

Model Building

fit_full <- gam(
  charges ~ smoker + sex + region +
    s(age, by = smoker, k = 15) + 
    s(bmi, by = smoker, k = 12) + 
    s(children, k = 5),
  data = insurance,
  family = Gamma(link = "log"),
  method = "REML"
)

summary(fit_full)

Family: Gamma 
Link function: log 

Formula:
charges ~ smoker + sex + region + s(age, by = smoker, k = 15) + 
    s(bmi, by = smoker, k = 12) + s(children, k = 5)

Parametric coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      9.03873    0.04421 204.467  < 2e-16 ***
smokeryes        1.42442    0.04732  30.100  < 2e-16 ***
sexmale         -0.06808    0.03812  -1.786  0.07434 .  
regionnorthwest -0.06953    0.05465  -1.272  0.20347    
regionsoutheast -0.13232    0.05481  -2.414  0.01591 *  
regionsouthwest -0.17986    0.05472  -3.287  0.00104 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                   edf Ref.df       F  p-value    
s(age):smokerno  4.002  4.976 103.142  < 2e-16 ***
s(age):smokeryes 1.026  1.051   6.375   0.0113 *  
s(bmi):smokerno  2.899  3.697   1.233   0.3039    
s(bmi):smokeryes 4.277  5.372  11.959  < 2e-16 ***
s(children)      2.470  2.960  11.262 9.99e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.832   Deviance explained = 73.4%
-REML =  13098  Scale est. = 0.47819   n = 1338

Interpretation:

The full GAM explains approximately 73.4% of the deviance. Smoking status has a strong positive effect. The smooth terms for age and BMI show significant non-linear effects that differ markedly by smoking status, while the effect of children is also significant.

Final (simplified) model Based on exploratory visualizations and model diagnostics, children were treated as a linear term (due to its discrete nature with only 6 levels). Sex and region were retained despite marginal significance because they are standard rating factors in insurance pricing.

fit_final <- gam(
  charges ~ smoker + sex + region +
    s(age, by = smoker, k = 15) + 
    s(bmi, by = smoker, k = 12) + 
    children,
  data = insurance,
  family = Gamma(link = "log"),
  method = "REML"
)

summary(fit_final)

Family: Gamma 
Link function: log 

Formula:
charges ~ smoker + sex + region + s(age, by = smoker, k = 15) + 
    s(bmi, by = smoker, k = 12) + children

Parametric coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      8.93892    0.04747 188.315  < 2e-16 ***
smokeryes        1.42370    0.04746  30.001  < 2e-16 ***
sexmale         -0.06814    0.03828  -1.780 0.075322 .  
regionnorthwest -0.06584    0.05484  -1.200 0.230183    
regionsoutheast -0.13136    0.05502  -2.387 0.017116 *  
regionsouthwest -0.18147    0.05494  -3.303 0.000981 ***
children         0.09129    0.01647   5.543 3.59e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                   edf Ref.df       F p-value    
s(age):smokerno  3.919  4.874 104.257 < 2e-16 ***
s(age):smokeryes 1.015  1.030   6.664 0.00965 ** 
s(bmi):smokerno  2.877  3.669   1.209 0.31379    
s(bmi):smokeryes 4.263  5.356  12.001 < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.834   Deviance explained = 73.2%
-REML =  13101  Scale est. = 0.48243   n = 1338
draw(fit_final) 

The final model explains 73.2% of the deviance. Smokers have substantially higher charges (multiplicative factor ≈ 4.15). The smooth effect of age is significant and non-linear for non-smokers, while nearly linear for smokers. BMI shows no meaningful association for non-smokers but a strong non-linear positive effect for smokers, particularly above BMI 30. The number of children has a small but significant positive linear effect.

Model Comparison

fit_simple <- gam(charges ~ smoker + s(age) + children, 
                  family = Gamma(link = "log"), data = insurance)

AIC(fit_simple, fit_final)
                  df      AIC
fit_simple  8.275969 26651.88
fit_final  22.489876 26157.46

The final model has substantially lower AIC (difference > 490), justifying the more complex specification with interaction smooths by smoking status.

Diagnostics

sim_resid <- simulateResiduals(fit_final)
plot(sim_resid)

Interpretation of diagnostics: Diagnostic plots from the DHARMa package indicate some model misspecification. The QQ-plot shows deviations in the upper tail, and the dispersion test is significant. These results suggest that the Gamma distribution with log link does not fully capture the heavy right tail and variance structure of medical charges.

Results and Practical Implications

Smoking status strongly modifies the effects of both age and BMI on medical charges. For non-smokers, charges show limited sensitivity to increases in age or BMI. For smokers, both older age and higher BMI (especially obesity) are associated with markedly higher costs. These findings support differentiated, risk-based premium setting in health insurance, where smoking status acts as an important effect modifier.

Limitations

Residual diagnostics indicate that the Gamma(log) distribution does not perfectly capture the heavy-tailed nature of the charges variable.

Sex and region were retained as theoretically important insurance rating factors, even though their effects were weak or non-significant in this sample.

The analysis is cross-sectional; causal interpretations require caution.