Bayesian linear regression analysis on health insurace cost

Author

Chung Him Au-Yeung

Published

December 14, 2024

Background

Machine Learning with R by Brett Lantz is a book that introduces machine learning concepts and their implementation using the R programming language. However, a limitation is that the datasets used in the book are not freely accessible online unless you purchase the book and create a user account with Packt Publishing. This can pose a challenge for readers borrowing the book from a library or a friend. Although the datasets themselves are in the public domain, they have been cleaned and reformatted specifically for the book, making it difficult for others to access them in the same form.

The data

Columns

“age”: age of primary beneficiary
“sex”: insurance contractor gender, female, male
“bmi”: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
“children”: Number of children covered by health insurance / Number of dependents
“smoker”: Smoking
“region”: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
“charges”: Individual medical costs billed by health insurance

suppressPackageStartupMessages({
  library(bayesrules)
  library(rstanarm)
  library(bayesplot)
  library(tidyverse)
  library(tidybayes)
  library(broom.mixed)
  library(readr)
  library(data.table)
  library(janitor)
})

#import the data
data = read.csv(file = "insurance.csv")
str(data)

'data.frame':   1338 obs. of  7 variables:
 $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
 $ sex     : chr  "female" "male" "male" "male" ...
 $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
 $ children: int  0 1 3 0 0 0 1 3 2 0 ...
 $ smoker  : chr  "yes" "no" "no" "no" ...
 $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
 $ charges : num  16885 1726 4449 21984 3867 ...

Data cleaning

summary(data)

      age            sex                 bmi           children    
 Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
 1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
 Median :39.00   Mode  :character   Median :30.40   Median :1.000  
 Mean   :39.21                      Mean   :30.66   Mean   :1.095  
 3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
 Max.   :64.00                      Max.   :53.13   Max.   :5.000  
    smoker             region             charges     
 Length:1338        Length:1338        Min.   : 1122  
 Class :character   Class :character   1st Qu.: 4740  
 Mode  :character   Mode  :character   Median : 9382  
                                       Mean   :13270  
                                       3rd Qu.:16640  
                                       Max.   :63770

data$sex = factor(data$sex)
data$children = factor(data$children)
data$age = as.numeric(data$age)
data$smoker = factor(data$smoker)
data$region = factor(data$region)

#check missing
which(is.na(data))

integer(0)

#check duplicate 
data %>% group_by_all() %>%
  summarise(count = n()) %>%
  filter(count > 1)

`summarise()` has grouped output by 'age', 'sex', 'bmi', 'children', 'smoker',
'region'. You can override using the `.groups` argument.

# A tibble: 1 × 8
# Groups:   age, sex, bmi, children, smoker, region [1]
    age sex     bmi children smoker region    charges count
  <dbl> <fct> <dbl> <fct>    <fct>  <fct>       <dbl> <int>
1    19 male   30.6 0        no     northwest   1640.     2

#remove duplicate
data=data %>% distinct()

Visualisation

The visualisations show that health care insurance cost is heavily right skewed.

ggplot(data, aes(charges))+
  geom_density()

ggplot(data, aes(x=log(charges), fill = smoker))+
  geom_density()

ggplot(data, aes(x=age, y=log(charges)))+
  geom_point()

ggplot(data, aes(x=age, y=bmi, colour=smoker))+
  geom_point()

ggplot(data, aes(x=age, y=log(charges), colour=smoker))+
  geom_point()

ggplot(data, aes(x=sex, y=log(charges)))+
  geom_boxplot()

ggplot(data, aes(x=bmi, y=log(charges), colour =smoker))+
  geom_point()

ggplot(data, aes(x=children, y=log(charges)))+
  geom_boxplot()

ggplot(data, aes(x=smoker, y=log(charges)))+
  geom_boxplot()

ggplot(data, aes(x=region, y=log(charges)))+
  geom_boxplot()

From the density plots, it appears there are two peaks, suggesting the presence of clustered groups among smokers. However, the dataset does not include variables that can explicitly identify these groups. Since the data is designed for machine learning purposes (as indicated by its use in a machine learning textbook), linear regression might not be the most suitable choice for prediction. Nevertheless, I am using this freely available online dataset to practice Bayesian statistics.

Specifying priors of the coefficient

Starting with tuning the centered intercept, since the dataset contains a column called region with names like southwest, northeast, etc., I assume that the dataset reflects the health care insurance cost in the US. The average annual cost of health insurance in the USA is US$7,739 for an individual (from google) and this average could be somewhere between US$5,253 and US$14,424 (from investopedia). From the plots above, we can see that health care cost is heavily right skewed, log-transformation is used for the outcomes. In order to match the scale of the model, the prior mean of the intercept on log scale will be 8.95 with a standard deviation of 0.25.

Simulate the posterior

data[["smoker"]] =relevel(factor(data[["smoker"]]), ref = "no")
data = data %>% mutate(log_charges = log(charges))

model_1 = stan_glm(formula= log_charges ~ smoker+bmi+age,
                     data=data,
                   family=gaussian,
                   prior_intercept = normal(8.5,0.25),
                   prior = normal(0, 2.5, autoscale = TRUE),
                   prior_aux = exponential(1,autoscale = TRUE),
                  chains = 4, iter = 5000 * 2, seed = 87453804)

Check the diagnostic plots for stability our the simulation.

mcmc_trace(model_1)

mcmc_dens_overlay(model_1)

mcmc_acf(model_1)

Interpreting the posterior

tidy(model_1, effects = "fixed",
     conf.int = TRUE, conf.level = 0.8)

# A tibble: 4 × 5
  term        estimate std.error conf.low conf.high
  <chr>          <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)   7.08    0.0711    6.98       7.17  
2 smokeryes     1.54    0.0314    1.51       1.59  
3 bmi           0.0108  0.00210   0.00810    0.0135
4 age           0.0351  0.000904  0.0339     0.0363

The coefficient suggest that for every 1 unit increase in BMI, we expect the health insurance charges to increase by approximately 1% . The 80% credible interval reflects the uncertainty in this relationship, this mean that the slope can range from anywhere between (0.008, 0.013). The slope for smoker suggests that for someone who is a smoker, we expect their health insurance cost is 366% more expensive than someone who is not a smoker.

How wrong is the model?

The figure below displays 100 out of 20,000 simulated samples. The predictions effectively capture both the typical health insurance costs and the overall range observed in the data. However, they fail to account for the trimodal pattern present in the health insurance cost distribution. As previously discussed, the dataset lacks variables that could help identify potential clusters within the smoker group, and it was originally designed for machine learning purposes in a textbook context. This limitation does not imply that the model is inherently flawed, it simply highlights an opportunity for improvement with more informative predictors or a more tailored modeling approach.

pp_check(model_1, nreps = 100)+xlab("log(charges)")

Model evaluation and comparison

Lets build another model 2 which includes all the variables provided by the dataset. We can then compare with model 1 using cross validation to evaluate the predictive accuracy of these models.

data[["children"]] =relevel(factor(data[["children"]]), ref = "0")


model_2 = stan_glm(formula= log_charges  ~ smoker+bmi+age+children+region,
                     data=data,
                   family=gaussian,
                   prior_intercept = normal(8.5,0.25),
                   prior = normal(0, 2.5, autoscale = TRUE),
                   prior_aux = exponential(1,autoscale = TRUE),
                  chains = 4, iter = 5000 * 2, seed = 87453804)

We will run a 10-fold cross-validaton for each of our model.

set.seed(84025797)

cv_model_1 =prediction_summary_cv(model = model_1,data=data,k=10)
cv_model_2 = prediction_summary_cv(model = model_2,data=data,k=10)

rbind(cv_model_1$cv %>% mutate(model="1") %>% select(model, everything()),
      cv_model_2$cv %>% mutate(model="2") %>% select(model,everything()))

  model       mae mae_scaled within_50 within_95
1     1 0.1616165  0.3469377 0.6806307 0.9386713
2     2 0.1478145  0.3311530 0.6941308 0.9364774

Overall, Model 2 demonstrates slightly better predictive performance than Model 1, as evidenced by its lower MAE and MAE scaled values. However, Model 2, which includes 5 variables, only marginally outperforms Model 1, which uses just 3 variables. This highlights an important principle: more variables do not always result in a significantly better model. The modest improvement in prediction accuracy from the additional variables in Model 2 does not justify the increased complexity. Therefore, in line with the principle of parsimony (choosing the simplest effective model), Model 1 stands out as the preferred choice for being both efficient and interpretable while delivering comparable performance.

However, looking at cross validation alone is not sufficient. We will also run the expected log-predictive densities (ELPD) to help access how well the models generalise to unseen data while penalising complexity and overfitting.

loo_1 = loo(model_1)
loo_2 = loo(model_2)

#results
c(loo_1$estimates[1],
  loo_2$estimates[1])

[1] -877.5847 -819.6226

#compare ELPD
loo_compare(loo_1,loo_2)

        elpd_diff se_diff
model_2   0.0       0.0  
model_1 -58.0      12.4

ELPD seems to suggest that model 2 is way better than model 1 in terms of making better out-of-sample predictions, which is critical in real-world applications. Furthermore, ELPD of model 2 is at least two standard errors below 0 which implies that model 2 is significantly better in posterior predictive accuracy.

The principle of efficiency is important, but in this case, the gain in predictive accuracy outweighs the cost of increased model complexity, as being demonstrated by ELPD. Therefore, we should go for model 2.

Conclusion

The initial linear regression models built to predict health insurance cost using variables such as age, smoker, bmi, children and region . The model captures typical health insuance charges and overall observed range in the data. However, the models can not capture the trimodal distribution of the in the health insurance cost. This report is primarily a trial-and-error exercise in practicing Bayesian regression. Therefore, the low overall prediction accuracy is not a critical concern in this context. Instead, the results may highlight that simple linear regression may not be the most suitable model choice for this data set.