suppressPackageStartupMessages({
library(bayesrules)
library(rstanarm)
library(bayesplot)
library(tidyverse)
library(tidybayes)
library(broom.mixed)
library(readr)
library(data.table)
library(janitor)
})Bayesian linear regression analysis on health insurace cost
Background
Machine Learning with R by Brett Lantz is a book that introduces machine learning concepts and their implementation using the R programming language. However, a limitation is that the datasets used in the book are not freely accessible online unless you purchase the book and create a user account with Packt Publishing. This can pose a challenge for readers borrowing the book from a library or a friend. Although the datasets themselves are in the public domain, they have been cleaned and reformatted specifically for the book, making it difficult for others to access them in the same form.
The data
Columns
“age”: age of primary beneficiary
“sex”: insurance contractor gender, female, male
“bmi”: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
“children”: Number of children covered by health insurance / Number of dependents
“smoker”: Smoking
“region”: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
“charges”: Individual medical costs billed by health insurance
#import the data
data = read.csv(file = "insurance.csv")
str(data)'data.frame': 1338 obs. of 7 variables:
$ age : int 19 18 28 33 32 31 46 37 37 60 ...
$ sex : chr "female" "male" "male" "male" ...
$ bmi : num 27.9 33.8 33 22.7 28.9 ...
$ children: int 0 1 3 0 0 0 1 3 2 0 ...
$ smoker : chr "yes" "no" "no" "no" ...
$ region : chr "southwest" "southeast" "southeast" "northwest" ...
$ charges : num 16885 1726 4449 21984 3867 ...
Data cleaning
summary(data) age sex bmi children
Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
Median :39.00 Mode :character Median :30.40 Median :1.000
Mean :39.21 Mean :30.66 Mean :1.095
3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
Max. :64.00 Max. :53.13 Max. :5.000
smoker region charges
Length:1338 Length:1338 Min. : 1122
Class :character Class :character 1st Qu.: 4740
Mode :character Mode :character Median : 9382
Mean :13270
3rd Qu.:16640
Max. :63770
data$sex = factor(data$sex)
data$children = factor(data$children)
data$age = as.numeric(data$age)
data$smoker = factor(data$smoker)
data$region = factor(data$region)
#check missing
which(is.na(data))integer(0)
#check duplicate
data %>% group_by_all() %>%
summarise(count = n()) %>%
filter(count > 1)`summarise()` has grouped output by 'age', 'sex', 'bmi', 'children', 'smoker',
'region'. You can override using the `.groups` argument.
# A tibble: 1 × 8
# Groups: age, sex, bmi, children, smoker, region [1]
age sex bmi children smoker region charges count
<dbl> <fct> <dbl> <fct> <fct> <fct> <dbl> <int>
1 19 male 30.6 0 no northwest 1640. 2
#remove duplicate
data=data %>% distinct()Visualisation
The visualisations show that health care insurance cost is heavily right skewed.
ggplot(data, aes(charges))+
geom_density()ggplot(data, aes(x=log(charges), fill = smoker))+
geom_density()ggplot(data, aes(x=age, y=log(charges)))+
geom_point()ggplot(data, aes(x=age, y=bmi, colour=smoker))+
geom_point()ggplot(data, aes(x=age, y=log(charges), colour=smoker))+
geom_point()ggplot(data, aes(x=sex, y=log(charges)))+
geom_boxplot()ggplot(data, aes(x=bmi, y=log(charges), colour =smoker))+
geom_point()ggplot(data, aes(x=children, y=log(charges)))+
geom_boxplot()ggplot(data, aes(x=smoker, y=log(charges)))+
geom_boxplot()ggplot(data, aes(x=region, y=log(charges)))+
geom_boxplot()From the density plots, it appears there are two peaks, suggesting the presence of clustered groups among smokers. However, the dataset does not include variables that can explicitly identify these groups. Since the data is designed for machine learning purposes (as indicated by its use in a machine learning textbook), linear regression might not be the most suitable choice for prediction. Nevertheless, I am using this freely available online dataset to practice Bayesian statistics.
Specifying priors of the coefficient
Starting with tuning the centered intercept, since the dataset contains a column called region with names like southwest, northeast, etc., I assume that the dataset reflects the health care insurance cost in the US. The average annual cost of health insurance in the USA is US$7,739 for an individual (from google) and this average could be somewhere between US$5,253 and US$14,424 (from investopedia). From the plots above, we can see that health care cost is heavily right skewed, log-transformation is used for the outcomes. In order to match the scale of the model, the prior mean of the intercept on log scale will be 8.95 with a standard deviation of 0.25.
Simulate the posterior
data[["smoker"]] =relevel(factor(data[["smoker"]]), ref = "no")
data = data %>% mutate(log_charges = log(charges))
model_1 = stan_glm(formula= log_charges ~ smoker+bmi+age,
data=data,
family=gaussian,
prior_intercept = normal(8.5,0.25),
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(1,autoscale = TRUE),
chains = 4, iter = 5000 * 2, seed = 87453804)Check the diagnostic plots for stability our the simulation.
mcmc_trace(model_1)mcmc_dens_overlay(model_1)mcmc_acf(model_1)Interpreting the posterior
tidy(model_1, effects = "fixed",
conf.int = TRUE, conf.level = 0.8)# A tibble: 4 × 5
term estimate std.error conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 7.08 0.0711 6.98 7.17
2 smokeryes 1.54 0.0314 1.51 1.59
3 bmi 0.0108 0.00210 0.00810 0.0135
4 age 0.0351 0.000904 0.0339 0.0363
The coefficient suggest that for every 1 unit increase in BMI, we expect the health insurance charges to increase by approximately 1% . The 80% credible interval reflects the uncertainty in this relationship, this mean that the slope can range from anywhere between (0.008, 0.013). The slope for smoker suggests that for someone who is a smoker, we expect their health insurance cost is 366% more expensive than someone who is not a smoker.
How wrong is the model?
The figure below displays 100 out of 20,000 simulated samples. The predictions effectively capture both the typical health insurance costs and the overall range observed in the data. However, they fail to account for the trimodal pattern present in the health insurance cost distribution. As previously discussed, the dataset lacks variables that could help identify potential clusters within the smoker group, and it was originally designed for machine learning purposes in a textbook context. This limitation does not imply that the model is inherently flawed, it simply highlights an opportunity for improvement with more informative predictors or a more tailored modeling approach.
pp_check(model_1, nreps = 100)+xlab("log(charges)")Model evaluation and comparison
Lets build another model 2 which includes all the variables provided by the dataset. We can then compare with model 1 using cross validation to evaluate the predictive accuracy of these models.
data[["children"]] =relevel(factor(data[["children"]]), ref = "0")
model_2 = stan_glm(formula= log_charges ~ smoker+bmi+age+children+region,
data=data,
family=gaussian,
prior_intercept = normal(8.5,0.25),
prior = normal(0, 2.5, autoscale = TRUE),
prior_aux = exponential(1,autoscale = TRUE),
chains = 4, iter = 5000 * 2, seed = 87453804)We will run a 10-fold cross-validaton for each of our model.
set.seed(84025797)
cv_model_1 =prediction_summary_cv(model = model_1,data=data,k=10)
cv_model_2 = prediction_summary_cv(model = model_2,data=data,k=10)rbind(cv_model_1$cv %>% mutate(model="1") %>% select(model, everything()),
cv_model_2$cv %>% mutate(model="2") %>% select(model,everything())) model mae mae_scaled within_50 within_95
1 1 0.1616165 0.3469377 0.6806307 0.9386713
2 2 0.1478145 0.3311530 0.6941308 0.9364774
Overall, Model 2 demonstrates slightly better predictive performance than Model 1, as evidenced by its lower MAE and MAE scaled values. However, Model 2, which includes 5 variables, only marginally outperforms Model 1, which uses just 3 variables. This highlights an important principle: more variables do not always result in a significantly better model. The modest improvement in prediction accuracy from the additional variables in Model 2 does not justify the increased complexity. Therefore, in line with the principle of parsimony (choosing the simplest effective model), Model 1 stands out as the preferred choice for being both efficient and interpretable while delivering comparable performance.
However, looking at cross validation alone is not sufficient. We will also run the expected log-predictive densities (ELPD) to help access how well the models generalise to unseen data while penalising complexity and overfitting.
loo_1 = loo(model_1)
loo_2 = loo(model_2)
#results
c(loo_1$estimates[1],
loo_2$estimates[1])[1] -877.5847 -819.6226
#compare ELPD
loo_compare(loo_1,loo_2) elpd_diff se_diff
model_2 0.0 0.0
model_1 -58.0 12.4
ELPD seems to suggest that model 2 is way better than model 1 in terms of making better out-of-sample predictions, which is critical in real-world applications. Furthermore, ELPD of model 2 is at least two standard errors below 0 which implies that model 2 is significantly better in posterior predictive accuracy.
The principle of efficiency is important, but in this case, the gain in predictive accuracy outweighs the cost of increased model complexity, as being demonstrated by ELPD. Therefore, we should go for model 2.
Conclusion
The initial linear regression models built to predict health insurance cost using variables such as age, smoker, bmi, children and region . The model captures typical health insuance charges and overall observed range in the data. However, the models can not capture the trimodal distribution of the in the health insurance cost. This report is primarily a trial-and-error exercise in practicing Bayesian regression. Therefore, the low overall prediction accuracy is not a critical concern in this context. Instead, the results may highlight that simple linear regression may not be the most suitable model choice for this data set.