Ksen
Medical Cost Personal Datasets (Insurance dataset) https://www.kaggle.com/datasets/mirichoi0218/insurance
Context Machine Learning with R by Brett Lantz is a book that provides an introduction to machine learning using R.
Research Question
How does smoking (smoker) modify the effect of age and body mass index (bmi) on medical charges, while controlling for the number of children (children), and can insignificant predictors (region, sex) be discarded without loss of predictive quality?
Research Question How does smoking (smoker) modify the effect of age and body mass index (bmi) on medical charges, while controlling for the number of children (children), and can insignificant predictors (region, sex) be discarded without loss of predictive quality?
Motivation
Insurance companies use rating factors. Understanding that BMI only affects charges for smokers (and not for non-smokers) allows for more accurate policy pricing. This is a direct application of pricing theory in insurance.
Theoretical Framework
Source 1: Age affects risks non-linearly. Su, X., Bai, M., & Chen, F. (2020). Stochastic gradient boosting frequency-severity model of insurance claims. PLoS ONE, 15(8), e0238000. https://doi.org/10.1371/journal.pone.0238000 This article states that in insurance, age affects costs non-linearly. This is exactly why I used a GAM instead of an ordinary linear model — because age does not add the same amount each year.
Source 2: GAM is better than GLM for insurance data. Chen, T., Desmond, A. F., & Adamic, P. (2023). Generalized additive modelling of dependent frequency and severity distributions for aggregate claims. Statistics in Economics, 12(4). https://ideas.repec.org/a/spt/stecon/v12y2023i4f12_4_1.html The authors prove that GAM works better than standard GLM because GLM cannot capture non-linear dependencies. I compared a simple GLM and a GAM using AIC and saw that the GAM is better.
Source 3: How GAM helps in insurance. Chen, T. (2018). Generalized additive models for dependent frequency and severity of insurance claims [Master’s thesis, University of Guelph]. http://hdl.handle.net/10214/14769 This work shows, using real insurance data, that GAM provides more accurate predictions than GLM.
All three sources confirm that GAM is the right choice for insurance data because age and other factors affect costs non-linearly, through complex curves.
Visualization
We begin by loading the necessary libraries and the insurance dataset. The data contains information about individual medical costs billed by health insurance, along with several potential predictors: age, sex, BMI, number of children, smoking status, and region.
ibrary(tidyverse) library(mgcv)library(DHARMa)library(performance)
insurance <- read.csv("insurance.csv") %>% as_tibble()
glimpse(insurance)
summary(insurance$charges)
I first looked at the data, then selected the variables.
Possible factors: age, smoker
Dependence of charges on age, separated by smoker
insurance %>%
ggplot(aes(x = age, y = charges, color = smoker)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "loess", se = FALSE) +
scale_y_log10() + # logarithm, because Gamma with log-link
facet_wrap(~smoker, scales = "free_y") +
theme_minimal()
Interpretation:
In this graph, I show how age affects medical expenses, but separately for smoking and non-smoking people. The graph is divided into two parts. On the left are non-smokers, on the right are smokers. The horizontal axis shows age, the vertical axis shows medical expenses, but these expenses are shown on a logarithmic scale to better see the difference between low and high-cost patients.
Looking at the left part of the graph, which shows non-smokers, the trend line is almost horizontal. This means that age has practically no effect on how much a non-smoker pays for health insurance. A twenty-year-old non-smoker and a sixty-year-old non-smoker spend roughly the same. The points on the graph are scattered, but most of them are concentrated at the bottom, meaning non-smokers’ expenses are generally low.
A completely different picture is seen on the right side, showing smokers. Here, the trend line rises sharply, especially after age forty to fifty. This means that for smokers, medical expenses rise significantly with age. If a twenty-year-old smoker pays a little, by age sixty their expenses become huge. We also see that the points on the graph go much higher than for non-smokers; smokers’ expenses can reach fifty to sixty thousand, whereas non-smokers rarely exceed ten to twenty thousand.
From this graph, I draw the main conclusion: the relationship between age and expenses depends entirely on whether a person smokes. For non-smokers, age doesn’t matter; for smokers, age is the most important factor. That is why in my model I cannot simply add age and smoking, but must allow age to affect expenses differently for these two groups.
Dependence on BMI
I thought about which code to use to visualize the relationship between body mass index and medical expenses. Initially, I had a simple version without additional settings, but then I realized it hid an important detail. When I use a common scale for smokers and non-smokers, non-smokers get compressed at the bottom of the graph, and I can’t see their real trend. This is misleading.
Therefore, I will now use the second, improved code. First, I added scales = "free_y" so that each panel has its own vertical scale. Now I clearly see that for non-smokers, BMI has no effect on expenses — the line is horizontal. Second, I added theme_minimal() to make the graph look neat and professional. Third, I added a title, subtitle, and axis labels via labs() so that anyone, including a professor, immediately understands what is shown in the graph and what conclusion follows.
This improved code is what I will use in my work. It honestly shows the data, highlights the main conclusion that BMI increases expenses only for smokers, and looks presentable for a presentation. I will keep the first code only for rough drafts, but it will not go into the final report.
Draft code for BMI relationship:
insurance %>%
ggplot(aes(x = bmi, y = charges, color = smoker)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "loess", se = FALSE) +
scale_y_log10() +
facet_wrap(~smoker)
Improved code for visualizing predictor-response relationship:
insurance %>%
ggplot(aes(x = bmi, y = charges, color = smoker)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "loess", se = FALSE) +
scale_y_log10() +
facet_wrap(~smoker, scales = "free_y") +
theme_minimal() +
labs(
title = "BMI effect differs by smoking status",
subtitle = "BMI increases charges ONLY for smokers",
y = "Charges (log scale)",
x = "Body Mass Index"
)
Interpretation:
In this graph, I show how body mass index, i.e., a person’s weight status, affects medical expenses. Again, I split the graph into two parts: on the left, non-smokers; on the right, smokers. The horizontal axis shows BMI, where normal values are around twenty to twenty-five, and values above thirty indicate obesity. The vertical axis, as in the previous graph, shows medical expenses on a logarithmic scale.
First, look at the left part, which shows non-smokers. Here, the trend line is curved like an arc, but more horizontal across the entire BMI range. This means that for a non-smoker, their weight status hardly affects medical expenses. A thin non-smoker and an obese non-smoker spend about the same on health. Even if the BMI rises above thirty, which is considered obesity, expenses remain at the same level. The points are scattered, but the trend clearly shows no relationship.
Now look at the right part, which shows smokers. Here the situation is dramatically different. While the BMI is within the normal range, up to twenty-five, expenses remain low. But as soon as the BMI exceeds thirty, the trend line shoots up sharply. This means that if a smoker is overweight or obese, their medical expenses increase greatly. It turns out that the combination of smoking and obesity has a much larger effect than just the sum of these two factors separately.
From this graph, I draw a very important conclusion for my model. BMI matters only for smokers. For non-smokers, this variable can be ignored entirely. Therefore, in my statistical model, I must allow BMI to affect expenses differently depending on whether a person smokes. That is why I use the by = smoker construct in the model, which allows the same variable to work differently for different groups of people. Without this approach, the model would be inaccurate and misleading.
Model Building
After I understood the visualization and which variables really affect medical expenses, I moved on to building a clean model, without unnecessary noise. My previous model was overfitted due to too many variables, especially region and sex. Now I will create a different code. I create the model fit_gam_clean using GAM.
First, I add the variable smoker by itself, because smokers generally pay more regardless of age and weight.
Second, I add age, but with the construct by = smoker. This allows age to affect expenses differently for smokers and non-smokers. From my first graph, I clearly saw that age is not important for non-smokers, while for smokers expenses rise sharply after age forty. The parameter k = 20 means I allow the curve to be flexible enough to capture this complex shape.
Third, I add BMI, also with by = smoker. The second graph showed me that BMI only affects expenses for smokers, especially after the value of thirty, when obesity begins. The parameter k = 15 is slightly smaller here because the relationship is not as complex as for age.
Fourth, I add the number of children as a separate smooth term. Although this is a discrete variable, I allow it to be smooth because the effect of the first child may differ from the effect of the third or fourth.
I completely remove region and sex because the visual analysis did not show any important effect.
I keep the Gamma family with a logarithmic link because medical expenses are positive and have a skewed distribution. I use the REML method for more accurate estimation of smoothness parameters.
After the model is built, I look at its summary via summary(fit_gam_clean). There I pay attention to the effective degrees of freedom of each smooth term. If they are close to one, the relationship is almost linear. If they are large, the non-linearity is strong. I also look at the p-values to understand which effects are statistically significant.
Now I have a clean, justified model that is not overfitted, yet reflects the real patterns in the data well.
fit_gam_clean <- gam(
charges ~ smoker +
s(age, by = smoker, k = 10) + # reduced from 20 to 10
s(bmi, by = smoker, k = 10) + # reduced from 15 to 10
s(children, k = 5), # children is discrete, k=5 is enough
data = insurance,
family = Gamma(link = "log"),
method = "REML"
)
summary(fit_gam_clean)
Model Summary Interpretation
Looking at the model summary, I see several important findings. The coefficient for smokers is 1.41, which means that, given the same age, weight, and number of children, a smoker pays about four times more for health insurance than a non-smoker, because the exponent of 1.41 gives exactly 4.1.
Now looking at the smooth terms. Age for non-smokers has 4 effective degrees of freedom and a very small p-value, meaning age does affect non-smokers’ expenses and is slightly curved, even though from my graph it seemed the line was horizontal. Age for smokers has only one effective degree of freedom, indicating an almost straight line, with a p-value of 0.01. the effect is present, but it is linear, not curved.
The most interesting result concerns BMI. For non-smokers, BMI shows a p-value of 0.34, which is not significant meaning weight status does not affect non-smokers’ expenses at all, fully confirming my second graph. For smokers, BMI has 4.3 effective degrees of freedom and a p-value less than 0.001. this is a strong and non-linear effect: the higher the BMI of a smoker, the more sharply their medical expenses increase, especially after the value of 30.
The number of children is also significant, with a p-value less than 0.001, and has 2.4 effective degrees of freedom. This means each additional child adds to expenses, but not strictly linearly, perhaps the first child gives a larger increase than the fourth.
The overall model quality is good. The adjusted R-squared is 0.837, meaning the model explains almost 84% of the variance in the data, and the explained deviance is 72.7%. The main conclusion from this model: BMI matters only for smokers, and for non-smokers it can be ignored; age affects both groups, but much more strongly and interestingly for smokers; and children are an additional factor to consider.
Diagnostics
sim_clean <- simulateResiduals(fit_gam_clean)
plot(sim_clean)
Interpretation:
Looking at this residual diagnostic plot from DHARMa, I see several problems. In the left graph, the QQ plot, the points deviate from the diagonal line. In an ideal model, all points should lie exactly on the black dashed line, but here there is a clear deviation in the upper right corner, and the KS test shows a p-value of zero, indicating that the residual distribution does not match the theoretical one.
In the middle graph, which shows residuals vs. fitted values, the red line should be horizontal at zero, but instead it bends, especially on the right side, and the dispersion test also gives a p-value of zero, indicating a problem with residual spread.
Only the outlier test shows a p-value of 0.06, which is not significant. This is the only good news: there are no obvious outliers that would completely ruin the model.
What does this mean for my model? It means that the Gamma distribution with a log link is not ideal for my data. Medical expenses have too heavy a right tail. there are people with extremely high expenses that the model cannot predict well. Additionally, the variance grows faster than the Gamma distribution assumes. perhaps the nature of medical expenses is more complex than a Gamma model can describe.
To fix this situation, I could try a Box-Cox transformation. However, within the scope of this research, the current model is still useful for understanding how age, smoking, and BMI affect expenses. I simply must honestly state in the limitations section that the residual distribution is not ideal, and this is a topic for future improvements.
Interpretation of Practical Effects
Look at the effective degrees of freedom (EDF):
summary(fit_gam_clean)
Looking at my model summary, I see that it turned out quite good and explains almost 84% of the variance in the data. meaning I can predict medical expenses fairly accurately. Smokers pay about four times more than non-smokers; this is a very strong and expected effect.
Age affects both groups, but for smokers it works almost as a straight line, while for non-smokers the shape of the relationship is slightly more curved. The most important conclusion about BMI: for non-smokers it is not important at all. p-value of 0.34 indicates that weight status does not affect their expenses, whereas for smokers BMI increases expenses very strongly and in a complex way, especially after obesity begins.
The number of children is also significant, but the effect is not as strong. At the same time, the residual plot showed problems because the Gamma distribution is not entirely suitable for medical expenses, which have too many people with very high bills. Still, the model is useful for understanding that smoking and obesity together are much more dangerous than separately.
Visualize Effects
library(gratia)
draw(fit_gam_clean)
Interpretation:
The first graph (top left) shows how age affects expenses for non-smokers. The horizontal axis is age, the vertical axis is the strength of age’s effect on expenses. If the line goes up, expenses increase with age. If the line goes down, expenses decrease with age. If the line is around zero, age has no effect. The red shaded area around the line shows the model’s uncertainty. the wider it is, the less precise the conclusion.
In this graph, the line wiggles a bit, but almost all the time it is near zero, and the red area is very wide and constantly overlaps zero. This means that for non-smokers, age has practically no effect on medical expenses, and the model is uncertain about this weak effect.
The second graph (top right) shows age for smokers. Here the line rises sharply, starting around age fifty. The red area stays below zero on the right side of the graph. This means that for smokers after fifty, expenses increase strongly, and the model is confident about this. The difference between the first and second graphs is huge: for non-smokers, age doesn’t matter; for smokers, age is a key factor.
The third graph (middle left) shows BMI for non-smokers. On the horizontal axis are values from twenty to fifty, where thirty is already obesity. The line on this graph is almost horizontal, going straight from left to right. The red area is wide and constantly overlaps zero. This means that for non-smokers, whether you are thin or fat does not matter for medical expenses. An obese non-smoker and a thin non-smoker pay the same.
The fourth graph (middle right) shows BMI for smokers. Here the line behaves very differently. Up to the value of thirty, the line is near zero or slightly below; after thirty, it rises sharply. The red area on the right side does not overlap zero, meaning the model is confident. This indicates that if a smoker has a normal weight up to thirty, BMI does not strongly affect expenses. But once BMI exceeds thirty, obesity begins, and then expenses skyrocket. In other words, smoking and being fat at the same time is much more dangerous than just smoking or just being fat.
The fifth graph (bottom) shows the number of children. On the horizontal axis, from zero to five children. The line initially rises from zero to two children, then drops a bit and becomes almost flat. This means the first child increases expenses, the second adds as well, but the third and fourth add almost nothing. Possibly this is because people with one or two children buy more expensive family coverage, while those with three or more children can no longer afford it or receive discounts.
Overall, all five graphs together say the same simple rule: if you don’t smoke, you don’t need to worry about your age or your weight — expenses will be roughly the same and predictable. If you smoke, you need to watch your weight and especially fear age after fifty, because every extra kilogram and every year lived will greatly increase your medical expenses.
Simple GAM without Noise (Final Model)
r
fit_final <- gam(
charges ~ smoker +
s(age, by = smoker, k = 15) +
s(bmi, by = smoker, k = 12) +
children,
data = insurance,
family = Gamma(link = "log"),
method = "REML"
)
fit_final
Interpretation of Final Model
So, when I looked at the fit_gam_clean model, I noticed that s(children) is a smooth term for the number of children, but children are discrete numbers from zero to five. The professor talked about noise and overfitting, and I thought that a smooth curve for only five values is indeed unnecessary. Moreover, in the fit_gam_clean model, the coefficient for children was significant, but I decided to simplify and make children an ordinary linear variable, without smoothness.
I also slightly increased the k parameters for age and BMI, just to give the model more flexibility if needed, but kept the REML method, which decides how much to curve.
In the end, I got the fit_final model, and here is what it showed. The REML score remained almost unchanged: 13108 versus 13106 in the previous model, meaning I didn’t lose quality, but the model became simpler, which is good.
The effective degrees of freedom for age for non-smokers were 4.08, so there is a curve, but it’s not crazy. For age for smokers, it was only 1.02, this is almost a straight line. This is interesting because the graph suggested that age affects smokers strongly, but the model says the effect is linear, meaning each additional year adds roughly the same amount.
For BMI for non-smokers, the EDF was 2.91, but from the previous summary I remember this effect is not significant, which is fine. For BMI for smokers, it was 4.28, a moderate non-linearity, confirming that after 30 units, expenses begin to rise faster.
Thus, I decided to use fit_final instead of fit_gam_clean because it is simpler, has less noise (no smooth term for children), did not lose quality, and looks cleaner for a presentation to the professor. The professor wanted to see that I know how to select variables and not overfit the model, and fit_final demonstrates exactly that.
Diagnostics + Comparison with a Simpler Model
fit_simple <- gam(charges ~ smoker + s(age) + children,
family = Gamma(link="log"), data=insurance)
AIC(fit_simple, fit_final)
Interpretation:
I have two models. The first, fit_simple, is a very simple model that includes only smoking, a smooth effect of age without separating smokers and non-smokers, and the number of children as a linear variable. The second, fit_final, is my improved model, which allows age and BMI to affect smokers and non-smokers differently.
AIC is a number that shows how well the model explains the data, while penalizing complexity (extra parameters). The smaller the AIC value, the better the model. The simple model has an AIC of 26652, while my final model has an AIC of 26185. The difference is about 467 units in favor of the final model.
Adding complex effects, specifically, different smooth curves for age and BMI depending on smoking, can be justified. The model became more complex, with nearly 19 effective degrees of freedom versus 8 in the simple model, but the reduction in AIC of 467 units is a very good difference. If the difference were less than 2 or even 10, one could say the simple model is almost as good. But here the difference is large, proving that smoking indeed changes how age and BMI affect medical expenses.