Modeling Medical Insurance Charges

Author

Emrah Akbas

Published

February 27, 2026

Exploratory Data Analysis

Assigned dataset named insurance, so lets look at what these dataset look like including variable and rows.

str(insurance)
'data.frame':   1338 obs. of  7 variables:
 $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
 $ sex     : chr  "female" "male" "male" "male" ...
 $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
 $ children: int  0 1 3 0 0 0 1 3 2 0 ...
 $ smoker  : chr  "yes" "no" "no" "no" ...
 $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
 $ charges : num  16885 1726 4449 21984 3867 ...

The insurance dataset is about collection of metrics describing the profile of the person who is insured and it some other information regarding to insurance charges and region. We could see correlation plot to take a look at variables and its relations.

insurance$smoker = as.factor(insurance$smoker)
insurance$region = as.factor(insurance$region)
insurance$sex = as.factor(insurance$sex)

str(insurance)
'data.frame':   1338 obs. of  7 variables:
 $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
 $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
 $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
 $ children: int  0 1 3 0 0 0 1 3 2 0 ...
 $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
 $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
 $ charges : num  16885 1726 4449 21984 3867 ...
library(skimr)
Warning: package 'skimr' was built under R version 4.5.2
skim(insurance)
Data summary
Name insurance
Number of rows 1338
Number of columns 7
_______________________
Column type frequency:
factor 3
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
sex 0 1 FALSE 2 mal: 676, fem: 662
smoker 0 1 FALSE 2 no: 1064, yes: 274
region 0 1 FALSE 4 sou: 364, nor: 325, sou: 325, nor: 324

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 39.21 14.05 18.00 27.00 39.00 51.00 64.00 ▇▅▅▆▆
bmi 0 1 30.66 6.10 15.96 26.30 30.40 34.69 53.13 ▂▇▇▂▁
children 0 1 1.09 1.21 0.00 0.00 1.00 2.00 5.00 ▇▂▂▁▁
charges 0 1 13270.42 12110.01 1121.87 4740.29 9382.03 16639.91 63770.43 ▇▂▁▁▁
library(GGally)
library(ggplot2)
ggpairs(insurance, columns = 1:7) 

Positive & Negative Associations

age and charges are showing moderate positive relationship with Cor = 0.299 meaning that older individuals tend to have higher insurance charges.

There is a weak positive relationship between bmi vs charges Cor = 0.198 showing us that as bmi increase, insurance charges tend to increase.

children and charges showing very weak relationship Cor = 0.068, number of children have a minimal positive impact on charges. We could investigate their relationship using different visualization plots.

charges variable showing right skew distribution. There might be potential outliers in our dataset.

Since smoker is categorical variable and it doesn’t show any correlation number, however boxplot showing smoking status likely drives a major portion of insurance charges. Next we will investigate smoker status affect on insurance charges.

library(ggplot2)
ggplot(data = insurance, mapping = aes(y = charges, x = smoker, fill = smoker)) + geom_boxplot()

Boxplot shows us that Non-smokers usually pay less than $10,000, while most smokers pay between $20,000 and $40,000. Even the cheapest medical bills for smokers are typically higher than the average bills for non-smokers.

library(dplyr)
library(scales) 


insurance_summary <- insurance %>%
  group_by(children) %>%
  summarize(mean_charges = mean(charges))


ggplot(insurance_summary, aes(x = children, y = mean_charges, fill = as.factor(children))) +
  geom_col() +
  geom_text(aes(label = dollar(mean_charges)), 
            angle = 90, 
            color = "white", 
            position = position_stack(vjust = 0.5)) +
    scale_y_continuous(labels = comma) + 
    scale_x_continuous(breaks = seq(0, max(insurance_summary$children), by = 1)) +
  labs(title = "Average Charges by Number of Children",
       x = "Number of Children", 
       y = "Mean Charges",
       fill = "Children") +
  theme_minimal()

Bar chart shows how the number of children relates to average medical charges. Costs slowly increase as the number of children goes from zero to three, peaking at a high of $15,355.32 for families with three children. However, after that point, the charges drop significantly, with families of five children having the lowest average cost at $8,786.04.

ggplot(data = insurance, mapping = aes(y = charges, x = region, color = region)) + geom_boxplot()

While the median costs remain relatively consistent across all four regions, the distribution of expenses varies, with the southeast region showing the widest range of charges. Every region contains a significant number of high cost outliers, showing us that there are some individuals whose insurance charges far more than typical regional average.

Fitting Linear Regression Model and Interpretations of outputs

full_model <- lm(charges ~ ., data= insurance)
summary(full_model)

Call:
lm(formula = charges ~ ., data = insurance)

Residuals:
     Min       1Q   Median       3Q      Max 
-11304.9  -2848.1   -982.1   1393.9  29992.8 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
age                256.9       11.9  21.587  < 2e-16 ***
sexmale           -131.3      332.9  -0.394 0.693348    
bmi                339.2       28.6  11.860  < 2e-16 ***
children           475.5      137.8   3.451 0.000577 ***
smokeryes        23848.5      413.1  57.723  < 2e-16 ***
regionnorthwest   -353.0      476.3  -0.741 0.458769    
regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6062 on 1329 degrees of freedom
Multiple R-squared:  0.7509,    Adjusted R-squared:  0.7494 
F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

Multiple R-squared = 0.7509 means that our model explains 75% of the variation in insurance charges.

Adjusted R-squared = 0.7494 means that after adjusting for the number of predictors, the model still explains 75% of the variation.

The coefficient for age is 256.9, meaning that holding sex, BMI, children, smoker status, and region fixed, a 1 year increase in age is associated with an increase, in predicted charges of about $256.90. The p-value < 2e-16 is highly significant, suggesting that age has a strong independent positive relationship with medical charges.

The coefficient for sex (male) is -131.3, meaning that holding all other variables fixed, males have predicted charges about $131 lower than females. However, the p-value = 0.693 is not statistically significant, meaning sex does not have a meaningful independent effect in this model.

The coefficient for BMI is 339.2, meaning that holding all other variables fixed, a 1-unit increase in BMI is associated with an increase in predicted charges of about $339.20. The p-value < 2e-16 is highly significant, suggesting BMI has a strong independent positive effect on medical charges.

The coefficient for children is 475.5, meaning that holding other variables fixed, each additional child is associated with an increase in predicted charges of about $475.50. The p-value = 0.000577 is statistically significant, indicating children has a meaningful positive effect on charges.

The coefficient for smoker (yes) is 23848.5, meaning that holding all other variables fixed, smokers have predicted charges about $23,848 higher than non-smokers. The p-value < 2e-16 is extremely significant, suggesting smoking status is the strongest predictor in the model. This is a very large magnitude effect relative to the average charge (~$13,270)

In terms of regional price indicators relative to the Northeast, both the Southeast and Southwest show statistically significant differences, with charges being approximately $1,035 lower (p = 0.031) and $960 lower (p = 0.045), respectively. In contrast, the Northwest shows a decrease of $353.0, which is not statistically significant (p = 0.459), indicating no meaningful difference in charges compared to the Northeast.

T-Test for our model and Fitting Reduced Model

The individual t-tests indicate that age, bmi, children, smoker, and two regional indicators significantly predict insurance charges, while sex and the northwest region do not have statistically significant independent effects on insurance charges.

reduced_model <- lm(charges ~ age + bmi + children + smoker + region, data = insurance)
summary(reduced_model)

Call:
lm(formula = charges ~ age + bmi + children + smoker + region, 
    data = insurance)

Residuals:
     Min       1Q   Median       3Q      Max 
-11367.2  -2835.4   -979.7   1361.9  29935.5 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -11990.27     978.76 -12.250  < 2e-16 ***
age                256.97      11.89  21.610  < 2e-16 ***
bmi                338.66      28.56  11.858  < 2e-16 ***
children           474.57     137.74   3.445 0.000588 ***
smokeryes        23836.30     411.86  57.875  < 2e-16 ***
regionnorthwest   -352.18     476.12  -0.740 0.459618    
regionsoutheast  -1034.36     478.54  -2.162 0.030834 *  
regionsouthwest   -959.37     477.78  -2.008 0.044846 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6060 on 1330 degrees of freedom
Multiple R-squared:  0.7509,    Adjusted R-squared:  0.7496 
F-statistic: 572.7 on 7 and 1330 DF,  p-value: < 2.2e-16
anova(full_model, reduced_model)
Analysis of Variance Table

Model 1: charges ~ age + sex + bmi + children + smoker + region
Model 2: charges ~ age + bmi + children + smoker + region
  Res.Df        RSS Df Sum of Sq      F Pr(>F)
1   1329 4.8840e+10                           
2   1330 4.8845e+10 -1  -5716429 0.1556 0.6933

An F-test comparing the full model to our reduced model excluding sex yielded F = 0.156 with p = 0.693. Since the p-value is much greater than 0.05, we fail to reject the null hypothesis that the coefficient for sex is zero. Removing sex does not significantly reduce model fit. Therefore, sex does not provide meaningful explanatory power and is excluded from the final model.

library(car)
Warning: package 'car' was built under R version 4.5.2
Loading required package: carData
Warning: package 'carData' was built under R version 4.5.2

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
vif(reduced_model)
             GVIF Df GVIF^(1/(2*Df))
age      1.016188  1        1.008061
bmi      1.104197  1        1.050808
children 1.003714  1        1.001855
smoker   1.006369  1        1.003179
region   1.098870  3        1.015838

The Variance Inflation Factor (VIF) results indicate no evidence of multicollinearity in our reduced model

Model diagnostics

par(mfrow = c(2,2))
plot(reduced_model)

The diagnostic plots suggest some violations of classical linear regression assumptions. The Residuals vs Fitted plot shows a slight curved pattern and increasing spread at higher fitted values, indicating potential non-linearity and heteroscedasticity.

The Scale-Location plot further confirms non-constant variance, as residual variability increases with predicted charges.

The Q-Q plot shows noticeable deviation in the upper tail, suggesting that residuals are not perfectly normally distributed, likely due to extreme high-cost observations.

The Residuals vs Leverage plot does not show any major influential points beyond Cook’s distance, although a few observations have slightly higher leverage. Overall, multicollinearity is not an issue in this model, but there is some heteroscedasticity and slight deviation from normality.

Model Remedy

At this point I will apply log transformation to charges in our reduced model that could help improving the model assumptions and make the results more reliable.

log_model <- lm(log(charges) ~ age + bmi + children + smoker + region, data = insurance)
summary(log_model)

Call:
lm(formula = log(charges) ~ age + bmi + children + smoker + region, 
    data = insurance)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.10302 -0.19707 -0.05206  0.06564  2.15091 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      7.0008478  0.0719853  97.254  < 2e-16 ***
age              0.0346490  0.0008746  39.618  < 2e-16 ***
bmi              0.0130711  0.0021004   6.223 6.52e-10 ***
children         0.1013204  0.0101304  10.002  < 2e-16 ***
smokeryes        1.5472965  0.0302910  51.081  < 2e-16 ***
regionnorthwest -0.0633386  0.0350174  -1.809 0.070712 .  
regionsoutheast -0.1568166  0.0351952  -4.456 9.07e-06 ***
regionsouthwest -0.1285638  0.0351393  -3.659 0.000263 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4457 on 1330 degrees of freedom
Multiple R-squared:  0.7663,    Adjusted R-squared:  0.765 
F-statistic: 622.9 on 7 and 1330 DF,  p-value: < 2.2e-16

Multiple R-squared = 0.7663 means that the model explains about 76.6% of the variation in log(charges). Adjusted R-squared = 0.765 means that after adjusting for predictors, the model still explains about 76.5% of the variation, which shows a strong overall fit and slightly better than the our previous reduced model.

par(mfrow = c(2,2))
plot(log_model)

The red line is not flat, it shows a noticeable downward curve as fitted values increase. There is also a distinct “split” in the residuals left part of the plot. This suggest that there is non-linear relationship that our model is not capturing.

Q-Q residuals plot is showing heavy tail towards the right, meaning there are more extreme “outlier” errors than a normal distribution expects.

The Scale-Location plot shows decrease and increase aftterwards. This indicates heteroscedasticity, meaning the model’s prediction error isn’t constant across all data ranges.

Residuals vs Leverage plot does not indicate any major influential observations beyond Cook’s distance. While there are some high leverage points, none seem to be single handedly affecting the model, however they should be investigated closely.

plot(log_model, which = 4)

The Cook’s Distance plot shows a few observations with moderately higher influence (220, 431 and 1028), but all values remain well below 1. Although some exceed the 4/(1338 number of observations)≈0.003 guideline, none appear large enough to be considered highly influential. Therefore, no single observation is unduly driving the regression results, and the model appears stable.

cooksD <- cooks.distance(log_model)
inf_points <- which(cooksD > 0.003) # I utilized this threshold since it is close to 4/n=1338 
length(inf_points)
[1] 105
inf_points
   4   10   31   35   58   93   99  103  104  141  144  162  220  224  243  245 
   4   10   31   35   58   93   99  103  104  141  144  162  220  224  243  245 
 260  263  264  290  292  302  306  307  322  341  355  356  378  388  398  420 
 260  263  264  290  292  302  306  307  322  341  355  356  378  388  398  420 
 430  431  443  469  475  476  492  504  517  521  526  527  534  540  555  584 
 430  431  443  469  475  476  492  504  517  521  526  527  534  540  555  584 
 600  619  624  638  665  689  755  760  782  804  807  820  855  859  867  877 
 600  619  624  638  665  689  755  760  782  804  807  820  855  859  867  877 
 891  912  937  958  960  984  988 1009 1020 1028 1040 1043 1048 1081 1086 1094 
 891  912  937  958  960  984  988 1009 1020 1028 1040 1043 1048 1081 1086 1094 
1105 1121 1124 1135 1140 1143 1157 1158 1163 1190 1196 1197 1212 1216 1224 1266 
1105 1121 1124 1135 1140 1143 1157 1158 1163 1190 1196 1197 1212 1216 1224 1266 
1289 1292 1314 1316 1318 1319 1322 1329 1332 
1289 1292 1314 1316 1318 1319 1322 1329 1332 

Next we are going to compare our model coefficiency changes without those influential observations.

log_model_reduced <- lm(log(charges) ~ age + bmi + children + smoker + region,
                        data = insurance[-inf_points, ])
cbind(Original = coef(log_model),
  Without_Inf = coef(log_model_reduced)
)
                   Original  Without_Inf
(Intercept)      7.00084777  6.780727831
age              0.03464897  0.040752119
bmi              0.01307111  0.009396141
children         0.10132039  0.111972050
smokeryes        1.54729653  1.583431473
regionnorthwest -0.06333857 -0.091761940
regionsoutheast -0.15681659 -0.150078978
regionsouthwest -0.12856380 -0.119116357

age is showin 18% increase from 0.0346 to 0.0408 this is a moderate change but still positive and significant.

bmi is showing -28% decrease but still statistically significant, suggesting that some influential observations were affecting bmi estmiated affect.

par(mfrow = c(2,2))
plot(log_model_reduced)

Our diagnostic plots are not showing expected improvement, also taking out 105 observation points might not be the best way to approach predicitng charges as they cover 8% of our dataset. Therefore we are going to look for alternative option for model remedy.

Remedy 2

At this point we have to consider posibble interactions we could utilize to improve predicting power of our model. In our previous discoveries we have noticed that smoker and bmi has strong affect on insurance charges. Higher bmi and being smoker could have an ultimate impact on insurance charges, therefore combining those variable can give us better prediction without making changes to our dataset.

interaction_model <- lm(log(charges) ~ age + children + region +
                                         bmi*smoker,
                        data = insurance)

summary(interaction_model)

Call:
lm(formula = log(charges) ~ age + children + region + bmi * smoker, 
    data = insurance)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.95236 -0.17362 -0.04363  0.04653  2.18043 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      7.2973126  0.0762591  95.691  < 2e-16 ***
age              0.0348687  0.0008466  41.185  < 2e-16 ***
children         0.1025063  0.0098040  10.456  < 2e-16 ***
regionnorthwest -0.0704737  0.0338945  -2.079   0.0378 *  
regionsoutheast -0.1621837  0.0340629  -4.761 2.13e-06 ***
regionsouthwest -0.1369021  0.0340154  -4.025 6.03e-05 ***
bmi              0.0032462  0.0022779   1.425   0.1544    
smokeryes        0.1749616  0.1466030   1.193   0.2329    
bmi:smokeryes    0.0447061  0.0046794   9.554  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4313 on 1329 degrees of freedom
Multiple R-squared:  0.7813,    Adjusted R-squared:   0.78 
F-statistic: 593.5 on 8 and 1329 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(interaction_model)

The interaction between BMI and smoking status significantly improves the model and changes how BMI and smoking should be interpreted. Rather than having separate independent effects, smoking amplifies the impact of BMI on medical costs. The interaction model provides a better statistical fit and a more realistic explanation of variation in insurance charges.

models <- list(
  Full = full_model,
  Reduced = reduced_model,
  Log = log_model,
  Log_Reduced = log_model_reduced,
  Interaction = interaction_model
)

model_summary <- data.frame(
  Model = names(models),
  R2 = sapply(models, function(m) summary(m)$r.squared),
  Adj_R2 = sapply(models, function(m) summary(m)$adj.r.squared),
  Residual_SE = sapply(models, function(m) summary(m)$sigma)
)

model_summary
                  Model        R2    Adj_R2  Residual_SE
Full               Full 0.7509130 0.7494136 6062.1022885
Reduced         Reduced 0.7508839 0.7495727 6060.1775001
Log                 Log 0.7662799 0.7650497    0.4457101
Log_Reduced Log_Reduced 0.9084246 0.9079013    0.2718137
Interaction Interaction 0.7813001 0.7799836    0.4313125

eventhough log_reduced has better prediction power, I would like to go with Interaction Model because , we can keep influential observations without sacrificing accuracy. The interaction terms allow the model to account for complex relationships that simple models ignore, resulting in a significantly better fit and lower prediction error.

confint(interaction_model, level = 0.95)
                       2.5 %       97.5 %
(Intercept)      7.147711317  7.446913837
age              0.033207813  0.036529604
children         0.083273327  0.121739179
regionnorthwest -0.136966174 -0.003981141
regionsoutheast -0.229006650 -0.095360843
regionsouthwest -0.203631888 -0.070172287
bmi             -0.001222566  0.007714901
smokeryes       -0.112636963  0.462560072
bmi:smokeryes    0.035526252  0.053885864

Now lets Predict the Insurance Charges for “New Observation”

new_obs <- data.frame(
  age = c(25, 45),
  children = c(0, 2),
  region = c("northwest", "southwest"), 
  bmi = c(22.5, 30.0),
  smoker = c("no", "yes")               
)
log_pred <- predict(interaction_model, newdata = new_obs)

final_prediction <- exp(log_pred) #this would help me to communicate my model

print(final_prediction)
        1         2 
 3538.986 38102.805 

Observation 1 ($3,538.99): Age 25 with no children, non-smoker, 22.5 bmi and living in Northwest. This person is predicted to have relatively low insurance costs based on my model.

Observation 2 ($38,102.81): Age 45 having 2 kids, bmi 30, smoker and living in Southwest region. This person is predicted to cost nearly 11 times more than the first person.